MD5 checksums from downloaded pdfs to prevent duplication

This is a discussion on MD5 checksums from downloaded pdfs to prevent duplication within the Linux General forums, part of the Linux Forums category; pk <pk@pk.invalid> wrote in news:fuhfjr$lfm$1@aioe.org: > If you download stuff from ...


Go Back   Usenet Forums > Linux Forums > Linux General

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #11 (permalink)  
Old 04-21-2008
Rahul
 
Posts: n/a
Default Re: MD5 checksums from downloaded pdfs to prevent duplication

pk <pk@pk.invalid> wrote in news:fuhfjr$lfm$1@aioe.org:

> If you download stuff from the command line, then I see no major
> problems in doing what you want (apart from the obvious fact that, to
> calculate the md5 of a file, you still have to download it...but you
> can do that in a temporary directory).
> How were you thinking to implement that?


My bad! I should have phrased myself more accurately. "Downloading" a
repeat is not my chief enemy. THese are pretty small pdfs so bandwidth
iis not an issue. But its important I detect and delete a duplicate
download before I archive it in my repository. Otherwise when I annotate
I end up having half the comments on one copy and the other half
(unbenownkst) on another.

I was thinking of using a bash wrapper around the md5sum command. A text
format log of all pdf MD5 sum keys plus a grep each time to verify I
don't have a preexisting copy. I could fire that manually each time I do
a pdf download although it'd be somehow nice to automate the fireup.
Haven't thought of how to do that yet.

> And, this is way OT, but I suggest you try zotero instead of endnote.
>


Oh! Good lead! Didn't know of that. Only ones I knew were EN, bibtex and
JabRef.


--
Rahul
Reply With Quote
  #12 (permalink)  
Old 04-21-2008
Rahul
 
Posts: n/a
Default Re: MD5 checksums from downloaded pdfs to prevent duplication

Unruh <unruh-spam@physics.ubc.ca> wrote in news:AnWOj.1041$XI1.766
@edtnps91:

>
> The problem is that a single bit difference will give a different md5
> checksum.



Thanks Unruh! That's a very relevant issue if I use MD5. It'd be great if
I could find a "checksum" that was forgiving. i.e. based on some sort of
"average" bit value. i.e. if I have two versions of the same pdf but
maybe scanned at different times or one publisher imprinted the download
time / ip on it (they often do) then the checksum should still remain
identical (or "close").

I know that's not a real "checksum" but maybe you get the idea? Something
like a metric of similarity with a fuzz factor around it. A one to one
map such that all "reasonably similar" files end up with a metric close
to each other in checksum space.

ANy ideas?


--
Rahul
Reply With Quote
  #13 (permalink)  
Old 04-21-2008
Rahul
 
Posts: n/a
Default Re: MD5 checksums from downloaded pdfs to prevent duplication

Javi <javibarroso@gmail.com> wrote in
news:a0994a7d-8568-4e2e-996e-877be073da6e@59g2000hsb.googlegroups.com:

>
>
> I'd use a application like fdupes (http://packages.debian.org/sid/
> fdupes), with a cron task perhaps
>



fdupes sounds great. I'm starting with that to discover my preexisting
duplicates. Later though, it might be easier to have a MD5sum wrapper
script since I want to test just one specific file against all that I
already have.

Now, if only I knew of a "fuzzy fdupes"! The test would be something
that can say 2 consecutive scanner images of the same page are the
"same"!

--
Rahul
Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT +1. The time now is 05:25 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0