This is a discussion on MD5 checksums from downloaded pdfs to prevent duplication within the Linux General forums, part of the Linux Forums category; pk <pk@pk.invalid> wrote in news:fuhfjr$lfm$1@aioe.org: > If you download stuff from ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
pk <pk@pk.invalid> wrote in news:fuhfjr$lfm$1@aioe.org:
> If you download stuff from the command line, then I see no major > problems in doing what you want (apart from the obvious fact that, to > calculate the md5 of a file, you still have to download it...but you > can do that in a temporary directory). > How were you thinking to implement that? My bad! I should have phrased myself more accurately. "Downloading" a repeat is not my chief enemy. THese are pretty small pdfs so bandwidth iis not an issue. But its important I detect and delete a duplicate download before I archive it in my repository. Otherwise when I annotate I end up having half the comments on one copy and the other half (unbenownkst) on another. I was thinking of using a bash wrapper around the md5sum command. A text format log of all pdf MD5 sum keys plus a grep each time to verify I don't have a preexisting copy. I could fire that manually each time I do a pdf download although it'd be somehow nice to automate the fireup. Haven't thought of how to do that yet. > And, this is way OT, but I suggest you try zotero instead of endnote. > Oh! Good lead! Didn't know of that. Only ones I knew were EN, bibtex and JabRef. -- Rahul |
|
|||
|
Unruh <unruh-spam@physics.ubc.ca> wrote in news:AnWOj.1041$XI1.766
@edtnps91: > > The problem is that a single bit difference will give a different md5 > checksum. Thanks Unruh! That's a very relevant issue if I use MD5. It'd be great if I could find a "checksum" that was forgiving. i.e. based on some sort of "average" bit value. i.e. if I have two versions of the same pdf but maybe scanned at different times or one publisher imprinted the download time / ip on it (they often do) then the checksum should still remain identical (or "close"). I know that's not a real "checksum" but maybe you get the idea? Something like a metric of similarity with a fuzz factor around it. A one to one map such that all "reasonably similar" files end up with a metric close to each other in checksum space. ANy ideas? -- Rahul |
|
|||
|
Javi <javibarroso@gmail.com> wrote in
news:a0994a7d-8568-4e2e-996e-877be073da6e@59g2000hsb.googlegroups.com: > > > I'd use a application like fdupes (http://packages.debian.org/sid/ > fdupes), with a cron task perhaps > fdupes sounds great. I'm starting with that to discover my preexisting duplicates. Later though, it might be easier to have a MD5sum wrapper script since I want to test just one specific file against all that I already have. Now, if only I knew of a "fuzzy fdupes"! The test would be something that can say 2 consecutive scanner images of the same page are the "same"! -- Rahul |