Mon 22 Jan 2007
As part of the purging process of becoming a nomad we are trying to make an archive of important data. This is not an easy task. We have about 30 Gigs of photos. I know there are duplicates in the data but I haven't done anything about it until now.
In integrity part1 I told how to check the md5deep database (just a text file with md5sum and filename) to see if there are any duplicates. Example: sort -n md5.test_data.txt | uniq -D -w 32 This will check the first 32 bytes of the md5 sums after sorting them. This works great for detecting duplicates.
But what can I do about them? Sometimes I have a duplicate on purpose. For example if I have a directory tree with 500 photos from a single shoot I want to make a directory with the "best of." I could do several things: Make symbolic links to the files, copy them to a new folder, or store best of stuff outside the main backup tree. I opt for just making a copy of the file into a "bob" folder. It is wasteful of space but this is the method that I choose.
So now that I have copied them I have files with duplicate md5sums. After pondering on this for a while I came up with the idea of changing the jpeg comment field to say something like: "I know this is a copy" or "Best Of Photos."
To accomplish this I am using my good friend jhead - the jpeg header manipulator. Example:
find ./best_of_best/ -type f | xargs -n1 jhead -cl \"Best of Photos\"
This does the trick… now each file even though it is really the same photo with the same image data and same file name has a different md5sum.
Some duplicates in the archive are caused by sloppy photo management. Sometimes I do not delete the files off of a card before taking more and end up having more than one copy of a photo in different directories. With the jpeg header trick I can now either delete them or just change the comment field.
I suppose in the future when I start using the comment field more this method could overwrite valuable comment info. I guess when that becomes a problem I will add checks to make sure the comment field is empty before overwriting.