As part of the purging process of becoming a nomad we are trying to make an archive of important data. This is not an easy task. We have about 30 Gigs of photos. I know there are duplicates in the data but I haven't done anything about it until now.

In integrity part1 I told how to check the md5deep database (just a text file with md5sum and filename) to see if there are any duplicates. Example:  sort -n md5.test_data.txt | uniq -D -w 32 This will check the first 32 bytes of the md5 sums after sorting them. This works great for detecting duplicates.

But what can I do about them? Sometimes I have a duplicate on purpose. For example if I have a directory tree with 500 photos from a single shoot I want to make a directory with the "best of." I could do several things: Make symbolic links to the files, copy them to a new folder, or store best of stuff outside the main backup tree. I opt for just making a copy of the file into a "bob" folder. It is wasteful of space but this is the method that I choose.

So now that I have copied them I have files with duplicate md5sums. After pondering on this for a while I came up with the idea of changing the jpeg comment field to say something like: "I know this is a copy" or "Best Of Photos."

To accomplish this I am using my good friend jhead - the jpeg header manipulator. Example:

find ./best_of_best/ -type f | xargs -n1 jhead -cl \"Best of Photos\" 

This does the trick… now each file even though it is really the same photo with the same image data and same file name has a different md5sum.

Some duplicates in the archive are caused by sloppy photo management. Sometimes I do not delete the files off of a card before taking more and end up having more than one copy of a photo in different directories. With the jpeg header trick I can now either delete them or just change the comment field.

I suppose in the future when I start using the comment field more this method could overwrite valuable comment info. I guess when that becomes a problem I will add checks to make sure the comment field is empty before overwriting.