SmugMug Duplicate Image Hunting
February 5, 2010 Leave a comment
One of the many things the developers of Thumbsplus got right was a proper normalized database schema. When I first inspected the layout of a Thumbsplus database I knew I was in good hands. In Thumbsplus image files get unique keys and image galleries are simply lists of image keys. Images can appear in any number of galleries, without duplication, just the way the gods of database design intended.
Assigning unique keys and grouping by key lists is so correct that it was a shock to discover that SmugMug, until recently, eschewed this principle. Prior to a recent upgrade if you wanted to display an image in more than one gallery you had to … shudder with horror …. make copies! Whenever I made an image copy I felt like I was masturbating in an art museum.
This outrage is now fixed and you can place an image in as many galleries as you want without copying. Unfortunately there is a residual problem. How do you hunt down and exterminate all your bogus copies? In an acronym: MD5. SmugMug assigns MD5′ s to all images. If two MD5′s are the same there is an extremely high likelihood you are dealing with copies. So all you have to do is find images with identical MD5′s and delete the extra copies. The following J verb uses image tables created from the XML captured by my SmugMug metadata dumper to do just this.
SmugDupsFrMD5=:3 : 0
NB.*SmugDupsFrMD5 v-- duplicate SmugMug images from MD5.
NB.
NB. monad: btct =. SmugDupsFrMD5 clDirectory
NB.
NB. SmugDupsFrMD5 'c:\pd\docs\smugmug\data\'
NB. read table files
path=.tslash y
albums=. readtd2 path,SMUGALBUMTABLE
images=. readtd2 path,SMUGIMAGETABLE
images=. }. images [ imhead=. 0 { images
NB. all duplicate MD5's
pos=. imhead i. <'MD5'
md5=. pos {"1 images
dup=. md5 #~ -. ~:md5
images=. (md5 e. dup)#images
images=. (/: pos {"1 images) { images
NB. remove images with matching smugmug pids
NB. these are proper virtual images and not copies
pos=. imhead i. <'PID'
pid=. pos {"1 images
dup=. pid #~ -. ~:pid
if. #images=. (-.pid e. dup)#images do.
NB. retain selected columns and insert album names
images=. (imhead i. ;:'FILENAME GID PID MD5 ALBUMURL') {"1 images
albums=. ((0 {"1 albums) i. 1 {"1 images){ 1 {"1 albums
images=. albums (<a:;1)} images
NB. group by MD5
images=. (~:3 {"1 images) <;.1 images
images=. >&.>@:(<"1@|:) &> images
NB. order MD5 groups by galleries in groups
NB. this results in a good order for editing
NB. out the duplicates on SmugMug
images=. (\:&.> 1 {"1 images) {&.> images
(\: 0 {&> 1 {"1 images){images
else.
NB. no duplicates
0 5$''
end.
)
J source is not supported by the WordPress source code plugin so no syntax coloring for now.


Recent Comments