SmugMug Duplicate Image Hunting

One of the many things the developers of Thumbsplus got right was a proper normalized database schema.  When I first inspected the layout of a Thumbsplus database I knew I was in good hands.  In Thumbsplus image files get unique keys and image galleries are simply lists of image keys.  Images can appear in any number of galleries, without duplication,  just the way the gods of database design intended.

Assigning unique keys and grouping by key lists is so correct that it was a shock to discover that SmugMug,  until recently,  eschewed this principle.  Prior to a recent upgrade if you wanted to display an image in more than one gallery you had to … shudder with horror …. make copies!  Whenever I made an image copy I felt  like I was  masturbating in an art museum.

This outrage is now fixed and you can place an image in as many galleries as you want without copying.  Unfortunately there is a residual problem.  How do you hunt down and exterminate all your bogus copies?  In an acronym:  MD5.  SmugMug  assigns MD5′ s to all images.  If two MD5’s are the same there is an extremely high likelihood you are dealing with copies.  So all you have to do is find images with identical MD5’s and delete the extra copies.  The following J verb uses image tables created from the XML captured by my SmugMug metadata dumper to do just this. 

SmugDupsFrMD5=:3 : 0

NB.*SmugDupsFrMD5 v-- duplicate SmugMug images from MD5.
NB.
NB. monad:  btct =. SmugDupsFrMD5 clDirectory
NB.
NB.   SmugDupsFrMD5 'c:\pd\docs\smugmug\data\'

NB. read table files
path=.tslash y
albums=. readtd2 path,SMUGALBUMTABLE
images=. readtd2 path,SMUGIMAGETABLE
images=. }. images [ imhead=. 0 { images

NB. all duplicate MD5's
pos=. imhead i. <'MD5'
md5=. pos {"1 images
dup=. md5 #~ -. ~:md5
images=. (md5 e. dup)#images
images=. (/: pos {"1 images) { images

NB. remove images with matching smugmug pids
NB. these are proper virtual images and not copies
pos=. imhead i. <'PID'
pid=. pos {"1 images
dup=. pid #~ -. ~:pid
if. #images=. (-.pid e. dup)#images do.

  NB. retain selected columns and insert album names
  images=. (imhead i. ;:'FILENAME GID PID MD5 ALBUMURL') {"1 images
  albums=. ((0 {"1 albums) i. 1 {"1 images){ 1 {"1 albums
  images=. albums (<a:;1)} images

  NB. group by MD5
  images=. (~:3 {"1 images) <;.1 images
  images=. >&.>@:(<"1@|:) &> images

  NB. order MD5 groups by galleries in groups
  NB. this results in a good order for editing
  NB. out the duplicates on SmugMug
  images=. (\:&.> 1 {"1 images) {&.> images
  (\: 0 {&> 1 {"1 images){images
else.
  NB. no duplicates
  0 5$''
end.
)

J source is not supported by the WordPress source code plugin so no syntax coloring for now.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s