Page 1 of 1

How do you detect duplicates? And how does IM Fingerprint work?

Posted: 2018-11-18T03:19:49-07:00
by chani
Hi there,

I tried searching for copy/duplicate/detect duplicate, though I didn't find something here. If I overlooked something (as in this has been already answered somewhere here on the board) please let me know. I also read: https://www.imagemagick.org/Usage/compare/. I am looking for a way to automate this.

Long story short: Something like a year ago I lost my photos as well as my backups and did end up with a folder containing most of them as well as modified (denoise, gamma, sharpen, scaled) duplicates of the original. Now I need to get rid of the duplicates. First of all I really just want to detect duplicates - choosing which of the duplicates to keep isn't that important currently.

So I tried the following:

1. Simple IM Fingerprint (storing all photos fingerprint in an array and while iterating over all my photos checking if something matches) - that seems to work quite good.
2. Downscale to 64x64 (as well tested 32x32), convert to grayscale, created 3 by 90-degree rotated versions, take the fingerprints of that to check for duplicates.

I might need a helping hand / idea about 2. To downscale

- first I used sample. That is pretty fast though no copies are detected.
- then I used scale. That is a little bit slower though still no copies are detected.
- then I used resize with POINT and BOX a little bit slower - still no copies.
- then I used resize with GAUSSIAN and HERMITE - GAUSSIAN is the slowest(!), HERMITE is a bit slower than above variants. THIS one detects some duplicates (so.. yes, it does work. It's just a little bit too slow).

Using sample/scale and follow that by a gaussian blur is still faster than using resize with GAUSSIAN - but it does not detect duplicates. So I'm curious why is a GAUSSIAN_RESIZE as well as HERMITE_RESIZE working and SAMPLE/SCALE+GAUSSIAN/BLUR not?

By the way, the fingerprint I am using is the one PHP's \Imagick::getImageSignature() gives back. Is that probably wrong to use for what I want to do? I'm not limited to PHP, Bash would be fine as well. How do you do that?

I noticed that auto-levels does not change the fingerprint. Looking for a way that color-distorted or gamma-corrected photos would still be detected as copies. For that I do the grayscale conversation. I also thought and tried creating an edge mask to use that - however, creating that mask takes way too long.

Thanks in advance,
Jean

Re: How do you detect duplicates? And how does IM Fingerprint work?

Posted: 2018-11-18T10:38:57-07:00
by chani
Okay, I wrote something which seems to work, based on what I did read about aHash. Here's the PHP Code:

Code: Select all

        
        $im = new \Imagick($file);
        $im->sampleImage(16, 16);
        $im->transformImageColorspace(\Imagick::COLORSPACE_GRAY);
        $data = $im->getImageChannelMean(\Imagick::CHANNEL_RED);
        $mean = $data['mean'];
        $im->thresholdImage($mean);
        $hash = $im->getImageSignature();

Re: How do you detect duplicates? And how does IM Fingerprint work?

Posted: 2018-11-18T12:09:31-07:00
by fmw42
You could use perceptual hash techniques. ImageMagick has a color phash. See https://imagemagick.org/discourse-serve ... =4&t=24906

I have built some other perceptual hash scripts at http://www.fmwconcepts.com/imagemagick/ ... /index.php.