Hacker News new | past | comments | ask | show | jobs | submit login

Duplicate photos is something you can do reasonably well without resorting to the more advanced techniques, depending on how "different" the duplicates are. In general though, I'd only consider an image to be a duplicate if it's a scaled version of the original (and maybe rotated).

When I've done it in the past I've created a 5x5 greyscale thumbnail of each and then you done bitwise comparisons. You can do more complex stuff like normalisation of brightness. Really the main thing to focus on is the vectorisation so you can quickly compare truckloads of images.

If I was to do it now I'd probably use width:height ratio to reduce the search space and then quickly check the 4 rotations of my sample hash against all known hashes of the same ratio. And I'd probably start but finding all images that were under a certain size, assume they were thumbnails, and try to find matching originals so I could remove them from the data set.

It depends on what exactly has happened to your library in the first place but I'd be surprised if you had lots of images that had undergone arbitrary transformations. Though obviously, only you know what that data looks like to start with! :-)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: