Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is pretty neat. I've been using pHash to find duplicate photos across my personal library, but this seems significantly better. I'd like to wrap it as a Python library - what are the bits that need improving?


Unfortunately it needs a huge amount of work to be production ready. I'd love to see it wrapped up in python though and I'll be continuing to work on it. The main issue you will run into is that the number of triangles computed from the keypoints is n^3 (where n is the number of keypoints), So currently it will take far too long to match large images. I have a few ideas on how to solve this but none of them have lead to quick and easy solutions...yet

I'm currently working on an improved method for getting 2D affine invariant keypoints because the current one just barely works. After that I will be working on handling the scaling issue.


Duplicate photos is something you can do reasonably well without resorting to the more advanced techniques, depending on how "different" the duplicates are. In general though, I'd only consider an image to be a duplicate if it's a scaled version of the original (and maybe rotated).

When I've done it in the past I've created a 5x5 greyscale thumbnail of each and then you done bitwise comparisons. You can do more complex stuff like normalisation of brightness. Really the main thing to focus on is the vectorisation so you can quickly compare truckloads of images.

If I was to do it now I'd probably use width:height ratio to reduce the search space and then quickly check the 4 rotations of my sample hash against all known hashes of the same ratio. And I'd probably start but finding all images that were under a certain size, assume they were thumbnails, and try to find matching originals so I could remove them from the data set.

It depends on what exactly has happened to your library in the first place but I'd be surprised if you had lots of images that had undergone arbitrary transformations. Though obviously, only you know what that data looks like to start with! :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: