You'd just have the network generate fingerprints for any given song similar to ...

You'd just have the network generate fingerprints for any given song similar to how facial recogniton is done

Siamese networks are what you want, two identical pairs of layers (one cached in this case) which act as the fingerprints then then the final layers are doing the similarity matching