You'd just have the network generate fingerprints for any given song similar to how facial recogniton is done
Siamese networks are what you want, two identical pairs of layers (one cached in this case) which act as the fingerprints then then the final layers are doing the similarity matching
Siamese networks are what you want, two identical pairs of layers (one cached in this case) which act as the fingerprints then then the final layers are doing the similarity matching