if the dataset fits in memory (less than 1-10 million entries) it might be faster to do a full matrix multiply in numpy instead of the approximate nn search, as it avoids disk read. I haven't tried benchmarking this though.
also, I might have skipped over it but most implementations do pca on the high dimensional feature vector as the data tends to be sparse, is there any reason it's not done here?
also, I might have skipped over it but most implementations do pca on the high dimensional feature vector as the data tends to be sparse, is there any reason it's not done here?