So it really depends on what you use for clustering. In this case, I'm clusterin...

uniqueuid · on June 10, 2022

Right, bertopic's double clustering is interesting. I've also seen people combine that with louvain instead of k-means.

My intuition was: UMAP itself tries to optimize for 2d separation in the projection. So we should expect at least some correspondence between the kmeans results and the layout in the UMAP plot (except in some pathological edge cases perhaps).

Nevertheless, nice example and blog post!

jayalammar · on June 10, 2022

TIL louvain clustering! I see it used for graphs. Can also be used for vectors/points?

Thank you!

uniqueuid · on June 10, 2022

You're welcome!

You can actually create a graph by using k-means similarities as edge weights. Then you do graph clustering on it. (using any algorithm, but louvain is one of the saner ones ... clique percolation, girvan-newman etc all have known problems).

PaulHoule · on June 10, 2022

Try t-SNE. I used to scoff at cluster plots until I saw those but with t-SNE… wow, those clusters are actually separated!

uniqueuid · on June 10, 2022

Are you sure t-SNE and UMAP actually perform very differently? Last I looked, they were somewhat comparable.

[edit]: Seems they are similar for some purposes: https://blog.bioturing.com/2022/01/14/umap-vs-t-sne-single-c...

Also interesting: Rapidsai has a cuda accelerated version of umap that is very fast (hdbscan as well BTW).

nighthawk454 · on June 10, 2022

All these dimension reduction methods are extremely similar. The math essentially just preserves nearest neighbors, with a setting for how 'tight' you want the clusters to be.

Check out this image [1] and accompanying paper [2] for further reference

[1] https://www.semanticscholar.org/paper/A-Unifying-Perspective...

[2] https://arxiv.org/abs/2007.08902