Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So it really depends on what you use for clustering. In this case, I'm clustering by the original embeddings so the UMAP results are different. I've also seen:

1- Clustering by UMAP. Here the plot would show clean separation of topics. But the clustering algorithm would be working on highly compressed data (from the 1024 dimensions of the embedding down to the 2 of UMAP).

2- BERTopic's approach of doing UMAP down to 5 dimensions, using this dimensionality for clustering, then UMAP again from 5 to 2. Which is an interesting approach.

I've heard people having good results with all three. It's kinda hard to objectively compare, but my leaning was to give the clustering algorithm the representation containing the most information about the text.



Right, bertopic's double clustering is interesting. I've also seen people combine that with louvain instead of k-means.

My intuition was: UMAP itself tries to optimize for 2d separation in the projection. So we should expect at least some correspondence between the kmeans results and the layout in the UMAP plot (except in some pathological edge cases perhaps).

Nevertheless, nice example and blog post!


TIL louvain clustering! I see it used for graphs. Can also be used for vectors/points?

Thank you!


You're welcome!

You can actually create a graph by using k-means similarities as edge weights. Then you do graph clustering on it. (using any algorithm, but louvain is one of the saner ones ... clique percolation, girvan-newman etc all have known problems).


Try t-SNE. I used to scoff at cluster plots until I saw those but with t-SNE… wow, those clusters are actually separated!


Are you sure t-SNE and UMAP actually perform very differently? Last I looked, they were somewhat comparable.

[edit]: Seems they are similar for some purposes: https://blog.bioturing.com/2022/01/14/umap-vs-t-sne-single-c...

Also interesting: Rapidsai has a cuda accelerated version of umap that is very fast (hdbscan as well BTW).


All these dimension reduction methods are extremely similar. The math essentially just preserves nearest neighbors, with a setting for how 'tight' you want the clusters to be.

Check out this image [1] and accompanying paper [2] for further reference

[1] https://www.semanticscholar.org/paper/A-Unifying-Perspective...

[2] https://arxiv.org/abs/2007.08902




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: