So it really depends on what you use for clustering. In this case, I'm clustering by the original embeddings so the UMAP results are different. I've also seen:
1- Clustering by UMAP. Here the plot would show clean separation of topics. But the clustering algorithm would be working on highly compressed data (from the 1024 dimensions of the embedding down to the 2 of UMAP).
2- BERTopic's approach of doing UMAP down to 5 dimensions, using this dimensionality for clustering, then UMAP again from 5 to 2. Which is an interesting approach.
I've heard people having good results with all three. It's kinda hard to objectively compare, but my leaning was to give the clustering algorithm the representation containing the most information about the text.
Right, bertopic's double clustering is interesting. I've also seen people combine that with louvain instead of k-means.
My intuition was: UMAP itself tries to optimize for 2d separation in the projection. So we should expect at least some correspondence between the kmeans results and the layout in the UMAP plot (except in some pathological edge cases perhaps).
You can actually create a graph by using k-means similarities as edge weights. Then you do graph clustering on it. (using any algorithm, but louvain is one of the saner ones ... clique percolation, girvan-newman etc all have known problems).
All these dimension reduction methods are extremely similar. The math essentially just preserves nearest neighbors, with a setting for how 'tight' you want the clusters to be.
Check out this image [1] and accompanying paper [2] for further reference
1- Clustering by UMAP. Here the plot would show clean separation of topics. But the clustering algorithm would be working on highly compressed data (from the 1024 dimensions of the embedding down to the 2 of UMAP).
2- BERTopic's approach of doing UMAP down to 5 dimensions, using this dimensionality for clustering, then UMAP again from 5 to 2. Which is an interesting approach.
I've heard people having good results with all three. It's kinda hard to objectively compare, but my leaning was to give the clustering algorithm the representation containing the most information about the text.