My work in statistical learning is always about multicollinearity, variable cardinality, and model selection. Understanding the topology of the data space is critical, and day-to-day concerns of data cleaning had made me lose sight of that. To that degree, this article was a fantastic reminder.
I recently saw a talk on this, and I didn't understand the barcode charts then either. For example, I understand that the zeroth order Betti number gives the number of connected components. So, why does the bar code chart show multiple numbers for each radius value? Shouldn't it look more like plot of some non-increasing function where for each radius, there is a single zeroth order Betti number on the y-axis (since the number of connected components is non-increasing as a function of the radius)?
The barcode is not usually showing the zeroth Betti numbers. It's usually showing the first Betti number, which intuitively counts the 1-dimensional holes in the space. For each radius, there could be lots of those (imagine several circles that all have a single point in common, then the first Betti number will be equal to the number of circles). At a given radius, there will be one bar over it for each 1-dimensional hole. If a bar is really long, i.e. the same hole exists for many radii, then we assume that it must represent actual structure in the data, rather than just noise.
You could do a barcode for the holes of any fixed dimension, but as you point out, the 0-th dimension case is relatively uninteresting, and as you get to higher dimensions, it's harder to visualize and interpret what is going on. So dimension 1 is most common.
So, let me try to understand: Let's take the order 1 case. If I pick a value on y-axis and hold it constant, this refers to a particular "1D hole". As I move left to right, increasing the radius, the graph is colored black if this particular hole is present for the given radius and not-colored otherwise. Is it not misleading, then, to label the y-axis as the Betti number since this is a single, global number?
> the 0-th dimension case is relatively uninteresting
It was explained to me that the zeroth order Betti numbers have applications for clustering.
Your description is correct. And to call the y-axis the Betti number is a bit misleading. If you are looking at a barcode for 1-dimensional holes, then at a given radius, the number of bars over that radius is the first Betti number (at that radius). So Betti number counts the number of holes, but the barcode graph is keeping track of each hole's "lifetime" as the radius changes.
> It was explained to me that the zeroth order Betti numbers have applications for clustering.
That is correct, so perhaps "uninteresting" was too strong :) The 0-th Betti number counts the number of connected components of the space. So if we are at radius, say, 1 and the 0-th Betti number is 3, then we know the data points can be put into 3 "dense" clusters. By dense, I mean that for every two data points A and B in the cluster, there is a sequence of data points that you could step on going from A to B where each step has distance at most 1. I don't know if that explanation made any sense.
Also very cool is the recognition of branching in the data by the computation of a persistent Borel-Moore homology. This is the method that was used in their cancer study.
I second Hatcher. His AT book is a goto for me. He also has introduction to point set topology notes on his site which are appropriate for someone without a math background.
Recently I've been reading 'Introduction to Topology' by Mendelson, it's very good so far, although I'm skipping the exercises for the first reading (First reading to try and get the concepts, second reading to solidify them).
It's actually availible free from archive.org[0], which is a huge plus.