If you think that "compression" somehow means "non-intelligent", consider this:
The best compression of data that is theoretically achievable (see Kolmogorov complexity) is an algorithm that approximates process that produces the data. And which process produces texts on the internet? Activity of the human brain. (I described it a bit sloppily. We are dealing with probability distribution of the data, not the data itself. But the general idea still holds.)
Using chain-of-thought removes the constraint that the output of the resultant algorithm should use fixed amount of compute per token.
The best compression of data that is theoretically achievable (see Kolmogorov complexity) is an algorithm that approximates process that produces the data. And which process produces texts on the internet? Activity of the human brain. (I described it a bit sloppily. We are dealing with probability distribution of the data, not the data itself. But the general idea still holds.)
Using chain-of-thought removes the constraint that the output of the resultant algorithm should use fixed amount of compute per token.