Where will training data come from for new tech & programming languages if SO di...

msluyter · 2025-05-16T14:32:03 1747405923

This was my question. There's a weird sort of self-cannibalism that this hints at. The LLM is only as good as it is because it's been able to train on existing SO answers. But if over time, SO content production declines, then the LLM results will be less reliable. It seems that a new equilibrium could be one in which -- for newer questions/concerns -- both SO and LLMs will be worse than they are now.

tevon · 2025-05-15T21:58:31 1747346311

All the same places it comes from for human programmers before a language has many answers on SO.

- Documentation - Open source projects using it - Github issues - Source code - Blogs - Youtube videos

The list goes on

bfung · 2025-05-16T04:07:35 1747368455

To add a bit more nuance, SO has a question-answer type format, which leads very well into prompt-rely format to train these chat applications. Most of the other sources do not, except for Github issues maybe. Without this question-answer format, there'll be a need for a bigger data labeling effort to train LLMs on new stuff, no?