I have a pet theory that there are two forms of the web: the document web and the application web. SPAs have some very attractive properties for the application web but complicate/break the document web.
That being said, with sites like HN, Reddit, LinkedIn, Twitter, news outlets, etc. the lines between “document” and “application” get blurred. In some ways they’ve built a micro-application that hosts documents. Content can be user submitted in-browser. Content can be “engaged with” in browser. Some handle this blurring better than others. HN is an example IMO of getting it right where nearly everything that should be addressable (like comments) can be linked to. Others not so much.
For application websites like the ones you listed, you'd typically end up building a special integration for crawling against their API or data dumps. This is also true for github, stackoverflow, and even document:y websites like wikipedia.
It's simply not feasible to treat them as any other website if you wanna index their data.
That being said, with sites like HN, Reddit, LinkedIn, Twitter, news outlets, etc. the lines between “document” and “application” get blurred. In some ways they’ve built a micro-application that hosts documents. Content can be user submitted in-browser. Content can be “engaged with” in browser. Some handle this blurring better than others. HN is an example IMO of getting it right where nearly everything that should be addressable (like comments) can be linked to. Others not so much.
(As an aside, I love marginalia!)