Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Where were these data sourced from? As far as I know, StackOverflow does not publicize internal analytics with such granularity. If these figures are real, are they leaked?


Users with over 25,000 reputation can see site analytics that include charts that look a lot like these. https://stackoverflow.com/help/privileges/site-analytics


Looking at that source, there's a pretty straightforward explanation of where the page views come from.

- Last month, Stack Overflow had ~142,575,642 visits.

- Last month, Google gave 127,896,508 visits to Stack Overflow

- Last month, Bing gave 7,491,274 visits to Stack Overflow

You could say that the Stack Overflow # of pageviews depends mainly on:

- How often people are searching Google/Bing for answer.

- How often Google/Bing rank Stack Overflow high enough for people to click into it.


That must be it. It would have been nice for the original post to include this information.



We can also see that lately there are more questions than answers, which shows that most experts are no longer that active, or that there are more beginners and fewer experts overall.


Besides the Archive.org data dump, there is also the Stack Exchange Data Explorer for which there are thousands of user queries[1].

For instance, this user query by Starball tracks network contributions over time[2][3].

[1]: https://data.stackexchange.com/meta.stackexchange/queries

[2]: https://data.stackexchange.com/meta.stackexchange/query/1759...

[3]: Static image if the query times out: https://i.stack.imgur.com/LYZQm.png


Stack Overflow used to release their data archives quarterly on BigQuery. Looking at the BQ datasets, they were last updated Nov 2022, which doesn't have the latest 2023 info in the submission.


FWIW, I now analyze the Stack Overflow dumps on Snowflake

https://medium.com/snowflake/how-to-load-the-stack-overflow-...


Thanks for sharing, good to see alternative options popping up. My wish is that the Stack Exchange dataset could one day be provided as a streaming parquet or arrow table, as underfunded grads and post-grads could then more easily/selectively sample the datasets (similar to how Huggingface provides some of its datasets)[1][2].

The Hugginface repo unfortunately prefilters some of the tables/rows according to some criteria, making it less usable for general analytical queries that the BQ or SEDE datasets enable. If anyone knows of an 'XML-streaming' solution that directly samples from the Internet Archive's data dumps, I am all ears.

[1]: https://huggingface.co/docs/datasets-server/rows

[2]: https://huggingface.co/datasets/HuggingFaceGECLM/StackExchan...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: