Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I haven't tried that, I am looking for comments attached to usernames attached to articles/other comments in a tree and the subreddit that they are contained within. In order to define each user as an agent who has a probabilistic membership to the set of all interest groups (subpopulations) within the data set defined by a mixture model.

My pipeline right now looks like:

list of most popular SFW subreddits of all time -> gnu parallel -> curl on comment listings of top articles within subreddit -> json processing with jq -> gzip -> store on filesystem

So at the end I should have a directory of 5000 subreddits containing a gzipped json file of the comment threads of their top 500 posts.

I'm new to this kind of data processing but this is the easiest way I have found to do a short term, one time scrape of this data.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: