I haven't tried that, I am looking for comments attached to usernames attached to articles/other comments in a tree and the subreddit that they are contained within. In order to define each user as an agent who has a probabilistic membership to the set of all interest groups (subpopulations) within the data set defined by a mixture model.
My pipeline right now looks like:
list of most popular SFW subreddits of all time -> gnu parallel -> curl on comment listings of top articles within subreddit -> json processing with jq -> gzip -> store on filesystem
So at the end I should have a directory of 5000 subreddits containing a gzipped json file of the comment threads of their top 500 posts.
I'm new to this kind of data processing but this is the easiest way I have found to do a short term, one time scrape of this data.
My pipeline right now looks like:
list of most popular SFW subreddits of all time -> gnu parallel -> curl on comment listings of top articles within subreddit -> json processing with jq -> gzip -> store on filesystem
So at the end I should have a directory of 5000 subreddits containing a gzipped json file of the comment threads of their top 500 posts.
I'm new to this kind of data processing but this is the easiest way I have found to do a short term, one time scrape of this data.