I'm working on a project relevant to this. Does anyone know if the author has shared this data set anywhere? Or does anyone know of any data sets that could be used for developing mixture models to classify users into interest groups (like photographers, programmers etc)?
If I'm remembering correctly, the raw data was given under an NDA/DND as a one time only deal. There was a subreddit associated with the data and collection but its since been banned.
I have been working on a small shell script to get the top 500 posts of the top 5000 subreddits. The big issue is the api call limit is 60 req/min for OAuth2 tokens. It would save a lot of time if someone made a torrent of a data set like that, I will probably post the data I collect somewhere as well but it would be easier if reddit just hosted a data set like the one I described. They could probably post one on AWS public data sets. https://aws.amazon.com/datasets/
I'm a little new to this but my understanding is that that would be the same as doing a GET request to the REST API. Which would have the same penalty for going over 30 req/s without an OAuth token.
For example, this is what my get request to the REST endpoint looks like without URL parameters:
GET reddit.com/r/explainlikeimfive/comments/2w1xzo.json
Which returns the same response as appending .json to the the URL:
I haven't tried that, I am looking for comments attached to usernames attached to articles/other comments in a tree and the subreddit that they are contained within. In order to define each user as an agent who has a probabilistic membership to the set of all interest groups (subpopulations) within the data set defined by a mixture model.
My pipeline right now looks like:
list of most popular SFW subreddits of all time -> gnu parallel -> curl on comment listings of top articles within subreddit -> json processing with jq -> gzip -> store on filesystem
So at the end I should have a directory of 5000 subreddits containing a gzipped json file of the comment threads of their top 500 posts.
I'm new to this kind of data processing but this is the easiest way I have found to do a short term, one time scrape of this data.
The admins released a bunch of anonymize voting data once. And several people have distributed datasets of scraped comments. Sorry I don't have links handy. Check through /r/redditdev
Though, technology-wise, it is one use-case where SVG beats pixel graphics, both in terms of usability and interface (whether it is custom D3.js or something graph-oriented as http://sigmajs.org/).
Fun fact: We did this exact analysis at reddit many years ago, and used it to figure out which subreddits were related to each other. We never got around to productizing it, unfortunately, but the idea was to use it to suggest new reddits to you.
All the reddit subreddit recommenders I've seen produce garbage recommendations. Outside of a handful of popular, general subreddits which everyone already knows about, everything is niche special interest stuff that you need to find on your own.
Subreddit specificity is so messily complex that it would be very difficult to do any recommendations based on your own subscriptions. Without reddit's cooperation in categorization (unlikely) it's probably not going to happen.
As others already said below, the feedback loop was dangerous.
Also it took a lot of resources to calculate and we just didn't have the time to build efficient map/reduce jobs to do it regularly. It was done by hand in Mathmatica.
The very difficult trick is to find a virtuous feedback loop. Positive feedback loops might appear to be virtuous, but without balancing mechanisms they encourage viral dynamics (when did the virus become out role model for success?), group think and gaming.
Google's PageRank is an example of an incestuous positive feedback loop. Web pages with many inbound links get ranked higher, and pages that rank higher get more inbound links (because that's how many people find things they link to).
I find these network visualizations nice to look at, but not all that insightful. They're generally hard to read and track relations outside of the main clusters. Am I missing something?
No I don't think so. A lot of people call these things "hairballs," and probably a more useful interface would be some kind of faceted browser that allows you to do pivots and look at aggregate stats of the various lenses you can put on top of a graph. Additionally, measurements such as node separation, "betweenness," or perhaps even looking at common chain patterns are probably more useful ways of trying to dissect graph structures.
What interests reddit? Casual racism and misogyny. Also cats.
Seriously though, it's interesting how interconnected some things can be in this view. I'm not sure what sense I can make of those interconnections, though. Mousing around, while being a very frequent redditor (so my own neural network is making connections based on experience), I can kinda infer order out of things like the "government->state" topics connected to "force" and "property" among others (hints at the libertarian-leaning general population), and the "women" topic connecting to a whole host of stuff...the cyan colored section off to the top right might even kinda hint at the casual misogyny thing (which was a "ha ha only serious" kind of joke), with words like "bullshit", "logic", "proof", "assumption", "reasoning", and "evidence", being connected to "women" but not to "men".
But, without having spent years on reddit, and without my particular flavor of reddit (the subs I'm subscribed to), maybe I'd interpret the data very differently. I never quite know how to interpret network graphs like this, honestly, short of for things that are networks. i.e. a computer network topology on a graph shows useful data...the hops from one machine to the next. When connecting up one word to the next, it seems difficult to draw meaningful conclusions. Like my interpretation of the meaning of "government->state->property" as being a hint at the libertarian leanings of many subreddits, or the connection of "women->reasoning->evidence" as being a hint of many redditors belief that women are illogical liars (which is the impression many of my female friends have of reddit, in general, particularly when topics like date rape or the "friend zone" come up). Is that actually the context in which these connections are made? I wouldn't really know how to check. It'd be cool to be able to drill down to conversations in which the connections where made, but presenting that in a coherent UI seems challenging.
I agree. It would be nice to drill down into the data in order to further analyze everything. I also feel there was a particular sort of censorship in the dataset, an indicator of which was the explicit racist and misogynistic words that were absent. There was a large lack of swears and bad terms in this analysis (bitch being a particularly obvious cut) and I, for one, see examples of these slurs prevalently used by young men far too often on the site.
Perhaps the data was tailored when it was provided to the analyst, or it was censored after reception, but this felt too "PG-13" for an analysis of reddit's "interests".
Porn is a big part of reddit, as well, and some of it (that found in the gonewild subs) is even among the more ethical sources of free porn on the web. But, as another commenter noted, porn seems to be absent from this network graph.
I get the feeling that Conde Nast may not like this type of approach when they're not directly profiting from it. A study of the language between the SFW and NSFW type tags might be pretty interesting, or, well, not very pleasant. I did participate in a couple music communities for a while, but there's something in the stew over there that I'm glad I closed my account and never looked back. YMMV.
Reddit is an independent entity, not a subsidiary of Condé Nast (like it used to be) and not a subsidiary of Advanced Publications (like it used to be).
It is an independent corporation, with it's own board of directors, and control of its own finances.
Just being a majority stakeholder doesn't mean you control the company either. There are a lot of details like share types and company by-laws that determine that.
You're confusing a subsidiary with a division or at least a wholly-owned subsidiary.
A subsidiary is a company whose majority stakeholder is another company, which is exactly what reddit is. If reddit weren't a distinct corporate entity, they'd be a division.
Now, since AP doesn't have 100%, reddit isn't a wholly-owned subsidiary, but a company doesn't have to be wholly-owned to be a subsidiary at all.
Thanks, I see in your other clarifications that AP is the majority shareholder - though there shouldn't be an assumption that they affect reddit's independence.
They were spun off as an independent company, but Advance Publications still has a large amount of equity in Reddit. I think it's because Reddit wanted to try a bunch of things that were too risky for AP.
I know they spun off Condé Nast, but according to that they're still owned by Advance Publications. I'm looking for info about them being completely separate.
No it's not safe to say that at all. Condé Nast is a subsidiary of Advanced Publications, one of MANY (AP is huge). Advanced Publications is the company that owns the majority share of Reddit, Inc. It's unlikely a small subsidiary of AP has much influence on one of AP's investments.
We don't even know how much influence AP has, since owning a majority share doesn't mean they have a lot of control. Also Reddit, Inc. just went through a round of investments, so those investors likely have a lot of influence too.
Not Conde Nast, Conde Nast's owners, Advance Publications, or when you really get down to it, the Newhouse family.
And the grandparent saying that Advance Publications is "only" a majority shareholder is a little deceptive. The shareholders are Advance Publications, current and former employees (as part of a ESOP) and a small residual ownership of angels in the original company.
While it is true that a majority owner can't just do whatever it wants, the rules protect the financial interests of minority shareholders, mostly in the context of takeovers, not the editorial independence of employees. If Si and Donald decided they really didn't like the NSFW part of reddit I think they could get rid of it.
Did they not just raise many millions of dollars in a new investment round? Did those new investors not receive a percentage of equity in Reddit Inc. as it is organized today?
You're right, mea culpa. It was a $50M investment on a reported $500M valuation. So 10% to the new investors (with a possibly defunct plan to give 10% of that, i.e. 1%, to the site's users), and 90% split between Advance Publications, ESOP, and the legacy angels (reported at less than 1% of the pre-investment total).
As for the ESOP percentage, all I've found is a reference in Forbes that describes it as a "sizable minority".
Small point: Is it really considered "scraping" (" I scraped approximately 84 million comments") if you used a Python library that uses the Reddit API, not the actual site directly?
Maybe I'm just stupid, but I don't see anything thought provoking.
In fact I feel that a better way to see what redditors are interested in would be to just find (there may even be stats on reddit on this) the ~50 most active subreddits.
If the 50 most active only constituted 10% of all reddit activity, though, they wouldn't necessarily paint an accurate picture of all of reddit.
Maybe nothing could paint that picture, but if there are themes prevalent independent of the subreddit topics themselves, this kind of analysis could shed light on them.