What interests Reddit? A network analysis of 84M comments by 200K users

faizshah · on Feb 18, 2015

I'm working on a project relevant to this. Does anyone know if the author has shared this data set anywhere? Or does anyone know of any data sets that could be used for developing mixture models to classify users into interest groups (like photographers, programmers etc)?

aroch · on Feb 18, 2015

If I'm remembering correctly, the raw data was given under an NDA/DND as a one time only deal. There was a subreddit associated with the data and collection but its since been banned.

minimaxir · on Feb 18, 2015

I was the one with this data. The subreddit which discussed provided the data went private.

I have the raw data but it's infeasable to distribute due to size and the original source got in trouble for making it easily accessible.

placeybordeaux · on Feb 18, 2015

Yeah the thing I most wanted out of the post is a torrent to the data.

faizshah · on Feb 18, 2015

I have been working on a small shell script to get the top 500 posts of the top 5000 subreddits. The big issue is the api call limit is 60 req/min for OAuth2 tokens. It would save a lot of time if someone made a torrent of a data set like that, I will probably post the data I collect somewhere as well but it would be easier if reddit just hosted a data set like the one I described. They could probably post one on AWS public data sets. https://aws.amazon.com/datasets/

HarryN · on Feb 18, 2015

Could you not just request the .json feed whenever you like without limits? ( by appending .json to the reddit URL )

faizshah · on Feb 18, 2015

I'm a little new to this but my understanding is that that would be the same as doing a GET request to the REST API. Which would have the same penalty for going over 30 req/s without an OAuth token.

For example, this is what my get request to the REST endpoint looks like without URL parameters:

GET reddit.com/r/explainlikeimfive/comments/2w1xzo.json

Which returns the same response as appending .json to the the URL:

https://www.reddit.com/r/explainlikeimfive/comments/2w1xzo/e...

minimaxir · on Feb 18, 2015

Note that almost all subreddits only let you travel back 1,000 posts using the .json feed.

The exception is /r/all, which had infinite pagination, but the low API limits prevent much usability.

bduerst · on Feb 18, 2015

What kind of data are you pulling? Have you tried scraping the subreddit RSS feeds?

faizshah · on Feb 18, 2015

I haven't tried that, I am looking for comments attached to usernames attached to articles/other comments in a tree and the subreddit that they are contained within. In order to define each user as an agent who has a probabilistic membership to the set of all interest groups (subpopulations) within the data set defined by a mixture model.

My pipeline right now looks like:

list of most popular SFW subreddits of all time -> gnu parallel -> curl on comment listings of top articles within subreddit -> json processing with jq -> gzip -> store on filesystem

So at the end I should have a directory of 5000 subreddits containing a gzipped json file of the comment threads of their top 500 posts.

I'm new to this kind of data processing but this is the easiest way I have found to do a short term, one time scrape of this data.

stared · on Feb 20, 2015

The author released data, links are at the bottom of the article (great thanks to him!).

faizshah · on Feb 21, 2015

Thanks for the tip, I just started downloading the data.

Houshalter · on Feb 18, 2015

The admins released a bunch of anonymize voting data once. And several people have distributed datasets of scraped comments. Sorry I don't have links handy. Check through /r/redditdev

faizshah · on Feb 18, 2015

Thanks for the tip, I found the link you're referring to:

https://github.com/umbrae/reddit-top-2.5-million

stared · on Feb 18, 2015

I like a lot such analyses based on the network structure (not long ago I made something similar for Stack Exchange - http://stared.github.io/tagoverflow/; continuation of my older one https://github.com/stared/tag-graph-map-of-stackexchange/wik...).

Though, technology-wise, it is one use-case where SVG beats pixel graphics, both in terms of usability and interface (whether it is custom D3.js or something graph-oriented as http://sigmajs.org/).

wamatt · on Feb 18, 2015

if you set tag coloring to "% answered", an interesting pattern emerges

http://i.imgur.com/ZLmWHrq.png

responsiveness of the community in order of most to least

- oldschool hacker (c/c++,bash,perl,regex)

- web dev (jquery, javascript, html, css)

- app dev (ios, objectivec, android, java)

stared · on Feb 18, 2015

^ edit: http://stared.github.io/tagoverflow/

Link "ate" the semicolon.

jedberg · on Feb 18, 2015

Fun fact: We did this exact analysis at reddit many years ago, and used it to figure out which subreddits were related to each other. We never got around to productizing it, unfortunately, but the idea was to use it to suggest new reddits to you.

sinemetu11 · on Feb 18, 2015

I guess this might get into some special sauce territory, but was there a specific reason why this type of recommendation system was deprecated?

Houshalter · on Feb 18, 2015

All the reddit subreddit recommenders I've seen produce garbage recommendations. Outside of a handful of popular, general subreddits which everyone already knows about, everything is niche special interest stuff that you need to find on your own.

bduerst · on Feb 18, 2015

Subreddit specificity is so messily complex that it would be very difficult to do any recommendations based on your own subscriptions. Without reddit's cooperation in categorization (unlikely) it's probably not going to happen.

jedberg · on Feb 18, 2015

As others already said below, the feedback loop was dangerous.

Also it took a lot of resources to calculate and we just didn't have the time to build efficient map/reduce jobs to do it regularly. It was done by hand in Mathmatica.

swalsh · on Feb 18, 2015

One potential downside of using an algorithm like that is the possibility of a feedback loop.

eevilspock · on Feb 18, 2015

The very difficult trick is to find a virtuous feedback loop. Positive feedback loops might appear to be virtuous, but without balancing mechanisms they encourage viral dynamics (when did the virus become out role model for success?), group think and gaming.

Google's PageRank is an example of an incestuous positive feedback loop. Web pages with many inbound links get ranked higher, and pages that rank higher get more inbound links (because that's how many people find things they link to).

hooo · on Feb 18, 2015

I find these network visualizations nice to look at, but not all that insightful. They're generally hard to read and track relations outside of the main clusters. Am I missing something?

th0ma5 · on Feb 18, 2015

No I don't think so. A lot of people call these things "hairballs," and probably a more useful interface would be some kind of faceted browser that allows you to do pivots and look at aggregate stats of the various lenses you can put on top of a graph. Additionally, measurements such as node separation, "betweenness," or perhaps even looking at common chain patterns are probably more useful ways of trying to dissect graph structures.

Real_S · on Feb 18, 2015

You might enjoy mapequation.org

The methods used there are awesome! Also, open source with a great website.

Chronic31 · on Feb 18, 2015

Let's say you compute how distant/similar two concepts are. Then what? You update a link on Wikipedia?

th0ma5 · on Feb 18, 2015

Yeah I don't know! :D I guess I was talking about graph processing in general: https://en.wikipedia.org/wiki/Betweenness_centrality

SwellJoe · on Feb 18, 2015

What interests reddit? Casual racism and misogyny. Also cats.

Seriously though, it's interesting how interconnected some things can be in this view. I'm not sure what sense I can make of those interconnections, though. Mousing around, while being a very frequent redditor (so my own neural network is making connections based on experience), I can kinda infer order out of things like the "government->state" topics connected to "force" and "property" among others (hints at the libertarian-leaning general population), and the "women" topic connecting to a whole host of stuff...the cyan colored section off to the top right might even kinda hint at the casual misogyny thing (which was a "ha ha only serious" kind of joke), with words like "bullshit", "logic", "proof", "assumption", "reasoning", and "evidence", being connected to "women" but not to "men".

But, without having spent years on reddit, and without my particular flavor of reddit (the subs I'm subscribed to), maybe I'd interpret the data very differently. I never quite know how to interpret network graphs like this, honestly, short of for things that are networks. i.e. a computer network topology on a graph shows useful data...the hops from one machine to the next. When connecting up one word to the next, it seems difficult to draw meaningful conclusions. Like my interpretation of the meaning of "government->state->property" as being a hint at the libertarian leanings of many subreddits, or the connection of "women->reasoning->evidence" as being a hint of many redditors belief that women are illogical liars (which is the impression many of my female friends have of reddit, in general, particularly when topics like date rape or the "friend zone" come up). Is that actually the context in which these connections are made? I wouldn't really know how to check. It'd be cool to be able to drill down to conversations in which the connections where made, but presenting that in a coherent UI seems challenging.

thejaredhooper · on Feb 18, 2015

I agree. It would be nice to drill down into the data in order to further analyze everything. I also feel there was a particular sort of censorship in the dataset, an indicator of which was the explicit racist and misogynistic words that were absent. There was a large lack of swears and bad terms in this analysis (bitch being a particularly obvious cut) and I, for one, see examples of these slurs prevalently used by young men far too often on the site.

Perhaps the data was tailored when it was provided to the analyst, or it was censored after reception, but this felt too "PG-13" for an analysis of reddit's "interests".

on Feb 18, 2015

[deleted]

SwellJoe · on Feb 18, 2015

No, I said exactly what I meant.

Porn is a big part of reddit, as well, and some of it (that found in the gonewild subs) is even among the more ethical sources of free porn on the web. But, as another commenter noted, porn seems to be absent from this network graph.

6stringmerc · on Feb 18, 2015

I get the feeling that Conde Nast may not like this type of approach when they're not directly profiting from it. A study of the language between the SFW and NSFW type tags might be pretty interesting, or, well, not very pleasant. I did participate in a couple music communities for a while, but there's something in the stew over there that I'm glad I closed my account and never looked back. YMMV.

brandonwamboldt · on Feb 18, 2015

Contrary to popular belief, Condé Nast no longer owns Reddit.

Since 2012, Reddit operates as an independent company (Advanced Publications, the parent company of Condé Nast is a majority share holder though).

See: http://www.redditblog.com/2013/08/reddit-myth-busters_6.html...

thieving_magpie · on Feb 18, 2015

So they don't own it, but they are the majority share holder? That doesn't feel very independent. Maybe I'm misunderstanding.

brandonwamboldt · on Feb 18, 2015

You are indeed misunderstanding.

Reddit is an independent entity, not a subsidiary of Condé Nast (like it used to be) and not a subsidiary of Advanced Publications (like it used to be).

It is an independent corporation, with it's own board of directors, and control of its own finances.

Just being a majority stakeholder doesn't mean you control the company either. There are a lot of details like share types and company by-laws that determine that.

amyjess · on Feb 18, 2015

You're confusing a subsidiary with a division or at least a wholly-owned subsidiary.

A subsidiary is a company whose majority stakeholder is another company, which is exactly what reddit is. If reddit weren't a distinct corporate entity, they'd be a division.

Now, since AP doesn't have 100%, reddit isn't a wholly-owned subsidiary, but a company doesn't have to be wholly-owned to be a subsidiary at all.

thieving_magpie · on Feb 18, 2015

Thanks, I see in your other clarifications that AP is the majority shareholder - though there shouldn't be an assumption that they affect reddit's independence.

It's still something to keep in mind.

nols · on Feb 18, 2015

Since when is Reddit not a subsidiary of Advance Publications?

bduerst · on Feb 18, 2015

Since last year.

They were spun off as an independent company, but Advance Publications still has a large amount of equity in Reddit. I think it's because Reddit wanted to try a bunch of things that were too risky for AP.

nols · on Feb 18, 2015

Do you have an article or their announcement? I can't find anything and I'm interested in Reddit's business structure.

fragmede · on Feb 19, 2015

http://www.redditblog.com/2011/09/independence.html

nols · on Feb 20, 2015

I know they spun off Condé Nast, but according to that they're still owned by Advance Publications. I'm looking for info about them being completely separate.

bduerst · on Feb 18, 2015

Not on hand, but they're privately traded, so there isn't going to be much information on the equity breakdown or other business structure details.

bhayden · on Feb 18, 2015

It is probably safe to say Condé Nast has a huge influence in who the board of directors are, and therefore controls reddit still.

brandonwamboldt · on Feb 18, 2015

No it's not safe to say that at all. Condé Nast is a subsidiary of Advanced Publications, one of MANY (AP is huge). Advanced Publications is the company that owns the majority share of Reddit, Inc. It's unlikely a small subsidiary of AP has much influence on one of AP's investments.

We don't even know how much influence AP has, since owning a majority share doesn't mean they have a lot of control. Also Reddit, Inc. just went through a round of investments, so those investors likely have a lot of influence too.

bradleyjg · on Feb 18, 2015

Not Conde Nast, Conde Nast's owners, Advance Publications, or when you really get down to it, the Newhouse family.

And the grandparent saying that Advance Publications is "only" a majority shareholder is a little deceptive. The shareholders are Advance Publications, current and former employees (as part of a ESOP) and a small residual ownership of angels in the original company.

While it is true that a majority owner can't just do whatever it wants, the rules protect the financial interests of minority shareholders, mostly in the context of takeovers, not the editorial independence of employees. If Si and Donald decided they really didn't like the NSFW part of reddit I think they could get rid of it.

dublinben · on Feb 18, 2015

Did they not just raise many millions of dollars in a new investment round? Did those new investors not receive a percentage of equity in Reddit Inc. as it is organized today?

bradleyjg · on Feb 18, 2015

You're right, mea culpa. It was a $50M investment on a reported $500M valuation. So 10% to the new investors (with a possibly defunct plan to give 10% of that, i.e. 1%, to the site's users), and 90% split between Advance Publications, ESOP, and the legacy angels (reported at less than 1% of the pre-investment total).

As for the ESOP percentage, all I've found is a reference in Forbes that describes it as a "sizable minority".

shillster · on Feb 18, 2015

Or even better, if we could systematically measure the brigades, moderator manipulation and psyops.

probably_wrong · on Feb 18, 2015

I've been thinking about that for a while in my free time. However, it's pretty much useless without actual data.

I wish there was a publicly available dataset, but given all the privacy implications, I doubt it.

erroneousfunk · on Feb 18, 2015

Small point: Is it really considered "scraping" (" I scraped approximately 84 million comments") if you used a Python library that uses the Reddit API, not the actual site directly?

fspacef · on Feb 18, 2015

Salute the effort put into this, quite thought provoking

okasaki · on Feb 18, 2015

Maybe I'm just stupid, but I don't see anything thought provoking.

In fact I feel that a better way to see what redditors are interested in would be to just find (there may even be stats on reddit on this) the ~50 most active subreddits.

zipppy · on Feb 18, 2015

If the 50 most active only constituted 10% of all reddit activity, though, they wouldn't necessarily paint an accurate picture of all of reddit.

Maybe nothing could paint that picture, but if there are themes prevalent independent of the subreddit topics themselves, this kind of analysis could shed light on them.

hbex5 · on Feb 18, 2015

Misogyny seems to be big in Reddit comments.

seany · on Feb 18, 2015

You misspelled misandry.

sliverstorm · on Feb 18, 2015

I'm thinking misanthropy is more accurate