I once went on a date with someone who did research at OKCupid who told me that they were doing NLP-style analysis of peoples' messages that they sent to each other. Still not really sure what to think of the date itself, but it was a fucked up admission.
If you remember the old OkCupid blog they used to post interesting articles about online dating. I know their article about whether you should smile on your profile picture was eventually debunked [1], but it was nonetheless nice to have objective, data-based, non-pua advice on how to be successful in online dating.
There was an actual effort at data science going on here before the marketing team took it over in the latter years. See the published book Dataclysm by one of the founders for more of the good stuff.
They did tons of data analysis across all aspects of profiles, and had a popular blog where they published the results.
They were heavily involved in researching what factors more reliably led to not just better matches, but better relationships -- when you disabled your account, they'd ask if it was because you'd met someone through OkC and ask you to pick who, if you were willing to share.
I don't think there was anything fucked up about it, as long as it was all anonymized and at scale. Trying to understand what messaging strategies worked better or worse could be a major part of figuring out how to improve matches.
Like, one obvious factor could be to match people who send lots of long messages with lots of questions with each other, while a separate set matches people who's messaging style is one sentence at a time. I'm not saying that would necessarily work well, but it's not crazy to research if NLP analysis of messages can produce additional potential compatibility signals.
The whole point of OkC back then was to try to develop as many data-based signals as possible to improve matches.
You realize that you're responding in a thread about OkCupid deceiving users and sharing data with 3rd parties right?
34. In response to this request, Humor Rainbow gave the Data Recipient access to nearly three million OkCupid user photos. Humor Rainbow’s President and Chief Technology Officer were directly involved in facilitating the data transfer.
35. In addition to user photos, Humor Rainbow shared other personal data with the Data Recipient, including each user’s demographic and location information.
> I don't think there was anything fucked up about it, as long as it was all anonymized and at scale
I'm confused why "as long as" carries so much weight here considering the article that started this discussion. You seem to trust that they stopped their privacy fuckups with third parties. I don't know where your trust comes from.
I'm saying that NLP over messages, and sharing with third parties, are completely orthogonal. It's not about trust, it's two different topics. The different topic was started by the original commenter. Because people often make comments on HN that are related but separate. As NLP and third-party sharing are.
I did like that they shared a lot of hard data with insightful analysis. At the time, there were a lot of narratives about what women wanted and it was refreshing to see them post what was actually working. I remember being skeptical about anything being private online at the time, but I guess that perspective wasn't as pervasive.
No it totally wasn't a fucked up admission, it was actually a useful and pro-user measure (all this good stuff was before the 2011 acquisition).
Christian Rudder's OKTrends blog (and Sam Yagan's presentation at their acquisition celebration) even spelled out the reason why: some women on OKC (or, more rarely, men) would acquire the "Replies rarely"/red color on their profile, for almost never (<10%) replying to initial messages, which was generally considered to be undesirable behavior, even in the negative (there is value in a negative message: "Thanks but I'm not interested due to age/location/other factor". And also OKC could then measure whether users' stated preferences mismatched preferences inferred from which set of users they message e.g. people who say they're looking for 30-55 for LTR but tend to message people 21-35 for short-term). And before anyone points out that younger more attractive female profiles would get more initial messages than males (up to 200:1 more), OKC used to allow you to set filters on the other user's age/distance/other criteria, so you could automatically filter those out. Also, factor in the usual caveats that many users on dating sites tend to lie about their age/weight/height/location/status/etc.
Anyway, to avoid getting labeled the dread "Replies rarely", some (mostly female) users got in the habit of sending one-liner responses that were ambiguous/non-committal/cryptic/negging. And then not responding further (but without unmatching, which only took a single click). This was making their profile look less undesirable but generating pointless message traffic and reducing the overall utility of the platform at actually attempting to match people (for compatibility, not just initial attraction). Hence, OKC tried to actually measure initial exchanges to figure out which ones led to genuine back-and-forth conversations of 3+ messages (which is an ok proxy for inferring a match, certainly a better proxy than just counting initial messages/likes/votes on photos). Yagan jokingly referred to this as "Every Monday morning, we ask 'How many three-ways did we set up over the weekend?'").
(PS, Rudder and Yagan both stressed that users' names/identities/ identifying characteristics were kept out of the analysis.)
After the 2011 IAC acquisition, most of this platform quality control (and looking for constructive insights) went out the window pretty quickly and the three cofounders moved to OKCupid Labs. But it was good for the brief while it lasted. By 2013 a chainsaw had been taken to most of OKC's unique features, esp. for free users.
Hey, my point is that it's fucked up that I went on a date with someone who admitted their job was to read other people's messages. If you don't think that's fucked up, then we simply have a difference in opinion. I don't know what the rest of your post is about.
OKC analyzed message traffic in an anonymized way to infer when matching was/wasn't working, and what insights that revealed about people's inferred vs stated preferences.
As such it wasn't "reading other people's messages". That's according to the OKC founders description of what they did, pre-acquisition. Since they were reasonably upfront about what they did, and since that functionality worked even for free users and they didn't aggressively push premium or gate the features, I believe they were being truthful. Furthermore, after the 2011 acquisition and when they stopped being active on OKC itself, there was a palpable degradation in site quality. (And post-2014, IAC went on to sell personal information about users' substance use etc. to insurers, to which users had never given informed consent).
So I think your date explained things badly and you picked up the wrong end of the stick. It's trivially easy to write a script that strips usernames and identifying information. And it's not too hard to distinguish "Hey baby" or "DTF?" from more meaningful messages. For the platform to do that with an intent to improving matches was strongly positive, not negative.
We're talking past each other. You think the ends justify the means and I don't. Like, if there's a "Read" marker for a person viewing my messages, then I should have the same for analysts reading my messages, too. Like, when someone is running grep over everyone's messages, they're not just viewing that in isolation -- the context is what's important, and that requires actually reading the messages and their outcomes.
> For the platform to do that with an intent to improving matches was strongly positive, not negative.
Do you realize how much work that OkCupid has put into their product to make matching intentionally worse (through analysis!)? Take off your rose-tinted glasses!!
Edit: aaaand there's a new article about OkCupid giving facial recognition data to 3rd parties.
makes me wonder if the person you went on a date with cherry-picked you due to your data. (anyone who would post on hacker news is obviously a good catch!)
I think the "only thing" that would make me cherry-pickable from their data is that I used an autoclicker to give everyone a 5 star... I have mixed feelings about doing that, but I got a couple (surprisingly nice) dates out of it that never went anywhere.
If only they had the long term data too. It might make for easier discussions on the first date, but maybe there's more to opposites attracting/different roles in a relationship.
Like I said in a different post, there are legal reasons for why you would want the original data. Deleting the data from the dataset is negligent.
If you want something to blame, blame the system that allowed the data to be bad in the first place. You're pointing your finger at the wrong people and it's unreasonable of you to call them negligent.
Sure we should indeed expect that they do that. But look at enough data and you'll learn that those expectations are a path towards never-ending frustration. I've been there, spending >100 hours cleaning data... that never got published because I was too damn focused on the dozens of years of errors that many, many people created.
To be clear, I'm not saying that we should accept messy data. Just, reality is messy and it's naive to think we can catch and remove all of reality's messiness -- which includes the bureaucratic slop that led to the data being published in the first place.
In fact, the government agencies will argue that they have zero legal obligation to clean the data, let alone figure anything about the data, and that they're just giving you the data as-is. This happened to me on a FOIA call where I was trying to get data from the county state's attorney. They insisted they could only run a specific report and that they had no obligation to run any query, meaning I can't even get access to the data I need.
"obviously wrong" is a never ending rabbit hole and you'll never, ever be satisfied because there will always be something "obviously wrong" with the data.
Messy data is a signal. You're wrong to omit signal.
Exactly. This is a big problem with "open data". A lot goes into cleaning it up to make it publishable, which often includes removing data so that the public "doesn't get confused". Now I have to spend months and months fighting FOIA fights to get the original raw, messy data because someone , somewhere had opinions on what "clean data" is. I'll pass -- give me the raw, messy data.
I do not disagree with that, but I am not sure what "raw data" means in some cases like the ones the article talks about. The 1.700.000 is no less or more raw than 1.700,000. Most probably somebody messed up some decimals somewhere, or somebody imported a csv in excel and it misinterpreted the numbers due to different settings. Similar to swapped longitude/latitude. That sounds different to me than, let's say, noisy temperature data from sensors. Rather, it seems more like issues that arose at the point of merging datasets together, which is already far from the data being raw.
The issue imo is that a person closer to the point the data was collected or merged is probably better equipped with understanding of what may be wrong with it, than a random person looking into that dataset. So I do not think it is unreasonable to have people in organisations take a second look into the datasets they publish.
When I say "raw", what I'm referring to is the preservation of the data's chain of custody. If I'm looking at the data with an intent to sue the respective government agency, then I have strong legal reasons to make sure that the data isn't modified. If I start from open data for example, the gov agency will have their data person sign an affidavit making this very clear and I will lose my case basically immediately.
The issue imo is that a person closer to the point the data was collected or merged is probably better equipped with understanding of what may be wrong with it
You'd think so, but just like most other systems, systems are often inherited or not thought out, so the understanding is external and we can't assume expertise within.
I have mixed feelings about this. On one hand, yeah stop publishing garbage data, but as a FOIA nerd... I'll take the data in any state it is. I'm not personally going to be able to clean the data before I receive it. Does that mean I shouldn't release the unsanitized (public) data knowing that it has garbage data within? Hell no. Instead, we should learn and cultivate techniques to work with shit data. Should I attempt to clean it? Sure. But it becomes a liability problem very, very quickly.
One of those people can republish their cleaned and validated version and the 999 others can compare it to the original to decide whether they agree with the way it was cleaned or not.
Do you remove those weird implausible outliers? They're probably garbage, but are they? Where do you draw the line?
If you've established the assumption that the data collection can go wrong, how do you know the points which look reasonable are actually accurate?
Working with data like this has unknown error bars, and I've had weird shit happen where I fixed the tracing pipeline, and the metrics people complained that they corrected for the errors downstream, and now due to those corrections, the whole thing looked out of shape.
This isn't possible to answer generally, but I'm sure you know that.
Look -- I've been in nonstop litigation for data through FOIA for the past ten years. During litigation I can definitely push back on messy data and I have, but if I were to do that on every little "obviously wrong" point, then my litigation will get thrown out for me being a twat of a litigant.
Again, I'd rather have the data and publish it with known gotchas.
Should I have told the Department of Finance to fuck off with their messy data? No -- even if I want to. Instead, we learn to work with its awfulness and advocate for cleaner data. Which is exactly what happened here -- once me and others started publishing stuff about tickets data and more journalists got involved, the data became cleaner over time.
Sorry I meant to say that usually it's not always possible to clean the data if the data is corrupt in the first place, because it was collected in a buggy manner. And having a few inexplicable outliers in datasets can often erode confidence in the rest.
Since this is not the data you collected, I understand you have to work with what you have, by the way very interesting post, and nice job!
Quick, without doing any kind of Googling or calculation: if I asked you to count 3B grains of rice by hand, how long would that take you? How big would that pile of rice be? How long would it take you to eat it?
A billion is already unfathomably large. If you think it isn't, you just haven't tried imagining what a billion of anything would be like.
The problem with this exercise is that I have a few million in wealth and I cannot actually visualize a few million grains of rice but I am fully aware of my total capacity to allocate capital to problems.
Neither $3b nor $300b are realistically unfathomable to me. I find them easy to consider in terms of the projects I can build if I achieved each of these amounts.
As an example I’d have to allocate somewhere between $50m to $250m to get people to vote on a California proposition. I’d need to spend $1.5b to create a wing of a major hospital. I’d need between $100m and half a billion to create a new K-12 school in my city.
These are large sums of money and are currently out of my reach so if AI doesn’t destabilize everything my best bet is to take the same approach each of my ancestors did. Move my children one level up the wealth ladder and hopefully give them the values that help them prioritize these actions and the optimal way of getting there. I think that involves some amount of compounding and then some amount of spending.
The point is that you’re deluding yourself if you think that there is any difference in terms of relative “unfathomability” between 3 billion and 300 billion.
3 billion generates more in interest per day than 99.99% of people make in a year. That’s unfathomable volumes of wealth for even the very rich.
If anyone's interested in this sort of puzzle, the game Noita is filled with them. A large chunk of the code's in lua and you can inspect it!
The final puzzle for the game is a cryptographic puzzle that's been unsolved for five years now. Folk have done just about everything imaginable to solve the puzzle.
The wand building community is also a fun dive. The reddit and discord is filled with people doing things like animating the rickroll music video to exploiting int overflows to kill all loaded enemies.
More important than healing is avoiding damage entirely.
The keys to getting through noita are getting a fast wand to repel enemies and a wand that can dig through stone quickly. There are spell components with negative mana cost and cast time, stack as many of those with a single projectile type spell and you'll make a blaster that can kill enemies safely. The digging wand will let you skip tough sections, tunnel to areas with good loot, let you return to the "sanctuary" zones to edit your wands (if you haven't gotten the "edit anywhere" perk), and such.
Once you have those, you can make more utility wands that let you fly via recoil and whatever else you need.
It took me a long time, too. Caution's the game. Collect as many hearts as you can in the beginning and as much gold. Perk shuffling is important, especially early on so you'll want to afford it.
reply