Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to parse a sentence and decide if to answer "that's what she said"? (quora.com)
256 points by fogus on April 27, 2011 | hide | past | favorite | 44 comments


This submission reminded me why I love HN: On Reddit, there would have been a flowchart (at best) if I followed the link. Here, there's an engineering solution, with practical tips for machine-learning.

Now I want to build a "that's what she said" bot for Twitter, which will parse new tweets and reply accordingly. (Maybe I will, as a learning exercise.)


http://www.reddit.com/r/programming/comments/gyook/how_would...

Actually, a fair part of the links about programming are just the same both here and on Reddit.


I don't think so, considering that the 1st comment thread on Reddit is a series of "that's what she said" jokes.

Edit: All the same, I'll try giving /r/programming a chance. As a whole, Reddit looks less civil than HN, but maybe I haven't been fair. People tend to criticize subjects that they don't know about or have never experienced 1st hand, and I'm guilty of that here.


I don't think so, considering that the 1st comment thread on Reddit is a series of "that's what she said" jokes.

Well, the thing is that people on reddit know better than to take themselves too seriously, and as a result you get plenty of amusing things mixed in to the comments.

For HN, on the other hand, commenting is Serious Business(TM), not to be wasted on such frivolities.

Some people like the HN approach. Personally I prefer to have a little fun every now and again.


On a programming topic, on HN, maybe 20% of the comments are interesting, or at least informative. On a programming topic, on Reddit, maybe 2% of the comments are interesting or informative, but there are 10 times as many comments, the interesting ones are voted to near the top (not necessarily right up there), and some of the interesting Reddit comments don't have an analogue on HN.


/agree SNR on reddit mebbe 5%, slashdot %1 here, I don't know 50%? I see some less than useful posts (like this one), but almost no outright trolling/patently wrong posts.

HN makes my daily read list, even if it is a quick glance. Great community.


Try /r/coding and /r/compsci as well.


Also grab r/python, r/ruby, r/java, r/cplusplus, r/lisp, r/haskell, r/php, etc to suit your fancy.


r/machinelearning, r/types r/reviewmycode, r/datasets, r/statistics


r/perl ;)


I don't get it - why is a great answer posted on Quota to this question a testament to HN?


Their argment is that HN's community links to and upvotes something useful. Reddit's community links to and upvotes something funny but otherwise worthless.

Not claiming it's accurate - I'm not a Redditer. Just clarifying.


I think the main difference between HN and Reddit is not the quality of the links (in the stories they share they have similar links and most of them come to HN later than Reddit) but the ensuing discussion. In fact most of the time I don't even click on the link on HN (for topics I already know about) but go directly to the discussion. I also tend to save the HN discussion page to my Delicious, not just the link, because of the value of the discussion.

That being said, I don't think the discussion on Reddit is worthless at all. Yes, there's quite a bit of joking, which is good to read at 4pm when you are stuck at a bug. But this particular topic is an excellent example of the value of the Reddit discussion: reading through the jokes, etc. provides interesting thoughts about implementing the TWSS bot.

So, the two discussions are complementary. For particular stories, I head over to proggit and see what the Redditers have commented.


It was on Reddit before it was here.


Be careful. This was the exact comment I would see on reddit a year or two ago. (replacing HN with reddit and reddit with Digg)


It'd be pretty easy to build, especially if you use something like Python and NLTK. You could then extend to a "your mom" bot too.

I'd actually be intrigued at how simple the algorithm could be and still be mostly on the mark, rather than how much NLP you could squeeze in...


Slightly OT, but I was really intrigued to read about the "Switchboard" Corpus (http://www.ldc.upenn.edu/Catalog/readme_files/switchboard.re...), especially given they were from audio recordings.

It seems, unfortunately, that the recordings aren't publicly available, although archive.org seems to have small sample of the transcripts: http://www.archive.org/details/SwitchboardCorpusSample

One particularly interesting section of the above readme was the section on technical issues (http://www.ldc.upenn.edu/Catalog/readme_files/switchboard.re...) including this example:

"ii.) The third problem was small changes in synchrony between A and B, due to a pseudorandom dropping of 2 ms chunks of data on either side. Over the course of a 10 minute conversation, these could accumulate to a differential of 30 or 40 msec between sides--enough to change a cross-channel echo from inaudible to audible, for example, or from barely audible to very noticeable, for a human listener.

When this bug was finally run down, it turned out to be a piece of code in the utility which extracts conversations ('messages') from the Robotoperator message master file. The code performed a check at each data block boundary to see if the first two bytes had the values 'FF FF'; if so, these were interpreted as header information, and the 16 bytes beginning with "FF FF" were discarded as not part of the speech data. This code was a relic from an earlier version of the Robotoperator which did not deal with mu-law values, and thus never encountered FF in data. In mu-law data, FF is one of two ways of representing zero signal level ('minus zero'). The offending lines of code were removed and the problem ceased."


That corpus was frequently used some time ago for performing automatic speaker recognition tests/system evaluations. Today, however, there are more challenging corpora, like those provided by NIST for its bi-annual speaker recognition evaluations (http://www.nist.gov/itl/iad/mig/sre10.cfm). These provide more mismatch conditions which require more advanced channel compensation mechanisms.


One of the commenters on Quora linked to an academic paper providing a solution to this problem. It was apparently published in ACL-HLT this year.

http://www.cs.washington.edu/homes/brun/pubs/pubs/Kiddon11.p...

Enjoy. It's easily the most hilarious academic paper I've read.


I've seen several better ones, but it's an interesting read nonetheless :) ("applying a novel approach - Double Entendre via Noun Transfer (DEviaNT) ")

As an example of other funny articles, this legal one on the word "fuck" is one of the best I've read recently (found it linked here on HN previously):

http://moritzlaw.osu.edu/faculty/articles/fairman_fuck

Of course, the annals of improbable research and Ig Nobel prizes yield several other examples :)


"Explaining a joke is like dissecting a frog. You understand it better but the frog dies in the process." — E. B. White

I think that answer at Quora, while awesome, should sufficiently kill TWSS.


It's a lot harder than I was expecting.


That's what she said.


I scanned through this page looking to see if this exact comment would appear here. Now when I see it I realize that it is less funny than I imagined. Maybe because the comment you replied to was a trap you fell for.


"Pre-trained That's-What-She-Said (TWSS) classifier in Ruby":

https://github.com/bvandenbos/twss


Here's my algorithm: Return false


Probably a pretty good baseline.


I would gem install twss

http://rubygems.org/gems/twss


Answer ignores a very public and growing corpus of data: IRC channels.


The author makes a great point, however, about the ability to link those two together. I'm sure you'd end up with a lot of false positives from IRC as even if it's attributed (ircuser: TWSS!) the flow of IRC conversations is such that it's not always the last thing ircuser said that is being replied to (because of network lag and just general reply lag in IRC).

Although now that I think about it... twitter probably often suffers from the exact same issue, although somewhat mitigated as people often include the original tweet in their reply.


Unless I'm missing something, this only occurs if someone sends two messages to someone and the receiving user responds to the first message but gets the second message before sending his.

Using average read/write times for a message - taking its length into account - it should be fairly easy to check if someone could have read and written a response to a message. Assuming it is a response if he could have written one in time should be fairly accurate.

You could probably even cheat and use a constant time while still getting good results.


I initially thought that myself but this is covered in end note [4]:

"[4] I think I once heard of an MSN chat transcript dataset that was really awesome, but I can't seem to find mention of it anymore. Let me know if you know where I can find that or any other instant message datasets. I know that some IRC rooms get publicly logged — is there a single place where one could grab all of them at once?"


I wouldn't assume just grab an existing log, wouldn't it be just like twitter? (ie, just "follow" a bunch of conversations/channels and build your own corpus by "tapping into the firehose").

You could use pre or post filtering to weed out connect/disconnect and other noise.

I only thought of this because there was like two "TWSS" replies in the past couple of days on #rubyonrails


You could write a simple client that just ignores connect/disconnect messages from the server.

Reading this post I remembered a small project called sociograph a friend of mine created. It's an IRC bot that logs messages and draws a graph of the people communicating with each other in realtime, see for an example http://www.youtube.com/watch?v=A_ah-SE-cNY.


Working on this, if anyone wants a corpus (positive examples from twssstories.com, negative from fmylife.com)

https://gist.github.com/945614


A British equivalent to 'that's what she said', would probably be 'said the actress to the bishop'.


2 suggestions:

1) Ask people in mechanical turk to write those sentences. Then you can ask others to verify them - you get far with few dollars

2) Include higher level features - for example bi-grams, there is more information in them

Also: corpus at http://thatswhatshesaid.com/ (I have no relation with this site)


Hey, let's see Google's prediction API take on this challenge, if it's so general. Just find a decent corpus and pop it into a GAE app.

In any case, I think most of us are now primed to try out some implementation or another now. It would be a lot of fun regardless of the quality. Actually, the false positives are probably funnier.


I would be very surprised if a simple bag-of-words approach works in this case. Intuitively, it's not the presence of certain groups of words that's important, it's something much more subtle and structural. Something that might be promising (and I'm being very handwavy here) is to discover 'template' sentence structures, as well as the particular words that populate those templates that result in TWSS.


I don't believe templates would work very well. The variations of sentences are too great such that you will result in very low recall.

An alternative solution is to attack the problem backward, training on terms (words or phrase) from sex-related conversations (such as adult chatroom transcripts). Then, from general corpus (Twitter or generic chats) identify terms that highly cooccur with those sex-terms. I would still use a Bayesian classifier, with strong prior against labelling something as a TWSS.


If this were to emulate some of my friends then it could respond to just about anything with "that's what she said."


I think this is marvelous but overthought. Here's my Python code (2.7 and 3.2-compatible):

twss = lambda sentence: True


Nice try Skynet.


Skynet? Who's heard anything of skynet? I'm GLADoS.

Now, will you stand over.... there?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: