Allegedly (according to HP labs anyway) my 15 year old software[0] was state-of-the-art at this on the CNN dataset[1].
By my thing was very heuristic based, and couldn't generate new sentences like this can. I'm pretty impressed - I'd say some of the machine generated summaries are better than the human ones.
This is way too difficult to get right, even for humans.
I dream of a not-so-smart news summarization engine that will not try to rewrite the news, but only pickup all the numbers and quotations, then present them in a table of who-said-what and how-many-what, along with the title.
I wish you wouldn't be so dismissive of journalism and journalists. What they provide is not filler. Controversial though this opinion may be around here, there is serious value in having an actual carbon-based life form -- one who has spent years or decades covering whatever beat -- provide context and insight for the quotes and data. That they have become a dying breed spells real trouble for our civic life.
The journalists you describe are so few and far between that there needs to be a different term for them. The vast majority of the 'news' out there isn't anything close to what you described.
The HN crowd does not need context and insight, they can just google a few keywords and then skim a wikipedia article to achieve the expertise necessary to argue with others on a public forum...
Are you serious? Journalists have templates for news articles they just fill with some new data every time statistics numbers are released or a politician speaks.
> This would put an end to filler-based journalism.
No it might put an end to the filler-producing journalists, the so called journalism would still get produced, albeit by a bot.
The real journalists (in terms of a better differentiation) would then be even more drowned out in an ever growing dessert of CGH (computer generated headlines).
There's SUMMRY (http://smmry.com). I don't recall it being very smart -- that is, it doesn't rewrite sentences. It extracts the most important/relevant ones to basically shorten an article. It's definitely useful.
As an example, here is the Google article resumed by SUMMRY.
===
Research Blog: Text summarization with TensorFlow
Being able to develop Machine Learning models that can automatically deliver accurate summaries of longer text can be useful for digesting such large amounts of information in a compressed form, and is a long-term goal of the Google Brain team.
One approach to summarization is to extract parts of the document that are deemed interesting by some metric and join them to form a summary.
Above we extract the words bolded in the original text and concatenate them to form a summary.
It turns out for shorter texts, summarization can be learned end-to-end with a deep learning technique called sequence-to-sequence learning, similar to what makes Smart Reply for Inbox possible.
In this case, the model reads the article text and writes a suitable headline.
In those tasks training from scratch with this model architecture does not do as well as some other techniques we're researching, but it serves as a baseline.
We hope this release can also serve as a baseline for others in their summarization research.
Here's my approach when I built my text-summary app with TensorFlow's SyntaxNet.
SyntaxNet (Parsey) gives part-of-speech for each word and a parse tree showing which less-important words/phrases describe higher-level words. "She sells sea shells, down by the sea shore" => (down by the sea shore) is tagged by SyntaxNet as lower describing "sells" so it can be removed from the sentence. Removing adjectives and prepositional phrases gives us simpler sentences easily.
Next, we find key words (central to sentences) for news article based on n-grams, and then score key sentences in which they appear. Use MIT ConceptNet for common-sense linking of nouns and most likely relations between them and similar words based on vectors. Generate article summary from the grammatically simple sentences.
My question is how well the trained models interpret human meaning in joined sentences. I discovered that by simplifying sentences you lose the original meaning when that grammatically-low-importance word is central to the meaning. "Clinton may be the historically first nominee, who is a woman, from the Dem or GOP party to win presidency" is way different meaning than that if you remove the "who is a woman". I am also interested in how it makes sence to join-up nouns/entities across sentences. This will cause the wrong meaning unless you are building the human meaning structures like in ConceptNet by learning from the article itself, as opposed to pretrained models based on grammar or word vector in Gigaword.
My work for the future, is using tf–idf style approach for deciding the key words in a sentence, which I would recommend over relative grammar/vectors. In the example in your blog post ("australian wine exports hit record high in september") you left out that it's 52.1 million liters; but if the article went on to mention or relate importance to that number, by comparing it to past records or giving it the price and so on, you can see this "52.1 million liters" phrase in this one sentence has a higher score relative to the collection of all sentences. As opposed to probabilistic word cherry picking based on prior data, this approach will enable you to extract named entities and phrases and build sentences from phrases in any sentence that grammatically refer to it.
You're pointing out what's already obvious. You still need some way to find what's "less important", which is what the topic is all about, like by using grammar dependency or keywords infrequency.
I'm not trying to be hostile, I just don't understand what you mean.
You still need some way to find what's "less important", which is what the topic is all about, like by using grammar dependency or keywords infrequency.
I have some experience in this area[1]. I found keyword frequency worked quite well.
Does this include the state of the network in its already-trained state? It looks like we need to train it with the $6000 dataset if we want to get good results like those mentioned. Is it possible for its state to be saved to disk and restored (so we don't all need a copy of the dataset)?
The Hainan example in the article is especially impressive, the generated summary uses completely different expressions compared to the source text, yet it is spot-on. Of course those are probably cherry-picked results, but still. As a side note, it would be interesting to see how the algorithm performs with longer sources.
We've observed that due to the nature of news headlines, the model can generate good headlines from reading just a few sentences from the beginning of the article.
This illustrates the importance of taking the trouble to understand a domain before trying to model it. I was taught in journalism class that the first paragraph of a newspaper article should summarize the story, the next 3-5 paragraphs summarize again, then the rest of the article fill in the details. Not only do the authors spend time discovering what should have been known from the outset, they reverse cause and effect. The model can generate good headlines due to the nature of newpaper writing, not due to the nature of headlines.
> we started looking at more difficult datasets where reading the entire document is necessary to produce good summaries
Was hoping to get rather some more insights on this.
Because when looking at the examples given, I wonder if we really need machine learning to summarize single sentences? Just by cutting all adjectives, replacing words by abbreviations and multiple words by potential category terms, we should face similar results. Maybe it's just a start or did I miss anything?
Author of post here. I'd say most of the examples generated from the best model were good. However we chose examples that were not too gruesome, as news can be :)
We encourage you to try the code and see for yourself.
How does the model deal with dangling anaphora[1]? I wrote a summarizer for Spanish following a recent paper as a side project, and it looks as if I'll need a month of work to solve the issue.
[1] That is, the problem of selecting a sentence such as "He approved the motion" and then realising that "he" is now undefined.
Wouldn't it suffice to do a coreference pass before extracting sentences? Obviously you'll compound coref errors with the errors in your main logic, but that seems somewhat unavoidable.
I am working on this in my kbsportal.com NLP demo. With accurate coreference substitutions (eg., substituting a previous NP like 'San Francisco' for 'there' in a later sentence, substituting full previously mentioned names for pronouns, etc.) extractive summarization should provide better results, and my intuition is that this preprocessing should help abstractive summarization also.
>>"In those tasks training from scratch with this model architecture does not do as well as some other techniques we're researching, but it serves as a baseline."
Can you elaborate a little on that? Is the training the problem or is the model just not good at longer texts?
Agreed, it seems they really hand-picked some shining examples for this post, and it would have been more interesting to see the full spectrum of when it works and when it doesn't. Perhaps the README in the Github repo is a bit more honest in terms of representativeness, though it only has 4 examples, one of them is an interesting failure:
article: novell inc. chief executive officer eric schmidt has been named chairman of the internet search-engine company google .
human: novell ceo named google chairman
machine: novell chief executive named to head internet company
I don't see that as a failure. It did produce a sentence that is shorter, grammatical (though "named to head" is a bit weird) and essentially true — calling Google an "internet company" would make sense in its early days (back when Google would be prefixed by "internet search-engine company").
I didn't think it was a failure either until I realized that I was letting future knowledge leak into the past! There is more than one internet company so upon reading that headline, given that it must be a novel event, my question would be: "Which Company?". Now I have to read the article until I find out. The human summary is better because I don't have to ask that question.
Most humans would interpret "novell chief executive named to head internet company" to mean "novell is an internet company and its chief executive just became its head" which is incorrect (and a little nonsensical since the CEO already is in change).
That's pretty interesting. It's taken me 5 readings of that sentence, including once out-loud to get your reading of it.
I thought the generated summary was really, really good. But I knew that Novell wasn't considered an internet company, so it wasn't until I made myself ignore that before I could see the other reading.
I think the holiday pay example is more glaring. Seen in isolation, I would be confused as to what on earth it was getting at. Furthermore, the summary is no good. The abstract isn't either. My summary would be: British Gas continues to fight eu court's decision that commission be included in holiday pay. Continue is used here to emphasize that the case is not yet over.
On the other hand, the football summary is exemplary; better than the provided abstract.
IMO at least the second example shown is already poor, or at least not much better than what sites like SMMRY[1] have been providing for years.
> hainan to curb spread of diseases
That sentence pretty much conveys no useful information - every city wants to "curb spread of diseases", so what has actually changed? The news here is about restriction on livestock, and even a student journalist would be expected to do better than this headline.
To be clear I'm excited about the idea and believe machine learning has much better potential for enormous refinements compared to SMMRY's method (as described by them[2]), I just don't think it's as "done" as a lot of people here seem to assume it to be.
May be the wrong context here, but if I am starting out learning Machine Learning using the Stanford course or some other one, is Tensorflow a good candidate to look into? Or does it only contain the advanced algorithms?
I remember there was a hackernews comment I read once where the user created an emacs plugin to do just this. It would generate a single sentence summery about whatever text was input into it.
I would love to use this kind of tool and look for gotchas or key hidden elements in legal documents. It's not exactly summarisation, but very useful. Something like bubbling up the important elements of the fine print.
I seem to recall someone used Microsoft's AutoSummarize feature to reduce and reduce classical works of literature to a few lines. The results were pretty hilarious, but I can't find it now.
I wonder what it'd be like on a novel, say something like 'pride and prejudice', would it be able to essentially summarise the plot or would it end up like 'movie plots explained badly'
Either way, this is great research with a ton of real world applications!
> Although this task serves as a nice proof-of-concept, we started looking at more difficult datasets where reading the entire document is necessary to produce good summaries. In those tasks training from scratch with this model architecture does not do as well as some other techniques we’re researching, but it serves as a baseline.
That would suggest that this method doesn't work well for long documents.
by classifying the emotional arcs for a filtered subset of 1,737 stories from Project Gutenberg's fiction collection, we find a set of six core trajectories which form the building blocks of complex narratives. We strengthen our findings by separately applying optimization, linear decomposition, supervised learning, and unsupervised learning
By my thing was very heuristic based, and couldn't generate new sentences like this can. I'm pretty impressed - I'd say some of the machine generated summaries are better than the human ones.
[0] http://classifier4j.sourceforge.net/ (yes, Sourceforge! Shows how old it is!!)
[1] http://dl.acm.org/citation.cfm?id=2797081