NLP concepts with spaCy tutorial

JPKab · on Feb 21, 2018

I love Spacy, and highly recommend it to anyone who needs to build production NLP software.

It is truly groundbreaking, and a major improvement over NLTK. I also recommend gensim, another phenomenal library for NLP.

wodenokoto · on Feb 22, 2018

I've always understood nltk as a teaching tool, and been accordingly surprised when I see people use it in production.

SpaCy really fills an important gap

Xeoncross · on Feb 21, 2018

I would love to see more tutorials explaining how you can use these basic building blocks (also covered by the great documentation) to find semantic meaning for comparing sentences or understanding basic intent like https://explosion.ai/blog/chatbot-node-js-spacy or some other useful tool besides finding all the nouns in a document.

wyldfire · on Feb 21, 2018

I think spaCy is pretty great. I'm curious to see more about how Prodigy (Montani's annotation/training feature) turns out. [1]

Oh, I see now that it's quite a lot further along then when I last checked in. ~$400 for an individual license seems pretty fair IMO.

[1] https://prodi.gy/

syllogism · on Feb 21, 2018

Thanks! I'll answer this, since Ines doesn't have an account here (I'm Matt).

We're really happy with how Prodigy's being received. It's only been on sale two months, so I'm looking forward to hearing more success stories as people finish their projects (and of course, feedback to change what needs to be changed!).

You can read how FullFact used it to train claim identification models for fact-checking here: https://fullfact.org/blog/2018/feb/how-we-customised-prodigy...

Probably the best place to follow the progress is the support forum: https://support.prodi.gy/

We're also working on more tutorial videos. This video shows the workflow for training a new entity type: https://www.youtube.com/watch?v=l4scwf8KeIA . This is one of the bits of the tool we're particularly proud of --- you can start off with a couple of seed terms, use word vectors to build up a larger terminology list, and then turn that list into a set of pattern rules to start boot-strapping a classifier. Prodigy will suggest phrases that match the patterns as entities, and your answers are used to train the statistical model. As you keep annotating, the model will start suggesting phrases too, which you'll say yes or no to. Eventually the model basically takes over, and you're mostly correcting its suggestions.

f00_ · on Feb 21, 2018

After reading "Backprop as Functor: A compositional perspective on supervised learning", it came to my attention that the spaCy backend (thinc) is built with higher order functions instead of a computational graph (unlike tensorflow, chainer, or pytorch)

https://github.com/explosion/thinc#no-computational-graph--j...

https://arxiv.org/abs/1711.10455

Could someone give me some more detail?

syllogism · on Feb 21, 2018

You know, I'm still not 100% certain whether there's a substantive difference between the "computational graph" perspective and this "functor" approach. Actually the feeling is sort of eerily familiar, because I spent most of my PhD confused about whether these grammar formalisms I was working with where really just notational variants, or whether there were significant differences. About the grammar formalisms, I ended up deciding that in theory there wasn't, in practice there sort of was.

About these neural networks, I think it's "just" implementation. Here's the linear layer implementation in Chainer: https://github.com/chainer/chainer/blob/master/chainer/funct...

We have the forward and backward pass organized as class methods here, and the intermediate state from the forward pass is saved into attributes in the instance. So on each call to the network, we make an instance of this LinearFunction class.

In terms of what's being computed, there's really no difference between this and what happens when you call a layer in Thinc. It's just that the state gets captured in the outer scope of the closure. Maybe Thinc's way has a little less overhead, if there are fewer levels of indirection. Thinc uses the Chainer folks' GPU library --- so, unsurprisingly if you define the same network, the benchmarks are very similar.

On the other hand...I do think the implementation matters! Here's a difference for you: if the library approaches it as "we're going to build a computational graph, and execute it", then the library is going to steal the control flow. If the library tells you "here are some functions, and some higher order functions to compose them", you have more access. PyTorch and Chainer doesn't steal the control flow to nearly the extent that Tensorflow does, but they still build up and tear down the state in their objects, and that makes it harder to intrude.

(I'm the author of spaCy and Thinc)

jph00 · on Feb 22, 2018

To clarify the grandparent comment : pytorch also supports a functional approach rather than a computational graph. There's even a pytorch functional model zoo :)

artpar · on Feb 21, 2018

This is a docker image exposing spacy ( +wordnet +neuralcoref) over http

https://hub.docker.com/r/artpar/languagecrunch/

https://github.com/artpar/languagecrunch

the_duck · on Feb 21, 2018

    There is no such thing as a sentence, or a phrase, or a part
    of speech, or even a "word"---these are all pareidolic
    fantasies occasioned by glints of sunlight we see reflected
    on the surface of the ocean of language; fantasies that we
    comfort ourselves with when faced with language's infinite
    and unknowable variability.

'Pareidolic' is my new favourite word.

https://en.wikipedia.org/wiki/Pareidolia

ramblenode · on Feb 22, 2018

This is a kind of tautology. Words and phrases exists not because of the whims of grammarians but because they are psychological realities for people. They are concepts represented in the mind and brain which exist whether or not people are explicitly aware of their existence--and this is demonstrated in a large psycho- and neurolinguistic literature. This is different from a man on the moon where we are talking about the anthropomorphism of bits of rock and dust--whether the ontology of such a thing is a human face or just bits of rock and dust.

posterboy · on Feb 26, 2018

I don't know, the moon face seems real enough at the moment of seeing it, too.

Linking this to signal theory and the Fourier transform, one point to consider is that solutions are only true in the infinite limit, so a word, a phrase is never enough to represent reality. A sense of continuity is real enough, but discontinuity, too, although I can't position that in a psychological frame. Or Neurological. But speaking with the y combinator in mind, I don't think words are the fixpoints of thought, but feelings are. Maybe onomatopoetic names are and familiar faces are close enough.

wyldfire · on Feb 21, 2018

Is there anyone out there who has tried linking named entities with an ontology? All I've seen is research but I'm curious if anyone's done any practical work in this area. Even if it was a narrowly-scoped ontology it might be pretty interesting.

Radim · on Feb 21, 2018

There's a number of tools that do entity linking (that's the phrase you're looking for), including some open source ones.

We've evaluated and used quite a few over the years: there's Dexter [0] by Diego Ceccarelli, Semanticizer by UvA [1] and DBpedia Spotlight [2] and a few others. We've used them for various linking tasks, such as detecting "work skills" in plain text (HR domain) or detecting drug names (medical domain).

The amount to which these tools allow "customization" (ease of plugging in your own ontology, support for input format and disambiguation signals) differs. Either way, even though this research includes open source code, the code is more of the "research prototype" kind. Don't expect a plug&play optimized production tool.

[0] http://dexter.isti.cnr.it/

[1] https://github.com/semanticize/semanticizer

[2] https://github.com/dbpedia-spotlight/dbpedia-spotlight

vadimberman · on Feb 22, 2018

Would you be interested in a plug&play optimised production tool if it required licensing?

jxub · on Feb 21, 2018

At datamaran [1] we do some of that stuff for assessing non-financial risks with our internal ontology. Also, we are using spaCy and trying out prodigy. Send me an email and I can put you in contact with one of our NLP scientists if you want (they're planning on releasing a paper somewhere soon btw).

[1] https://www.datamaran.com

syllogism · on Feb 21, 2018

That's definitely deployed in production, e.g. to link entities to a summary page in news articles, or in information extraction systems.

In practice it's very driven by information retrieval, especially the coverage of the synonym list provided in the ontology. Relevance scores for each ontology entry are also super important.

Disambiguation is getting better with neural networks, but it's still really hard. If you have two entities matching for some text, picking the most frequent one gives you a very strong baseline.

yorwba · on Feb 21, 2018

I would be interested in the research you mention in case you have links handy.

syllogism · on Feb 21, 2018

The magic words for the task are "Named Entity Disambiguation" and "Named Entity Linking"

https://www.semanticscholar.org/search?q=named%20entity%20li...

This survey paper I coauthored is outdated now, but it's not the worst place to start: https://www.semanticscholar.org/paper/Evaluating-Entity-Link...

amirouche · on Feb 21, 2018

I don't understand the hype around spacy especially when it relies on (closed source) annotated corpus to do all of its job. The default models fail in simple cases. And it takes a lot to train a new model on new corpus.

I am wondering how people solve actual problems with spacy. What are the use cases? Is Spacy used in question answering sytems or summarization pipelines? Maybe conceptual search?

I prefer the approach of link grammar / relex which are based on dictionaries / grammars, it seems easier and less error prone.

Prodigy is genius! More tools like that will be built in the next few years...

bosie · on Feb 21, 2018

> I prefer the approach of link grammar / relex which are based on dictionaries / grammars, it seems easier and less error prone.

Can you link to something open source that outperforms spacy on the basic NLP tasks (NER, POS, dependency parsing)?

amirouche · on Feb 22, 2018

all at the same time, I agree that there is no equivalent; that said, for each tasks there is better tools.