I would love to see more tutorials explaining how you can use these basic building blocks (also covered by the great documentation) to find semantic meaning for comparing sentences or understanding basic intent like https://explosion.ai/blog/chatbot-node-js-spacy or some other useful tool besides finding all the nouns in a document.
Thanks! I'll answer this, since Ines doesn't have an account here (I'm Matt).
We're really happy with how Prodigy's being received. It's only been on sale two months, so I'm looking forward to hearing more success stories as people finish their projects (and of course, feedback to change what needs to be changed!).
We're also working on more tutorial videos. This video shows the workflow for training a new entity type: https://www.youtube.com/watch?v=l4scwf8KeIA . This is one of the bits of the tool we're particularly proud of --- you can start off with a couple of seed terms, use word vectors to build up a larger terminology list, and then turn that list into a set of pattern rules to start boot-strapping a classifier. Prodigy will suggest phrases that match the patterns as entities, and your answers are used to train the statistical model. As you keep annotating, the model will start suggesting phrases too, which you'll say yes or no to. Eventually the model basically takes over, and you're mostly correcting its suggestions.
After reading "Backprop as Functor: A compositional perspective on supervised learning", it came to my attention that the spaCy backend (thinc) is built with higher order functions instead of a computational graph (unlike tensorflow, chainer, or pytorch)
You know, I'm still not 100% certain whether there's a substantive difference between the "computational graph" perspective and this "functor" approach. Actually the feeling is sort of eerily familiar, because I spent most of my PhD confused about whether these grammar formalisms I was working with where really just notational variants, or whether there were significant differences. About the grammar formalisms, I ended up deciding that in theory there wasn't, in practice there sort of was.
We have the forward and backward pass organized as class methods here, and the intermediate state from the forward pass is saved into attributes in the instance. So on each call to the network, we make an instance of this LinearFunction class.
In terms of what's being computed, there's really no difference between this and what happens when you call a layer in Thinc. It's just that the state gets captured in the outer scope of the closure. Maybe Thinc's way has a little less overhead, if there are fewer levels of indirection. Thinc uses the Chainer folks' GPU library --- so, unsurprisingly if you define the same network, the benchmarks are very similar.
On the other hand...I do think the implementation matters! Here's a difference for you: if the library approaches it as "we're going to build a computational graph, and execute it", then the library is going to steal the control flow. If the library tells you "here are some functions, and some higher order functions to compose them", you have more access. PyTorch and Chainer doesn't steal the control flow to nearly the extent that Tensorflow does, but they still build up and tear down the state in their objects, and that makes it harder to intrude.
To clarify the grandparent comment : pytorch also supports a functional approach rather than a computational graph. There's even a pytorch functional model zoo :)
There is no such thing as a sentence, or a phrase, or a part
of speech, or even a "word"---these are all pareidolic
fantasies occasioned by glints of sunlight we see reflected
on the surface of the ocean of language; fantasies that we
comfort ourselves with when faced with language's infinite
and unknowable variability.
This is a kind of tautology. Words and phrases exists not because of the whims of grammarians but because they are psychological realities for people. They are concepts represented in the mind and brain which exist whether or not people are explicitly aware of their existence--and this is demonstrated in a large psycho- and neurolinguistic literature. This is different from a man on the moon where we are talking about the anthropomorphism of bits of rock and dust--whether the ontology of such a thing is a human face or just bits of rock and dust.
I don't know, the moon face seems real enough at the moment of seeing it, too.
Linking this to signal theory and the Fourier transform, one point to consider is that solutions are only true in the infinite limit, so a word, a phrase is never enough to represent reality. A sense of continuity is real enough, but discontinuity, too, although I can't position that in a psychological frame. Or Neurological. But speaking with the y combinator in mind, I don't think words are the fixpoints of thought, but feelings are. Maybe onomatopoetic names are and familiar faces are close enough.
Is there anyone out there who has tried linking named entities with an ontology? All I've seen is research but I'm curious if anyone's done any practical work in this area. Even if it was a narrowly-scoped ontology it might be pretty interesting.
There's a number of tools that do entity linking (that's the phrase you're looking for), including some open source ones.
We've evaluated and used quite a few over the years: there's Dexter [0] by Diego Ceccarelli, Semanticizer by UvA [1] and DBpedia Spotlight [2] and a few others. We've used them for various linking tasks, such as detecting "work skills" in plain text (HR domain) or detecting drug names (medical domain).
The amount to which these tools allow "customization" (ease of plugging in your own ontology, support for input format and disambiguation signals) differs. Either way, even though this research includes open source code, the code is more of the "research prototype" kind. Don't expect a plug&play optimized production tool.
At datamaran [1] we do some of that stuff for assessing non-financial risks with our internal ontology. Also, we are using spaCy and trying out prodigy. Send me an email and I can put you in contact with one of our NLP scientists if you want (they're planning on releasing a paper somewhere soon btw).
That's definitely deployed in production, e.g. to link entities to a summary page in news articles, or in information extraction systems.
In practice it's very driven by information retrieval, especially the coverage of the synonym list provided in the ontology. Relevance scores for each ontology entry are also super important.
Disambiguation is getting better with neural networks, but it's still really hard. If you have two entities matching for some text, picking the most frequent one gives you a very strong baseline.
I don't understand the hype around spacy especially when it relies on (closed source) annotated corpus to do all of its job. The default models fail in simple cases. And it takes a lot to train a new model on new corpus.
I am wondering how people solve actual problems with spacy. What are the use cases? Is Spacy used in question answering sytems or summarization pipelines? Maybe conceptual search?
I prefer the approach of link grammar / relex which are based on dictionaries / grammars, it seems easier and less error prone.
Prodigy is genius! More tools like that will be built in the next few years...
It is truly groundbreaking, and a major improvement over NLTK. I also recommend gensim, another phenomenal library for NLP.