I work in the industry on NLP tasks. Unsupervised learning has been behind the l...

sanchezdev · on Nov 19, 2019

I don't disagree with your point, but the unsupervised aspect of NLP typically isn't useful on its own. Usually it's a form of pre-training to help supervised models perform better with less data.

From Google in 2018:

"One of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples. However, modern deep learning-based NLP models see benefits from much larger amounts of data, improving when trained on millions, or billions, of annotated training examples. To help close this gap in data, researchers have developed a variety of techniques for training general purpose language representation models using the enormous amount of unannotated text on the web (known as pre-training). The pre-trained model can then be fine-tuned on small-data NLP tasks like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch."

_pd19 · on Nov 19, 2019

As I said, I'm an NLP researcher and practitioner, so you don't need to quote this at me.

The unsupervised aspect is the engine driving all modern NLP advancements. Your comment suggests that it is incidental, which is far from the case. Yes, it is often ultimately then used for a downstream supervised task, but it wouldn't work at all without unsupervised training.

Indeed, one of the biggest applications of deep NLP in recent times, machine translation, is (somewhat arguably) entirely unsupervised.

sanchezdev · on Nov 19, 2019

I didn't mean to make it sound incidental although I do see your point. Just wanted to chime in with how important having a labeled dataset is for a successful ML project.

fspeech · on Nov 20, 2019

I think the point is labeling itself is very difficult except for special and limited domains. Manually constructed labels, like feature engineering, are not robust and do not advance the field in general.

sanchezdev · on Nov 20, 2019

That makes sense. I'm coming from the angle of applied ML where solutions need to solve a business problem rather than advance the field of ML. In consulting many problems can't be solved well without a labeled dataset and in lieu of one, less credible data scientists will claim they can solve it in an unsupervised manner.

_pd19 · on Nov 20, 2019

For sure. There are counter-examples however - fully unsupervised machine translation for resource poor languages comes to mind and is increasingly getting business applications.

I think that in the future, more and more clever unsupervised approaches will be the path forward in huge AI advances. We've essentially run out of labeled data for a large variety of tasks.