Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You don't need parallel corpora -- the OP was generating the incorrect versions of training data through random perturbation (dropping articles, etc.)


The author can probably get a lot of traction by doing random perturbation, but they won't be getting things in the correct ratios (i.e. lots of dropped hyphens but not many dropped apostrophes) and also won't probably make all the types of errors that humans actually make. It will work, but a huge part of machine learning and doing NLP is getting those ratios right. This is one reason Google slays with their translator, they have huge corpuses that allow them to get fine grained distinctions in their models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: