Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Thinking about high-quality human data (lilianweng.github.io)
103 points by tim_sw on Feb 9, 2024 | hide | past | favorite | 4 comments


> The NCV (Noisy Cross-Validation) method (Chen et al. 2019) divides the dataset into half at random, and then identifies data samples as “clean” if its label matches the predicted label provided by the model that is only trained on the other half of the dataset.

I was doing this trick in 2018, didn't write anything up. If you repeat this process a few times, it provides more fine-grained example difficulty signal so you can validate only the hard part by hand, or just skip it.


For those that don't know, Lilian Weng wrote one of the best "prompt engineering" howtos on the planet. It is beautiful in its succinct compactness.

https://lilianweng.github.io/posts/2023-03-15-prompt-enginee...


More than that, she writes great reviews of all sorts of ML stuff. They're about as good as it gets for getting started on whatever topic.


Discussed here:

Prompt Engineering: Steer a large pretrained language model to do what you want - https://news.ycombinator.com/item?id=35227358 - March 2023 (49 comments)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: