I've been doing this for one of the major companies in the space for a few years now. It has been interesting to watch how much more complex the projects have gotten over the last few years, and how many issues the models still have. I have a humanities background which has actually served me well here as what constitutes a "better" AI model response is often so subjective.
I can answer any questions people have about the experience (within code of conduct guidelines so I don't get in trouble...)
Thank you, I'll bite. If within your code of conduct:
- Are you providing reasoning traces, responses or both?
- Are you evaluating reasoning traces, responses or both?
- Has your work shifted towards multi-turn or long horizon tasks?
- If you also work with chat logs of actual users, do you think that they are properly anonymized? Or do you believe that you could de-anonymize them without major efforts?
- Do you have contact to other evaluators?
- How do you (and your colleagues) feel about the work (e.g., moral qualms because "training your replacement" or proud because furthering civilization, or it's just about the money...)?
What kinds of data are you working on? Coding? Something else?
I've been curious how much these AI models look for more niche coding language expertise, and what other knowledge frontiers they're focusing on (like law, medical, finance, etc.)
I can answer any questions people have about the experience (within code of conduct guidelines so I don't get in trouble...)