Currently, we simply use GPT4-as-a-judge, with a general system prompt we've written which is task agnostic. This is then used to train the neural scoring function, which predicts quality ahead-of-time. However, it's on our roadmap to add make the judging more flexible, potentially task-specific judge prompts and in-context examples, also perhaps using a jury [https://arxiv.org/pdf/2404.18796].