(Disclaimer: I'm the founder of OpenPipe, one of the fine-tuning services OP tried and ultimately the one that produced the highest performing model, it appears.)
Data extraction is a use case that fine-tuned models are fantastic at, so I'm not surprised that OP got good results. That said, I've also found it's pretty easy to beat GPT-4 across many task types if you have a way of getting strong training data. We published some research[1] a week ago where we found that across 4 example tasks spanning creative summarization, question answering, data extraction and classification a fine-tuned Llama 3 8B was able to outperform GPT-4 on 3 of them. The key was to create a repeatable way of generating high-quality training data, which is also addressed in the post.
Is this something, as a tech enthusiast that's no expert, I can easily fine tune are run?
My use case would be fine tuning on technical docs. Specific news, 2 years of blog posts, primary source material, and Twitter explainer thread. I want to gather all the niche information of a topic from the last two years, dump it into this and have an LLM that is a subject-matter expert.
Fine tuning doesn't quite work that way. You have to format the training data set as request/response. The idea of fine tuning is to get the model to output things in a specific format, style or structure.
Your use case is better suited to RAG. This is where you retrieve data from a large dataset and inject it into the user's request so the AI model has the context it needs to answer accurately.
But that's not a silver bullet and you would need to spend significant time on chunking strategy and ranking of results to hopefully get a decent response accuracy.
Here is an example of the Predibase platform, referred in the article for the Solar model, but that can train also Llama-3, Phi-3 and Mistral. https://www.youtube.com/watch?v=R2JQhzfaOFw&themeRefresh=1 I think you can assess by yourself if it's easy enough to do for you. (Predibase founder here)
Why isn't someone providing a "meta model" that uses an LLM to choose between various fine tuned models depending on the question to get overall better results than gpt4?
Founding AI Engineer at OpenPipe here, using a fine tuned "router LLM" to route between various specialized (inc fine tuned but not necessarily) applied models depending on the input is becoming a common pattern in more modern "graph like" LLM applications.
You can see how that "routing function" could include a call to a "Router LLM." And yes, fine tuning is a great method to better improve the routing intelligence of said Router LLM.
Worth mentioning that you don’t even need separate models to implement this. Dynamically loading LoRA adapters is much more efficient, and is the approach Apple took.
It seems like you do not work for OpenPipe (OP), so it probably doesn't matter for you, but it could (should) matter a whole lot for OpenPipe and/or their customers
Data extraction is a use case that fine-tuned models are fantastic at, so I'm not surprised that OP got good results. That said, I've also found it's pretty easy to beat GPT-4 across many task types if you have a way of getting strong training data. We published some research[1] a week ago where we found that across 4 example tasks spanning creative summarization, question answering, data extraction and classification a fine-tuned Llama 3 8B was able to outperform GPT-4 on 3 of them. The key was to create a repeatable way of generating high-quality training data, which is also addressed in the post.
[1]: https://openpipe.ai/blog/mixture-of-agents