I've been increasingly wondering if the field considering LLMs as a continuum as opposed to a set of distinct thresholds is leading to erroneous "rules of thumb" as most research on methodology is concentrated in smaller and more accessible model experimentation right now.
We generally recognize (nearly ad nauseum) that mouse models of medical research don't necessarily translate to humans.
Similarly, I'd imagine most would laugh at the idea that a neurology researcher who found the best way to get a fruit fly's brain to navigate a maze should extrapolate that methodology to a dolphin or a chimp's brain.
Maybe we should be defining "weight classes" for LLMs and grouping research based on those classes. Like "these are the techniques that work best for lightweight models" but not necessarily assuming those as a general rule of thumb for "heavyweight models."
Even something like the discussion of synthetic data on model collapse is a good example of where there might be a very significant difference in the effect on model quality for a cheaper and less sophisticated model generating synthetic data to feed back into itself and a much more complex and sophisticated model. Maybe the lesson is actually "recursive training on synthetic data leads to model collapse in lightweight and medium weight models."
So while the writeup is a great one on fine tuning 7B models with LoRA, I would be curious just what % of the recommendations hold true in replication for even just a 65B model.
> research on methodology is concentrated in smaller and more accessible model experimentation
Lets pull out a part of the article: "Note that my experiments also included two arithmetic benchmarks (they are included in my other more technical write-up), on which LoRA-finetuned models performed significantly worse than the pretrained base models. My hypothesis is that the model unlearned arithmetic because the Alpaca dataset did not contain corresponding examples. Whether the model completely lost the knowledge or whether it's because the model can't handle the instructions anymore would require further investigation. However, a takeaway here is that it's probably a good idea to include examples of each task you care about when finetuning LLMs."
The llm for your average customer service bot isn't going to have a lot of need for arithmetic, or an ability to code! Can we get smaller models without math, or take a big model and prune that section, or better yet tell it "see Wolfram alpha" ...
Stuff like this is how we go from PhD projects that are black boxes to, opaque systems that engineers can beat on till they are clear and well understood.
I'm not saying that smaller model task specialization is a bad thing. If anything the research kicked off with Orca into using more complex models to jumpstart fine tuning of much smaller models is probably my pick for the most important ML research trend of 2023.
But even in the example you bring up and your comments on it, I'd strongly recommend considering Goodhart's Law - turning a handful of measurements into the target by which we are throwing other things away to improve model scores on those measurements doesn't necessarily represent a path to best in class production feasibility.
I can imagine a number of edge cases where a customer service bot not having basic math capabilities could lead to issues ("Did the package come with at least 4 screws?" "It only had three" "Ok, great - I don't see any issues with the shipment and am denying the return request").
Further, many qualities which probably do matter for applications like customer service, like patience, empathy, or de-escalation - don't happen to be parts of the measurements any LLMs are being optimized to hit (even though they are almost certainly represented at least in part in the pretrained models given the presence in the data).
We've become a bit too focused on optimizing LLMs around measurements reflecting our anchoring biases of what AI should look like as imagined decades ago rather than evaluating the starting point and use cases as they actually occur as we might for a tool by any other name.
Though this is all an entirely different issue from whether different model sizes require their own best practices.
We generally recognize (nearly ad nauseum) that mouse models of medical research don't necessarily translate to humans.
Similarly, I'd imagine most would laugh at the idea that a neurology researcher who found the best way to get a fruit fly's brain to navigate a maze should extrapolate that methodology to a dolphin or a chimp's brain.
Maybe we should be defining "weight classes" for LLMs and grouping research based on those classes. Like "these are the techniques that work best for lightweight models" but not necessarily assuming those as a general rule of thumb for "heavyweight models."
Even something like the discussion of synthetic data on model collapse is a good example of where there might be a very significant difference in the effect on model quality for a cheaper and less sophisticated model generating synthetic data to feed back into itself and a much more complex and sophisticated model. Maybe the lesson is actually "recursive training on synthetic data leads to model collapse in lightweight and medium weight models."
So while the writeup is a great one on fine tuning 7B models with LoRA, I would be curious just what % of the recommendations hold true in replication for even just a 65B model.