I've been increasingly wondering if the field considering LLMs as a continuum as...

zer00eyz · on Nov 20, 2023

> research on methodology is concentrated in smaller and more accessible model experimentation

Lets pull out a part of the article: "Note that my experiments also included two arithmetic benchmarks (they are included in my other more technical write-up), on which LoRA-finetuned models performed significantly worse than the pretrained base models. My hypothesis is that the model unlearned arithmetic because the Alpaca dataset did not contain corresponding examples. Whether the model completely lost the knowledge or whether it's because the model can't handle the instructions anymore would require further investigation. However, a takeaway here is that it's probably a good idea to include examples of each task you care about when finetuning LLMs."

The llm for your average customer service bot isn't going to have a lot of need for arithmetic, or an ability to code! Can we get smaller models without math, or take a big model and prune that section, or better yet tell it "see Wolfram alpha" ...

Stuff like this is how we go from PhD projects that are black boxes to, opaque systems that engineers can beat on till they are clear and well understood.

kromem · on Nov 20, 2023

I'm not saying that smaller model task specialization is a bad thing. If anything the research kicked off with Orca into using more complex models to jumpstart fine tuning of much smaller models is probably my pick for the most important ML research trend of 2023.

But even in the example you bring up and your comments on it, I'd strongly recommend considering Goodhart's Law - turning a handful of measurements into the target by which we are throwing other things away to improve model scores on those measurements doesn't necessarily represent a path to best in class production feasibility.

I can imagine a number of edge cases where a customer service bot not having basic math capabilities could lead to issues ("Did the package come with at least 4 screws?" "It only had three" "Ok, great - I don't see any issues with the shipment and am denying the return request").

Further, many qualities which probably do matter for applications like customer service, like patience, empathy, or de-escalation - don't happen to be parts of the measurements any LLMs are being optimized to hit (even though they are almost certainly represented at least in part in the pretrained models given the presence in the data).

We've become a bit too focused on optimizing LLMs around measurements reflecting our anchoring biases of what AI should look like as imagined decades ago rather than evaluating the starting point and use cases as they actually occur as we might for a tool by any other name.

Though this is all an entirely different issue from whether different model sizes require their own best practices.