We can only speculate what data closed LLMs were trained on, but I'd be highly surprised if Google/openai had exclusive access to a bigger repository of written data than, well, the internet, as it presents itself to the world at large.
Products can be developed by basically every group with the passion to do so, even in an OSS setting. A great example is InvokeAI, a stable diffusion implementation that, while it doesn't (yet) offer the customization and extensability of AUTOMATIC1111 has a pretty superb UX.
So no, there is no productization-moat either.
> All of the specialized and local models need access to user data for their task
What exactly would they require "user data" for?
The LLM plugin I use for coding tasks requires access to my current vim-buffer, which the plugin provides. My script for generating documentation from API code requires only the API code. When I use an LLM to write an email, the data it requires is the prompt and some datetime information, which my script provides.
Even the existing cloud based solutions don't need access to user-data either to perform their functions.
> Everything needs to be deployed, and BigCorp can just push it as an OS update.
And app providers can just update an app. LLMs don't have some special requirements that would make updateing integrated versions any more difficult than upgrading other software.
> BigCorp could achieve a strong product moat and improve its own performance beyond GPT-4.
By doing what, deploying ever larger models? Attention based transformers have O(n^2) scaling, so that's unlikely to happen unless there is some architectural breakthrough. Which is far more likely to happen in OSS first, due to the aforementioned next to limitless talent pool.
> Right now, OSS is behind where it matters, and there's no guarantee this would change
OSS powers basically everything in the world of computing minus office software, desktops and gaming PCs, and that isn't for a lack of capability. So I'd say that purely based on experience and history, I think it's very unlikely that this won't change, and quickly.
>We can only speculate what data closed LLMs were trained on, but I'd be highly surprised if Google/openai had exclusive access to a bigger repository of written data than, well, the internet, as it presents itself to the world at large.
>What exactly would they require "user data" for?
There are several classes here:
A) Total internet data. Google/OpenAI may have more data from Google Books/GSuite/etc. but maybe not. No way to know. Maybe even if they do, it's not significant compared to total data volume. Since we can't meaningfully compare, let's just ignore it.
B) Global usage data. This is useful to further tune the model - we saw what the open models could do with a partial log of ChatGPT. OpenAI of course has the entire log. For example, it's possible that users in country X ask for stuff in a different manner, or that terms have a local meaning the model may not be aware of. Language evolves after all. A local model can at most update on current user data, or by much slower updates from the origin, and OSS has less resources here.
C) Local usage data. For example, a company may wish an LLM to access all its documents to create a knowledge base. There's a good chance all the documents stored in Office 365/GSuite. You can guess who has easy access and who gets the scary permission prompts. Another example: The LLM writing an email may wish to be aware of the previous communication in the thread and your general tone. Or replace Spotlight/Windows search with an LLM, but the LLM needs access to all your data to properly search it. Some of this can be emulated with really long prompts, but it's more efficient to just let the LLM have access.
>Even the existing cloud based solutions don't need access to user-data either to perform their functions.
Currently no, but the personal assistant they want to build will require it.
>Products can be developed by basically every group with the passion to do so
>By doing what, deploying ever larger models?
Alas, OSS devs tend to get bored on 'non-sexy' subjects. Meanwhile, Microsoft and Google will embed LLM in all their apps. The apps have their own moats (data migration, UX) and in turn act act as a moat for the LLM.
Moat in action:
Imagine Thunderbird worked with OSS LLM and Outlook works with OpenAI GPT. A user has meetings in Outlook and uses GPT to do various related planning. Say the user was willing to migrate to OSS LLM. But OSS LLM doesn't have easy interface with Outlook (Microsoft 'competitive' behaviour), and manually importing all the time is too messy. The user may even consider switching to Thunderbird, but Thunderbird doesn't do ActiveSync, and IT refuses to even consider allowing IMAP in its Exchange, so user is stuck with Outlook and in turn with OpenAI GPT. Doing ActiveSync is boring for OSS devs, so Microsoft gets an indirect moat: Exchange <=> Outlook <=> OpenAI.
SD is way more in tune, subject is way more popular with devs I guess. These people have a chance. They don't however need to brag about inevitable victory of SD or how Adobe is going down.
>> Everything needs to be deployed, and BigCorp can just push it as an OS update.
>And app providers can just update an app.
Deploying an app requires more friction. How do you get users to install it in the first place? Not impossible (see Google Chrome over Internet Explorer) but a struggle where the OS maker has a built-in advantage.
>By doing what, deploying ever larger models?
A bit of that, but I expect more effective tuning because they have way more usage data.
>Attention based transformers have O(n^2) scaling
There are numerous papers trying to improve that. We'll see.
Firstly, this would require re-training and tuning the model CONSTANTLY. Which is computationally expensive, on top of the already expensive running of the trained model. So this isn't happening, least of all in a local context.
Thirdly, we haven't even talked about the legal implications of using global usage data for training the next generation of models. I would love seeing corporations trying to explain that to, say, the EU regulators, with regards to the GDPR.
> Alas, OSS devs tend to get bored on 'non-sexy' subjects.
I have already given an example for an OSS software product in the generative AI space with superb UX. I can produce countless other examples across all realms of software. Take a look at Krita. The Dolphin file browser. The entire KDE deskop environment. Libreoffice. Firefox. Blender.
And btw. there are also countless commercial products with horrible UX.
> SD is way more in tune, subject is way more popular with devs I guess.
SD benefits from having a base model that already meets or exceeds the performance of closed source models. I see no reason why devs wouldn't be equally motivated when a sufficiently advanced LLM base model comes along.
There are definitely challenges, but your own link shows they're already trying, and that's following the more famous Tay failure. The incentives are obvious, while I doubt the challenge - at least regarding global user data - is insurmountable. It's rather well suited to BigCorp capabilities (and more difficult for OSS).
BigCorps are perfectly willing and able to deploy an army of moderators if required. Not too different than what OpenAI used to jumpstart its GPT. The reward is a market valued in billions, the moderators get minimum or 3rd world wage. If I were a BigCorp I'd jump on it.
[EDIT: we can see from the front page MEZO article that tuning does not have to be computationally expensive]
>I would love seeing corporations trying to explain that to, say, the EU regulators, with regards to the GDPR.
I don't think it's a big problem: For once, BigCorp is truly not interested in PII for training the model. Compared to what they're already doing in other fields, no reason they shouldn't be able to pass retraining easily.
>SD benefits from having a base model that already meets or exceeds the performance of closed source models.
I wanted to avoid saying it, but there are obvious SD usecases which the typical commercial interests would rather avoid. There are very motivated existing communities, which are far more likely to have a GPU. Adobe is weaker overall. The model is more accessible compared to still non-trivial LLM initial training. IMHO, these are more likely reasons than the raw technical comparison which I don't think the regular user or even regular dev bothers with.
>I have already given an example for an OSS software product in the generative AI space with superb UX
Which is why I bother writing these comments. Because OSS can compete by being good enough. But I see BigCorp strategies which give the incumbents a good chance to keep a stranglehold given the way it's going currently. Right now the ecosystem tends towards overconfidence (dumb dumb Google memo), and I think highlighting the challenges may help correction in time.
>there are also countless commercial products with horrible UX.
True. Which shows moats have more reason than technical comparisons.
>Why? App stores exist.
You still need visibility to get users to install your model. There's the (surmountable) technical challenge of deployment across varied configurations. Did I mention the biggest App stores are run by the closed competition? [Insert million HN threads about App store policies]
It should be obvious that the companies who get to install their model API by default without asking the user have an advantage, and the devs having to submit their models to be approved by the these companies are at a disadvantage.
> I wanted to avoid saying it, but there are obvious SD usecases which the typical commercial interests would rather avoid.
There are also many more use cases that don't fall into these categories.
My point still stands: SD is a prime example for what happens when a desirable OSS technology becomes competitive in quality and is then tinkered with by a near limitless amount of creative and talented developers.
> to keep a stranglehold given the way it's going currently.
> Which shows moats have more reason than technical comparisons.
This isn't office software, there is no "we always used X" factor since the technology is still in its early phase, and my thoughts about interactions with "user data" in the LLM space, have been outlined above.
Again: Strangleholds, moats, whatever we want to call it, only work if there is a competitive advantage that the competition cannot reach itself. So far, that's better model performance and ease of use. The former gap is shrinking with every week, the latter will resolve itself the same way it did for SD once the performance is good enough.
When that happens, OSS solutions are the ones with advantages that cannot be easily imitated: They run on premises, only cost utilities, can work offline, and can be endlessly tinkered with and improved upon by a near limitless talent pool.
But, as has been said before, much remains to be seen, and there are many unknown factors that will influence the outcome.
Therefore I thank you for the discussion. I look forward to seeing the next developments in this tech, and I'm confident that we're going to see OSS being as successful in the LLM space as it is in most areas of computing.
We can only speculate what data closed LLMs were trained on, but I'd be highly surprised if Google/openai had exclusive access to a bigger repository of written data than, well, the internet, as it presents itself to the world at large.
Which is the reason why things like https://pile.eleuther.ai/ exist.
So no, there is no data-moat.
> , and both require productization.
Products can be developed by basically every group with the passion to do so, even in an OSS setting. A great example is InvokeAI, a stable diffusion implementation that, while it doesn't (yet) offer the customization and extensability of AUTOMATIC1111 has a pretty superb UX.
So no, there is no productization-moat either.
> All of the specialized and local models need access to user data for their task
What exactly would they require "user data" for?
The LLM plugin I use for coding tasks requires access to my current vim-buffer, which the plugin provides. My script for generating documentation from API code requires only the API code. When I use an LLM to write an email, the data it requires is the prompt and some datetime information, which my script provides.
Even the existing cloud based solutions don't need access to user-data either to perform their functions.
> Everything needs to be deployed, and BigCorp can just push it as an OS update.
And app providers can just update an app. LLMs don't have some special requirements that would make updateing integrated versions any more difficult than upgrading other software.
> BigCorp could achieve a strong product moat and improve its own performance beyond GPT-4.
By doing what, deploying ever larger models? Attention based transformers have O(n^2) scaling, so that's unlikely to happen unless there is some architectural breakthrough. Which is far more likely to happen in OSS first, due to the aforementioned next to limitless talent pool.
> Right now, OSS is behind where it matters, and there's no guarantee this would change
OSS powers basically everything in the world of computing minus office software, desktops and gaming PCs, and that isn't for a lack of capability. So I'd say that purely based on experience and history, I think it's very unlikely that this won't change, and quickly.