> Before I begin I would like to credit the thousands or millions of unknown artists, coders and writers upon whose work the Large Language Models(LLMs) are trained, often without due credit or compensation
I like this. If we insist on pushing forward with GenAI we should probably at least make some digital or physical monument like "The Tomb of the Unknown Creator".
There's a difference between "I'm willing to exert myself to help people" and "I'm willing to exert myself so that somebody else can slap their name on my effort and make money, then deny me the right to do the same to them".
Also, at a social level - the worst kind of user has always been a help vampire, and LLMs are really good at increasing the number of help vampires.
An ungodly amount of paid software is built on the back free stack exchange answers. Complaining about lack of compensation for answers that ultimately lead to revenue was just as valid 20 years ago as today.
Your intentions might not have been to make money, but you were creating social credit that could be redeemed for a higher and better paying job. With GitHub, Stack Overflow, etc., you are adding to your resume, but with AI, you literally get nothing in return for contributing.
My contributions to Stack Overflow have all been done anonymously, and I haven't ever felt even the slightest bit of desire to link that identity to my real one, add it to my resume, brag to friends.
Having and sharing knowledge to me is its own reward, and I have no intention of profiting from it in any way.
Your own personal way of thinking about the world shines through when you search for ulterior motivations and state as fact that those factors, like a desire for fame or money, must be present.
I understand the perspective and generally have the same stance. But, a subtlety is that you know that "user425712" (made up) is you, and you can see your contributions being upvoted, quoted, discussed, your overall karma increase, etc.
Given that, as a thought experiment, would you be OK with your answers/comments being attributed to others, or that once you've submitted them, there is no linkage to you at all (e.g. you can't even know what you submitted)? Would it be as satisfying of a process if your contributions are just dissolved into a soup of other data irreversibly?
That doesn't sound like a system I'd be as keen to contribute to. Maybe the ulterior motive is at least being able to find my body of work as a source of personal fulfillment. Where is my work in the various LLMs? I have no idea, and will likely never know.
> Given that, as a thought experiment, would you be OK with your answers/comments being attributed to others, or that once you've submitted them, there is no linkage to you at all? Would it be as satisfying of a process if your contributions are just dissolved into a soup of other data irreversibly?
Yes. Wikipedia _almost_ operates like this. I have no expectations of anyone digging into who wrote what, it turns into a soup of information. I still do know I contributed, but I don't care if what I wrote gets rewritten, replaced, improved.
4chan does operate like this, and back in the days that /prog/ had meaningful discussions, I enjoyed participating in threads there.
I have spent the last year in a new area (sql) and I've written a lot of questions to LLMs, which it gas been able to answer well enough for me to make speedy progress.
I'm a big fan of StackOverflow, and Google, and before that reference books to gather and learn.
Each technology builds on the layer before. Information is disseminated, repackaged, reauthored.
I get that some people feel like their contribution should be the end of the line. Despite perhaps that they got that knowledge from somewhere. Do they credit their college professor when posting on Reddit?
So again, thank you for your contribution. Your willingness to answer questions, and the answers you provided, will exist long after you and I do not.
>Your intentions might not have been to make money, but you were creating social credit that could be redeemed for a higher and better paying job. With GitHub, Stack Overflow, etc., you are adding to your resume, but with AI, you literally get nothing in return for contributing.
I guess I never really thought about having a dog in this race (effectively being a tradesman as I am, and not a "content creator"). I did write a ton on ServerFault.com. I guess I am in this unwittingly.
I'm a little salty about the LLM training on my Stack Exchange answers but I knew what I was getting in to when I signed-up. I don't really subscribe to notions of "intellectual property" so I don't feel strongly on that front.
It just feels impolite and rude. More like plagiarism and less like copyright infringement. A matter of tact between people, versus a legal matter.
The way LLMs turn the collective human expression into "slop" that "they" then "speak" with a tone of authority about feels scummy. It feels like a person who has read a few books and picked up the vernacular and idiom of a trade confidently lying about being an expert.
I can't attribute that scumminess to the LLM itself, since it's just a pile of numbers. I absolutely attribute that scumminess to the companies making money from them.
re: Stack Exchange social credit and redeeming it - I'm not a good self-promoter, and admittedly ServerFault.com is a much smaller traffic Stack Exchange site than Stack Overflow, but begin the top-ranked user on the site for 5+ years didn't confer much in the way of real-world benefits. I had a ton of fun though.
(I got a tiny bit of name recognition from some IRL people and a free trip to the Stack Overflow offices in NYC one time. I definitely got a boost of happiness every time a friend related a story to the effect of: "I ran into an issue, search-engined it, and came up with something you wrote on Server Fault that solved my problem.")
> I absolutely attribute that scumminess to the companies making money from them.
so if they weren't making money (or weren't planning on making any), then would it still be "scummy"?
In other words, do you feel that they're only scummy because they're able to profit off the work (where as you didn't or couldn't)? Why isn't this sour grapes?
"sour grapes" assumes that both I, and the person using this information, intended to make money and they were better at it.
The reality is many wikipedia, stack overflow, etc contributors want information to be free and correct, and don't want money, so it's not sour grapes, it's rather annoyance at a perversion of the intent and vision.
I contributed to wikipedia because I want a free reservoir of human knowledge to benefit all, I want the commons to be rich with information.
Anyone making money off it is scummy not because I couldn't figure out how to, but because they are perverting the intention of information to be free.
Instead, we've ended up with one of the main interfaces to wikipedia being a paid often inaccurate chatbot for a for-profit company which doesn't attribute wikipedia and burns down forests as a side-effect.
This isn't sour grapes, this is recognizing exploitation of the commons.
I might suggest that exploiting the commons does not diminish the value or accessibility of the commons. Indeed, it spreads knowledge faster.
Equally I'd suggest that the commons is not free. It has to be paid for by someone. Wikipedia exists by begging for donations. Google sells advertising (as does StackOverflow as job listings) etc.
I mean, the first carpenter who took "common knowledge" and wrote a (paid for) book did the same thing. Knowledge is definitely not free, and it costs money to spread it.
(As an aside, I've been using LLMs for free all year.)
All through history people have exploited the commons. The printing press, books, universities, education, radio, television, through computers, Google, sites like SO. LLMs are just the latest step in a long long line of history.
If you know of the commons, you near certainly, know of the “tragedy of the commons’. It is VERY clear that exploiting the commons diminishes its value.
There is no such thing as a free lunch. Over grazing common pasture land, results in its decimation.
There are national and international level bodies required to ensure we dont kill all the rhinos. Hell - that we dont kill all the people.
The printing press, universities, education - these are NOT commons in many places, nor do they function as commons. Let alone function as LLMs.
> It is VERY clear that exploiting the commons diminishes its value.
does it diminish, if the commons is knowledge based, such as online sources? Those sources does not truly disappear after the information is extracted and placed into an LLM.
Unlike a physical commons, which has limitations on use, informational commons don't.
So the fact that someone else is able to gain more value out of the knowledge than others is not a reason to make them scummy - as if they alone don't deserve access to the knowledge that you claim should be free.
If contributors, after seeing how someone else is able to make profits off previously freely available knowledge, feel that they somehow now suddenly deserve to be paid after the fact, then i dont know how to say it but to call it sour grapes.
that social credit remains though? you can still point at your history of answers for how well you understand the thing, and how well you can communicate that knowledge. a portfolio is still a portfolio
These systems will collapse over time because the incentives are being removed for them to exist. So you won’t be able to point to your answers in quora or whatever but they’ll live in the training records and data and in some shape in neural nets being monetized.
I’m not like anti what’s happening or for it, it’s just, that social credit depends on those institutions surviving.
With StackOverflow, GitHub, etc. you would likely have people reach out to you for opportunities. With AI, if you contribute to StackOverflow and if it gets picked up by AI, people may or may not know.
That's a very small price to pay for what we all stand to gain.
I pay for o1-pro cheerfully, but I wouldn't pay anything at all for Stack Overflow. ChatGPT certainly generates its share of BS, but I have yet to have a question rejected because somebody who was using a different language or OS asked about something vaguely similar 8 years ago.
> That's a very small price to pay for what we all stand to gain.
Sure, if AI was made free for everybody (or only be charged for cost to run).
With Stack Overflow, GitHub and others, there is a mutual understanding that contributing can benefit the contributor. What is the incentive to continue contributing if the social agreement is, you get to help define a statistical weight for the next token and nobody will know?
I think the future business model may require AI companies to pay people to contribute, or it might not be a technology roadblock, but rather a data roadblock that prevents further advancement.
>> With Stack Overflow, GitHub and others, there is a mutual understanding that contributing can benefit the contributor.
I think lots of people contribute everywhere without getting any benefit at all.
I don't doubt that some who do contribute hoping for, or expecting, some ancillary benefit.
I'd suggest that pretty much the only tangible benefit I can see is those searching for a job. Contributing in public spaces is a good technique for self-promotion as being skilled in an area.
Then again I'd suggest that the majority of people participating on those sites are already employed, so they're not doing it for that benefit. I'd even argue that their day job accomplishments are likely to be more impressive than their github account when it comes to their next interview.
So perhaps I can reassure you. I'm pretty sure people will continue to absorb information, and skills, and will continue to share that with others. This has been the way for thousands of years. It has survived the inventions of writing, printing, radio, television and the internet. It will survive LLMs.
> This has been the way for thousands of years. It has survived the inventions of writing, printing, radio, television and the internet. It will survive LLM
Guilds used to jealously guard their secrets. Metallurgy techniques were lost when their creators died, or were silenced.
And most crucially - the audience has always been primarily humans. There has never been an audience composition, where authors have to worry about plagiarism as the default.
The idea of free exchange of ideas is something that we enjoyed only recently.
This isn’t naysaying or doom and gloom - this is simply reality. Placing our hopes on the wrong things leads to disappointment, anger and resentment when reality decides our hopes are an insufficient argument to change its ways.
> Then again I'd suggest that the majority of people participating on those sites are already employed, so they're not doing it for that benefit.
But the company benefits from less confusion and a better user experience. Companies are literally paying employees to provide content as it benefits the company.
I do believe there are people who freely choose to contribute with no strings attached, and I guess we'll learn in the coming years if people will contribute their time and effort for benevolent reasons.
This assumes that everything is a single move game; that people will not adapt to the new GenAI internet, nor the new behaviors of corporations.
Heck - how many people will go to stack overflow when they can get pseudo good answers from GenAI in the first place?
And stack overflow is filled with bots, and not humans?
Why would they contribute, when every action they take will simply mean OpenAI or someone will benefit, and some random bot will answer?
Signal vs Noise is what the internet is all about. Plastic was a godsend when it was invented. It’s a plague found at the bottom of the Mariana Trench today.
Just FYI, I'm in the top 250 users on stack overflow and I think I've been contacted like 3 times in over 10 years. I'm not exactly getting a lot of opportunities from it, not that I've advertised as looking either.
> I didn't start contributing to make money, just to share information and learn from each other. that can and should also be good enough for the AI era.
Sure; my content was contributed under CC-BY-SA, and if AI honors the rather simple terms of that license, then it's also good enough for the AI era, just as I had the same expectations of human consumers.
I wonder what would happen if we purposely start polluting the training data? It may be too late for older technology but technology is always changing. If this is a way for me to protect my job and increase my value I actually consider doing this.
performative wokeness. the main factor is whether you are using it or not. let's not claim OP is a better person because they said grace before sinning like the rest of us
I agree, I'm not claiming to be a better person for giving credit where it's due; It's just my habit.
I stayed away from using LLMs for a long time but then came a point when it became a disadvantage to not use one, especially as a person belonging to disadvantaged section of the society I'm forced use any & all forms of technology which helps me & others like me to have some equity.
Hey Abishek I really did not mean to derail the comments with my offhand observation. I actually really liked the attribution and it did not strike me as a (as I've come to learn the term) "land acknowledgement". It came across as sincere.
Besides I don't want a land acknowledgement, I want a statue!!
Not at all. I really like constructive criticism of HN, may be the way I wrote it came off as insincere to some but I credit even the memes I share on social media.
Why can't we train a model only on public domain materials and if anything is copyright only on materials where the rights-owners have granted permission?
Because copyright last for 75+ years so there is relatively little that is public domain.
Why should training be subject to copyright? (and at what stages of the process). People learn most of what they know from copyright media, giving copyright owners more control over that might be a bad idea
But AI companies are paying content owners for for access recently (partially to reduce legal risk, partially to get access to material not publicly available) but then giving deep-pocket Incumbent a monopoly might be a bad idea.
Mostly I see two outcomes. Either the holders "lose" and it basically follows "human" rules meaning you can train a model on any (legally obtained) material but there's restrictions on use and regurgitation. I say "human" rules because that's basically how people work. Any artist or writer worth their salt has been trained on gobs of copyrighted material, and you can totally hire people to use their knowledge to violate copyright now.
the other option is the holders "win" and these models must only be trained on owned material, in which case the market will collapse into a handful of models controlled by companies that already own huge swaths of intellectual property. basically think DisneyDiffusion or RandomHouse-LLM. Nobody is getting paid more but it's all above board since it's been trained on all the data they have rights to. You might see some holders benefit if they have a particularly large and useful dataset, like Reddit or the Wall Street Journal.
People with power and money, can get paid. Artists who have no reach and recognition get exploited. Especially those from countries which arent in north America and Europe.
Should we extend test model to what university professors can teach?
Should we extend that model to text books? If I learn about a topic from a book, can I never write a book of my own on that topic?
Should we extend that model to the web? If I learned CSS and JavaScript reading StackOverflow, am I banned from writing a book, giving classes, or indeed even answering questions on those topics?
I ask this in seriousness. I get that LLM training is new, and it's causing concerns. But those concerns have existed forever - the dissemination of information has been going on a long time.
I'm sure the same moral panic existed the first time someone started making marks in clay to describe what us the best time of year to plant the crops.
No, because none of those things involve creating a copy on a computer which will then be regurgitated w/o acknowledgement of what went before, and w/o any sort of compensation to the previous rights bearer.
Time was if a person read multiple books to write a new text, they either purchased them, or borrowed them from a library which had purchased them, and then acknowledged them in a footnote or reference section.
At least one author noted that there was a concern that writing would lead to a diminishment of human memory/loss of oral tradition (Louis L'Amour in _The Walking Drum_).
> At least one author noted that there was a concern that writing would lead to a diminishment of human memory/loss of oral tradition (Louis L'Amour in _The Walking Drum_).
Really? I can't tell if you're joking, so I'll take it at face value.
See, I associate the earliest famous (I thought) expression of that concern with Plato, and before today I couldn't remember any other associated details enough to articulate them with confidence. ChatGPT tells me, using the above quote without the citation as a prompt, that it was in Plato in his dialogue Phaedrus, and offers additional succinct contextual information and a better quote from that work. I probably first learned to associate that complaint about writing with Plato in college, and probably got it from C.D.C. Reeve, who was a philosophy professor and expert on Plato at the college I attended. But I feel no need to cite any of Reeve's works when dropping that vague reference. If I were to use any of Reeve's original thoughts related to analysis of Plato, then a reference would be merited.
It seems to me that there are different layers of abstraction of knowledge and memory, and LLMs mostly capture and very effectively synthesize knowledge at layers of abstraction that are above that of grammar checkers and below that of plagiarism in most cases. It's true that it is the nature of many of today's biggest transformers that they do in some cases produce output that qualifies as plagiarism by conventional standards. Every instance of that plagiarism is problematic, and should be a primary focus of innovation going forward. But in this conversation no one seems to acknowledge that the bar has been moved. The machine looked upon the library, and produced some output, therefore we should assume it is all theft? I am not persuaded.
Okay maybe maybe but hear me out -- it would be hilarious if this started happening often enough that the LLMs started echoing it XD. That would certainly cause some legal headaches.
both this and land acknowledgements are making sure people are keeping those people in mind. that makes it harder to keep justifying screwing over those people.
I like this. If we insist on pushing forward with GenAI we should probably at least make some digital or physical monument like "The Tomb of the Unknown Creator".
Cause they sure as sh*t ain't gettin paid. RIP.