Can we assume this is a product of the biased real world training data? Feed an ...

latexr · 2025-07-17T17:30:19 1752773419

> Can we assume this is a product of the biased real world training data?

Of course it is. And we’ve known that to be a problem since before the current rise of LLMs.

https://www.technologyreview.com/2019/01/21/137783/algorithm...

Note the 2019 date. But I’m certain I’ve seen reports earlier.

And as a sibling comment put it: “There is no unbiased training data. That's the problem.” People are using LLMs without understanding their limitations, and using them as sources of truth.

roxolotl · 2025-07-17T17:41:56 1752774116

Weapons of Math Destruction[0] is a 2016 book on this topic. And that’s building on years of prior research. It’s not a new concept at all.

[0]: https://en.m.wikipedia.org/wiki/Weapons_of_Math_Destruction

throwaway290 · 2025-07-17T17:21:24 1752772884

There is no unbiased training data. That's the problem.

Think about it... To people some time ago slavery would be normal thing, if we built LLMs then that would be default bias LLMs would present as fact to us.

potato3732842 · 2025-07-17T17:38:58 1752773938

>There is no unbiased training data. That's the problem.

Exactly. Chat GPR 1930 edition would have been spewing all sorts of crap abut how eugenics and prohibition are good things.

advisedwang · 2025-07-17T17:31:16 1752773476

That is a safe assumption. And it is useful to think about if you are working to improve LLMs.

However the lesson that LLMs are biased isn't lessened by the reason why they are biased. Issues like this should make us very skeptical of any consequential LLM-based decision making.

GuinansEyebrows · 2025-07-17T17:24:55 1752773095

can we assume that the LLM has been trained on not just real-world data but also content that discusses things like gender/ethnic pay gaps, the causes thereof and ameliorative strategies? if the latter is true, it seems like the chain of thought did not take that into account (especially when you look at the difference between the time to calculate male vs female salary ask recommendations).

im3w1l · 2025-07-17T17:44:47 1752774287

It certainly seems plausible, but I wouldn't entirely rule out other possibilities.

Do to give an example if you present the LLM with two people that are exactly the same except they have different color shirts I think it will suggest slightly different salary for one than the other for no clear reason and without any obvious bias in the training set.

andrewmcwatters · 2025-07-17T17:23:11 1752772991

I think the current state of the art consensus is that you don't want to filter training data so much as you want to fine-tune behavior out of models in the same way that you probably don't want to shelter children too much, but to explain what is right and wrong to them.

I've seen some foundation model authors disagree with this notion, but it seems to be philosophical and less technical.

Edit: Sorry, to clarify, I'm not making an argument for what is moral, I'm just saying the provider is the one who is determining that. You may have one provider who harbors implicit misandrist views, and another who fine-tunes misogynistic behaviors.

frizlab · 2025-07-17T17:27:35 1752773255

And what is right and wrong, please do tell? Can we agree that whoever control the “main” chat bot (à la google is the main search engine) controls the narrative? Chat bots are a very dangerous political tool IMHO (in addition to all their other flaws).

gessha · 2025-07-17T17:34:23 1752773663

I do agree that the LLM provider is controlling the narrative.

The difficult part is you can’t (it’s not responsible to) go full libertarian on this. You have to draw the line somewhere and LLM providers are tiptoeing around morals/ethics/regulations and things can change day to day as to what’s allowed for these models to output.

frizlab · 2025-07-17T19:40:24 1752781224

I agree you cannot go full libertarian either. But I’m not trusting a company with the ethics of OpenAI to do the right thing. Nor any company for that matter…

belter · 2025-07-17T17:33:48 1752773628

> Can we assume this is a product of the biased real world training data? Feed an LLM data that shows women (unfairly) earn less on average and you’ll get advice that they should earn less than average.

And this is the best argument to demonstrate they are not smart