The documents provide smoking-gun evidence that Meta, Google, Snap, and TikTok all purposefully designed their social media products to addict children and teens with no regard for known harms to their wellbeing, and how that mass youth addiction was core to the companies’ business models. The documents contain internal discussions among company employees, presentations from internal meetings, expert testimony, and evidence of Big Tech coordination with tech-funded groups, including the National Parent Teachers Association (PTA) and Family Online Safety Institute (FOSI), in attempts to control the narrative in response to concerned parents.
“These unsealed documents prove Big Tech has been gaslighting and lying to the public for years
How did that work mechanically though? At YT we were banned from doing basically anything with pre-18yo data, even if we only suspected they might possibly not be an adult -- no A/B tests, no ML, no ad targeting, no nada. Did leadership design a system where those sorts of things would happen anyway? Were there just enough rogue teams to cause problems?
For business, government, and religion: achieving scale and centralization necessarily leads to corrupt outcomes. This is also where Marx’s legitimate criticisms of capitalism turn into a solution which is essentially its doppelgänger, a scaled system of corruption with absolute authority with the rhetorical veneer of democracy.
Are people surprised by this. Clearly this was a tactic widely used in the tech industry. Their aim is to keep people on the platform specifically teens. Why else would you need curated algorithms for users.
Indeed, I've been paying attention to the market share of the biggest companies in advertisement, its clear that Google and Meta are the largest share by a large margin. Almost not even comparable to other players other than Reddit recently. which it's exposure is dependent on Search engines like Google and Bing for that matter. users within the platform are a different story. I personally think the internet is not being utilized wisely when it comes to current context. There is still so much to be done and innovated and there are gate keepers keeping this from happening.
Companies don't necessarily have to suffer when restrictions are placed on them.
Ask any educator what the biggest positive change was to U.S. high schools in the 1970s and they'll probably answer that it was the ban on smoking in schools.
I expect a similar response in the future regarding bans on social media.
I can only imagine what it's like right now in schools. I can't see how anybody arguing the point that student are allowed to use social media in school is an okay activity. I know there are some countries banning the use of such activities in Europe and some others i can't think of right now.
> Why does open source have to mean that you, the user, are on equal footing with where the copyright holder was before they even started? Where is that written?
It's written on the OSI page about license approval:
"""
The license does not have terms that structurally put the licensor in a more favored position than any licensee.
"""
People still chose to compile emacs from scratch rather than modify the binary. The source code was the preferred form for modifications.
The same is not true of these models. To my knowledge no company has retrained a model from scratch to make a modification to it. They make new models, but these are fundamentally different works with different parameter counts and architectures. When they want to improve on a model that they already built, they fine tune the weights.
If that's what companies that own all the IP do, that tells me that the weights themselves are the preferred form for making modifications, which makes them source code under the gpl's definition.
The problem with hype around LLM is that people without much experience in the field can't think of anything else.
So much they forget the basics of the discipline.
What do you think cross validation is for?
To compare different weights obtained from different initializations, different topologies, different hyper-parameters... all trained from the same training dataset.
Even for LLM, have you ever tried to reduce the size of the vocabulary of, say, Llama?
No?
Yet it's a totally reasonable modification.
What's the preferred form to make modifications like this?
Can you do it fine tuning llama weights?
No.
You need training data.
That's why training data are the preferred form to make modification, because whatever the AI (hyped or not) it's the only form that let you make all modifications you want.
It's not just marketing: European AI Act impose several compliance obligations to corporations building AI system, including serious scientific scrutiny on the whole training process.
Such obligations are designed to mitigate the inherent risks that AI can pose to individuals and society.
The AI Act exempts open source from such scientific scrutiny because it's already transparent.
BUT if OSI defines black boxes as "open source", they open a loophole that will be exploited to harm people without being held accountable.
So it's not just marketing, but dangerous corporate capture.
Exactly. Without models being truly open source, (training data, training procedures, alignment etc.), there is no way for auditors to assess, for example, whether a model was trained on data exhibiting certain forms of selection bias (anything from training data or alignment being overly biased towards Western culture, controversial political or moral viewpoints, particular religions, gender stereotypes, even racism) which might lead to dangerous outcomes later on, whether by contamination of derived models or during inference.
> if OSI defines black boxes as "open source", they open a loophole that will be exploited to harm people without being held accountable
The OSI’s definition matches the legal definition in the EU and California (and common use). If the OSI says open data only, it will just be ignored. (If people are upset about the current use, they can make the free vs. open distinction we do in software to keep the pedantic definition contained.)
Obviously I was not aware of this, so the whole decompilation process was a waste of computation time, but it doesn't prove nor disprove anything about the "model"'s relation with the source dataset.
The term "absorbed" was not for the people in the field, but for people who don't know what folding means.
IMHO it's a better metaphor then "learning", because learning is a _subjective_ experience that everyone does and using that term lead inevitably to anthropomorphisation.
"Absorb" match the insight of filters and pipelines, that can be easily understood from any CS student, any "ML expert", any lawyer and any other citizen.
____
As for the network, my argument is simple: if I get back the source dataset from the executable, I think we can agree that such dataset is projected on the numerical matrices that such executable record.
Now where is the dataset?
You might argue that it is recorded _only_ into the gradients logged there (the gradients applied to one single "neuron" for each "layer"), but if so you could reconstruct the source dataset from the logs alone, and in fact, you cannot. You need both the "model" and those gradients in the correct order (and the encodings of inputs and outputs, obviously).
You might ask: "fine, but how much of the source dataset is projected into the gradients and how much is projected into the model?"
To answer, we need to consider that
- the vector space that constitutes the executable is non-linear (the "model" part) and hierarchical (the vectors of the gradients are not independent neither between layers nor between samples)
- (initialization apart) all the values (and the operative value) that the "model" contains comes from the source dataset
Thus I argue that a substantial portion of the source dataset is contained in the "model".
This does not exclude that another substantial portion of the source dataset is also contained into the few logged gradients!
And in fact I've never stated that the "model" contained the whole source dataset.
But if the portion contained into the "model" was negligible, you would be able to get back the sources from those logged gradients alone with negligible errors.
AFAIK, it is not possible, but if you can, please teach me how!
I'm always more than happy to be proven wrong if I can learn how to do something that I previously thought impossible!
> And in fact I've never stated that the "model" contained the whole source dataset.
Apologies for seeming rude, but I feel that the abstract is disingenuous (I'm assuming you're the author of the article).
The abstract states:
> we provide a ... decompiler that reconstruct the source dataset from the cryptic matrices that constitute the software executed by them.
But that's not what's happening here. Instead, what's happening is (correct me if I'm wrong) the decompiler uses the gradient information along with the network itself (which is very close to the penultimate network) to reconstruct the input. If we consider for instance, MSE loss, that reconstruction appears trivial given all the information available. As I said before, this reconstruction (while interesting) does not show copyright violation, because the training process information is not available once the network is deployed. Obviously the model contains information about its training set. If it didn't, it would be useless.
I'm not saying there aren't obvious copyright issues, and I'm also not saying that the approach to recover the training set is not interesting. I'm just saying that the overall copyright argument has a major gap (and there are more direct alternative arguments).
Beyond the global output and error of each sample from the last epoch, the log also includes the weight update of one single (fully connected) node for each layer.
During the compilation phase, the training dataset is projected on a complex vector space that is constituted by both the "model" of the "neural network" and these logs.
It's just like projecting a shadow over a bidimensional surface: if you discard the data pertaining to one dimension you have no hope to guess what projected it: you need both dimensions.
The logs that are preserved in the compilation process is the part of the vector space that is usually discarded during the "training".
But discarding the "model" would have exactly the same effect: you cannot get back the source dataset from those logs alone. That's why this does not "smuggle the training dataset back".
Indeed the fact that the source dataset is obtainable from the couple "these logs" + "final model", but neither from "these logs" alone nor by the final model alone, proves that a substantial portion of the source dataset is always embedded in the "model", that becomes a derivative work of the sources.
The last iteration (or epoch) of SGD is not shipped with the trained model. The point just does not stand. There are other (better) arguments for why such models are derivative works.
Basically the argument starts with a claim (you can reconstruct the training set of model X from its weights alone) and then shows something totally different. Of course you can reconstruct from the gradient updates plus the weights—that's not interesting, nor does it support the claim.
It's worth noticing that while they didn't target specific people or made the attack undetectable through cache control trickery, the tools they are looking for can be used to detect such evidence removal.
So they are probably building a Government database of IP/people using such tools!
This target mainly Russians, but you know, as Mozilla use to say... "this is the Web functioning as designed"!
“These unsealed documents prove Big Tech has been gaslighting and lying to the public for years