Yet again the word "open source" is being used in a way that doesn't make any se...

hgs3 · on July 31, 2024

Right or wrong licensing source code separately from data isn't a new thing. I can think of some very famous video games that have released their source code under a Free Software license, but kept the game data proprietary.

According to the FSF there is a separation between data and code [1] (search for "data" on that page). They specifically say that data inputted or outputted by a program isn't affected by the programs license which indicates a separation from their perspective.

[1] https://www.gnu.org/licenses/gpl-faq.en.html

wasmitnetzen · on July 31, 2024

For any given data, it can be used as code, and vice versa. But for any given program, it should be very clear what's code and what's data.

If I send a Python file over SSH, it should most definitely be data for all software involved. And I for sure should be able to send a Python file via OpenSSH, not matter what either is licensed as.

userbinator · on July 31, 2024

According to the FSF there is a separation between data and code

Which of course is a complete denial of the reality. Code is data, and data is code. That duality is the crucial reason why general-purpose computers are so powerful. The only ones to profit from trying to make a distinction, as usual, are the lawyers and corporations behind them who seek to restrict instead of empower.

Especially in this era when decompilers are close to "perfect" (and can sometimes even be better than reading the original source code!), and with the rise of AI, IMHO the whole idea of "source code" being somehow more special than the executable binary is quickly losing relevance.

nolist_policy · on July 31, 2024

  decompilers are close to "perfect" (and can sometimes even be better than reading the original source code!)

Citation needed.

mplewis9z · on July 31, 2024

I have definitely read some teammates’ code that felt like it would be more readable doing a compiler-decompiler round-trip. Never actually did it, but I doubt it would be less readable than that seemingly intentionally obfuscated garbage.

InDubioProRubio · on July 31, 2024

Cant want for the jetbrains "deabstract" plugin, that compiles it, decompiles it and reconstructs a indirection free AST and then cleaner code from that AST via AI. De-Tech-Bro-My-Code. Pull the plug on all-the-patterns in one project devs and get cleaner code today.

Refactor> ThrowIt> IntoTheBin

userbinator · on July 31, 2024

Personal experience.

There is a lot of decompiler research which isn't public.

A sibling comment mentions Hex-Rays and Ghidra. Those are only now slowly approaching the capabilities of what I've used.

The fact that the majority of code tends to not be intentionally obfuscated and is compiler-generated and thus easily pattern-matched also makes it quite straightforward. Of course the fact that decompilers are often used on code that is (e.g. malware, DRM) skews a lot of people's perceptions.

jchw · on July 31, 2024

Just to be completely clear, the conditions I have been using Ghidra/Hex-Rays/BN with were not that bad. I wasn't analyzing malware or heavily-DRM'd software. Even with symbols and full debug info, many of those gripes still apply. (Hex-Rays is able to do a lot more with debug info. It can usually get a lot of the vtable indirections typed correctly, including with a bit of effort, multi-inheritance offset this pointers.)

I'd love to see this non-public decompiler research but I have some skepticism, as a lot of the information that is lost would require domain-specific reconstruction to get back to anywhere near full fidelity. I do not deny that you have seen impressive results that I have not, but I really do wonder if the results are as generalizable as you're making it sound. That sounds like quite a breakthrough that I don't think Ghidra or IDA are slowly approaching.

But since it's non-public, I suppose I'll just have to take you at your word. I'll be looking forward to it some day.

jchw · on July 31, 2024

> Especially in this era when decompilers are close to "perfect" (and can sometimes even be better than reading the original source code!)

As someone who is knee-deep in a few hobby reverse engineering projects, I certainly wish this was the case :)

Hex-Rays and Ghidra both do a very commendable job, but when it comes to compiled languages, it is almost never better than reading the original source code. Even the easier parts of reversing C++ binaries still aren't fully automated; nothing that I'm aware of is going to automatically pull your vtables and start inferring class hierarchies.

Variable names are lost in executable code. When it comes to naming variables, most of the tools support working backwards from "known" API calls to infer decent function names, but only Binary Ninja offers a novel approach to providing variable names. They have an LLM service called Sidekick which offers suggestions to improve the analysis, including naming variables. Of course, it isn't very impressive if you were to just drop into a random function in a random binary where you have no annotations and no debug information.

Most of the "framework" stuff that compiles down, by some form of metaprogramming, is nearly non-sense and requires you to know the inner workings of the frameworks that you're touching. In my case I spend a lot of time on Win32 binaries, so the tricky things I see often are a result of libraries like MFC/ATL/WTL/etc. And I'll grant you that in some cases the original source code wouldn't exactly be the most scrutable thing in the world, but I'd still really rather have the MFC message handler mapping in its original form :) COM becomes a complete mess as its all vtable-indirected and there's just no good way for a decompiler to know which vtable(s) or (to some degree) the function signatures of the vtable slots, so you have to determine this by hand.

Vectorized code is also a nightmare. Even if the code was originally written using intrinsics, you are probably better off sticking to the graph view in the disassembly. Hex-Rays did improve this somewhat but last I checked it still struggled to actually get all the way through.

The truth is that the main benefit of the decompiler view in IDA/Ghidra/etc. is actually the control flow reconstruction. The control flow reconstruction makes it vastly easier to read than even the best graph view implementation, for me. And this, too, is not perfect. Switch statements that compile down to jump tables tend to be reconstructed correctly, but many switch statements decompile down to a binary tree of conditionals; this is the case a lot of the time for Win32 WndProc functions, presumably because the WM_* values are almost always too sparse to be efficient for a jump table. So I'd much rather have the original source code, even for that.

Of course it depends a bit on the target. C code on ELF platforms probably yields better results if I had to guess, due to the global offset table and lack of indirection in C code. Objective C is probably even better. And I know for a fact that Java and C# "decompiling" is basically full fidelity, since the bytecode is just a lot less far away from the source code. But in practice, I would say we're a number of major breakthroughs away from this statement in general not being a massive hyperbole.

(I'm not complaining either. Hex-Rays/Ghidra/BN/etc. are all amazing tools that I'm happy to have at my disposal. It's just... man. I wish. I really wish.)

fxd123 · on July 31, 2024

The repo contains some source code, so therefore it's open source

bogwog · on July 31, 2024

These files are source assets, which is as close to source code as you can get with non-code stuff. For regular people who didn't drink the OSI koolaid, this is a perfectly valid and logical use of the term "open source". I don't know if that's the angle you're coming from, or if you just didn't know what usd was, but either way this is a good release.

blitzar · on July 31, 2024

The phrase "open source" is itself open source and is freely available for use, modification and redistribution.

insomniacity · on July 31, 2024

Not exactly - https://opensource.stackexchange.com/a/8369

fragmede · on July 31, 2024

Open Source, with the capitals, however, is not, and is a trademark of the Open Source Initiative (OSI).

https://opensource.org/trademark-guidelines

insane_dreamer · on July 31, 2024

No, it's not. From the page you linked to:

> OSI, Open Source Initiative, and OSI logo (“OSI Logo”), either separately or in combination, are hereinafter referred to as “OSI Trademarks” and are trademarks of the Open Source Initiative.

fragmede · on Aug 1, 2024

> In all cases, use is permitted only provided that:

> the use of the term “Open Source” is used solely in reference to software distributed under OSI Approved Licenses.

captainhorst · on July 31, 2024

The map data is provided in the USD format which is a 3D authoring and interchange format that can be used with a lot of software. Unlike the final optimized data used by the game this doesn't require revere engineering and can be seen as source data that is in fact useful for graphics researchers and game developers.

bee_rider · on July 31, 2024

I’m confused as to why the convention isn’t to consider ML weights data-sets instead of any type of code (closed or open).

jncfhnb · on July 31, 2024

Model weights are functions.

In the same way

`lambda x: x > 0.25` is a function.

dkersten · on July 31, 2024

The article claims it’s open source (which it clearly isn’t, especially since they say things like “open source for non-commercial use” which is a bit of a contradiction), but the GitHub makes no such claim only stating that the OpenUSD format is open source.

ssss11 · on July 31, 2024

Feels like a covert way to destroy the term “open source” by making it meaningless over time.

WheatMillington · on July 31, 2024

Yes it's all one big conspiracy.

01HNNWZ0MV43FF · on July 31, 2024

No, it's a Schelling point but evil

highcountess · on July 31, 2024

I find it rather odd that after all the years of exposed and revealed conspiracies too numerous and pervasive to even necessitate listing any of them, people like you just reject the notion that any additional, unknown conspiracies may exist.

It is an odd phenomenon among humans that I at least don’t quite understand, the seeming tendency to ignore or dismiss possibilities of proven negative outcomes … for whatever reason. “I know all those other conspiracies I dismissed all turned out to be true, but I am sure I would know if there were any additional conspiracies” … totally ignoring one’s track record.

It appears to be the same kind of mentality of “hey, you know who we should trust with our lives … the government made up of people who lie to us, steal from us, and mass murder on a regular basis; that’s who we should give control over to.”

People conspire, I’ve witnessed it personally numerous times; sometimes for greedy business reasons, at other times to mass murder and commit genocide on a scale not seen since. Humans conspire, even if sometimes only because they’re not prevented from doing so naturally.

ssss11 · on July 31, 2024

It’s probably convenient for them to dismiss this as it aligns with whatever goals they have..

jncfhnb · on July 31, 2024

ML weights are code

jchw · on July 31, 2024

I'll accept that if you would like. However, they are not source code if so. They are object code. And open source is about source code, not object code. (And this particular press release isn't about ML weights anyways, at least unless I'm grossly misunderstanding; it is just a dataset. So even failing this, it still doesn't really make any sense.)

jncfhnb · on July 31, 2024

No it is not object code unless you want to get so stupidly pedantic that you want to argue a Python script in a zip file can’t be considered open source because it’s compressed.

The model pickles unpack back to their original form. The picked binary forms are merely for convenience.

jchw · on July 31, 2024

Look, please go do research as to what "object code" and "source code" are before saying my argument is "stupidly pedantic". I'm not elaborating because the example you gave has nothing to do with what I said.

jncfhnb · on July 31, 2024

Your analogy does not make sense. ML weights are distributed in binary form, like object code, but it is nothing like compiled binary. It’s just temporarily in binary form for convenience. It unpacks directly into its original form.

This is not a technicality like “technically up can reverse engineer or modify binary code”. The binary form of model weights is just a fancy zip file format that is useful because they are so large that text is impractical.

jchw · on July 31, 2024

Source code is human readable. Object code is not, and produced from some mechanical process.

Model weights are not written by hand. You don't manually tweak individual weights. You have to run a training process that has multiple "raw" inputs. Trying to read model weights directly is no better than trying to read object code directly. Heck, reading object code directly is probably easier, because at least it's just machine code at the bottom; I will never be able to comprehend what's going on in an ML model just by reading the weights.

The closest thing to "source code" in ML models would be the inputs to the training process, because that's the "source" of the model weights that pops out the other end. If the analogy doesn't make sense, that's because ML models are probably not really code in the same sense that source code and object code.

(It may be tempting to look at "ML weights" as source code because of the existence of "closed-weight" API services. Please consider the following: If Amazon offers me a unique database service that I can only use with Amazon Web Services, and then releases a closed-source binary that you can run locally, that is still closed-source, because you don't have the source code.)

jncfhnb · on July 31, 2024

“Human readable” is not a requirement. Visual programming code breaks down to some obtuse data structure. But with the right tools, it’s easy for humans to interact with it. Visual programming node workflows can be open sourced. ML models are the same. Tooling is required to interact with it. The limits of your human understanding do not determine if something is open source. Otherwise a really complicated traditional program might be argued as not open source. You can individually explore specific vectors and layers of a model and their significance.

Produced by a non mechanistic process is not a requirement. I can generate a hello world script with code, and open source the hello world script. It does not matter how it was formed. I do not need to open source the hello world generator either.

Data and training code is not source code of the model. That is the source code of a model maker. That’s `make_hello_world.py` not `hello_world.py`

The closed source database is not a correct analogy. Excluding unreasonably difficult efforts to decompile the binary, you CANNOT modify the program without expecting it to break. With an ML model, the weights are the PREFERRED method of modifying the program. You do NOT want the original data and training code. That will just be a huge expense to get you what you already have. If you want the model to be different, you take the model weights and change them. Not recreate them differently from scratch. Which is the same for all traditional code. Open source does not mean I provide you with the design documents and testing feedback to demonstrate how the code base got created. It means you get the code base. Recreating the codebase is not something we think about because it doesn’t make sense because we have the code and we have the models.

jchw · on July 31, 2024

Human readable is a requirement. The existence of things that don't fit into this paradigm doesn't invalidate it entirely, it just proves that it is imperfect. However, it being imperfect does not mean that 1 + 1 != 2. Semantics debates don't grant you the power to just invalidate the entire purpose of words.

What you are proving repeatedly is that model weights are not code, not that they are "source" code.

- The existence (barely, btw) of visual programming does not prove that model weights are code. It proves that there are forms of code other than source code that are useful to humans. There are not really forms of model weights that are directly useful to humans. I can't open any set of model weights in some software and get a useful visualization of what's going on. It's not source code. (Any visual programming language can output some useful human readable equivalent if it wants to. For some of them, the actual on-disk format is in fact human-readable source code.)

(A key point here: if you write assembly code, it's source code. If you assemble it, it's object code. This already stresses the paradigm a bit, because disassembly is reversible... but it's only reversible to some degree. You lose macros, labels, and other details that may not be possible to recover. Even if it was almost entirely reversible though, that doesn't mean that object code is source code. It just means that you can convert the object code into meaningful source code, which is not normally the case, but sometimes it is.)

- The existence of fine-tuning doesn't have anything to do with source code versus object code. Bytecode is easy to modify. Minecraft is closed source but the modding community has absolutely no trouble modifying it to do literally anything without almost any reverse engineering effort. This is a reflection of how much information is lost during the compilation process, which is a lot more for most AOT-compiled languages (where you lose almost all symbols, relocations, variable and class names, etc.) than it is for some other languages (and it's not even split on that paradigm, either; completely AOT languages can still lose less information depending on a lot of factors.) The mechanical process of producing model weights loses some information too; in some models, you can even produce models that are less suitable for fine-tuning (by pruning them and removing meta information that is useful for training). A closer analogy here would be closed source with or without symbols.

jncfhnb · on July 31, 2024

> Human readable is a requirement. The existence of things that don't fit into this paradigm doesn't invalidate it entirely, it just proves that it is imperfect. However, it being imperfect does not mean that 1 + 1 = 2. Semantics debates don't grant you the power to just invalidate the entire purpose of words.

well first of all, 1+1 does actually equal 2

Secondly, contradictions to your supposed hard rules absolutely means you don’t have hard rules. If you want to play the semantic game of saying words can mean whatever you want them to mean then sure. But then that’s pointless and you’re just saying you just want to be stubborn.

> I can't open any set of model weights in some software and get a useful visualization of what's going on. It's not source code.

Yes you can. Do you actually have any experience with what you’re talking about? This is a huge red flag that you do not.

Your Minecraft example is a straw man. I did not claim that the existence of fine tuning meant models are source code. I claimed that because fine tuning models is the preferred form of modifying models means that it meets the definitional requirement of being called open source.

Minecraft can be modified, but it is not the preferred form to do so, so it is not open source.

You are still failing to address helloworldmaker vs hello world. Helloworldmaker is explicitly not the source code of hello world. Model maker is not the source code of model.

Appealing to your own lack of capabilities to understand something doesn’t make it not source code.

jchw · on July 31, 2024

> well first of all, 1+1 does actually equal 2

Sigh. That's a typo. I almost feel like it's not important to fix it considering that it's pretty obvious what I meant, but alas.

> Secondly, contradictions to your supposed hard rules absolutely means you don’t have hard rules. If you want to play the semantic game of saying words can mean whatever you want them to mean then sure. But then that’s pointless and you’re just saying you just want to be stubborn.

The "semantics game" I'm using is the long-understood definition of the term 'source code'.

American Heritage® Dictionary of the English Language, 5th Edition:

> source code, noun

> 1. Code written by a programmer in a high-level language and readable by people but not computers. Source code must be converted to object code or machine language before a computer can read or execute the program.

> 2. Human-readable instructions in a programming language, to be transformed into machine instructions by a compiler, assembler or other translator, or to be carried out directly by an interpreter.

> 3. Program instructions written as an ASCII text file; must be translated by a compiler or interpreter or assembler into the object code for a particular computer before execution.

Oxford Languages via Google:

> source code /ˈsôrs ˌkōd/

> noun: source code; plural noun: source codes; noun: sourcecode; plural noun: sourcecodes

> a text listing of commands to be compiled or assembled into an executable computer program.

Merriam-Webster:

> source code, noun

> : a computer program in its original programming language (such as FORTRAN or C) before translation into object code usually by a compiler

Wikipedia:

> In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer.

So every source pretty much agrees. Merriam-Webster falls short of actually specifying that it must be "human readable", but all of them specify in enough detail that you can say with certainty that ML model weights simply don't come anywhere near the definition of source code. It's just not even close.

> Yes you can. Do you actually have any experience with what you’re talking about? This is a huge red flag that you do not.

I'm trying to be patient but having to explain things in such verbosity that you actually understand what I'm trying to say is so tiring that it should be a violation of the Hacker News guidelines.

YES, I am aware that tools which can input model weights and visualize them exist. NO, that doesn't mean that what you see is useful the way that a visual programming language is. You can not "see" the logic of model weights. This is the cornerstone of an entire huge problem with AI models in the first place: they're inherently opaque.

(P.S.: I will grant you that escalating my tone here is not productive, but this arguing goes nowhere if you're just going to take the weakest interpretation of everything I say and run with it. I have sincerely not been doing the same for you. I accepted early on that one could argue that model weights could be considered "code" even though I disagree with it, because there's absolutely zero ambiguity as to whether or not it's "source code", and yet here we are, several more comments deep and the point nowhere to be found.)

> Your Minecraft example is a straw man. I did not claim that the existence of fine tuning meant models are source code. I claimed that because fine tuning models is the preferred form of modifying models means that it meets the definitional requirement of being called open source.

First of all, to be called "open source", it first needs to meet the definition of being "source code". That's what the "source" part of "open source" means.

Secondly, to be called "open source", it also needs to meet the definition of being "open". That's the "open" part of open source.

Open-weight models that have actual open source licenses attached to them do meet the criteria for "open", but many models, like Meta's recent releases, do not. They have non-commercial licenses that don't even come close to meeting the requirements.

> Minecraft can be modified, but it is not the preferred form to do so, so it is not open source.

Whether or not source code is the preferred form to modify something is entirely beside the point. I'm not sure where you got this, but it's simply wrong. Please stop spreading blatant misinformation.

> You are still failing to address helloworldmaker vs hello world. Helloworldmaker is explicitly not the source code of hello world. Model maker is not the source code of model.

I'm not addressing it because it's not 100% agreed upon. If you read my above definitions, you will see that in some of them, the results of "Helloworldmaker" will qualify as source code, and in some of them, it wouldn't. Likewise, you can compile any Wasm blob down to C code, and I'd strongly argue that the resulting C code is not human readable source code, it's just in a programming language. This definition, though, has a degree of fallibility to it. Unfortunately, a rigid set of logic can not determine what should be considered source code.

That's OK though, because it actually has nothing to do with whether or not model weights are source code. They don't even come remotely close to anything resembling source code in this entire debate. Model training doesn't produce human-readable source code, it produces model weights, a bunch of data that is, on its own, not even particularly useful, less readable.

> Appealing to your own lack of capabilities to understand something doesn’t make it not source code.

With all due respect, I am not concerned about your judgement of my capabilities. (And it has nothing to do with this anyways. This is a pretty weak jab.)

jncfhnb · on July 31, 2024

> Whether or not source code is the preferred form to modify something is entirely beside the point. I'm not sure where you got this, but it's simply wrong. Please stop spreading blatant misinformation.

I’m not sure how you could read what I wrote in any way that contradicts that. Minecraft binaries is NOT open source because, unlike model weights, it’s NOT the preferred way to modify Minecraft.

> I'm not addressing it because it's not 100% agreed upon. If you read my above definitions, you will see that in some of them, the results of "Helloworldmaker" will qualify as source code,

Helloworldmaker is 100% source code. Buts it’s not the source code for helloworld. To make this even simpler, if I wrote a hello world program by rolling literal dice, SURELY you would pretend that the fully functional program’s source code is the worldly logic by which I rolled dice to generate the characters of code.

Or if we had an LLM spit out doom, we would not claim that the doom source code is neither the doom code, nor the LLM model but the training code for the model originally.

The origin of a program has no bearing on whether the program’s source code is considerable to be source code.

Given that we have established this, you cannot argue that the training program and data, which are not required to make a random set of ML weights, are the source code of the ML model. Your only recourse here is to argue that there is no source code for this project, but frankly that seems very dumb. It is a structured file of logic, in a form that is convenient to modify. That’s open source! The only reason we felt the need to gatekeep “preferred form” was to clarify that binaries being “technically able to be modified” shouldn’t count. But it’s ridiculous to assert that these assets shouldn’t meet the criteria just because it doesn’t resemble text code. And it’s ridiculous to argue that there is no source code. And it’s ridiculous to argue that the progenitor process to make a program is the source code of the program.

Getting obsessive over antiquated definitions here is entirely missing the point of why source code and open source is defined the way it is.

jchw · on July 31, 2024

> I’m not sure how you could read what I wrote in any way that contradicts that. Minecraft binaries is NOT open source because, unlike model weights, it’s NOT the preferred way to modify Minecraft.

Minecraft "binaries" can not be open source because binaries are not source code.

> Helloworldmaker is 100% source code. Buts it’s not the source code for helloworld. To make this even simpler, if I wrote a hello world program by rolling literal dice, SURELY you would pretend that the fully functional program’s source code is the worldly logic by which I rolled dice to generate the characters of code.

What I said is that the results of "helloworldmaker" would not be universally considered source code. This is because whether generated code is source code is already debated. Most accurately, the source code for "helloworld" would be a script that generates it, by calling "helloworldmaker" with some set of parameters, not the result of that generation. That is source code, by every definition past, present and future. (Whether the resulting "helloworld" is also source code is unclear and depends on your definitions.)

> Or if we had an LLM spit out doom, we would not claim that the doom source code is neither the doom code, nor the LLM model but the training code for the model originally.

If you overfit an LLM to copy data in a roundabout way, then you're just having it spit out copies of human code in the first place, which isn't particularly novel. The only real wrench in the cogs re: LLMs is that LLMs are effectively 'equivalent' to humans in this case, as they can generate "novel" code that I agree would qualify as source code.

> The origin of a program has no bearing on whether the program’s source code is considerable to be source code.

I would advise you to check the definition of the word "source" before claiming asinine things like this.

> Given that we have established this, you cannot argue that the training program and data, which are not required to make a random set of ML weights, are the source code of the ML model. Your only recourse here is to argue that there is no source code for this project, but frankly that seems very dumb.

Yes that is correct, ML weights do not have source code, because they are data, not code. This isn't particularly stunning as computers perform all kinds of computational operations over datasets that don't involve things that are called source code. Database data in general is not source code. If you paint something in Photoshop, there is no source code for your painting; you can save it with greater or less fidelity, but none of those things are "source code", they're just different degrees of fidelity to the original files you worked on.

I am not thusly claiming, though, that computer graphics can't involve source code; it can, like, for example, producing graphics by writing SVG code. Rendering this to raster is not producing "object code" though; "object code" would be more like converting the SVG into some compiled form like a PDF. This is a great example of how "source code" and "object code" are not universal terms. They have fairly specific meanings tied to programming that, while are not universally 100% agreed upon, have clear bounds on what they are not.

> It is a structured file of logic, in a form that is convenient to modify. That’s open source!

No, it isn't "open source". Open source as it's used today was coined in the late 90s and refers to a specific, well-defined concept. Even if we ignore the OSI, dictionary definitions generally agree. Oxford says that "open source" is an adjective "denoting software for which the original source code is made freely available and may be redistributed and modified." Wikipedia says "Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose."

Importantly, "open source" refers to computer software and in particular, computer software source code. It also has a myriad of implications about what terms software is distributed under. Even ignoring the OSI definition, "free for non-commercial use" is not a concept that has ever been meaningfully recognized as "open source", especially not by the experts who use this definition.

> The only reason we felt the need to gatekeep “preferred form” was to clarify that binaries being “technically able to be modified” shouldn’t count. But it’s ridiculous to assert that these assets shouldn’t meet the criteria just because it doesn’t resemble text code. And it’s ridiculous to argue that there is no source code. And it’s ridiculous to argue that the progenitor process to make a program is the source code of the program.

Frankly I have no idea what you're on about with how it is ridiculous to argue there is no source code. I mean obviously, the software that does inference and training has "source code", but it is completely unclear to me why it's "ridiculous" that I don't consider ML model weights, which are quite literally just a bunch of numbers that we do statistics on, to be "source code". Again, ML weights don't even come close to any definition of source code that has ever been established.

> Getting obsessive over antiquated definitions here is entirely missing the point of why source code and open source is defined the way it is.

The reasoning for why Open Source is defined the way it is is quite well-documented, but I'm not sure what part of it to point to here, because there is no part of it that has anything in particular to do with this. The open source movement is about software programs.

I am not against an "open weight" movement, but co-opting the term "open source" is stupid, when it has nothing to do with it. The only thing that makes "open source" nice is that it has a good reputation, but it has a good reputation in large part because it has been gatekept to all hell. This was no mistake: in the late 90s when Netscape was being open sourced, a strategic effort was made to gatekeep the definition of open source.

But otherwise, it's unclear how these "free for non-commercial usage" ML weights and especially datasets have anything to do with open source at all.

It's not that the definition of the word "source code" has failed to keep up with the times. It has kept up just fine and still refers to what it always has. There is no need to expand the definition to some literally completely unrelated stuff that you feel bears some analogical resemblance.

(P.S.: The earliest documentation I was able to dig up for the definitions of the words "source code" and "object code" go back to about the 1950s. The Federal Register covers some disputes relating to how copyright law applies to computer code. At the time, it was standard to submit 50 pages of source code when registering a piece of software for copyright: the first 25 pages and last 25 pages. Some companies were hesitant to submit this, so exceptions were made to allow companies to submit the first and last 25 pages of object code instead. The definitions of "source code" and "object code" in these cases remains exactly the same as it is today.)

shevis · on July 31, 2024

No, they really aren’t and I’m not sure why I keep seeing this take. ML weights are binary and it’s painfully obvious.

They are the end result of a compilation process in which the training data and model code are compiled into the resulting weights. If you can’t even theoretically recreate the weights on your own hardware it isn’t open source.

jncfhnb · on July 31, 2024

ML weights are not binary. They are modifiable.

If produce a program that outputs a hello world file, I can open source the hello world script without open sourcing the hello world generator.

i_read_news · on July 31, 2024

We can also say binaries are code, but if we are being pedantic that likely isn’t the source code that generated the binary (I also doubt the intention of hand writing binary or manually inputting billions of weights). I’d reckon that’s why it’s called open source, not open code or open binary, as the source code that generates the data is distributed. I’d actually just call this for what it is - open weights.

jncfhnb · on July 31, 2024

Binary is not the equivalent of models. Source code is the equivalent of models.

It doesn’t matter if a machine generated source code or a human did for it to be open source code.

EnigmaFlare · on July 31, 2024

You keep asserting this but without any reason. Do you have a reason? It seems to go against the general open source idea of source code being convenient for people to modify.

jncfhnb · on July 31, 2024

ML weights ARE convenient for people to modify. You can go look at the dozens of modifications of diffusion models being produced, daily, on civit ai. It’s very easy.

EnigmaFlare · on Aug 2, 2024

Would you say that once a model is trained, there's no need to go back and re-train it, even if you want to, say, remove some material from the training set? Anything can be done just with the weights? That's a big surprise to me.

Of course people hack binaries too, and binaries are obviously not source code. I once edited a book in PDF form because we didn't have the original Word/whatever document. It's not hard but a PDF still isn't considered to be source code for documentation despite that.

kennyadam · on July 31, 2024

Technically, but it feels like you're intentionally missing the point being made. Sure, providing the weights is very useful given the cost of generating them, but you can't exactly learn much by looking through the 'code', make changes and gain an in-depth understanding in the same way you can from the code provided by an actual open source project.

jncfhnb · on July 31, 2024

You absolutely can and people do all the time. There are mountains of forks and dissections and improvements on open source models.