Your analogy does not make sense. ML weights are distributed in binary form, lik...

jchw · 2024-07-31T15:56:39 1722441399

Source code is human readable. Object code is not, and produced from some mechanical process.

Model weights are not written by hand. You don't manually tweak individual weights. You have to run a training process that has multiple "raw" inputs. Trying to read model weights directly is no better than trying to read object code directly. Heck, reading object code directly is probably easier, because at least it's just machine code at the bottom; I will never be able to comprehend what's going on in an ML model just by reading the weights.

The closest thing to "source code" in ML models would be the inputs to the training process, because that's the "source" of the model weights that pops out the other end. If the analogy doesn't make sense, that's because ML models are probably not really code in the same sense that source code and object code.

(It may be tempting to look at "ML weights" as source code because of the existence of "closed-weight" API services. Please consider the following: If Amazon offers me a unique database service that I can only use with Amazon Web Services, and then releases a closed-source binary that you can run locally, that is still closed-source, because you don't have the source code.)

jncfhnb · 2024-07-31T16:22:42 1722442962

“Human readable” is not a requirement. Visual programming code breaks down to some obtuse data structure. But with the right tools, it’s easy for humans to interact with it. Visual programming node workflows can be open sourced. ML models are the same. Tooling is required to interact with it. The limits of your human understanding do not determine if something is open source. Otherwise a really complicated traditional program might be argued as not open source. You can individually explore specific vectors and layers of a model and their significance.

Produced by a non mechanistic process is not a requirement. I can generate a hello world script with code, and open source the hello world script. It does not matter how it was formed. I do not need to open source the hello world generator either.

Data and training code is not source code of the model. That is the source code of a model maker. That’s `make_hello_world.py` not `hello_world.py`

The closed source database is not a correct analogy. Excluding unreasonably difficult efforts to decompile the binary, you CANNOT modify the program without expecting it to break. With an ML model, the weights are the PREFERRED method of modifying the program. You do NOT want the original data and training code. That will just be a huge expense to get you what you already have. If you want the model to be different, you take the model weights and change them. Not recreate them differently from scratch. Which is the same for all traditional code. Open source does not mean I provide you with the design documents and testing feedback to demonstrate how the code base got created. It means you get the code base. Recreating the codebase is not something we think about because it doesn’t make sense because we have the code and we have the models.

jchw · 2024-07-31T17:10:07 1722445807

Human readable is a requirement. The existence of things that don't fit into this paradigm doesn't invalidate it entirely, it just proves that it is imperfect. However, it being imperfect does not mean that 1 + 1 != 2. Semantics debates don't grant you the power to just invalidate the entire purpose of words.

What you are proving repeatedly is that model weights are not code, not that they are "source" code.

- The existence (barely, btw) of visual programming does not prove that model weights are code. It proves that there are forms of code other than source code that are useful to humans. There are not really forms of model weights that are directly useful to humans. I can't open any set of model weights in some software and get a useful visualization of what's going on. It's not source code. (Any visual programming language can output some useful human readable equivalent if it wants to. For some of them, the actual on-disk format is in fact human-readable source code.)

(A key point here: if you write assembly code, it's source code. If you assemble it, it's object code. This already stresses the paradigm a bit, because disassembly is reversible... but it's only reversible to some degree. You lose macros, labels, and other details that may not be possible to recover. Even if it was almost entirely reversible though, that doesn't mean that object code is source code. It just means that you can convert the object code into meaningful source code, which is not normally the case, but sometimes it is.)

- The existence of fine-tuning doesn't have anything to do with source code versus object code. Bytecode is easy to modify. Minecraft is closed source but the modding community has absolutely no trouble modifying it to do literally anything without almost any reverse engineering effort. This is a reflection of how much information is lost during the compilation process, which is a lot more for most AOT-compiled languages (where you lose almost all symbols, relocations, variable and class names, etc.) than it is for some other languages (and it's not even split on that paradigm, either; completely AOT languages can still lose less information depending on a lot of factors.) The mechanical process of producing model weights loses some information too; in some models, you can even produce models that are less suitable for fine-tuning (by pruning them and removing meta information that is useful for training). A closer analogy here would be closed source with or without symbols.

jncfhnb · 2024-07-31T17:27:28 1722446848

> Human readable is a requirement. The existence of things that don't fit into this paradigm doesn't invalidate it entirely, it just proves that it is imperfect. However, it being imperfect does not mean that 1 + 1 = 2. Semantics debates don't grant you the power to just invalidate the entire purpose of words.

well first of all, 1+1 does actually equal 2

Secondly, contradictions to your supposed hard rules absolutely means you don’t have hard rules. If you want to play the semantic game of saying words can mean whatever you want them to mean then sure. But then that’s pointless and you’re just saying you just want to be stubborn.

> I can't open any set of model weights in some software and get a useful visualization of what's going on. It's not source code.

Yes you can. Do you actually have any experience with what you’re talking about? This is a huge red flag that you do not.

Your Minecraft example is a straw man. I did not claim that the existence of fine tuning meant models are source code. I claimed that because fine tuning models is the preferred form of modifying models means that it meets the definitional requirement of being called open source.

Minecraft can be modified, but it is not the preferred form to do so, so it is not open source.

You are still failing to address helloworldmaker vs hello world. Helloworldmaker is explicitly not the source code of hello world. Model maker is not the source code of model.

Appealing to your own lack of capabilities to understand something doesn’t make it not source code.

jchw · 2024-07-31T17:57:32 1722448652

> well first of all, 1+1 does actually equal 2

Sigh. That's a typo. I almost feel like it's not important to fix it considering that it's pretty obvious what I meant, but alas.

> Secondly, contradictions to your supposed hard rules absolutely means you don’t have hard rules. If you want to play the semantic game of saying words can mean whatever you want them to mean then sure. But then that’s pointless and you’re just saying you just want to be stubborn.

The "semantics game" I'm using is the long-understood definition of the term 'source code'.

American Heritage® Dictionary of the English Language, 5th Edition:

> source code, noun

> 1. Code written by a programmer in a high-level language and readable by people but not computers. Source code must be converted to object code or machine language before a computer can read or execute the program.

> 2. Human-readable instructions in a programming language, to be transformed into machine instructions by a compiler, assembler or other translator, or to be carried out directly by an interpreter.

> 3. Program instructions written as an ASCII text file; must be translated by a compiler or interpreter or assembler into the object code for a particular computer before execution.

Oxford Languages via Google:

> source code /ˈsôrs ˌkōd/

> noun: source code; plural noun: source codes; noun: sourcecode; plural noun: sourcecodes

> a text listing of commands to be compiled or assembled into an executable computer program.

Merriam-Webster:

> source code, noun

> : a computer program in its original programming language (such as FORTRAN or C) before translation into object code usually by a compiler

Wikipedia:

> In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer.

So every source pretty much agrees. Merriam-Webster falls short of actually specifying that it must be "human readable", but all of them specify in enough detail that you can say with certainty that ML model weights simply don't come anywhere near the definition of source code. It's just not even close.

> Yes you can. Do you actually have any experience with what you’re talking about? This is a huge red flag that you do not.

I'm trying to be patient but having to explain things in such verbosity that you actually understand what I'm trying to say is so tiring that it should be a violation of the Hacker News guidelines.

YES, I am aware that tools which can input model weights and visualize them exist. NO, that doesn't mean that what you see is useful the way that a visual programming language is. You can not "see" the logic of model weights. This is the cornerstone of an entire huge problem with AI models in the first place: they're inherently opaque.

(P.S.: I will grant you that escalating my tone here is not productive, but this arguing goes nowhere if you're just going to take the weakest interpretation of everything I say and run with it. I have sincerely not been doing the same for you. I accepted early on that one could argue that model weights could be considered "code" even though I disagree with it, because there's absolutely zero ambiguity as to whether or not it's "source code", and yet here we are, several more comments deep and the point nowhere to be found.)

> Your Minecraft example is a straw man. I did not claim that the existence of fine tuning meant models are source code. I claimed that because fine tuning models is the preferred form of modifying models means that it meets the definitional requirement of being called open source.

First of all, to be called "open source", it first needs to meet the definition of being "source code". That's what the "source" part of "open source" means.

Secondly, to be called "open source", it also needs to meet the definition of being "open". That's the "open" part of open source.

Open-weight models that have actual open source licenses attached to them do meet the criteria for "open", but many models, like Meta's recent releases, do not. They have non-commercial licenses that don't even come close to meeting the requirements.

> Minecraft can be modified, but it is not the preferred form to do so, so it is not open source.

Whether or not source code is the preferred form to modify something is entirely beside the point. I'm not sure where you got this, but it's simply wrong. Please stop spreading blatant misinformation.

> You are still failing to address helloworldmaker vs hello world. Helloworldmaker is explicitly not the source code of hello world. Model maker is not the source code of model.

I'm not addressing it because it's not 100% agreed upon. If you read my above definitions, you will see that in some of them, the results of "Helloworldmaker" will qualify as source code, and in some of them, it wouldn't. Likewise, you can compile any Wasm blob down to C code, and I'd strongly argue that the resulting C code is not human readable source code, it's just in a programming language. This definition, though, has a degree of fallibility to it. Unfortunately, a rigid set of logic can not determine what should be considered source code.

That's OK though, because it actually has nothing to do with whether or not model weights are source code. They don't even come remotely close to anything resembling source code in this entire debate. Model training doesn't produce human-readable source code, it produces model weights, a bunch of data that is, on its own, not even particularly useful, less readable.

> Appealing to your own lack of capabilities to understand something doesn’t make it not source code.

With all due respect, I am not concerned about your judgement of my capabilities. (And it has nothing to do with this anyways. This is a pretty weak jab.)

jncfhnb · 2024-07-31T19:55:27 1722455727

> Whether or not source code is the preferred form to modify something is entirely beside the point. I'm not sure where you got this, but it's simply wrong. Please stop spreading blatant misinformation.

I’m not sure how you could read what I wrote in any way that contradicts that. Minecraft binaries is NOT open source because, unlike model weights, it’s NOT the preferred way to modify Minecraft.

> I'm not addressing it because it's not 100% agreed upon. If you read my above definitions, you will see that in some of them, the results of "Helloworldmaker" will qualify as source code,

Helloworldmaker is 100% source code. Buts it’s not the source code for helloworld. To make this even simpler, if I wrote a hello world program by rolling literal dice, SURELY you would pretend that the fully functional program’s source code is the worldly logic by which I rolled dice to generate the characters of code.

Or if we had an LLM spit out doom, we would not claim that the doom source code is neither the doom code, nor the LLM model but the training code for the model originally.

The origin of a program has no bearing on whether the program’s source code is considerable to be source code.

Given that we have established this, you cannot argue that the training program and data, which are not required to make a random set of ML weights, are the source code of the ML model. Your only recourse here is to argue that there is no source code for this project, but frankly that seems very dumb. It is a structured file of logic, in a form that is convenient to modify. That’s open source! The only reason we felt the need to gatekeep “preferred form” was to clarify that binaries being “technically able to be modified” shouldn’t count. But it’s ridiculous to assert that these assets shouldn’t meet the criteria just because it doesn’t resemble text code. And it’s ridiculous to argue that there is no source code. And it’s ridiculous to argue that the progenitor process to make a program is the source code of the program.

Getting obsessive over antiquated definitions here is entirely missing the point of why source code and open source is defined the way it is.

jchw · 2024-07-31T23:47:35 1722469655

> I’m not sure how you could read what I wrote in any way that contradicts that. Minecraft binaries is NOT open source because, unlike model weights, it’s NOT the preferred way to modify Minecraft.

Minecraft "binaries" can not be open source because binaries are not source code.

> Helloworldmaker is 100% source code. Buts it’s not the source code for helloworld. To make this even simpler, if I wrote a hello world program by rolling literal dice, SURELY you would pretend that the fully functional program’s source code is the worldly logic by which I rolled dice to generate the characters of code.

What I said is that the results of "helloworldmaker" would not be universally considered source code. This is because whether generated code is source code is already debated. Most accurately, the source code for "helloworld" would be a script that generates it, by calling "helloworldmaker" with some set of parameters, not the result of that generation. That is source code, by every definition past, present and future. (Whether the resulting "helloworld" is also source code is unclear and depends on your definitions.)

> Or if we had an LLM spit out doom, we would not claim that the doom source code is neither the doom code, nor the LLM model but the training code for the model originally.

If you overfit an LLM to copy data in a roundabout way, then you're just having it spit out copies of human code in the first place, which isn't particularly novel. The only real wrench in the cogs re: LLMs is that LLMs are effectively 'equivalent' to humans in this case, as they can generate "novel" code that I agree would qualify as source code.

> The origin of a program has no bearing on whether the program’s source code is considerable to be source code.

I would advise you to check the definition of the word "source" before claiming asinine things like this.

> Given that we have established this, you cannot argue that the training program and data, which are not required to make a random set of ML weights, are the source code of the ML model. Your only recourse here is to argue that there is no source code for this project, but frankly that seems very dumb.

Yes that is correct, ML weights do not have source code, because they are data, not code. This isn't particularly stunning as computers perform all kinds of computational operations over datasets that don't involve things that are called source code. Database data in general is not source code. If you paint something in Photoshop, there is no source code for your painting; you can save it with greater or less fidelity, but none of those things are "source code", they're just different degrees of fidelity to the original files you worked on.

I am not thusly claiming, though, that computer graphics can't involve source code; it can, like, for example, producing graphics by writing SVG code. Rendering this to raster is not producing "object code" though; "object code" would be more like converting the SVG into some compiled form like a PDF. This is a great example of how "source code" and "object code" are not universal terms. They have fairly specific meanings tied to programming that, while are not universally 100% agreed upon, have clear bounds on what they are not.

> It is a structured file of logic, in a form that is convenient to modify. That’s open source!

No, it isn't "open source". Open source as it's used today was coined in the late 90s and refers to a specific, well-defined concept. Even if we ignore the OSI, dictionary definitions generally agree. Oxford says that "open source" is an adjective "denoting software for which the original source code is made freely available and may be redistributed and modified." Wikipedia says "Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose."

Importantly, "open source" refers to computer software and in particular, computer software source code. It also has a myriad of implications about what terms software is distributed under. Even ignoring the OSI definition, "free for non-commercial use" is not a concept that has ever been meaningfully recognized as "open source", especially not by the experts who use this definition.

> The only reason we felt the need to gatekeep “preferred form” was to clarify that binaries being “technically able to be modified” shouldn’t count. But it’s ridiculous to assert that these assets shouldn’t meet the criteria just because it doesn’t resemble text code. And it’s ridiculous to argue that there is no source code. And it’s ridiculous to argue that the progenitor process to make a program is the source code of the program.

Frankly I have no idea what you're on about with how it is ridiculous to argue there is no source code. I mean obviously, the software that does inference and training has "source code", but it is completely unclear to me why it's "ridiculous" that I don't consider ML model weights, which are quite literally just a bunch of numbers that we do statistics on, to be "source code". Again, ML weights don't even come close to any definition of source code that has ever been established.

> Getting obsessive over antiquated definitions here is entirely missing the point of why source code and open source is defined the way it is.

The reasoning for why Open Source is defined the way it is is quite well-documented, but I'm not sure what part of it to point to here, because there is no part of it that has anything in particular to do with this. The open source movement is about software programs.

I am not against an "open weight" movement, but co-opting the term "open source" is stupid, when it has nothing to do with it. The only thing that makes "open source" nice is that it has a good reputation, but it has a good reputation in large part because it has been gatekept to all hell. This was no mistake: in the late 90s when Netscape was being open sourced, a strategic effort was made to gatekeep the definition of open source.

But otherwise, it's unclear how these "free for non-commercial usage" ML weights and especially datasets have anything to do with open source at all.

It's not that the definition of the word "source code" has failed to keep up with the times. It has kept up just fine and still refers to what it always has. There is no need to expand the definition to some literally completely unrelated stuff that you feel bears some analogical resemblance.

(P.S.: The earliest documentation I was able to dig up for the definitions of the words "source code" and "object code" go back to about the 1950s. The Federal Register covers some disputes relating to how copyright law applies to computer code. At the time, it was standard to submit 50 pages of source code when registering a piece of software for copyright: the first 25 pages and last 25 pages. Some companies were hesitant to submit this, so exceptions were made to allow companies to submit the first and last 25 pages of object code instead. The definitions of "source code" and "object code" in these cases remains exactly the same as it is today.)