> I’m not sure how you could read what I wrote in any way that contradicts that. Minecraft binaries is NOT open source because, unlike model weights, it’s NOT the preferred way to modify Minecraft.
Minecraft "binaries" can not be open source because binaries are not source code.
> Helloworldmaker is 100% source code. Buts it’s not the source code for helloworld. To make this even simpler, if I wrote a hello world program by rolling literal dice, SURELY you would pretend that the fully functional program’s source code is the worldly logic by which I rolled dice to generate the characters of code.
What I said is that the results of "helloworldmaker" would not be universally considered source code. This is because whether generated code is source code is already debated. Most accurately, the source code for "helloworld" would be a script that generates it, by calling "helloworldmaker" with some set of parameters, not the result of that generation. That is source code, by every definition past, present and future. (Whether the resulting "helloworld" is also source code is unclear and depends on your definitions.)
> Or if we had an LLM spit out doom, we would not claim that the doom source code is neither the doom code, nor the LLM model but the training code for the model originally.
If you overfit an LLM to copy data in a roundabout way, then you're just having it spit out copies of human code in the first place, which isn't particularly novel. The only real wrench in the cogs re: LLMs is that LLMs are effectively 'equivalent' to humans in this case, as they can generate "novel" code that I agree would qualify as source code.
> The origin of a program has no bearing on whether the program’s source code is considerable to be source code.
I would advise you to check the definition of the word "source" before claiming asinine things like this.
> Given that we have established this, you cannot argue that the training program and data, which are not required to make a random set of ML weights, are the source code of the ML model. Your only recourse here is to argue that there is no source code for this project, but frankly that seems very dumb.
Yes that is correct, ML weights do not have source code, because they are data, not code. This isn't particularly stunning as computers perform all kinds of computational operations over datasets that don't involve things that are called source code. Database data in general is not source code. If you paint something in Photoshop, there is no source code for your painting; you can save it with greater or less fidelity, but none of those things are "source code", they're just different degrees of fidelity to the original files you worked on.
I am not thusly claiming, though, that computer graphics can't involve source code; it can, like, for example, producing graphics by writing SVG code. Rendering this to raster is not producing "object code" though; "object code" would be more like converting the SVG into some compiled form like a PDF. This is a great example of how "source code" and "object code" are not universal terms. They have fairly specific meanings tied to programming that, while are not universally 100% agreed upon, have clear bounds on what they are not.
> It is a structured file of logic, in a form that is convenient to modify. That’s open source!
No, it isn't "open source". Open source as it's used today was coined in the late 90s and refers to a specific, well-defined concept. Even if we ignore the OSI, dictionary definitions generally agree. Oxford says that "open source" is an adjective "denoting software for which the original source code is made freely available and may be redistributed and modified." Wikipedia says "Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose."
Importantly, "open source" refers to computer software and in particular, computer software source code. It also has a myriad of implications about what terms software is distributed under. Even ignoring the OSI definition, "free for non-commercial use" is not a concept that has ever been meaningfully recognized as "open source", especially not by the experts who use this definition.
> The only reason we felt the need to gatekeep “preferred form” was to clarify that binaries being “technically able to be modified” shouldn’t count. But it’s ridiculous to assert that these assets shouldn’t meet the criteria just because it doesn’t resemble text code. And it’s ridiculous to argue that there is no source code. And it’s ridiculous to argue that the progenitor process to make a program is the source code of the program.
Frankly I have no idea what you're on about with how it is ridiculous to argue there is no source code. I mean obviously, the software that does inference and training has "source code", but it is completely unclear to me why it's "ridiculous" that I don't consider ML model weights, which are quite literally just a bunch of numbers that we do statistics on, to be "source code". Again, ML weights don't even come close to any definition of source code that has ever been established.
> Getting obsessive over antiquated definitions here is entirely missing the point of why source code and open source is defined the way it is.
The reasoning for why Open Source is defined the way it is is quite well-documented, but I'm not sure what part of it to point to here, because there is no part of it that has anything in particular to do with this. The open source movement is about software programs.
I am not against an "open weight" movement, but co-opting the term "open source" is stupid, when it has nothing to do with it. The only thing that makes "open source" nice is that it has a good reputation, but it has a good reputation in large part because it has been gatekept to all hell. This was no mistake: in the late 90s when Netscape was being open sourced, a strategic effort was made to gatekeep the definition of open source.
But otherwise, it's unclear how these "free for non-commercial usage" ML weights and especially datasets have anything to do with open source at all.
It's not that the definition of the word "source code" has failed to keep up with the times. It has kept up just fine and still refers to what it always has. There is no need to expand the definition to some literally completely unrelated stuff that you feel bears some analogical resemblance.
(P.S.: The earliest documentation I was able to dig up for the definitions of the words "source code" and "object code" go back to about the 1950s. The Federal Register covers some disputes relating to how copyright law applies to computer code. At the time, it was standard to submit 50 pages of source code when registering a piece of software for copyright: the first 25 pages and last 25 pages. Some companies were hesitant to submit this, so exceptions were made to allow companies to submit the first and last 25 pages of object code instead. The definitions of "source code" and "object code" in these cases remains exactly the same as it is today.)
Minecraft "binaries" can not be open source because binaries are not source code.
> Helloworldmaker is 100% source code. Buts it’s not the source code for helloworld. To make this even simpler, if I wrote a hello world program by rolling literal dice, SURELY you would pretend that the fully functional program’s source code is the worldly logic by which I rolled dice to generate the characters of code.
What I said is that the results of "helloworldmaker" would not be universally considered source code. This is because whether generated code is source code is already debated. Most accurately, the source code for "helloworld" would be a script that generates it, by calling "helloworldmaker" with some set of parameters, not the result of that generation. That is source code, by every definition past, present and future. (Whether the resulting "helloworld" is also source code is unclear and depends on your definitions.)
> Or if we had an LLM spit out doom, we would not claim that the doom source code is neither the doom code, nor the LLM model but the training code for the model originally.
If you overfit an LLM to copy data in a roundabout way, then you're just having it spit out copies of human code in the first place, which isn't particularly novel. The only real wrench in the cogs re: LLMs is that LLMs are effectively 'equivalent' to humans in this case, as they can generate "novel" code that I agree would qualify as source code.
> The origin of a program has no bearing on whether the program’s source code is considerable to be source code.
I would advise you to check the definition of the word "source" before claiming asinine things like this.
> Given that we have established this, you cannot argue that the training program and data, which are not required to make a random set of ML weights, are the source code of the ML model. Your only recourse here is to argue that there is no source code for this project, but frankly that seems very dumb.
Yes that is correct, ML weights do not have source code, because they are data, not code. This isn't particularly stunning as computers perform all kinds of computational operations over datasets that don't involve things that are called source code. Database data in general is not source code. If you paint something in Photoshop, there is no source code for your painting; you can save it with greater or less fidelity, but none of those things are "source code", they're just different degrees of fidelity to the original files you worked on.
I am not thusly claiming, though, that computer graphics can't involve source code; it can, like, for example, producing graphics by writing SVG code. Rendering this to raster is not producing "object code" though; "object code" would be more like converting the SVG into some compiled form like a PDF. This is a great example of how "source code" and "object code" are not universal terms. They have fairly specific meanings tied to programming that, while are not universally 100% agreed upon, have clear bounds on what they are not.
> It is a structured file of logic, in a form that is convenient to modify. That’s open source!
No, it isn't "open source". Open source as it's used today was coined in the late 90s and refers to a specific, well-defined concept. Even if we ignore the OSI, dictionary definitions generally agree. Oxford says that "open source" is an adjective "denoting software for which the original source code is made freely available and may be redistributed and modified." Wikipedia says "Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose."
Importantly, "open source" refers to computer software and in particular, computer software source code. It also has a myriad of implications about what terms software is distributed under. Even ignoring the OSI definition, "free for non-commercial use" is not a concept that has ever been meaningfully recognized as "open source", especially not by the experts who use this definition.
> The only reason we felt the need to gatekeep “preferred form” was to clarify that binaries being “technically able to be modified” shouldn’t count. But it’s ridiculous to assert that these assets shouldn’t meet the criteria just because it doesn’t resemble text code. And it’s ridiculous to argue that there is no source code. And it’s ridiculous to argue that the progenitor process to make a program is the source code of the program.
Frankly I have no idea what you're on about with how it is ridiculous to argue there is no source code. I mean obviously, the software that does inference and training has "source code", but it is completely unclear to me why it's "ridiculous" that I don't consider ML model weights, which are quite literally just a bunch of numbers that we do statistics on, to be "source code". Again, ML weights don't even come close to any definition of source code that has ever been established.
> Getting obsessive over antiquated definitions here is entirely missing the point of why source code and open source is defined the way it is.
The reasoning for why Open Source is defined the way it is is quite well-documented, but I'm not sure what part of it to point to here, because there is no part of it that has anything in particular to do with this. The open source movement is about software programs.
I am not against an "open weight" movement, but co-opting the term "open source" is stupid, when it has nothing to do with it. The only thing that makes "open source" nice is that it has a good reputation, but it has a good reputation in large part because it has been gatekept to all hell. This was no mistake: in the late 90s when Netscape was being open sourced, a strategic effort was made to gatekeep the definition of open source.
But otherwise, it's unclear how these "free for non-commercial usage" ML weights and especially datasets have anything to do with open source at all.
It's not that the definition of the word "source code" has failed to keep up with the times. It has kept up just fine and still refers to what it always has. There is no need to expand the definition to some literally completely unrelated stuff that you feel bears some analogical resemblance.
(P.S.: The earliest documentation I was able to dig up for the definitions of the words "source code" and "object code" go back to about the 1950s. The Federal Register covers some disputes relating to how copyright law applies to computer code. At the time, it was standard to submit 50 pages of source code when registering a piece of software for copyright: the first 25 pages and last 25 pages. Some companies were hesitant to submit this, so exceptions were made to allow companies to submit the first and last 25 pages of object code instead. The definitions of "source code" and "object code" in these cases remains exactly the same as it is today.)