Hacker Newsnew | past | comments | ask | show | jobs | submit | anilgulecha's commentslogin

> translation.

It's not technically a translation, it's a re-implementation, with test suites acting as the destination. If it was a file by file translation your argument would have been valid.


Git is part of the LLM's training set though, so simply asking it to recreate git in another language is pretty equivalent. Like, you can almost certainly get these LLMs to output gits full source code with some prompting, so there's not that much difference (as much as we like to pretend that AI generated code has no copyright implications)

As mentioned in another comment, it's even more clear cut in this case. They actually put the original git sources in their project repo and instructed the agent to use it as the "source of truth".

Simple thought experiment. If you handed this same agents.md file (https://github.com/gitbutlerapp/grit/blob/main/AGENTS.md#sou...) to a human software developer and let them work on exactly the same goal, would their output be considered a derivative work?


That's something I have been wondering. If I as a human want to make a clean room reimplementation of some API or application, I must not have read the source code of the original implementation. I don't see why this shouldn't apply to LLMs as well. If an LLM might have been trained on the original source code, it should be considered "tainted".

Yes, and realistically any code that LLMs produce is a derivative work of its training data. There's going to be a huge disaster licensing wise

I have absolutely no idea how LLMs got through anyone's legal departments, I guess the hope is that if everyone breaks the law enough, it'll just be fine


> the hope is that if everyone breaks the law enough, it'll just be fine

Ever since the early 2010s when companies were started with the business idea "unlicensed hotels" and "unlicensed taxis" and made the owners really, really rich, this is said pretty much out loud. Look for words like "regulatory risks" and similar.

Maybe it started with the unlicensed gambling fad before that? That also made a lot of people filthy rich. Every time you have something under special license, or insuance requirements, then of course there is a margin for you if you can skimp on the license and hire gig workers instead.

The LLM situation with copyright and derived works in the 2020s is similar. Someone is likely to be rich, but there is a clear regulatory risk to it.


> if everyone breaks the law enough, it'll just be fine

That's pretty much what happened, isn't it? These concerns were all discussed in the beginning back in 2022, and I recall answers from many here on HN along the lines of "oh well, we can't stop it now or we'll risk falling behind China in AI development"

So yeah, the laws went out the window a long time ago the moment our government and the people decided to just look the other way willingly in the name of "progress."


Problem is there's a lot more than a single repo in training data, the corpus is massive... Should the author of a blog post on cats also be compensated for simply being in the same training data as the git repo?

Honestly? Yes. This is why its such a problem that most of the training data was not used with permission, and without the correct copyright status or license associated with it

There's a lot of arguments about humans doing the same thing, but the reality is that humans and robots don't enjoy the same legal protection. Its clearly a derivative work of all of its training data


> Honestly? Yes.

Then it works both ways. Say I manage to generate essentially a ripoff of your copyrighted song, release it and make a ton of money, you now have to split that royalty with keyboard cat. And Joe bloggs. You'd end up fractions of pennies


> If I as a human want to make a clean room reimplementation of some API or application, I must not have read the source code of the original implementation.

That is the difference between necessary and sufficient. Clean-room is sufficient to guarantee avoiding copyright, but it is not necessary. The line legally is south of there, but that position was chosen because they didn’t want to crossing and it was easier to argue for legally in court.

tl;dr: clean room is overkill for avoiding copyright infringement


> Like, you can almost certainly get these LLMs to output gits full source code with some prompting, so there's not that much difference (as much as we like to pretend that AI generated code has no copyright implications)

Are you sure? LLMs are in some way a compressed version of their input but it's a pretty lossy compression (arguably this makes them more like a compression algorithm than a compressed version of the data). I'm not sure you can prompt a full, accurate, copy of a nontrivial codebase out of them. Even with zero temperature their accuracy is just not that high.


> I'm not sure you can prompt a full, accurate, copy of a nontrivial codebase out of them. Even with zero temperature their accuracy is just not that high.

Granted, these are some of the most widely spread texts, and not codebases, but just fyi: https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).


That paper is basically using the LLM as a compression algorithm: it's prompting with some section of the book and it's reprompting if it doesn't give the right output. Notably this only works if you already have a copy of the book in question!

Distributed a compressed copy of something is still copyright infringement

You misunderstand my point: the LLM is not a losslessly compressed version of the text: you need to supply additional information from the original in order to 'extract' it from the LLM (and from that point of view, the extra information would be the compressed form).

Wouldn't a re-implementation be akin to 'heres how it works, write the code' rather than 'heres the code, redo it in rust'?

> We believe it would be good for the world to have the option to slow or temporarily pause frontier AI development to enable societal structures and alignment research to keep up with the advance of the technology. The Anthropic Institute will conduct research—in collaboration with many others—and take actions to help build the systems that a credible slowdown or pause would require.

Interesting - they're commiting to kickoff policy conventions to organize a world-slowdown of frontier LLM building. If they actually are able to crack it, this will give a much needed breather IMO. As exciting as the last ~6 months have been, there's some bigger questions to go answer now.


We should be skeptical of any major player that advocates for regulating their own industry. In practice, this just means increasing barriers to entry and making it harder to compete with them.

In my mind we should be trying to push AI along the Linux trajectory. You have a free and open source product, developed by a decentralized team with a strong code of ethics, running on commodity hardware. There can still be trillion dollar industries built on top of it, but the core technology is democratized and available to everybody. I don't see how we get there if we allow a handful of companies to dictate where development of the technology goes.


The regulation that is being argued for here is against pushing the frontier. Entering the market with say a new speech to text model is not subject to such regulation. What's needed is something qualitatively different from entry barriers, and of the frontier model companies at least Anthropic and deepmind seem to have enough self-awareness to speak about it. They are finding themselves in a race with possibly catastrophic outcome for humanity and would like to stop, but it needs internation cooperation on a level that no single company can provide.

its a cartel looking to end competition though

the actual race is to keep having revenue, since everyone is still willing to pay more for the best model.

we as consumers of LLM models lose out by the arms race ending by the creation of a cartel

what happens if they get this regulatory capture is that all the frontier labs put effort into making inference cheaper, and become extraordinarily profitable, at the expense of us consumers, who really want better models, at a subsidized price


Wouldn’t this align with their financial interests? In theory the thing that’s keeping them from being profitable (or one of the big things) is the periodic capex expenditures of building new frontier models.

I don't think there's anything inherently bad about Anthropic making a profit. Red Hat makes a profit off of Linux. I'm interested in the democratization of the underlying technology.

I read this differently: they are actually seeing that it's hard to keep advancing frontier models, and now are moving the goal posts so that when they start getting evaluated more harshly, they can point to something like this.

Theyre probably looking to get a way to slow down the capex required to keep up, so they can be more profitable

> organize a world-slowdown of frontier LLM building

i don't want to be a negative nancy but i'm sure this "slowdown" will only be in effect until the infrastructure buildout is done or largely done. If they weren't hardware constrained there'd be no slowdown at all. Whoever gets there first wins everything ("there" being defined as AGI or a similar scale leap in capability).


Disagree is such a loose/wimpy study. Add in a grounded/expected response, and then it becomes a better benchmark (because it'll force the author to actually think about choices presented to the LLM).


Will add a human-labelled expected response and measure against it in a follow up research. This one only captures the disagreement between the models, but not which model is write/wrong.


IMO, I read 2 faulty assumptions:

1) That LLM/Agents are being pushed and not adopted. I see plenty of deep adoption by junior folks.

2) The unit economics don't work out. From the details on every model so far - each model is wildly profitable over it's amotized time-frame. It's just that money is used upfront for the next model, and each next model is significantly more costly to train. The best case argument instead is - this will not last and we'll pour more on some models, than see in it's revenue.

I think realistically these form the core of the thesis, and IMO, and hence it's conclusions are a bit off the mark.


Juniors reverting to something even less useful is not a selling point. Often most adoption is forced by bosses. Repo code change metrics tell you all you need to know how well it 'works'.

The economics do not work, not even close. Even if they ever did (probably a decade or so after the bubble pops), all parts of the stack(with the expetion of nvidia, maybe) are interchangeable. Meaning that people can easliy swap out foundation models, nor are creating new wrappers very hard. It will be a race to the bottom, I doubt anyone will make much money.

Last I checked, ycombinator will not fund your start-up if you shill for AI hard enough.


1) there is more to the world than software development

2) there is no profit. There is barely any revenue, the only money is continuous injections of VC cash and some frankly Enron-like book keeping.


apparently their goal is to take part of the 60 trillion dollar labor market that includes fast food order takers, IT professionals, hotel front desk workers, etc. There is plenty of payroll they can target


> I see plenty of deep adoption by junior folks

Where mate? Details?

> From the details on every model so far - each model is wildly profitable over it's amotized time-frame

Is it? Is that why all of them are switching their users from the subsidized flat-rates to billing based on usage?

> hence it's conclusions are a bit off the mark

You're funny - they are spot on and any dreamer who is working for equity in these LLM-wrapper-product companies who dreams of getting rich in the next few years or so, is in for a nasty surprise.


The younger devs have largely been the ones showing us old farts (eg millennials) how do to the really sophisticated stuff with claude - custom skills, plugins, tooling like openspec, things that have had massive benefits over stock claude.

We are nowhere near the ceiling in terms of process either.

Re: flat rate going to by-usage, I believe this is largely a long tail problem. You have a small number of power users that capitalize on the flat rate to use the service orders of magnitude more than the average user.


> really sophisticated stuff with claude - custom skills, plugins, tooling

You mean the "make-no-mistakes.md" and linter pipeline? Did not know that was now considered top-notch stuff.


Iceberg surface


Rubbish. The license change was the reason for the fork of community, and people switching. Quality was never cited as the issue.


It is of course a multiplier. The worries are:

- Lesser overall engineers needed -> lesser demand of human engineers -> lower compensations

- insufficient training at junior levels.

- longer time to productive human engineering skill.

These are playing out right now, and a concern for all engineers in the industry. IronMan amplification don't address the above


I have anosmia, triggered by AERD/polyps. I have been mostly without the sense for the past ~12 years, but int eh past year have had bouts of smell again, via a doc who finally diagnosed AERD, and suggested steroid intervention + mepolizumab.


What a blast from the past. Couple decades ago I had submitted my first bit to this place.

https://github.com/Planet-Source-Code/anil-gulecha-bat-man-b...

VB6: What made a generation fall in love with programming.


Same - I stole my friend’s VB6 textbook in high school and couldn’t put it down. Transferred into CS classes the next day.

I used to go through PSC and download anything that looked interesting, and I would read the code to figure out how it worked. Learned so much from there! VB6 for apps, ASP for web.


Did he at least get the book back? The one time I ever let anyone borrow my books, was a HTML 4 book, and I got it back in such a poor state I never did again.


He did! I think I borrowed it over a weekend and that was all the convincing I needed


You can still download VS6 from Microsoft and clone a repo from there, and chances it’ll compile and run are higher than JS project that’s two weeks old


RADBasic is also compatible: https://www.radbasic.dev/


Wonder how this compares to TwinBasic. I know back in my teens we were using what became Xojo as an alternative, don't think its drop-in, at least not anymore with VB6 but I still eyeball it every few years.


IMO, not too many people are being discovered by substack. Twitter and other social media is where you have to have conversations to slowly build up your subscriber base.


I agree. A lot of discovery there is just people writing notes and posts “how to discover and boost discoverability” - just an endless loop of growing and talking about growing without substantial non-growth content.


> (again)

I can completely empathize - sometimes some problems never leave us.. like that piece of food stuck b/w teeth. There's a force within us asking us to right that problem in the world.

All the best to your project.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: