More

polygot · 2025-04-26T06:34:15 1745649255

I decided to do that, and made Paper2Code2Code: https://github.com/alexyorke/Paper2Code2Code/tree/main

IshKebab · 2025-04-26T11:08:16 1745665696

And... does it work?

polygot · 2025-04-16T03:12:29 1744773149

Reminds me of the JetBrains installer--not a bad reminder, I like the animations.

polygot · 2025-04-15T19:24:20 1744745060

I also wonder if "Kitty Litter/Zeolite" was read without context in the white-paper, i.e., any type of kitty litter rather than specifically the inorganic kind.

polygot · 2025-04-14T03:47:50 1744602470

I could _maybe, maybe_ see self-hosted GitHub runners inside of VMs running on developer machines. The VMs would have to be re-created after every request however.

I don't think this is a great idea, however, as now CI is dependent on my flaky laptop's wifi/internet connection/IP address, has the potential to be contaminated by something running on my machine, build logs can be modified, environment shape/architectures are all different and can't be easily controlled, I now have access to all of the CI secrets and can impersonate the CI, etc.

polygot · 2025-04-02T17:35:38 1743615338

Would dev containers solve this issue?

tyzoid · 2025-04-02T21:13:07 1743628387

Most likely, yes. Then it wouldn't have mattered that the `sl` package was installed.

polygot · 2025-03-29T05:03:37 1743224617

How much better is the turn taking relative to two humans, when, for example, ordering a pizza? Human to human interaction has a non-zero false turn taking false positive score in my experience.

Ajedi32 · 2025-03-29T05:26:58 1743226018

Latency (such as you get when communicating over the phone) makes turn taking much more difficult. Even in-person it's still a non-zero false positive rate though.

pzo · 2025-03-29T07:08:58 1743232138

exactly when you are speaking face to face and you see person you have another visual cue from someone lips and face expression realtime and you know if someone stopped speaking or just taking some time to gather thoughts or trying to find proper word in their own head (e.g. non native speaker).

aoanevdus · 2025-03-29T05:22:59 1743225779

IMO this is a use-case where the added latency of cell phones is a downgrade.

polygot · 2025-03-27T21:18:49 1743110329

Very cool! I'd be curious to see an HDR version of it using bracketed exposures--it might give a better sense of how it actually looks in person. It seems really bright in the photos, so the shortest exposure would probably need to be very low to capture a good dynamic range for the HDR.

polygot · 2025-03-27T18:49:21 1743101361

There needs to be some more research on what path the model takes to reach its goal, perhaps there is a lot of overlap between this and the article. The most efficient way isn't always the best way.

For example, I asked Claude-3.7 to make my tests pass in my C# codebase. It did, however, it wrote code to detect if a test runner was running, then return true. The tests now passed, so, it achieved the goal, and the code diff was very small (10-20 lines.) The actual solution was to modify about 200-300 lines of code to add a feature (the tests were running a feature that did not yet exist.)

brulard · 2025-03-27T18:54:26 1743101666

That is called "Volkswagen" testing. Some years ago that automaker had mechanism in cars which detected when the vehicle was being examined and changed something so it would pass the emission tests. There are repositories on github that make fun of it.

rsynnott · 2025-03-27T19:26:33 1743103593

While that’s the most famous example, this sort of cheating is much older than that. In the good old days before 3d acceleration, graphics card vendors competed mostly on 2d acceleration. This mostly involved routines to accelerate drawing Windows windows and things, and benchmarks tended to do things like move windows round really fast.

It was somewhat common for card drivers to detect that a benchmark was running, and just fake the whole thing; what was being drawn on the screen was wrong, but since the benchmarks tended to be a blurry mess anyway the user would have a hard time realising this.

hn_acc1 · 2025-03-27T23:59:36 1743119976

Pretty sure at least one vendor was accused of cheating on 3D-Mark at times as well.

Cyphase · 2025-03-28T00:22:39 1743121359

https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal

airstrike · 2025-03-27T22:07:52 1743113272

I think Claude-3.7 is particularly guilty of this issue. If anyone from Anthropic is reading this, you might want to put your thumb on the scale so to speak the next time you train the model so it doesn't try to use special casing or outright force the test to pass

phobeus · 2025-03-27T18:55:01 1743101701

This looks like the very complaint of "specification gaming". I was wondering how will it show up in llm's...looks like this is the way it presented itself..

TeMPOraL · 2025-03-27T21:28:00 1743110880

I'm gonna guess GP used a rather short prompt. At least that's what happens when people heavily underspecify what they want.

It's a communication issue, and it's true with LLMs as much as with humans. Situational context and life experience papers over a lot of this, and LLMs are getting better at the equivalent too. They get trained to better read absurdly underspecified, relationship-breaking requests of the "guess what I want" flavor - like when someone says, "make this test pass", they don't really mean "make this test pass", they mean "make this test into something that seems useful, which might include implementing the feature it's exercising if it doesn't exist yet".

polygot · 2025-03-28T21:15:41 1743196541

My prompt was pretty short, I think it was "Make these tests pass". Having said that, I wouldn't mind if it asked me for clarification before proceeding.

pton_xd · 2025-03-27T19:28:55 1743103735

Similar experience -- asked it to find and fix a bug in a function. It correctly identified the general problem but instead of fixing the existing code it re-implemented part of the function again, below the problematic part. So now there was a buggy while-loop, followed by a very similar but not buggy for-loop. An interesting solution to say the least.

neonsunset · 2025-03-27T23:55:47 1743119747

Funny that you mention it because in JavaScript there already is a library for this:

https://github.com/auchenberg/volkswagen

felbane · 2025-03-27T18:52:50 1743101570

Ah yes, the "We have a problem over there/I'll just delete 'over there'" approach.

polygot · 2025-03-27T19:12:56 1743102776

I've also had this issue, where failing tests are deleted to make all the tests pass, or, it mocks a failing HTTP request and hardcodes it to 200 OK.

ctoth · 2025-03-27T19:27:04 1743103624

Reward hacking, as predicted over and over again. You hate to see it. Let him with ears &c.

jsight · 2025-03-28T01:18:43 1743124723

I've heard this a few times with Claude. I have no way to know for sure, but I'm guessing the problem is as simple as their reward model. Likely they trained it on generating code with tests and provided rewards when those tests pass.

It isn't hard to see why someone rewarded this way might want to game the system.

I'm sure humans would never do the same thing, of course. /s

polygot · 2025-03-16T05:28:25 1742102905

Very interesting podcast, I find that the guest was very candid so it was great to hear what it is really like working at Jane Street.

The reproducible-Python notebook problem/notebook for researchers mentioned in the podcast inspired me to create a new project, branch-pad https://github.com/alexyorke/branch-pad which is an interactive Python notebook environment that allows you to create and explore multiple branches of code execution.

polygot · 2025-02-08T18:49:47 1739040587

I'd like to see a photo of said cat, for science of course :)