Hacker Newsnew | past | comments | ask | show | jobs | submit | overfeed's commentslogin

Chinese companies, starting with CXMT will own the consumer segment: until they are sanctioned/banned in the US. The rest of the world will be fine, but consumer desktop computing in the US will be akin to the cars in Cuba.

i use local dev containers: the worst an agent can do is delete its working copy; no access to my home directory, access tokens or sudo.

> Now I'm really curious. What field are you in that ndjson files of that size are common?

I'm not OP,but structured JSON logs can easily result in humongous ndjson files, even with a modest fleet of servers over a not-very-long period of time.


So what's the use case for keeping them in that format rather than something more easily indexed and queryable?

I'd probably just shove it all into Postgres, but even a multi terabyte SQLite database seems more reasonable.


Replying here because the other comment is too deeply nested to reply.

Even if it's once off, some people handle a lot of once-offs, that's exactly where you need good CLI tooling to support it.

Sure jq isn't exactly super slow, but I also have avoided it in pipelines where I just need faster throughput.

rg was insanely useful in a project I once got where they had about 5GB of source files, a lot of them auto-generated. And you needed to find stuff in there. People were using Notepad++ and waiting minutes for a query to find something in the haystack. rg returned results in seconds.


You make some good points. I've worked in support before, so I shouldn't have discounted how frequent "once-offs" can be.

The use case could be e.g. exactly processing an old trove of logs into something more easily indexed and queryable, and you might want to use jq as part of that processing pipeline

Fair, but for a once-off thing performance isn't usually a major factor.

The comment I was replying to implied this was something more regular.

EDIT: why is this being downvoted? I didn't think I was rude. The person I responded to made a good point, I was just clarifying that it wasn't quite the situation I was asking about.


At scale, low performance can very easily mean "longer than the lifetime of the universe to execute." The question isn't how quickly something will get done, but whether it can be done at all.

Good point. I said it above, but I'll repeat it here that I shouldn't have discounted how frequent once offs can be. I've worked in support before so I really should've known better

Certain people/businesses deal with one-off things every day. Even for something truly one-off, if one tool is too slow it might still be the difference between being able to do it once or not at all.

> I feel like we are just inching closer and closer to a world where rapid iteration of software will be by default.

There's a lots of experimentation right now, but one thing that's guaranteed is that the data gatekeepers will slam the door shut[1] - or install a toll-booth when there's less money sloshing about, and the winners and losers are clear. At some point in the future, Atlassian and Github may not grant Anthropic access to your tickets unless you're on the relevant tier with the appropriate "NIH AI" surcharge.

1. AI does not suspend or supplant good old capitalism and the cult of profit maximization.


What were you using 6 months ago?

Opus 4.5 ~= Opus 4.6 high. Opus 4.5 was nerfed just before or after the release of 4.6.

The models don’t change.

On paper. There's huge financial incentive to quantize the crap out of a good model to save cash after you've hooked in subscriptions.

And there’s an incentive to publish evidence of this to discourage it, do you have any?

Models aren't just big bags of floats you imagine them to be. Those bags are there, but there's a whole layer of runtimes, caches, timers, load balancers, classifiers/sanitizers, etc. around them, all of which have tunable parameters that affect the user-perceptible output.

There really always is a man behind the curtain eh?


It's still engineering. Even magic alien tech from outer space would end up with an interface layer to manage it :).

ETA: reminds me of biology, too. In life, it turns out the more simple some functional component looks like, the more stupidly overcomplicated it is if you look at it under microscope.


There's this[1]. Model providers have a strong incentive to switch (a part of) their inference fleet to quantized models during peak loads. From a systems perspective, it's just another lever. Better to have slightly nerfed models than complete downtime.

[1]: https://marginlab.ai/trackers/claude-code/


So - as the charts say - no statistical difference?

Isn't this link am argument against the point you are making?


The chart doesn't cover the 4.6 release which was in the end of December/early January time frame. So, it's hard to tell from existing data.

That isn't true. The whole point it to quickly pick up statistically significant variations quickly, and with the volume of tests they are doing there is plenty of data.

If you turn on the 95% CI bands you can see there is plenty of statistical significance.


Unless you and I are looking at different web pages… it only goes back to February, not December or January.

Anybody with more than five years in the tech industry has seen this done in all domains time and again. What evidence you have AI is different, which is the extraordinary claim in this case...

Or just change the reasoning levels.

Real world usage suggests otherwise. It's been a known trend for a while. Anthropic even confirmed as such ~6 months ago but said it was a "bug" - one that somehow just keeps happening 4-6 months after a model is released.

Real world usage is unlikely to give you the large sample sizes needed to reliably detect the differences between models. Standard error scales as the inverse square root of sample size, so even a difference as large as 10 percentage points would require hundreds of samples.

https://marginlab.ai/trackers/claude-code/ tries to track Claude Opus performance on SWE-Bench-Pro, but since they only sample 50 tasks per day, the confidence intervals are very wide. (This was submitted 2 months ago https://news.ycombinator.com/item?id=46810282 when they "detected" a statistically significant deviation, but that was because they used the first day's measurement as the baseline, so at some point they had enough samples to notice that this was significantly different from the long-term average. It seems like they have fixed this error by now.)


It's hard to trust public, high profile benchmarks because any change to a specific model (Opus 4.5 in this case) can be rejected if they have regressions on SWE-Bench-Pro, so everything that gets to be released would perform well in this benchmark

Any other benchmark at that sample size would have similarly huge error bars. Unless Anthropic makes a model that works 100% of the time or writes a bug that brings it all the way to zero, it's going to work sometimes and fail sometimes, and anyone who thinks they can spot small changes in how often it works without running an astonishingly large number of tests is fooling themselves with measurement noise.

They do. I'm currently seeing a degradation on Opus 4.6 on tasks it could do without trouble a few months back. Obvious I'm a sample of n=1, but I'm also convinced a new model is around the corner and they preemptively nerf their current model so people notice the "improvement".

Make that 2, I told my friends yesterday "Opus got dumb, new model must be coming".

I swear that difference sessions will route to different quants. Sometimes it's good, sometimes not.


Well, I don't see 4.5 on there ... so I'm not sure what you're trying to say.

And today is a 53% pass rate vs. a baseline 56% pass rate. That's a huge difference. If we recall what Anthropic originally promised a "max 5" user https://github.com/anthropics/claude-code/issues/16157#issue... -- which they've since removed from their site...

50-200 prompts. That's an extra 1-6 "wrong solutions" per 5 hours ... and you have to get a lot of wrong answers to arrive at a wrong solution.


Only nominally...

Oh yes, they do.

I think the conspiracy theories are silly, but equally I think pretending these black boxes are completely stable once they're released is incorrect as well.

No conspiracy theories. Companies being scumbags, cutting corners, and doctoring benchmarks while denying it. Happens since forever.

> ...that really isn't that hard.

Until the AI scrapers[1] come for you at 5k requests per second and you're doing operations in hard-mode.

1. Most forges have http pages for discoverability. I suppose one could hypothetically setup an ssh-only forge and statically generate a html site periodically, but this is already advanced ops for the average Github user


This isn't a real thing and if it ever becomes a thing you can sue them for DDOS and send Sam Altman to jail. AI scraping is in the realm of 1-5 requests per second, not 5000.

I wasn't proposing a full on forge, just a VM with a (key auth only) ssh server to push code to/from.

fail2ban

> why would israel want to annex territory in Lebanon?

Why are Israeli settlers annexing land in the West Bank? Why is the right wing government letting them?


these two issues are completely different. judea and samaria do not equal lebanon, ideologically or geopolitically whatsoever.

Israeli military launching incursions into lebanon to fight hezbullah and prevent them from launching rockets randomly into israel (these rockets killing many arabs as well), is not the same as the squabbles of a small minority of civilians in disputed territory within israel proper.


Hence the name Samson: caving the roof over one's self while taking down the enemies.

Do go on - what were his instructions on what they ought to do after the bombing stopped?

Overthrow the regime. Has the bombing stopped?

The CIA, as its tradition demands, never meddles when the conditions are ripe to promote American interests. They just let nature take its course from afar.

> CIA, as its tradition demands, never meddles when the conditions are ripe to promote American interests

Straw man. Nobody argued American interests were unrepresented on the ground.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: