Hacker Newsnew | past | comments | ask | show | jobs | submit | SchemaLoad's commentslogin

"Good enough" bridges still last 50+ years. We could design a bridge to last 200 years but we won't even know if the design we have today will even be needed in 200 years. Maybe by then we all use trains in underground tunnels.

I don't think that's true. Engineers would largely want to build the best bridge costs be damned. But they would end up undercut by anyone who cuts corners resulting in the only companies getting contracts are the ones who cut the most corners. Even if no one wants to build bridges that collapse, it would be impossible without some counter forces of laws and accountability.

Microsoft has had a lot of naming blunders in the past but this has to be their worst. Copilot is currently, a tool to review PRs on github, the new name for windows cortana, the new name for microsoft office, a new version of windows laptop/pc, a plugin for VS code that can use many models, and probably a number of other things. None of these products/features have any relation to each other.

So if someone says they use Copilot that could mean anything from they use Word, to they use Claude in VS Code.


>Microsoft has had a lot of naming blunders in the past but this has to be their worst.

Nah I still rate "Windows App" the Windows App that lets you remotely access Windows Apps. I hate it to death, its like a black hole that sucks all meaning from conversations about it.


"Microsoft Remote Desktop" was such a good and distinct name. RIP.

It’s probably a useful feature: if it’s named copilot, assume it’s slop and avoid it.

This feels like an AI generated comment, but I'll reply anyway. AI has been a massive negative for open source since every project is now drowning in AI generated PRs which don't work, reports for issues which don't exist, and the general mountain of time waster automated slop.

We are getting to the point where many projects may have to close submissions from the general public since they waste far more time than they help.


And then you get a new hire who already knows the common SaaS products but has to re learn your vibe coded version no one else uses where no information exists online.

There is a reason why large proprietary products remain prevalent even when cheaper better alternatives exist. Being "industry standard" matters more than being the best.


The new hire will just vibe code a new solution that translates your solution into something he prefers. Every new hire will have his own.

This will all end well I'm sure

It will. By translation I mean like a front end client that translates the api into a user interface they prefer. They will build something localized to their own workflow. If it doesn't end well it's localized to them only.

As the kids say: "let them cook"

Maybe, but I don't really believe users can or want to start designing software, if it was even possible which today it isn't really unless you already have software dev skills.

That would basically make users a product manager and UX designer, which they aren't really capable of currently. At most they will discover what they think they want isn't what they actually want.


Benchmarks on public tests are too easy to game. The model owners can just incorporate the answers in to the dataset. Only the private problems actually matter.

In this case the code is public and you can see they are not cheating in that sense.

The harness seems extremely benchmark specific that gives them a huge advantage over what most models can use. This isn't a qualifying score for that reason.

Here is the ARC-AGI-3 specific harness by the way - lots of challenge information encoded inside: https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...


I agree it's not cheating that restricted sense. But I'm not really convinced that it can't be cheating in a more general sense. You can try like 10^10 variations of harnesses and select the one that performs best. And probably if you then look at it, it will not look like it's necessarily cheating. But you have biased the estimator by selecting the harness according to the value.

Once the model has seen the questions and answers in the training stage, the questions are worthless. Only a test using previously unseen questions has merit.

They aren't training new models for this. This is an agent harness for Opus 4.6.

All traffic is monitored, all signal sources are eventually incorporated into the training set in one way or another. The person you're responding to is correct, even a single API call to any AI provider is sufficient to discount future results from the same provider.

ok! So if someone uses an existing, checkpointed, open source model then the answer is yes the results are valid and it doesn't matter that the tests are public.

Yes, assuming the checkpoint was before the announcement & public availability of the test set.

You live in a conspiracy world. Those AI providers don't update the models that fast. You can try ask them solve ARC-AGI-3 without harness and see them struggle as yesterday yourself.

Which part is the conspiracy? Be as concrete as possible.

They are definitely cheating, they have crafted prompts[1] that explain the game rules rather than have the model explore and learn.

1. https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...


Where do you see that? I only skimmed the prompts but don't see any aspects of any of the games explained in there. There are a few hints which are legitimate prior knowledge about games in general, though some looks too inflexible to me. Prior knowledge ("Core priors") is a critical requirement of the ARC series, read the reports.

What is the use in keeping it open when no one will ever look at it again after it goes stale? It still exists in the system if you ever wanted to find it again or if someone reports the same issue again. But after a certain time without reconfirming the bug exists, there is no point investigating because you will never know if you just haven't found it yet or if it was fixed already.

See my reply to eminence32 - bug tracking serves as a list of known defects, not as a list of work the engineers are going to do this [day/month/year].

The primary purpose is not usually a list of known defects and many ‘bugs’ are not actually bugs but feature requests or misunderstandings from users (e.g. RFC disallows the data you want my html parser to allow).

> The primary purpose is not usually a list of known defects and many ‘bugs’ are not actually bugs but feature requests

IME there are separate mechanisms to track feature work, bug trackers are for... bugs.

> or misunderstandings from users (e.g. RFC disallows the data you want my html parser to allow).

Again, this is a class of bug report that nobody is arguing should stay open.


The people who filed them would disagree and many would vehemently argue that their bug is in fact a bug, and is the most important bug and how dare you close it.

There's also going to be mountains of bugs resulting from cosmic rays hitting the computer, defective ram chips, weird modifications of the system the reporter hasn't mentioned.

You could sink an infinite amount of time investigating and find nothing. At some point you have to cut off the time investment when only one person has reported it and no devs have been able to reproduce it.


If bug contains instructions for reproduction most of that will be eliminated.

99.99% of bug reports do not

You're lucky if there is an accurate description of what produced the bug on the customers' specific setup at the time of reporting.


well then closing it with inactivity is fine.

Quite often reproduction information will only reproduce the bug in the customers environment, hence there is a lot of incomplete state on what is actually causing the problem.

It's pretty terrible in enterprise because there is so much 3rd party crap touching things it shouldn't that love to cause fun problems.


There's obviously some nuance here, but the fact is that much modern software is riddled with bugs, and this is sub-optimal for everyone (both software users and software builders). Most of the bugs which frustrate and irritate software users are not due to uncontrollable events such as cosmic rays flipping a bit. Most of them are plain old code defects.

But, you do have a valid point. Allow me to rephrase it this way: The answer is not for software companies to spend unbounded amounts of engineer time chasing every reported bug.

But there are ways that we, as an industry, can do better, and it's not by pouring all our time into chasing hard-to-diagnose bugs. Here are a few ways that I personally see:

1. Some very powerful technologies for finding many bugs with little engineering effort already exist, but are not widely used. As an example, coverage-guided fuzzing is amazingly good at finding all kinds of obscure bugs. The idea of coverage-guided fuzzing was known from the 1990's, but it took AFL (in ~2013) to take it mainstream. Even now, much of the industry is not benefiting from the awesome power of coverage-guided fuzzing. And there are other, equally powerful techniques which have been known for a long time, but are even less accessible to most software developers.

So: spread the word about such techniques, and for programming language/platform developers, work on making them more easily applicable. This could help many software companies to catch a great number of bugs before they ever go to production.

2. Similarly, there are extant historical computing systems which had very powerful debugging facilities, much better than what is currently available to most developers. The ideas on how to make our platforms more debuggable are already out there; it's now a matter of popularizing those ideas and making them readily accessible and applicable.

3. Since it's widely known that many bugs (real bugs, not "cosmic rays") are extremely hard to reproduce, an admirable target for us to aim for as developers is to implement debug logging in a way which allows us to root-cause most obscure bugs just by examining the logs (i.e. no need to search for a reproducer). Some real-world systems have achieved that goal, with very good results.

4. While there is currently much buzz about using LLM-based coding agents to write code, I think an almost better use case for coding agents is in triaging bug reports, diagnosing the bugs, finding reproducers, etc.

I've recently had a couple of shocking experiences where, just given a written description of an intermittent, hard-to-diagnose bug, a coding agent was able to search an entire codebase, identify the exact cause, and write a reproducer test case. (And this after multiple experienced human programmers had looked at the issue but failed to identify the cause.)

In summary, I think there are ways to "cut the Gordian knot" of bug reports.


What if no devs even tried to reproduce it, and they have no reason to believe they've fixed the bug with any other changes?

That seems to be the case described in the article. In such a situation, I think it's dishonest to ask the reporter to expend even more effort when you've spent zero. Just close it if you don't want to do it, you don't have to be a jerk to your customers, too, by sending them off on a wild goose chase.

Otherwise, why not ask the reporter to reproduce the issue every single day until you choose to fix it in some unknown point in the future, and if they miss a day, it gets closed? That seems just as arbitrary.


> Otherwise, why not ask the reporter to reproduce the issue every single day until you choose to fix it in some unknown point in the future, and if they miss a day, it gets closed? That seems just as arbitrary.

Truenas literally takes this approach to bugs.


Right. The problem isn’t closing the ticket, it’s pretending more work is happening than actually is.

“Needs verification” is fine if someone has actually tried to reproduce it. Otherwise it’s just a nicer way of saying “we’re not going to look at this.”


Most of the time there is some reason to believe that the bug could be fixed though, i.e. there were non-trivial code changes around that area.

Most of the time, there isn't any reason to believe it could be fixed, i.e. there were not any non-trivial changes around that area. What you're describing happens less frequently, and in such cases, the devs should discuss that with the reporter.

In this specific example, it looks like Apple gave no indications that such changes had happened, and no indications they had even spent a nonzero amount of effort following the reproduction instructions with either the old code or the new code.


I'll steal this to my projects bug template! /s

"Please consider cosmic rays hitting the computer, defective ram chips, weird modifications of the system before submitting the bug. Unlesss you explicitly acknowledge that, your bug will be closed automatically in 30 days. Thank you very much"


ios added "Call screening" which asks unknown callers to explain who they are and what they want before it rings the receiver.

The tricky part for scammers is there is no good answer here, if you claim to be a plumber and the victim hasn't booked a plumber, they won't answer.


Google Pixel phones also have this feature since at least 5 years. Spammers usually just hang up instantly.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: