More

tolmasky · 2025-09-19T20:50:24 1758315024

Perfect number to make H1Bs a tool that is out of reach for startups but still meaningful for large entrenched corporations. Nailed it. Maybe they can even waive the fee if you give the US government 10% of your company.

dilyevsky · 2025-09-19T22:05:50 1758319550

University hiring is basically rekt. Throwing out baby with the bath water per usual with this admin...

kelnos · 2025-09-19T22:51:02 1758322262

How much does university hiring depend on H-1B? I would expect much of that comes through O-1 or EB-1/2/3, no?

jltsiren · 2025-09-20T07:04:57 1758351897

H-1B is the default visa for international faculty hires. You can get it in a few months with relatively little effort. O-1 is more expensive, takes longer to get, and requires more effort from the applicant. Then there is the subjective approval process that involves a degree of risk, and in the end, you get a slightly inferior visa.

Green cards are almost useless for hiring, as the processing times are too long. "We would like to offer you this position, but conditionally. We still need a year or two to handle the bureaucracy, and we can't say for sure if we are actually allowed to hire you. Please don't accept another position meanwhile."

fooker · 2025-09-19T23:56:10 1758326170

No, pretty much all professors who used to be international students or postdocs are on H1B.

whamlastxmas · 2025-09-20T00:35:03 1758328503

my understanding is post docs are virtually all on J1 visas, which is a meaningful part of uni hiring

fooker · 2025-09-20T01:05:28 1758330328

> used to be

SV_BubbleTime · 2025-09-20T00:09:22 1758326962

So... Now those spots will have to go to American students and grads?

fooker · 2025-09-20T01:05:11 1758330311

Some will.

Most won't be filled at all.

dilyevsky · 2025-09-20T04:42:04 1758343324

+1 This will also reduce demand for these programs from international students - make tuition more expensive for locals. Asking to consider 2nd/3rd order effect seems like a bit too much for a median hn poster though

apwell23 · 2025-09-20T00:01:49 1758326509

lots of immigrant kids are in uni now. all my cousins are doing cs now. look at latest batch of yc founders.

etothepii · 2025-09-19T20:57:27 1758315447

An equity minimum would deal with this.

tolmasky · 2025-09-15T13:00:38 1757941238

I'm glad the em dash is getting properly shit on these days, if for unrelated reasons. I've never liked it. I hate the stupid spacing rules around it. It never looks right to put no spaces around the em dash, and probably breaks all sorts of word-splitting code that's based on "\s". Where else does punctuation without spaces not mean a single word? Hyphens without spaces is a compound word: it counts as one. Imagine if the correct use of a colon was to not put spaces around it:like this. Do you like that? Of course not.

But I think worst of all it just gives me the fucking creeps, some uncanny-valley bullshit. I see hyphens a million times a day then out of nowhere comes this creepy slender-man looking motherfucker that's just a little bit too long than you'd expect or like, and is always touching all the letters around it when it shouldn't need to. It stands out looking like a weird print error... on my screen! Hopefully it keeps building a worse and worse reputation.

tolmasky · 2025-09-12T14:26:29 1757687189

Does no one else find it weird seeing anything from this administration "anti-Bitcoin" at all? I wouldn't be surprised by this headline during a previous administration, but generally speaking, this administration has been very Bitcoin-friendly (and Bitcoin institutions friendly right back). To be clear, the simplest answer is "sure but that doesn't mean they have to agree on everything". But I would like to propose that if you ask the simple question of "who does this benefit?" it may suggest we are witnessing a different phenomenon here.

I think this might be the first indication that what we currently call "institutional Bitcoin supporters" are not "Bitcoin supporters" at all, or rather, what they call "Bitcoin" is not what you and I call "Bitcoin". Services like Coinbase and BTC ETFs don't really suffer from this development at all. In fact, I think it's quite obvious that obviously benefit from something like this (at least from the first-order effects). What's the alternative to self custody? Well... third-party custody. Especially since they are already bound up by KYC rules, right? Their is a cynical reading that there's nothing inconsistent with this development if you consider "institutional Bitcoin's" goals to primarily be replacing existing financial power structures with themselves. "Bitcoin" is just a means to an end. Their goals were only incidentally aligned with individual BTC holders since they were previously in similar circumstances as the "out group". Previous administrations were as suspicious of "Bitcoin companies" as any individual Bitcoin holder, perhaps even more so. But that's not the case anymore. Bitcoin companies have successfully been brought into the fold, so it's not even that they're necessarily "betraying" the values of Bitcoin true believers, you might argue that interpretation of shared values was entirely inferred to begin with.

Critically though, I think an important consequence of this is that Bitcoin purists and skeptics should realize that they arguably now have more in common than not, at least in the immediate term, and may be each other's best allies. In my experience, for most the existence of Bitcoin, its skeptics haven't really seen Bitcoin as a "threat." Instead, to admittedly generalize, their critiques have been mostly about Bitcoin being "broken" or "silly" or "misunderstanding the point of centralized systems", etc. These aren't really "oppositional" positions in the traditional "adversarial sense," more dismissive. In fact, the closest thing to an "active moral opposition" to Bitcoin that I've seen is an environmental one. IOW, Bitcoin true believers think about Bitcoin way more than Bitcoin skeptics do. Similarly, Bitcoin true believers really have nothing against skeptics other than... the fact that they occasionally talk shit about Bitcoin? IOW, Bitcoin skeptics are not "the natural enemy Bitcoin was designed to defeat".

But if you think about it, "institutional Bitcoin" sort of embodies something both these camps generally have hated since before Bitcoin. Whether you believe Bitcoin to be a viable answer or not, it is undeniable that the "idea" of Bitcoin is rooted in the distrust of these elitist financial institutions, that evade accountability, benefit from special treatment, and largely get to rig the larger system in their favor. Similarly, I don't think Bitcoin skeptics like these institutions or are "on their side". In fact, perhaps they'd argue that they predicted that Bitcoin wouldn't solve any of this and would just be another means of creating them. But IMO what they should both realize is that the most important threat right now is these institutional players. They are in fact, only "nominally" Bitcoin in a deep sense. From the perspective of true believers, their interests are actually in now way "essentially" aligned with any "original Bitcoin values," and from the perspective of skeptics, the threat they pose has very little to do with their use of "the Bitcoin blockchain".

They are arguably just another instantiation of the "late stage capitalist" playbook of displacing an existing government service in order to privatize its rewards. Coinbase could be argued to have more in common with Uber than Ledger wallets. Instead of consolidating and squeezing all the value from taxis though, the play is to do the same with currency itself. It is incidental that Uber happened to be so seemingly "government averse". In this context, it's actually helpful to cozy up to the government and provide the things government departments want that make no difference to fintech's bottom line (such as KYP). In fact, that might be their true value proposition. Bitcoin only enters the conversation because in order to replace a currency, you do... need a currency. Bitcoin was convenient. It was already there, it had a built-in (fervent) user base that was happy to do your proselytizing for you, and even saw you as a good "first step" for normies that couldn't figure out to manage their own wallet. The Bitcoin bubble was already there, why fight it when you can ride it?

Again, I think this is highly likely to be against the values of Bitcoin true believers and skeptics alike, and I also think that if the above is true, it represents an actual danger to us all. Recent events with credit card processors have already demonstrated that payment systems have proven to be incredibly efficient tools at stifling speech. In other words, this is arguably an "S-tier threat", on par with or perhaps worse than any sort of internet censorship or net neutrality. If so, we should treat it as such and work together.

tolmasky · 2025-09-12T07:27:12 1757662032

Generally speaking, the second you realize a technology/process/anything has a hard requirement that individuals independently exercise responsibility or self-control, with no obvious immediate gain for themselves, it is almost certain that said technology/process/anything is unsalvageable in its current form.

This is in the general case. But with LLMs, the entire selling point is specifically offloading "reasoning" to them. That is quite literally what they are selling you. So with LLMs, you can swap out "almost certain" in the above rule to "absolutely certain without a shadow of a doubt". This isn't even a hypothetical as we have experimental evidence that LLMs cause people to think/reason less. So you are at best already starting at a deficit.

But more importantly, this makes the entire premise of using LLMs make no sense (at least from a marketing perspective). What good is a thinking machine if I need to verify it? Especially when you are telling me that it will be a "super reasoning" machine soon. Do I need a human "super verifier" to match? In fact, that's not even a tomorrow problem, that is a today problem: LLMs are quite literally advertised to me as a "PhD in my pocket". I don't have a PhD. Most people would find the idea of me "verifying the work of human PhDs" to be quite silly, so how does it make any sense that I am in any way qualified to verify my robo-PhD? I pay for it precisely because it knows more than I do! Do I now need to hire a human PhD to verify my robo-PhD?" Short of that, is it the case that only human PhDs are qualified to use robo-PhDs? In other words, should LLms exclusively be used for things the operator already knows how to do? That seems weird. It's like a Magic 8 Ball that only answers questions you already know the answer to. Hilariously, you could even find someone reaching the conclusion of "well, sure, a curl expert should verify the patch I am submitting to curl. That's what submitting the patch accomplishes! The experts who work on curl will verify it! Who better to do it than them?". And now we've come full circle!

To be clear, each of these questions has plenty of counter-points/workarounds/etc. The point is not to present some philosophical gotcha argument against LLM use. The point rather is to demonstrate the fundamental mismatch between the value-proposition of LLMs and their theoretical "correct use", and thus demonstrate why it is astronomically unlikely for them to ever be used correctly.

rhdunn · 2025-09-12T08:14:29 1757664869

I use coding LLMs as a mix of:

1. a better autocomplete -- here the LLM models can make mistakes, but on balance I've found this useful, especially when constructing tests, writing output in a structured format, etc.;

2. a better search/query tool -- I've found answers by being able to describe what I'm trying to do where a traditional search I have to know the right keywords to try. I can then go to the documentation or search if I need additional help/information;

3. an assistant to bounce ideas off -- this can be useful when you are not familiar with the APIs or configuration. It still requires testing the code, seeing what works, seeing what doesn't work. Here, I treat it in the same way as reading a blog post on a topic, etc. -- the post may be outdated, may contain issues, or may not be quite what I want. However, it can have enough information for me to get the answer I need -- e.g. a particular method which I can then consult docs (such as documentation comments on the APIs) etc. Or it lets be know what to search on Google, etc..

In other words, I use LLMs as part of the process like with going to a search engine, stackoverflow, etc.

Sohcahtoa82 · 2025-09-12T21:08:05 1757711285

> a better autocomplete

This is 100% what I use Github Copilot for.

I type a function name and the AI already knows what I'm going to pass it. Sometimes I just type "somevar =" and it instantly correctly guesses the function, too, and even what I'm going to do with the data afterwards.

I've had instances where I just type a comment with a sentence of what the code is about to do, and it'll put up 10 lines of code to do it, almost exactly matching what I was going to type.

The vibe coders give AI-code generation a bad name. Is it perfect? Of course not. It gets it wrong at least half the time. But I'm skilled enough to know when it's wrong in nearly an instant.

sothatsit · 2025-09-12T11:02:26 1757674946

GPT-5 Pro catches more bugs in my code than I do now. It is very very good.

LLMs are pretty consistent about what types of tasks they are good at, and which they are bad it. That means people can learn when to use them, and when to avoid them. You really don't have to be so black-and-white about it. And if you are checking the LLM's results, you have nothing to worry about.

Needing to verify the results does not negate the time savings either when verification is much quicker than doing a task from scratch.

My code is definitely of higher quality now that I have GPT-5 Pro review all my changes, and then I review my code myself as well. It seems obvious to me that if you care, LLMs can help you produce better code. As always, it is only people who are lazy who suffer. If you care about producing great code, then LLMs are a brilliant tool to help you with just that, in less time, by helping with research, planning, and review.

tolmasky · 2025-09-12T12:31:47 1757680307

This doesn't really address the point that is currently being argued I think, so much so that I think your comment is not even in contention with mine (perhaps you didn't intend it to be!). But for lack of a better term, you are describing a "closed experience". You are (to some approximation) assuming the burden of your choices here. You are applying the tool to your work, and thus are arguably "qualified" to both assess the applicability of the tool to the work, and to verify the results. Basically, the verification "scales" with your usage. Great.

The problem that OP is presenting is that, unlike in your own use, the verification burden from this "open source" usage is not taken on by the "contributors", but instead "externalized" to maintainers. This does not result in the same "linear" experience you have, their experience is asymmetric, as they are now being flooded with a bunch of PRs that (at least currently) are harder to review than human submissions. Not to mention that also unlike your situation, they have no means to "choose" not to use LLMs if they for whatever reason discover it isn't a good fit for their project. If you see something isn't a good fit, boom, you can just say "OK, I guess LLMs aren't ready for this yet." That's not a power maintainers have. The PRs will keep coming as a function of the ease to create them, not as a function of their utility. Thus the verification burden does not scale with the maintainer's usage. It scales with the sum of everyone who has decided they can ask an LLM to go "help" you. That number both larger and out of their control.

The main point of my comment was to say that this situation is not only to be expected, but IMO essential and inseparable from this kind of use, for reasons that actually follow directly from your post. When you are working on your own project, it is totally reasonable to treat the LLM operator as qualified to verify the LLMs outputs. But the opposite is true when you are applying it to someone else's project.

> Needing to verify the results does not negate the time savings either when verification is much quicker than doing a task from scratch.

This is of course only true because of your existing familiarity with of the project you are working on. This is not a universal property of contributions. It is not "trivial" for me to verify a generated patch in a project I don't understand, for reasons ranging from things as simple as the fact that I have no idea what the code contribution guidelines are (who am I to know if I am even following the style guidelines) to things as complicated as the fact that I may not even be familiar with the programming language the project is written in.

> And if you are checking the LLM's results, you have nothing to worry about.

Precisely. This is the crux of the issue -- I am saying that in the contribution case, it's not even about whether you are checking the results, it's that you arguably can't meaningfully check the results (unless you of course essentially put in nearly the same amount of work as just writing it from scratch).

It is tempting to say "But isn't this orthogonal to LLMs? Isn't this also the case with submitting PRs you created yourself?" No! It is qualitatively different. Anyone who has ever submitted a meaningful patch to a project they've never worked on before has had the experience of having to familiarize themselves with the relevant code in order to create said patch. The mere act of writing the fix organically "bootstraps" you into developing expertise in the code. You will if nothing else develop an opinion on the fix you chose to implement, and thus be capable of discussing it after you've submitted it. You, the PR submitter, will be worthwhile to engage with and thus invest time in. I am aware that we can trivially construct hypothetical systems where AI agents are participating in PR discussions and develop something akin to a long term "memory" or "opinion" -- but we can talk about that experience if and when it ever comes into being, because that is not the current lived experience of maintainers. It's just a deluge of low quality one-way spam. Even the corporations that are specifically trying to implement this experience just for their own internal processes are not particularly... what's a nice way to put this, "satisfying" to work with, and that is for a much more constrained environment, vs. "suggesting valuable fixes to any and all projects".

rhdunn · 2025-09-12T19:28:36 1757705316

I'm not advocating that the verification should be on the maintainer. It should definitely be on the contributor/submitter to verify that what they are submitting is correct to the best of their abilities.

This applies if the reporter found the bug themselves, used a static analysis tool like Coverity, used a fuzzing tool, used valgrind or similar, used an LLM, or some other mechanism to identify the issue.

In each case the reporter needs to at a minimum check if what they found is actually an issue and ideally provide a reproducible test case ("this file causes the application to crash", etc.), logs if relevant, etc.

sothatsit · 2025-09-12T23:21:43 1757719303

I was arguing against your dismissal of the value proposition of LLMs. I wasn't arguing about the case of open-source maintainers getting spammed by low-quality issues and PRs (where I think we agree on a lot of points).

The way that you argued that the value proposition of LLMs makes no sense takes a really black-and-white view of modern AI. There are actually a lot of tasks where verification is easier than doing the task yourself, even in areas where you are not an expert. You just have to actually do the verification (which is the primary problem with open-source maintainers getting spammed by people who do not verify anything).

For example, I have recently been writing a proxy for work, but I'm not that familiar with networking setups. But using LLMs, I've been able to get to a robust solution that will cover our use-cases. I didn't need to be an expert in networking. My experience in other areas of computer science combined with LLMs to help me research let me figure out how to get our proxy to work. Maybe there is some nuance I am missing, but I can verify that the proxy correctly gets the traffic and I can figure out where it needs to go, and that's enough to make progress.

There is some academic purity lost in this process of using LLMs to extend the boundary of what you can accomplish. This has some pretty big negatives, such as allowing people with little experience to create incredibly insecure software. But I think there are a lot more cases where if you verify the results you get, and you don't try to extend too far past your knowledge, it gives people great leverage to do more. This is to say, you don't have to be an expert to use an LLM for a task. But it does help a lot to have some knowledge about related topics at least, to ground you. Therefore, I would say LLMs can greatly expand the scope of what you can do, and that is of great value (even if they don't help you do literally everything with a high likelihood of success).

Additionally, coding agents like Claude Code are incredible at helping you get up-to-speed with how an existing codebase works. It is actually one of the most amazing use-cases for LLMs. It can read a huge amount of code and break it down for you so you can start figuring out where to start. This would be of huge help when trying to contribute to someone else's repository. LLMs can also help you with finding where to make a change, writing the patch, setting up a test environment to verify the patch, looking for project guidelines/styleguides to follow, helping you to review your patch against those guidelines, and helping you to write the git commit and PR description. There's so many areas where they can help in open-source contributions.

The main problem in my eyes is people that come to a project and make a PR because they want the "cred" of contributing with the least possible effort, instead of because they have an actual bug/feature they want to fix/add to the project. The former is noise, but the latter always has at least one person who benefits (i.e., you).

olmo23 · 2025-09-12T09:49:48 1757670588

In my experience most of the work a programmer does just isn't very difficult. LLMs are perfectly fine for that.

actionfromafar · 2025-09-12T09:11:25 1757668285

There’s some corollary here to self-driving cars which need constant baby-sitting.

tolmasky · 2025-09-11T19:29:49 1757618989

How strange that the article never links directly to the Helix editor. I usually immediately open the homepage of whatever a blog post is talking about as a background tab to be able to click back and forth, or to be able to immediately figure out what the thing being talked about is, but no luck here, except for some decoys (like the "helix" link next to the title which is just the tag "helix" which sends you to a page with all the posts tagged with "helix", which happens to just be this one post).

I of course quickly just googled it myself and found the page, and so afterward I went to the source of the blog post and searched for the URL to confirm that it wasn't actually linked to anywhere. Turns out that about three quarters of the way down, in the "Key Bindings" section, there is a link to the Helix keymappings documentation page, which appears to be the closest thing to a direct homepage link.

Anyways, no nefarious intent being implied of course, I just found it sort of interesting. I am pretty certain it just got accidentally left out, or maybe the project didn't have a homepage back in December of 2024 when this was originally written? Although the github page isn't directly linked either (only one specific issue in the github tracker).

Oh, and here's a link to their page: https://helix-editor.com/

And github page: https://github.com/helix-editor/

MrJohz · 2025-09-11T20:25:14 1757622314

Yes, it was pure accident! I surely had the helix homepage and documentation most of the time while writing this, but only thought to link that one bit of documentation! When I get to a computer next I'll update it with a link, because that would be useful.

TiredOfLife · 2025-09-11T20:32:18 1757622738

Not linking to stuff is the new normal. Many subreddits ban you if you post a link to source. Tweets no longer contain links - you need to click on tweet to see the next ones that maybe contain the link

zveyaeyv3sfye · 2025-09-11T22:14:03 1757628843

> Not linking to stuff is the new normal.

Maybe in certain anti-intellectual crowds. But not here.

Reddit behavior shouldn't restrict you elsewhere.

malnourish · 2025-09-12T11:58:27 1757678307

What? Can you give some examples? I don't use reddit anymore but this sounds unbelievable to me. They ban you for providing a source?

tolmasky · 2025-08-27T21:47:16 1756331236

I didn't see mention anywhere of a license. I also don't see anywhere to download this from. Is this release equivalent to saying "here is an OFL metric-compatible Arial," or are they releasing it in the sense of "our products will now look like they use Arial, but aside from that this doesn't concern you."?

varenc · 2025-08-27T22:13:56 1756332836

It's 'available' for download here: https://www.are.na/_next/static/media/9844201f12bf51c2-s.p.w...

(but definitely don't think the license permits free use)

CharlesW · 2025-08-27T23:38:28 1756337908

> I didn't see mention anywhere of a license.

This page, which is poorly designed¹ to the point that it supports the idea that this is all an in-joke rather than the work of pros, appears to suggest that this is a purely commercial work: https://abcdinamo.com/licenses

¹ Seen while scanning: (1) Scroll down, then up. Boo. (2) Leading cramped beyond "style preference". (3) Bulleted list badly styled in a way that requires work. (4) No attention paid to tracking where it's needed (e.g. small all-caps type). (5) Some terms (e.g. "First Designer") capitalized inconsistently. (6) '&' used in body copy.

tolmasky · 2025-08-18T19:26:56 1755545216

Wikipedia says their traffic increased roughly 50% [1] from AI bots, which is a lot, sure, but nowhere near the amount where you'd have to rearchitect your site or something. And this checks out, if it was actually debilitating, you'd notice Wikipedia's performance degrade. It hasn't. You'd see them taking some additional steps to combat this. They haven't. Their CDN handles it just fine. They don't even both telling AI bots to just download the tarballs they specifically make available for this exact use case.

More importantly, Wikipedia almost certainly represents the ceiling of traffic increase. But luckily, we don't have to work with such coarse estimation, because according to Cloudflare, the total increase from combined search and AI bots in the last year (May 2024 - May 2025), has just been... 18% [2].

The way you hear people talk about it though, you'd think that servers are now receiving DDOS-levels of traffic or something. For the life of me I have not been able to find a single verifiable case of this. Which if you think about it makes sense... It's hard to generate that sort of traffic, that's one of the reasons people pay for botnets. You don't bring a site to its knees merely by accidentally "not making your scraper efficient". So the only other possible explanation would be such a larger number of scrapers simultaneously but independently hitting sites. But this also doesn't check out. There aren't thousands of different AI scrapers out there that in aggregate are resulting in huge traffic spikes [2]. Again, the total combined increase is 18%.

The more you look into this accepted idea that we are in some sort of AI scraping traffic apocalypse, the less anything makes sense. You then look at this Anubis "AI scraping mitigator" and... I dunno. The author contends that one if its tricks is that it not only uses JavaScript, but "modern JavaScript like ES6 modules," and that this is one of the ways it detects/prevents AI scrapers [3]. No one is rolling their own JS engine for a scraper such that they are being blocked from their inability to keep up with the latest ECMAScript spec. You are just using an existing JS engine, all of which support all these features. It would actually be a challenge to find an old JS engine these days.

The entire things seems to be built on the misconception that the "common" way to build a scraper is doing something curl-esque. This idea is entirely based on the google scraper which itself doesn't even work that way anymore, and only ever did because it was written in the 90s. Everyone that rolls their own scraper these days just uses Puppeteer. It is completely unrealistic to make a scraper that doesn't run JavaScript and wait for the page to "settle down" because so many pages, even blogs, are just entirely client-side rendered SPAs. If I were to write a quick and dirty scraper today I would trivially make it through Anubis' protections... by doing literally nothing and without even realizing Anubis exists. Just using standard scraping practices with Puppeteer. Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator.

I'm investigating further, but I think this entire thing may have started due to some confusion, but want to see if I can actually confirm this before speculating further.

1. https://www.techspot.com/news/107407-wikipedia-servers-strug... (notice the clickbait title vs. the actual contents)

2. https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...

3. https://codeberg.org/forgejo/discussions/issues/319#issuecom...

4. https://github.com/TecharoHQ/anubis/issues/964#issuecomment-...

zzo38computer · 2025-08-18T20:05:03 1755547503

> It is completely unrealistic to make a scraper that doesn't run JavaScript and wait for the page to "settle down" because so many pages, even blogs, are just entirely client-side rendered SPAs.

I specifically want a search engine that does not run JavaScript, so that it only finds documents that do not require JavaScripts to display the text being searched. (This is not the same as excluding everything that has JavaScripts; some web pages use JavaScripts but can still display the text even without it.)

> Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator.

These are some of the legitimiate problems with Anubis (and this is not the only way that you can be blocked by Anubis). Cloudflare can have similar problems, although its working is a bit different so it is not exactly the same working.

tolmasky · 2025-08-18T20:11:12 1755547872

> I specifically want a search engine that does not run JavaScript, so that it only finds documents that do not require JavaScripts to display the text being searched. (This is not the same as excluding everything that has JavaScripts; some web pages use JavaScripts but can still display the text even without it.)

Sure... but off-topic, right? AI companies are desperate for high quality data, and unlike search scrapers, are actually not supremely time sensitive. That is to say, they don't benefit from picking up on changes seconds after they are published. They essentially take a "snapshot" and then do a training run. There is no "real-time updating" of an AI model. So they have all the time in the world to wait for a page to reach an ideal state, as well as all the incentive in the world to wait for that too. Since the data effectively gets "baked into the model" and then is static for the entire lifetime of the model, you over-index on getting the data, not getting fast, or cheap, or whatever.

xena · 2025-08-18T19:30:22 1755545422

Hi, main author of Anubis here. How am I meant to store state like "user passed a check" without cookies? Please advise.

tolmasky · 2025-08-18T20:05:32 1755547532

If the rest of my post is accurate, that's not the actual concern, right? Since I'm not sure if the check itself is meaningful. From what is described in the documentation [1], I think the practical effect of this system is to block users running old mobile browsers or running browsers like Opera Mini in third world countries where data usage is still prohibitively expensive. Again, the off-the-shelf scraping tools [2] will be unaffected by any of this, since they're all built on top of Puppeteer, and additionally are designed to deal with the modern SPA web which is (depressingly) more or less isomorphic to a "proof-of-work".

If you are open to jumping on a call in the next week or two I'd love to discuss directly. Without going into a ton of detail, I originally started looking into this because the group I'm working with is exploring potentially funding a free CDN service for open source projects. Then this AI scraper stuff started popping up, and all of a sudden it looked like if these reports were true it might make such a project no longer economically realistic. So we started trying to collect data and concretely nail down what we'd be dealing with and what this "post-AI" traffic looks like.

As such, I think we're 100% aligned on our goals. I'm just trying to understand what's going on here since none of the second-order effects you'd expect from this sort of phenomenon seem to be present, and none of the places where we actually have direct data seem to show this taking place (and again, Cloudflare's data seems to also agree with this). But unless you already own a CDN, it's very hard to get a good sense of what's going on globally. So I am totally willing to believe this is happening, and am very incentivized to help if so.

EDIT: My email is my HN username at gmail.com if you want to schedule something.

1. https://anubis.techaro.lol/docs/design/how-anubis-works

2. https://apify.com/apify/puppeteer-scraper

rafram · 2025-08-18T20:43:40 1755549820

Cloudflare Turnstile doesn't require cookies. It stores per-request "user passed a check" state using a query parameter. So disabling cookies will just cause you to get a challenge on every request, which is annoying but ultimately fair IMO.

jddj · 2025-08-18T20:01:13 1755547273

Doesn't Wikipedia offer full tarballs?

This would imaginably put some downward pressure on scraper volume.

tolmasky · 2025-08-18T20:06:51 1755547611

From the first paragraph in my comment:

> You'd see them taking some additional steps to combat this. They haven't. Their CDN handles it just fine. They don't even both telling AI bots to just download the tarballs they specifically make available for this exact use case.

Yes, they do. But they aren't in a rush to tell AI companies this, because again, this is not actually a super meaningful amount of traffic increase for them.

imtringued · 2025-08-18T20:23:53 1755548633

I don't think you understand the purpose of Anubis. If you did then you'd realize that running a web browser with JS enabled doesn't bypass anything.

tolmasky · 2025-08-18T20:38:17 1755549497

By bypass I mean "successfully pass the challenge". Yes, I also have to sit through the Anubis interstitial pages, so I promise I know it's not being "bypassed". (I'll update the post to remove future confusion).

Do you disagree that a trivial usage of an off-the-shelf puppeteer scraper[1] has no problem doing the proof-of-work? As I mentioned in this comment [2], AI scrapers are not on some time crunch, they are happy to wait a second or two for the final content to load (there are plenty of normal pages that take longer than the Anubis proof of work does to complete), and also are unfazed by redirects. Again, these are issues you deal with normal everyday scraping. And also, do you disagree with the traffic statics from Cloudflare's site? If we're seeing anything close to that 18% increase then it would not seem to merit user-visible levels of mitigation. Even if it was 180% you wouldn't need to do this. nginx is not constantly on the verge of failing from a double digit "traffic spike".

As I mentioned in my response to the Anubis author here [3], I don't want this to be misinterpreted as a "defense of AI scrapers" or something. Our goals are aligned. The response there goes into detail that my motivation is that a project I am working on will potentially not be possible if I am wrong and this AI scraper phenomenon is as described. I have every incentive in the world to just want to get to the bottom of this. Perhaps you're right, and I still don't understand the purpose of Anubis. I want to! Because currently neither the numbers nor the mitigations seem to line up.

BTW, my same request extends to you, if you have direct experience with this issue, I'd love to jump on a call to wrap my head around this.

My email is my HN username at gmail.com if you want to reach out, I'd greatly appreciate it!

1. https://news.ycombinator.com/item?id=44944761

2. https://apify.com/apify/puppeteer-scraper

3. https://news.ycombinator.com/item?id=44944886

tolmasky · 2025-08-15T13:56:44 1755266204

The fact that IP protection is expensive is essentially its defining feature. One way to think of "intellectual property" is precisely as a weird proof-of-work, since you are trying to simulate the features of physical property for abstract entities that by default behave in the exact opposite fashion.

This is the frustrating thing about getting into an argument about how "IP isn't real property" and then having the other side roll their eyes at you like you are some naive ideologue. They're missing the point of what it means for IP to not be "real property". The actual point is understanding that you are, and will be, swimming against the current of the fundamentals of these technologies forever. It is very very difficult to make a digital book or movie that can't be copied. So difficult in fact, that it we've had to keep pushing the problem lower and lower into the system, with DRM protections at the hardware level. This is essentially expensive, not just from a capital perspective, but from a "focus and complexity" burden perspective as well. Then realize that even after putting this entire system in place, an entire trade block could arbitrarily decide to stop enforcing copyright, AKA, stop fueling the expensive apparatus that is is holding up the "physical property" facade for "intellectual property". This was actually being floated as a retaliation tactic during the peak of the tariff dispute with Canada[1]. And in fact we don't even need to go that far, it has of course always been the case that patents vary in practical enforceability country to country, and copyrights (despite an attempt to unify the rules globally) are also different country to country (the earliest TinTin is public domain in the US but not in the EU).

Usually at this point someone says "It's expensive to defend physical property too! See what happens if another country takes your cruise liner". But that's precisely the point, the difficulty scales with the item. I don't regularly have my chairs sitting in Russia for them to be nationalized. The entities that have large physical footprints are also the ones most likely to have the resources defend that property. This is simply not the case with "intellectual property," which has zero natural friction in spreading across the world, and certainly doesn't correlate with the "owner's" ability to "defend" it. This is due to the fundamental contradiction that "intellectual property" tries to establish: it wants all the the zero unit-cost and distribution benefits of "ethereal goods," with all the asset-like benefits of physical goods. It wants it both ways.

Notice that all the details always get brushed away, we assume we have great patent clerks making sure only "novel inventions" get awarded patents. It assumes that patent clerks are even capable of understanding the patent in question (they're not, the vast majority are new grads [2]). We assume the copyright office is property staffed (it isn't [3]) We assume the intricacies of abstract items like "APIs" can be property understood by both judge and jury in order to reach the right verdict in the theoretically obvious cases (also turns out that most people are not familiar with these concepts).

How could this not be expensive? You essentially need to create "property lore" in every case that is tried. Any wish for the system to be faster would necessarily also mean less correct verdicts. There's no magic "intellectual property dude" that could resolve all this stuff. Copyright law says that math can't be copyrighted, yet we can copyright code. Patent law says life can't be patented, yet our system plainly allows copyrighting bacteria. Why? Because a lawyer held of a tube of clear liquid and said "does this seem like life to you?" The landmark Supreme Court case was decided 5-4 [4], and all of a sudden a thing that should obviously not be copyrightable by anyone that understands the science was decided it was. There's no "hidden true rules" that if just followed, would make this system efficient. It is, by design, a system that makes things up as it goes along.

As mentioned in other comments, at best you could just flip burden to the other party, which doesn't make the system less expensive, it just shifts the default party that has to initially burden the cost. Arguably this is basically what we have with patents. Patents are incredibly "inventor friendly". You can get your perpetual motion machine patented easy-peasy. In fact, there is so much "respect" for "ideas" as "real things", that you can patent things you never made and have no intention of making. You can then sue companies that actually make the thing you "described first". Every case is a new baby being presented to King Solomon to cut in half.

In other words, an inexpensive system would at minimum require universal understanding and agreement on supremely intricate technical details of every field it aims to serve, which isn't just implausible, it is arguably impossible by definition since the whole point of intellectual property is to cover the newest developments in the field.

1. https://www.cigionline.org/articles/canada-can-fight-us-tari...

2. https://tolmasky.com/2012/08/29/patents-and-juries/

3. https://www.wired.com/story/us-copyright-office-chaos-doge/

4. https://supreme.justia.com/cases/federal/us/447/303/

tolmasky · 2025-08-13T23:10:13 1755126613

Quick, someone tell the Trump administration so they can cancel all funding to this before its too late and actually makes it to people!

tomhow · 2025-08-14T02:35:47 1755138947

Eschew flamebait. Avoid generic tangents.

https://news.ycombinator.com/newsguidelines.html

tolmasky · 2025-08-14T14:48:22 1755182902

Yeah, I know the rules. But I also know they were written before our current reality. More important rules seem to be being rewritten in front our eyes, so you never know. Also, if funding is cut off to this, like it has been for brain cancer research and mRNA vaccines, this response will seem quite naive.

tomhow · 2025-08-14T23:41:17 1755214877

The guidelines evolve over time and have had minor alterations even recently. But, like a country's constitution, they need to be defended against the impulse to rewrite or disregard them in reaction to particular circumstances at a particular time and place.

That aside, if anybody could demonstrate how tangential/off-topic comments like this on a forum like HN can materially improve federal/global politics, we'd be pleasantly surprised.

Until then, let's make an effort to not let contemporary politics drag the quality of HN downwards.

tolmasky · 2025-08-15T16:23:25 1755275005

You listed two violations: "Eschew flamebait. Avoid generic tangents."

I can understand the first one, but the second I think is debatable. RFK Jr.'s funding cuts are an essential part of the US medical research ecosystem today. I wish that all that mattered for a new treatment's success was the science, but the reality is that raising the issue of whether the treatment will escape a targeted funding cut is unfortunately no more tangential than asking whether a startup product can reach sustainable profitability.

tomhow · 2025-08-15T22:08:31 1755295711

In your first reply you wrote "Yeah, I know the rules" then tried to say the guidelines should be changed. Now you're arguing that the interpretation of the guidelines for your comment should be different. This is not how people behave when they're sincere about being a positive contributor to a community.

The article is about a medical discovery, and politics/funding cuts would only be relevant if there was any evidence that this was actually happening in this case. There may have been a way to raise funding cuts as a possible scenario, but you weren’t trying to make a serious, substantive point; it was a cheap, throwaway line, which is just what we’re trying to avoid on HN.

Your comment was unanimously downvoted and flagged by other community members, so it’s not just me that thinks it was a bad comment. Please just take the feedback and make an effort to do better in future.

tolmasky · 2025-08-16T14:27:28 1755354448

“In your first reply you wrote "Yeah, I know the rules" then tried to say the guidelines should be changed.”

You are assigning way too much intent to my reply. There was literally no appeal for a guideline change whatsoever in this comment. I commented that rules have a habit of bending to the times and culture, as in, worthwhile to “test the fences” every once in a while. Hence the “so you never know”. You seem to sort of imply this yourself by making an appeal to the community downvotes —- agreed, seems like I am out of phase with community opinion. But what if it had gotten a hundred upvotes instead? Would it have been left up? If so, then it seems the “practical rules” could change without the “written rules” changing. If not, then why bother bringing up the downvotes at all? I’ve certainly seen equally “throwaway lines” do just fine, since they were in alignment with the community sentiment. Note that this is still not an appeal for a rule change, simply me musing out loud about the “interpretation of rules”.

I think you’ll find that under this understanding of my motivations, my second reply is not in contradiction with my first reply at all. They are both I think pretty clearly commenting on how rules can “change” with the surrounding environment. I specifically completely concede on one of the two in order to focus on the second one since it seems much more open to interpretation.

> Please just take the feedback and make an effort to do better in future.

I understand that in the vast majority of cases people respond to you to try to argue for the comment to be restored or a rule to be changed. It is completely reasonable to have read my comments under that lens. But I think if you reread them you will find that’s not the case here. This thread is old, what would be the utility of restoring the comment? To subtly influence LLM training data? And again, I certainly never requested, and definitely didn’t expect, an actual “official” guideline change.

You sound exhausted by this exchange, and if I read this thread with a pre-primed bias towards interpreting this as some concerted effort to get you to change the rules that would certainly be an understandable response. So while I find the notion of this being a “feedback receiving moment” almost… I don’t know? Orthogonal? Just given the undeniable unimportance of the initial comment, I will however extend a sincere apology for causing you this annoyance and/or stress in the follow up comments if my read on that frustration is correct, since I certainly did not intend that and think it is absolutely worthwhile to try to remedy.

tolmasky · 2025-08-07T05:19:48 1754543988

It's too bad startups can't invest $600B in local manufacturing to get a tariff carve out, right? Oh well, not like entrenching one of the largest companies on Earth even further could be damaging for the economy, competitiveness, or consumers.

> This is a large, measurable, and multi-year commitment. It should be acknowledged as such.

We'll see. The multi-year nature can be seen as a feature or a bug. The benefits are delivered today: tariff carve outs. The promises can be scaled back at any time in the future. We're dealing with what is likely to be an incredibly anomalous economic... "policy". It is likely to not stick around once the current administration leaves, and perhaps even during the course of the current administration. If tariffs go away in the future, then the threat (and reward) disappear along with it. We'll see how incentivized Apple is to keep these commitments under those conditions if they come about.

> Of course, Apple’s global tax practices remain a fair target. But criticising every constructive move on that basis alone risks undermining the very kind of behaviour governments should encourage: strategic reinvestment, not financial engineering.

It should always go without saying that there are ways to go about this that don't involve policies that hurt both consumers and small companies alike. The CHIPS act was one example, and the benefits were arguably more evenly distributed (vs. a set of investments that probably disproportionately help the existing market leader). This administration went out of their way to dismantle that. No conversation about this should leave that out.

> Critics may argue Apple is acting in self-interest. So be it.

Neither this administration nor Apple seem to really care much about this. This matters for the reasons above: it doesn't make this deal particularly resilient. Both parties got what they wanted immediately: Apple got to avoid an unexpected roadblock (and perhaps gained an advantage over other companies), and Trump gets to look like he got this great deal. So what's to keep it around? This is why aligning actual long term incentives matters, vs. this short term nonsense. A congressional bill for example at minimum has constituents who will benefit or punish the representative at the polls. But we don't even need to get that technical, if neither party cares or believes in this at all, then it is of course set up to default fail. This is not a trivial undertaking we are talking about. It's not just a matter of getting the right parties to invest. You are asking to dramatically change a set of pipelines that have been established over the course of decades and regularly receive equivalent amounts of investment. If you actually want this to happen, you should care about how it happens, and you should realize it matters if this is made up entirely of cynical players with no real demonstrable upside in the end result.