A sysadmin's rant about feed readers and crawlers (2022)

dijit · 2025-03-05T13:10:18 1741180218

Using HTTP meta-headers is actually something we seem to have forgotten how to do.

The one that annoys me most is the accept-language header which is almost entirely ignored in favour of GeoIP lookups to figure out regionality... which I find super odd; as if people are walking around using a browser in a language they don't speak. (or, an operating system configured for a language they don't speak).

ETAG's though, are a bit fraught- if you're a company, a security scan will fire if an etag is detected because you might be able to figure out the inode on the filesystem based on it... which, idk why that's a security problem eitherway[0], but it's common for there to be false-positives[1]... which makes people not respect the header.

Last-Modified should work though, I love the idea of checking headers and not content.

I think people don't care to imagine the computer doing as little as possible to get the job done, and instead use the near unlimited computing power to just avoid thinking about consequences.

[0]: https://www.pentestpartners.com/security-blog/vulnerabilitie...

[1]: https://github.com/sullo/nikto/issues/469

yonatan8070 · 2025-03-05T13:43:00 1741182180

To add to the language problem, when I travelled to Europe, some websites (like YouTube) changed to whatever regional language based on where I was, despite me being logged in and Google knowing full well which languages I speak. Even the ads changed language, as if advertising in a language I don't speak will help anyone

steanne · 2025-03-05T18:53:24 1741200804

almost all of my spam is in french, which is an assumption on the part of the spammers based on the email username. almost all my gmail is spam, because i have directed most real email elsewhere. therefore, almost all the mail i receive at gmail is in french. this has lead to google blocking things (like voter registration confirmation!) that are in english because they're "not in your normal language."

golph · 2025-03-05T13:47:06 1741182426

I think that Ads on Google are different in regards to licensing and targeting. A company might target "Users in Europe" for example.

hylaride · 2025-03-05T14:10:07 1741183807

I'm Canadian and I switched from an ISP headquartered in Ontario to one headquarterd in Quebec (Teksavvy to Bell because of a bulk agreement my building got) and now half my youtube ads are in french, despite me living in Ontario.

Don't care either way, but it does make you think...

shadowgovt · 2025-03-05T16:42:13 1741192933

That's a targeting limitation.

I don't know OTOH whether "target audience's spoken language" is one of the signals an advertiser can key into in targeting an ad (at a glance, it looks like it might be). But (a) advertisers don't always have that signal and (b) advertisers themselves aren't always savvy enough to set it (how many American advertisers targeting Iowa actually tag their ads as "in English?"), so you'll end up with region targeting as a proxy for language targeting.

In your case, it's probably that the ad engine doesn't have enough info on you so it's falling back to geotargeting and hoping for the best (are you running with JavaScript disabled? Clearing cookies frequently? Avoiding logins? If so, these are all things known to decrease ad signal quality).

hylaride · 2025-03-06T19:23:36 1741289016

That's the point. Youtube/Google knows who I am and I'm logged in - my preferences are all set to english and my searches, videos watched, etc are all 99% english, with the rest being with english subtitles.. I'm not talking about banner ads on random websites.

They literally have first class data.

shadowgovt · 2025-03-06T20:44:49 1741293889

Good point. Another hypothesis is that individual advertisers aren't using that data.

Back in the day, I had a front-row-seat to this process and I observed how often advertisers simply misconfigure a campaign and under-target it. If you don't set a targeting preference for a given indicator, the default can be to target everyone regardless of what that indicator says about them.

It might be the case that advertisers are saying they want to target you anyway (or failing to say one way or the other) even though they should have enough signal to know it's a wasted impression.

hylaride · 2025-03-10T13:34:43 1741613683

Yeah, that makes sense as well. The french ads come to me in waves and I can see lazy advertisers just "targeting quebec" or "all french regions" for the french runs.

shadowgovt · 2025-03-05T13:57:52 1741183072

IIUC, accept-language is mostly ignored because the tooling to configure it on the user agent is really poor for most user agents. So users log into a site, they get the site in the wrong language, and because only the site is visible they blame the site, not their UA.

It's the "Your site's broken if IE won't load it" problem.

dijit · 2025-03-05T14:53:03 1741186383

Can someone attest that this is actually the issue?

FWIW Outlook does accept the "Accept-Language" header and I don't think anyone is saying that outlook is wrong for doing that or claiming it to be broken?

Are you totally sure that this isn't a backwards myth?

I think the most likely situation is that locale information for English speaking countries would be incorrect if the default (en_US) was used to install the operating system, which happens on occasion.

shadowgovt · 2025-03-05T16:46:39 1741193199

I couldn't speak to Outlook, but Outlook is both popular enough and has enough people locked-in to its ecosystem (as business users) that it doesn't have to worry about losing users if Accept-Language doesn't do the right thing. Companies have IT departments to fix that stuff.

I'm talking more like https://www.buerklin.com/. If that site comes up in the wrong language and the only way to fix it is to change the user agent's Accept-Language header, the user isn't going to just figure it out; they're going to navigate elsewhere. So the site has a bug in the top-left to toggle English or German.

Something you mentioned up-thread that I should have commented on but overlooked:

> which I find super odd; as if people are walking around using a browser in a language they don't speak

... yes, all the time. In libraries and Internet cafes, schools, and other shared spaces.

dijit · 2025-03-06T06:28:08 1741242488

But then, that's not worse, because geoip would already be forcing those users into a language they potentially don't speak by not listening to their device.

Or your library has somehow misconfigured their PCs when setting them up?

A cookie based override already exists, forcing geoip is strictly worse as a default, except for localising currency? I guess.

shadowgovt · 2025-03-06T14:34:55 1741271695

As a default, geoip is remarkably effective. When you didn't know anything about the user, geography is a pretty decent predictor of language.

dijit · 2025-03-06T14:46:52 1741272412

But, the users browser is telling you the language they speak!!!

shadowgovt · 2025-03-06T15:40:10 1741275610

That's the problem: too often, it is not.

https://www.reddit.com/r/webdev/comments/7a2cfe/comment/dp77... for details: there are a lot of reasons speakers more comfortable with another language will have their OS locale (and therefore the accept-language header) set to English.

OptionOfT · 2025-03-05T15:15:30 1741187730

Growing up in Belgium I feel your pain about GeoIPs and accept-language.

I lived in Flanders, with my accept-language set to en-US, en.

Ads would pop up in Dutch, Flemish, French and sometimes German. When you think about it, from a brick-and-mortar point of view, it makes sense. I'm more likely to buy <physical product advertised> at the <local chain grocery store> vs buying it anywhere in the USA, based on my IP.

Next to that, imagine you browsing Reuters.com in with a Berlin IP and accept-language set to en-US, en.

What SHOULD they show you? Local news in German, auto translated? Local news in German? Or redirect you to the US page?

Ghoelian · 2025-03-06T15:11:23 1741273883

Locality is different from language. In your example, it would have to show you the local German news, as that's local to you, and it would have to show it to you in the first supported language in your accept-language header.

Personally I would prefer, for example, Reuters.com to be a "hub", and all the regional variants on de.reuters.com. Then just let the user choose what they want.

unregistereddev · 2025-03-05T14:24:38 1741184678

Even when etag's have nothing to do with the filesystem they can still be a security vector. Some API's use etag's to identify what has changed since the last time you called a particular API. This means the ETAG values are probably stored in a database, which means the API server needs to protect against SQL injection in the request headers.

mikevin · 2025-03-05T17:23:18 1741195398

I mean that's something you need to do every time a DB is involved. Not really an argument against ETAGS.

balamatom · 2025-03-05T13:18:31 1741180711

>as if people are walking around using a browser in a language they don't speak. (or, an operating system configured for a language they don't speak

Well, yes, they are! Computers translated in my native language sound dumb. That's how a whole generation of my world learned better English than native speakers, ffs!

Half of the time it's just translated wrong. You think anyone has any incentive to translate any technology to a language with a couple million speakers, all of whom are obligate pirates?

And it seems like you might be surprised to hear that people speak more than one language. Then where's my global setting to tell the browser what languages I speak, so it'd know what header to send? Same place that lets me configure what ads I'm actually interested in. Nowhere.

>I think people don't care to imagine the computer doing as little as possible to get the job done, and instead use the near unlimited computing power to just avoid thinking about consequences.

This, friend, is what computers are for in the XXI century. "Bicycle for the mind", ha...

dijit · 2025-03-05T13:22:02 1741180922

Accept-Language is an array, not a string.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Ac...

balamatom · 2025-03-05T13:25:20 1741181120

[flagged]

dijit · 2025-03-05T13:33:14 1741181594

Just so you're aware, the way in which you've chosen to conduct this conversation is not in-keeping with the desired environment.

I'm not entirely sure why you are coming across as deeply emotional about this topic, but it's really not worth getting angry over.

However, you made a specific point:

> And it seems like you might be surprised to hear that people speak more than one language.

And I was mentioning that, actually, the header has support for multiple languages and the point is to fall back to the one you actually support; so if a site is translated in Spanish, but not Catalonian, then a person living in Barcelona might have ["ca_ES", "es_ES"], and actually only be served "es_ES" Spanish despite requesting the former as a preference.

Also, there are mechanisms for changing this locale away from the Operating Systems choice, but, I would wager a sane default is to use the localisation of the operating system, as that is largely going to be localised for the person already- moreso than browser fingerprinting(?) or GEOIP lookups, as computers move just like humans do throughout our world.

balamatom · 2025-03-05T13:47:03 1741182423

> I would wager a sane default is to use the localisation of the operating system, as that is largely going to be localised for the person already

I already explained why in many parts of the world this would not be the good wager you think it is.

>moreso than browser fingerprinting(?)

I'm not saying browser fingerprinting is a good way to determine what language to serve to the user.

I'm saying setting the headers to non-default values (and especially ones that represent actual facts about the user, such as what languages they can be expected to understand) can be used for fingerprinting, and that's probably the most sensible reason to avoid making use of such features that otherwise would have been, as you say, benign and quite convenient.

>I'm not entirely sure why you are coming across as deeply emotional about this topic, but it's really not worth getting angry over.

I'm not getting angry! I am having fun fun FUN! Your culture requires me to be having fun fun FUN in order to not be gradually destroyed from the inside! Or from the outside!

dijit · 2025-03-05T14:29:13 1741184953

I live in Sweden, our culture does not require anyone to be having fun.

Lagom; https://en.wikipedia.org/wiki/Lagom

fc417fc802 · 2025-03-05T14:39:05 1741185545

> can be used for fingerprinting

That's an interesting point. However, given that this is a language preference users are quite likely to manually select the correct language if the incorrect one loads up. At which point you have shared that information anyway.

I guess there's an argument to be made against blasting that information out to every last third party though. Perhaps it should only ever be sent to the target that appears in the address bar.

3np · 2025-03-05T13:31:45 1741181505

You really don't need to have Accept-Language overlap with language of browser. I'm sorry but reading-comprehension department seems out today. I suggest you try reading without assistance.

Also would you please tune it down and stop having such a confrontative and aggressive tone in your comments?

balamatom · 2025-03-05T13:44:14 1741182254

Would you please show me where that fact was pointed out prior to invoking the reading comprehension department?

trurl42 · 2025-03-05T13:24:20 1741181060

> Then where's my global setting to tell the browser what languages I speak, so it'd know what header to send?

In Chrome: chrome://settings/languages

In Firefox: https://support.mozilla.org/en-US/kb/choose-display-language...

balamatom · 2025-03-05T13:27:17 1741181237

Cool! I didn't know that.

I tried it out then reverted to the default.

Because I keep forgetting it's not the 90s and "we" have also invented such brilliant things as browser fingerprinting.

Tijdreiziger · 2025-03-05T13:23:23 1741181003

> Then where's my global setting to tell the browser what languages I speak, so it'd know what header to send?

Firefox: https://support.mozilla.org/en-US/kb/choose-display-language...

Chrome insists that the first language be the UI language, and Safari insists that the first language be the _system_ language.

kalleboo · 2025-03-05T13:58:11 1741183091

You can set a per-app language for Safari in macOS/iOS System Settings instead of using the system language

Tijdreiziger · 2025-03-05T20:43:37 1741207417

Oh yeah, you’re right. But that just gets you Chrome’s behaviour (the first preferred language has to equal the browser UI language).

So I suppose GGP is mostly right, in the sense that most browsers get this wrong (except Firefox).

kalleboo · 2025-03-06T02:27:01 1741228021

Yeah it's still not ideal.

As someone who lives between 3 languages what I'd really like is a browser setting for language per-site. E.g. I want my Swedish bank's site in Swedish, not the English translation, I want Google Maps in Japanese so I can see the Kanji for the station names, but I want the AWS console in English. Each of these sites have their own toggles but they are very inconsistent and keep resetting, the browser would have done a much better job.

I wonder if this can be done with a browser extension.

shadowgovt · 2025-03-05T14:01:33 1741183293

Yes, and this is the actual issue: user agent configuration for preferred language is poor, users blame sites when they can't read the site, so it's in the site's best interest to ignore the broken thing and use a heuristic.

Tijdreiziger · 2025-03-05T20:44:02 1741207442

Yeah, I suppose you’re right about that.

3np · 2025-03-05T13:28:52 1741181332

> Then where's my global setting to tell the browser what languages I speak, so it'd know what header to send? ...Nowhere

Look again. Or switch browser. It is a basic feature and the issue is indeed websites ignoring it.

gildas · 2025-03-05T13:54:15 1741182855

Bonus point for clients that don't support the HTTP “Accept-Encoding” header [1] and consume all your bandwidth.

[1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Ac...

fc417fc802 · 2025-03-05T14:44:29 1741185869

Seems like a reasonable case for disregarding the client preference. If you're able to speak TLS then you're able to load up a public domain (de)compression library.

tomrod · 2025-03-05T12:55:30 1741179330

I always appreciate Rachel's writings. I don't know much about her, but my takeaway is that she has worked at some of the hardest sysadmin jobs in the past few decades and writes to her experience super well.

someothherguyy · 2025-03-05T13:37:14 1741181834

IMO, it is also unreasonable to have ultra-restrictive rate limits, like blocking a client after one request.

https://rachelbythebay.com/w/atom.xml

horsawlarway · 2025-03-05T14:38:40 1741185520

I'm with you.

Especially as the cost to serve this content approaches zero.

I find the take in the blog to be relatively hostile. It's a "technically correct" rant. Not wrong, but mostly missing the point, and being a bit of a dick in the process.

Sure - block the readers that make a request every 10 seconds. It's perfectly reasonable to block clients if they hit a limit like 20 to 50 requests in a day.

It's damn hostile to block for 24 hours after a single request. If the 10MB of traffic for 20 requests is going to break the bank... maybe don't host an atom or RSS feed at all?

---

That said - weirdos can weird on their own sites as they like. It's not a public service.

But I bucket this into the same category of weird as posting a whole bunch of threatening "no trespassing", "beware of dog", "homeowner is armed", "Solicitors not welcome", etc style signs all over their property.

Like - point out on the doll where the rss client hurt you. Because something's up.

rmholt · 2025-03-05T14:26:29 1741184789

Maybe a warning after 3 requests and a ban on 4 per 24hr, but I understand the sentiment

ing33k · 2025-03-05T14:11:44 1741183904

haha, true. Just got blocked because I opened the link once and clicked on refresh.

Aeolun · 2025-03-05T13:50:03 1741182603

160 gigabytes of feed over the course of a month (when polling a 640kb feed every 10 seconds), in case anyone else was wondering.

shadowgovt · 2025-03-05T13:56:55 1741183015

Rachel makes an excellent point here about feed change frequency.

Seems like it'd be straightforward to implement a backoff strategy based on how frequently the feed content changed into most readers. For a regular, periodic fetch, if the content has proven it doesn't update frequently, just back off the period for that endpoint.

account42 · 2025-03-05T13:05:34 1741179934

If-Modified-Since and ETag are nice and everyone should implement them but IME the implementation status is much better on the reader side than on the feed side. Trim your (main) feed to only recent posts and use Atom's paginatio to link to the rest for new subscribers and the difference in data transferred becomes much smaller.

> Besides that, a well-behaved feed will have the same content as what you will get on the actual web site. The HTML might be slightly different to account for any number of failings in stupid feed readers in order to save the people using those programs from themselves, but the actual content should be the same. Given that, there's an important thing to take away from this: there is no reason to request every single $(&^$(&^@#* post that's mentioned in the feed.

> If you pull the feed, don't pull the posts. If you pull the posts, don't pull the feed. If you pull both, you're missing the whole point of having an aggregated feed!

Unfortunately there are too many feeds that don't include the full content for this to work. And a reader won't know if the feed has the full content before fetching the HTML page. This can also change from post to post so it can't just determine this when subscribing.

> Then there are the user-agents who lie about who they are or where they are coming from because they think it's going to get them special treatment somehow.

These exist because of misbehaved web servers that block based on user agen't or send different content. And since you are complaining about faked user agents that probably includes you.

> Sending referrers which make no sense is just bad manners.

HTTP Referer should not exist. And has been abused by spammers for ages.

spiderfarmer · 2025-03-05T13:23:58 1741181038

> These exist because of misbehaved web servers that block based on user agen't or send different content. And since you are complaining aber faked user agents that probably includes you.

That's a niche. It's about 1 million percent more likely a fake request is coming from an overzealous AI scraper nowadays. I have blocked hundreds of them and I'm on the verge of giving up and handing over money to Cloudflare just for their AI scraping protection.

jstanley · 2025-03-05T13:40:24 1741182024

> If you pull the feed, don't pull the posts. If you pull the posts, don't pull the feed. If you pull both, you're missing the whole point of having an aggregated feed!

People probably do this because some sites only give you a preview in the feed, to force you to go to the site and view the ads.

So if you want the full post in the feed reader, you need to pull the post as well.

hylaride · 2025-03-05T14:11:41 1741183901

This. My feed reader pulls a "reader" view so I don't have to leave the app. I normally wouldn't mind going to the website, except that to do so would mean waiting for it to fully load, dealing with javascript popups, and often bad scrolljacking.

This person isn't thinking as a user.

benwerd · 2025-03-05T13:52:45 1741182765

WebSub is your friend here: https://www.w3.org/TR/websub/

This adds a nice publish-subscribe model to RSS. Ping the WebSub server when there are changes; subscribing services are easily notified; nobody has to worry about excessive polling. Hooray.

theandrewbailey · 2025-03-05T13:21:48 1741180908

RSS feeds have a TTL inside the feed. Do feed readers respect it?

https://www.rssboard.org/rss-draft-1#element-channel-ttl

hahn-kev · 2025-03-05T15:13:34 1741187614

I would ask "Has a feed ever tried to abuse that to get readers to go away" the answer is yes, at which point the reader would ignore it.

ozarker · 2025-03-05T13:31:39 1741181499

I was just thinking a header with a suggested poll rate might be nice

moebrowne · 2025-03-05T14:49:31 1741186171

The GitHub Event API uses a `X-Poll-Interval` header for this purpose. There is also Retry-After but that seems more targeted towards error states.

EDIT: There is the `Cache-Control` header, it seems ideal for this use-case

- https://docs.github.com/en/enterprise-cloud@latest/rest/acti... - https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Re... - https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Ca...

croisillon · 2025-03-05T12:54:03 1741179243

for a feed reader i built that year (2022) i was polling each feed every second day, except the ones which had a new item within 50 days -> every day

flymasterv · 2025-03-05T13:25:20 1741181120

My strategy is to have an exponential backoff. I start a new feed set to query every 30 minutes, and if there’s no new post, I double the period. If there’s is a new post, I halve it. My reader goes through every feed every half hour, and randomizes which feeds it checks: a 1/4 chance for a 2 hour feed, a 1/48 chance for a 24 hour feed, etc.

cellularmitosis · 2025-03-05T14:23:28 1741184608

Several people have mentioned exponential back off. What upper limit would you suggest? Someone might not post for 6 months, then resume posting daily. You might miss those posts for months.

flymasterv · 2025-03-05T17:26:20 1741195580

I am not worried about missing posts, on my own reader. I specifically only grab the single newest post, so I may be missing posts by design.

That said, I think 3 or 4 days seems reasonable.

croisillon · 2025-03-05T15:34:49 1741188889

in this case polling every few days is more than reasonable

ffjffsfr · 2025-03-05T20:31:11 1741206671

I’m 100% sure there are many badly written inefficient crawlers that are wasting server resources and resources where they run but I use feed readers a lot and it is very hard to find well maintained feeds. Many servers also use cache related headers incorrectly or don’t use them at all.

yapyap · 2025-03-05T12:52:39 1741179159

https://web.archive.org/web/20241205224611/http://rachelbyth...

balamatom · 2025-03-05T13:34:33 1741181673

So nice to see RSS making a comeback!

theshrike79 · 2025-03-05T13:00:08 1741179608

This is a good lesson on being a good citizen of the Internet.

It's easy to just curl a feed every second, but should you? (Of course not)

Take it as a challenge to make your reader as fancy as possible, use every trick in the book to optimise how it fetches content. Analyse the patterns of releasing new content per feed and adjust the fetch frequency based on that.

And if you're building a reader for distribution, don't let the user set a refresh interval that doesn't make sense.

flir · 2025-03-05T13:31:11 1741181471

Like writing lift control software. Minutes to learn, a life time to master.

But I bet you can get 95% of the benefit with a simple exponential backoff scheme.

preinheimer · 2025-03-05T13:54:03 1741182843

I feel like polling twice a day (with the right etags) will beat exponential back offs.

Exponential back off is great for a lot of problems, but irregularly updated blogs doesn’t seem like one of them.

flir · 2025-03-05T15:33:15 1741188795

Thinking about it, it's not an easy problem to define because it's got a tradeoff. Getting the content quickly (easy: poll in a loop) vs not using server resources (easy: never poll). We have to define "better" before we can decide which solution is better.

internetter · 2025-03-05T13:53:58 1741182838

> If you pull the feed, don't pull the posts. If you pull the posts, don't pull the feed. If you pull both, you're missing the whole point of having an aggregated feed!

In some cases the reader should fetch both the feed and the pages. Unfortunately, none do

https://github.com/miniflux/v2/issues/3084