Using HTTP meta-headers is actually something we seem to have forgotten how to do.
The one that annoys me most is the accept-language header which is almost entirely ignored in favour of GeoIP lookups to figure out regionality... which I find super odd; as if people are walking around using a browser in a language they don't speak. (or, an operating system configured for a language they don't speak).
ETAG's though, are a bit fraught- if you're a company, a security scan will fire if an etag is detected because you might be able to figure out the inode on the filesystem based on it... which, idk why that's a security problem eitherway[0], but it's common for there to be false-positives[1]... which makes people not respect the header.
Last-Modified should work though, I love the idea of checking headers and not content.
I think people don't care to imagine the computer doing as little as possible to get the job done, and instead use the near unlimited computing power to just avoid thinking about consequences.
To add to the language problem, when I travelled to Europe, some websites (like YouTube) changed to whatever regional language based on where I was, despite me being logged in and Google knowing full well which languages I speak. Even the ads changed language, as if advertising in a language I don't speak will help anyone
almost all of my spam is in french, which is an assumption on the part of the spammers based on the email username. almost all my gmail is spam, because i have directed most real email elsewhere. therefore, almost all the mail i receive at gmail is in french. this has lead to google blocking things (like voter registration confirmation!) that are in english because they're "not in your normal language."
I'm Canadian and I switched from an ISP headquartered in Ontario to one headquarterd in Quebec (Teksavvy to Bell because of a bulk agreement my building got) and now half my youtube ads are in french, despite me living in Ontario.
Don't care either way, but it does make you think...
I don't know OTOH whether "target audience's spoken language" is one of the signals an advertiser can key into in targeting an ad (at a glance, it looks like it might be). But (a) advertisers don't always have that signal and (b) advertisers themselves aren't always savvy enough to set it (how many American advertisers targeting Iowa actually tag their ads as "in English?"), so you'll end up with region targeting as a proxy for language targeting.
In your case, it's probably that the ad engine doesn't have enough info on you so it's falling back to geotargeting and hoping for the best (are you running with JavaScript disabled? Clearing cookies frequently? Avoiding logins? If so, these are all things known to decrease ad signal quality).
That's the point. Youtube/Google knows who I am and I'm logged in - my preferences are all set to english and my searches, videos watched, etc are all 99% english, with the rest being with english subtitles.. I'm not talking about banner ads on random websites.
Good point. Another hypothesis is that individual advertisers aren't using that data.
Back in the day, I had a front-row-seat to this process and I observed how often advertisers simply misconfigure a campaign and under-target it. If you don't set a targeting preference for a given indicator, the default can be to target everyone regardless of what that indicator says about them.
It might be the case that advertisers are saying they want to target you anyway (or failing to say one way or the other) even though they should have enough signal to know it's a wasted impression.
Yeah, that makes sense as well. The french ads come to me in waves and I can see lazy advertisers just "targeting quebec" or "all french regions" for the french runs.
IIUC, accept-language is mostly ignored because the tooling to configure it on the user agent is really poor for most user agents. So users log into a site, they get the site in the wrong language, and because only the site is visible they blame the site, not their UA.
It's the "Your site's broken if IE won't load it" problem.
Can someone attest that this is actually the issue?
FWIW Outlook does accept the "Accept-Language" header and I don't think anyone is saying that outlook is wrong for doing that or claiming it to be broken?
Are you totally sure that this isn't a backwards myth?
I think the most likely situation is that locale information for English speaking countries would be incorrect if the default (en_US) was used to install the operating system, which happens on occasion.
I couldn't speak to Outlook, but Outlook is both popular enough and has enough people locked-in to its ecosystem (as business users) that it doesn't have to worry about losing users if Accept-Language doesn't do the right thing. Companies have IT departments to fix that stuff.
I'm talking more like https://www.buerklin.com/. If that site comes up in the wrong language and the only way to fix it is to change the user agent's Accept-Language header, the user isn't going to just figure it out; they're going to navigate elsewhere. So the site has a bug in the top-left to toggle English or German.
Something you mentioned up-thread that I should have commented on but overlooked:
> which I find super odd; as if people are walking around using a browser in a language they don't speak
... yes, all the time. In libraries and Internet cafes, schools, and other shared spaces.
But then, that's not worse, because geoip would already be forcing those users into a language they potentially don't speak by not listening to their device.
Or your library has somehow misconfigured their PCs when setting them up?
A cookie based override already exists, forcing geoip is strictly worse as a default, except for localising currency? I guess.
Growing up in Belgium I feel your pain about GeoIPs and accept-language.
I lived in Flanders, with my accept-language set to en-US, en.
Ads would pop up in Dutch, Flemish, French and sometimes German. When you think about it, from a brick-and-mortar point of view, it makes sense. I'm more likely to buy <physical product advertised> at the <local chain grocery store> vs buying it anywhere in the USA, based on my IP.
Next to that, imagine you browsing Reuters.com in with a Berlin IP and accept-language set to en-US, en.
What SHOULD they show you? Local news in German, auto translated? Local news in German? Or redirect you to the US page?
Locality is different from language. In your example, it would have to show you the local German news, as that's local to you, and it would have to show it to you in the first supported language in your accept-language header.
Personally I would prefer, for example, Reuters.com to be a "hub", and all the regional variants on de.reuters.com. Then just let the user choose what they want.
Even when etag's have nothing to do with the filesystem they can still be a security vector. Some API's use etag's to identify what has changed since the last time you called a particular API. This means the ETAG values are probably stored in a database, which means the API server needs to protect against SQL injection in the request headers.
>as if people are walking around using a browser in a language they don't speak. (or, an operating system configured for a language they don't speak
Well, yes, they are! Computers translated in my native language sound dumb. That's how a whole generation of my world learned better English than native speakers, ffs!
Half of the time it's just translated wrong. You think anyone has any incentive to translate any technology to a language with a couple million speakers, all of whom are obligate pirates?
And it seems like you might be surprised to hear that people speak more than one language. Then where's my global setting to tell the browser what languages I speak, so it'd know what header to send? Same place that lets me configure what ads I'm actually interested in. Nowhere.
>I think people don't care to imagine the computer doing as little as possible to get the job done, and instead use the near unlimited computing power to just avoid thinking about consequences.
This, friend, is what computers are for in the XXI century. "Bicycle for the mind", ha...
Just so you're aware, the way in which you've chosen to conduct this conversation is not in-keeping with the desired environment.
I'm not entirely sure why you are coming across as deeply emotional about this topic, but it's really not worth getting angry over.
However, you made a specific point:
> And it seems like you might be surprised to hear that people speak more than one language.
And I was mentioning that, actually, the header has support for multiple languages and the point is to fall back to the one you actually support; so if a site is translated in Spanish, but not Catalonian, then a person living in Barcelona might have ["ca_ES", "es_ES"], and actually only be served "es_ES" Spanish despite requesting the former as a preference.
Also, there are mechanisms for changing this locale away from the Operating Systems choice, but, I would wager a sane default is to use the localisation of the operating system, as that is largely going to be localised for the person already- moreso than browser fingerprinting(?) or GEOIP lookups, as computers move just like humans do throughout our world.
> I would wager a sane default is to use the localisation of the operating system, as that is largely going to be localised for the person already
I already explained why in many parts of the world this would not be the good wager you think it is.
>moreso than browser fingerprinting(?)
I'm not saying browser fingerprinting is a good way to determine what language to serve to the user.
I'm saying setting the headers to non-default values (and especially ones that represent actual facts about the user, such as what languages they can be expected to understand) can be used for fingerprinting, and that's probably the most sensible reason to avoid making use of such features that otherwise would have been, as you say, benign and quite convenient.
>I'm not entirely sure why you are coming across as deeply emotional about this topic, but it's really not worth getting angry over.
I'm not getting angry! I am having fun fun FUN! Your culture requires me to be having fun fun FUN in order to not be gradually destroyed from the inside! Or from the outside!
That's an interesting point. However, given that this is a language preference users are quite likely to manually select the correct language if the incorrect one loads up. At which point you have shared that information anyway.
I guess there's an argument to be made against blasting that information out to every last third party though. Perhaps it should only ever be sent to the target that appears in the address bar.
You really don't need to have Accept-Language overlap with language of browser. I'm sorry but reading-comprehension department seems out today. I suggest you try reading without assistance.
Also would you please tune it down and stop having such a confrontative and aggressive tone in your comments?
As someone who lives between 3 languages what I'd really like is a browser setting for language per-site. E.g. I want my Swedish bank's site in Swedish, not the English translation, I want Google Maps in Japanese so I can see the Kanji for the station names, but I want the AWS console in English. Each of these sites have their own toggles but they are very inconsistent and keep resetting, the browser would have done a much better job.
I wonder if this can be done with a browser extension.
Yes, and this is the actual issue: user agent configuration for preferred language is poor, users blame sites when they can't read the site, so it's in the site's best interest to ignore the broken thing and use a heuristic.
Seems like a reasonable case for disregarding the client preference. If you're able to speak TLS then you're able to load up a public domain (de)compression library.
I always appreciate Rachel's writings. I don't know much about her, but my takeaway is that she has worked at some of the hardest sysadmin jobs in the past few decades and writes to her experience super well.
Especially as the cost to serve this content approaches zero.
I find the take in the blog to be relatively hostile. It's a "technically correct" rant. Not wrong, but mostly missing the point, and being a bit of a dick in the process.
Sure - block the readers that make a request every 10 seconds. It's perfectly reasonable to block clients if they hit a limit like 20 to 50 requests in a day.
It's damn hostile to block for 24 hours after a single request. If the 10MB of traffic for 20 requests is going to break the bank... maybe don't host an atom or RSS feed at all?
---
That said - weirdos can weird on their own sites as they like. It's not a public service.
But I bucket this into the same category of weird as posting a whole bunch of threatening "no trespassing", "beware of dog", "homeowner is armed", "Solicitors not welcome", etc style signs all over their property.
Like - point out on the doll where the rss client hurt you. Because something's up.
Rachel makes an excellent point here about feed change frequency.
Seems like it'd be straightforward to implement a backoff strategy based on how frequently the feed content changed into most readers. For a regular, periodic fetch, if the content has proven it doesn't update frequently, just back off the period for that endpoint.
If-Modified-Since and ETag are nice and everyone should implement them but IME the implementation status is much better on the reader side than on the feed side. Trim your (main) feed to only recent posts and use Atom's paginatio to link to the rest for new subscribers and the difference in data transferred becomes much smaller.
> Besides that, a well-behaved feed will have the same content as what you will get on the actual web site. The HTML might be slightly different to account for any number of failings in stupid feed readers in order to save the people using those programs from themselves, but the actual content should be the same. Given that, there's an important thing to take away from this: there is no reason to request every single $(&^$(&^@#* post that's mentioned in the feed.
> If you pull the feed, don't pull the posts. If you pull the posts, don't pull the feed. If you pull both, you're missing the whole point of having an aggregated feed!
Unfortunately there are too many feeds that don't include the full content for this to work. And a reader won't know if the feed has the full content before fetching the HTML page. This can also change from post to post so it can't just determine this when subscribing.
> Then there are the user-agents who lie about who they are or where they are coming from because they think it's going to get them special treatment somehow.
These exist because of misbehaved web servers that block based on user agen't or send different content. And since you are complaining about faked user agents that probably includes you.
> Sending referrers which make no sense is just bad manners.
HTTP Referer should not exist. And has been abused by spammers for ages.
> These exist because of misbehaved web servers that block based on user agen't or send different content. And since you are complaining aber faked user agents that probably includes you.
That's a niche. It's about 1 million percent more likely a fake request is coming from an overzealous AI scraper nowadays. I have blocked hundreds of them and I'm on the verge of giving up and handing over money to Cloudflare just for their AI scraping protection.
> If you pull the feed, don't pull the posts. If you pull the posts, don't pull the feed. If you pull both, you're missing the whole point of having an aggregated feed!
People probably do this because some sites only give you a preview in the feed, to force you to go to the site and view the ads.
So if you want the full post in the feed reader, you need to pull the post as well.
This. My feed reader pulls a "reader" view so I don't have to leave the app. I normally wouldn't mind going to the website, except that to do so would mean waiting for it to fully load, dealing with javascript popups, and often bad scrolljacking.
This adds a nice publish-subscribe model to RSS. Ping the WebSub server when there are changes; subscribing services are easily notified; nobody has to worry about excessive polling. Hooray.
My strategy is to have an exponential backoff. I start a new feed set to query every 30 minutes, and if there’s no new post, I double the period. If there’s is a new post, I halve it. My reader goes through every feed every half hour, and randomizes which feeds it checks: a 1/4 chance for a 2 hour feed, a 1/48 chance for a 24 hour feed, etc.
Several people have mentioned exponential back off. What upper limit would you suggest? Someone might not post for 6 months, then resume posting daily. You might miss those posts for months.
I’m 100% sure there are many badly written inefficient crawlers that are wasting server resources and resources where they run but I use feed readers a lot and it is very hard to find well maintained feeds. Many servers also use cache related headers incorrectly or don’t use them at all.
This is a good lesson on being a good citizen of the Internet.
It's easy to just curl a feed every second, but should you? (Of course not)
Take it as a challenge to make your reader as fancy as possible, use every trick in the book to optimise how it fetches content. Analyse the patterns of releasing new content per feed and adjust the fetch frequency based on that.
And if you're building a reader for distribution, don't let the user set a refresh interval that doesn't make sense.
Thinking about it, it's not an easy problem to define because it's got a tradeoff. Getting the content quickly (easy: poll in a loop) vs not using server resources (easy: never poll). We have to define "better" before we can decide which solution is better.
> If you pull the feed, don't pull the posts. If you pull the posts, don't pull the feed. If you pull both, you're missing the whole point of having an aggregated feed!
In some cases the reader should fetch both the feed and the pages. Unfortunately, none do
The one that annoys me most is the accept-language header which is almost entirely ignored in favour of GeoIP lookups to figure out regionality... which I find super odd; as if people are walking around using a browser in a language they don't speak. (or, an operating system configured for a language they don't speak).
ETAG's though, are a bit fraught- if you're a company, a security scan will fire if an etag is detected because you might be able to figure out the inode on the filesystem based on it... which, idk why that's a security problem eitherway[0], but it's common for there to be false-positives[1]... which makes people not respect the header.
Last-Modified should work though, I love the idea of checking headers and not content.
I think people don't care to imagine the computer doing as little as possible to get the job done, and instead use the near unlimited computing power to just avoid thinking about consequences.
[0]: https://www.pentestpartners.com/security-blog/vulnerabilitie...
[1]: https://github.com/sullo/nikto/issues/469