Hacker News new | past | comments | ask | show | jobs | submit login

This website is well-meaning, but will be difficult to parse if you don't have elementary Arabic and can't tell apart ال from ل ا.

What it really needs is a simple reference input string, examples of how it gets broken, and what to do to fix them. The middle case, where the sentence is correctly rendered RTL but the individual words are LTR (breaking the ligatures), is particularly common and insidious because it looks plausible to non-Arabic speakers.




I actually thought that part was perfectly clear.

> In general, the letter combination ال should be common.

So if your text has more than a few words, you should be able to look though your text and see that somewhere.

I can't read Arabic but I can recognize that pattern. I went to https://www.bbc.com/arabic and could find numerous occurrences.

It's a bit like saying "if you have a paragraph in 'English' and it has no e's in it, it's probably not English."


> It's a bit like saying "if you have a paragraph in 'English' and it has no e's in it, it's probably not English."

This is true. You should also be able to see the word "the" in several places in English text, and if it is rendered as "ehT" or something like that, the ligature code may have a bug -- similar with reversing the ال (read as "al") pattern.

> I can't read Arabic but I can recognize that pattern. I went to https://www.bbc.com/arabic and could find numerous occurrences.

Really? I only ever learned to read just enough Arabic to read parts of the Quor'an and am by no means fluent, but I couldn't see any mistakes on that website myself.


The comment was that they found numerous cases of ال, not ل ا.


> In general, the letter combination ال should be common.

It was a really smart way of saying you don't need to know anything at all to pattern match on this a spot a very common problem.

Also, if you haven't checked out the youtube video on that page I highly recommend it. It gives a great concise summary of the issue, and it's impressive how much I was able to learn to visually parse Arabic script with only a couple of mins into the video.

[1] - https://youtu.be/X1ynZm1wI18?t=65


A group of flying kookaburras flap about a body of liquid. In this group, individuals distinguish from distinct sorts. Small, big, colorful or plain, all flap wings as a way to maintain afloat. It's a sight to watch as this group zips, zooms, and turns all around. It's a natural habitat for this flying squad, and you can find it in many spots around our world. This group adds to our natural world's charm and, through its distinct traits, brings a lot of joy to many.


Ah yes, the good old "your heuristic doesn't work on my carefully constructed pathological case" observation, wouldn't be the Internet without that.

GP also said "probably". That's how heuristics work.


You would almost think what you described was our profession :-)


Wait, you get paid for Internet snark?


> your heuristic doesn't work on my carefully constructed pathological case

This is a bit like writing unit tests, debugging code, and so on.


It's a heuristic (claims to be probably helpful). You might as well complain that checking string size first as part of an equality check of strings that demonstrably have a very large distribution of sizes is obviously dumb because obviously you can construct same-length strings at will. Breaking rules of thumb is the easiest thing in the world.


Great example because if that text is on your corporate website then someone upstream is not doing their job

Edit: obviously it is English but it's definitely not correct for a company website.


Hey, you now want corporate website standard English with no Es? Talk about scope creep!


But it may be totally correct for a website about kookaburras.

There's more than just corporate websites, and frankly, if a company of any meaningful size offers content in Arabic, I'd expect them to hire someone for that. Even part-time or freelance.


> But it may be totally correct for a website about kookaburras.

It isn't; "to maintain afloat" is not grammatical.

You could replace that with "stay afloat" and it would fix the grammatical error without introducing an E.

There's a similar unforced error in referring to a "group" of kookaburras rather than a "flock".

"Individuals distinguish from distinct sorts" is gibberish. I cannot tell what it's supposed to mean.

"All flap wings as a way to [stay] afloat" is, at best, very awkward; fluent English would require "flap their wings", but that would introduce an E.

"Flapping about a body of liquid" is a very odd thing to say unless the body of liquid happens to be suspended in midair, since midair is the only location where you can find birds flapping.


Now, find anothEr 20% of commEnts hErE that follow the samE rulE as yours.


The 'e' is in your username. :)


> It's a bit like saying "if you have a paragraph in 'English' and it has no e's in it, it's probably not English."

For the exception, see

see https://en.wikipedia.org/wiki/Gadsby_(novel)


Wow, having read the example prose, it is quite an interesting sounding book.


https://en.wikipedia.org/wiki/Ella_Minnow_Pea is a riot of a read. Starts with a full complement of letters and drops one letter chapter by chapter.


The article even links to a site where you can practice: https://notarabic.com/

Thanks to this training, I can now identify that numerous Arabic strings on that site are backwards. He wasn't joking, IJ is everywhere.


> ال from ل ا.

Who can’t tell these two character sequences from each other? Genuine question.

I think the main point the author is making is, if you are including Arabic script somewhere, take a bit of time to either do it right or hire someone to do it right for you.


I believe the difference is simply that the two sequences are in reverse order of each other.


Not only this but the reverse of ال is لا because Arabic letters change based on position of the letter (initial, medial, final). So ل ا is never going to be a word since you lack joining, since the ل will look like لـ if joined and لا is a distinct ligature


> What it really needs is a simple reference input string, examples of how it gets broken, and what to do to fix them

I don't think it does. Just hire someone.

I guess really, the point Rami tries to make is that not a single person who reads Arabic was involved in the video game/advertising/website/etc. The errors are often so basic that a child could point them out.

It would be like if someone wrote English without any spaces between words. It's so painfully obviously wrong.


It's one thing for a small personal project to get this stuff wrong, and there's somewhat of a baseline you can try yourself, but nobody is really going to care that much. It's like those funny pictures of restaurants in China with an English name of TRANSLATION SERVER ERROR.

What the site is about and where hiring someone makes sense is anything "big budget", especially if your target market includes Arabic-speaking or Arabic-adjacent countries.


> a simple reference input string, examples of how it gets broken, and what to do to fix them.

I really like this idea. Just have a standard set of strings covering all edge cases (even the sprawling labyrinth that is bidi) with a visual reference that shows how the correct rendering of each string would look like. Each entry would also have a description of the problem and suggested solutions.

Unlike the solutions in OP, this one is pragmatic and is actually actionable for the vast majority developers. I'm kinda surprised that something like this doesn't already exist given the substantial amount of material and visual examples already available that covers the bidi algorithm.

- https://www.w3.org/International/articles/inline-bidi-markup... - https://www.w3.org/International/articles/inline-bidi-markup... - https://www.w3.org/International/articles/inline-bidi-markup...


I'm not sure I understand the idea. It sounds insane to me because I feel like there's probably trillions of combinations (and it would be insane to expect to be able to cover every specific example of incorrect text), and I thought the website was pretty clear and provided good examples.


As I understand it, the idea is just to make an Arabic test suite with enough examples (maybe a few dozen to a few hundred) such that if your program correctly renders all those examples, it’ll probably work fine with most Arabic text found in the wild. It sounds like there’s a lot of very broken software out there. Testing any Arabic input would be a big improvement for a lot of software.


> and can't tell apart ال from ل ا.

Who can't tell these apart? I know literally no Arabic - these characters look very much like latin ones I and J, and it's just an order thing.

It seems like an excellent quick test to me to see if there's ordering problems.


Exactly if you aren't comfortable with the idea that glyphs are distinct and order can be assumed to matter.

This article can't possibly be scoped for someone like that.

There's a side note here, that dyslexia can become apparent in very different written languages.[0]

In which case don't try and handle multilingual text pay someone else, even if that's on fiverr.

[0] https://blogs.scientificamerican.com/observations/its-all-ch...


> Exactly if you aren't comfortable with the idea that glyphs are distinct and order can be assumed to matter.

But isn't that exactly what it's just telling you? The order is important and if you see this very simple pattern it's wrong.

If you read any latin character based language then you surely must be OK with the idea that glyphs can be distinct? Are there many people who exclusively read languages where the order of symbols is not important?

> There's a side note here, that dyslexia can become apparent in very different written languages.[0]

If the point is that dyslexia means some people can't see the difference then that's fair, I'd not come at it from that angle. I don't see any surprising pre-assumed knowledge in this IJ/JI distinction however.


Yea we are in agreement.


Sorry, I'd not read the usernames and was trying to tie your response to the original comment. Makes a lot more sense now.


I don't think the intent is to fix every issue, but rather to tell you if you are using busted translation tools. The thing about left-to-right/right-to-left is something that a non-native Arabic speaker may not even know to look out for, even though it is a core aspect of the way the language is written.


Author was more making the point that one occurs commonly and the other order should never occur as the other order would require connecting the letters. The stick character does not connect to the left, but will on the right if that letter connects to the left.


I think the author is aiming for "You're making a prop for a film with Arabic text, or making a multilingual sign, are you doing it right?" rather than "You're writing a text editor, are you doing it right?"


If you can’t tell the difference between a stylized lJ and Jl you probably didn’t end up on that page to begin with.

I don’t think that is an unreasonable assumption on the reader


I think that's the point though? It's trying to educate us that we should pay attention to that difference as we can the block of text.


> This website is well-meaning, but will be difficult to parse if you don't have elementary Arabic and can't tell apart ال from ل ا.

That's a really weird comment. Just Ctrl+F ل ا in your supposedly Arabic text?


I agree. It doesn't take long to get to the author's thoughts on the most common cause: lack of subject matter expertise. That's cold comfort for those who are working against a tight deadline, though.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: