Writing UTF-8 Programs in Plan 9

tialaramex · on Feb 27, 2022

The "An aside on whitespace" section is bafflingly wrong.

It notices that for some reason the Go code here just lists all the characters it considers to be whitespace (rather than asking its Unicode database if they're White_Space, perhaps as a speed-up) but then it forgets how UTF-8 works and considers U+0085 and U+00A0 to be a single UTF-8 byte because their Rune is less than 256.

Beyond that though almost everything is confused about what's going on because Plan9's documentation keeps talking about "characters" which might actually mean a Unicode code point, or a Unicode scalar value, or that Plan9 doesn't know the difference because it is old, and is unlikely to be a "character" in any sense you understand. The problem is that even Unicode's scalar values aren't necessarily as this article claims, "individual" or "legible".

This just isn't how human writing systems work, if you don't like it don't blame Unicode, or Plan9, blame all your ancestors back to whoever first figured out tally marks.

lifthrasiir · on Feb 27, 2022

It is clear that the article simply translates K&R exercise into runes without a consideration, but I should note that reversing a codepoint (or rune) sequence is not identical to reversing the human-readable text. The hiragana example is outright wrong because no Japanese would think `ゃき` (codepoint-wise reversal of `きゃ` kya) as a meaningful Japanese text.

ynfnehf · on Feb 27, 2022

How is that any different than reversing the English digraphs? "th", "ch", "wh", etc. Reversing English doesn't usually produce something meaningful either.

nerdponx · on Feb 27, 2022

Some languages consider (or used to consider) "ch" as a single "letter".

lifthrasiir · on Feb 27, 2022

You are entirely correct that that K&R exercise is flawed in the first place. I'm more concerned about the language-specific rules for (somehow) reversing a text therefore.

donatj · on Feb 27, 2022

Can you explain further, I’m not sure I understand the objection. Doesn’t any language backwards become unintelligible to its readers? Doesn’t seem specific to Japanese.

How would you wish `ゃき` to reverse if not `きゃ`?

https://emoji.boats/s/きゃ

Seeing as they are both are normal runes with no modifiers attached, I don’t know what alternative there would be?

lifthrasiir · on Feb 27, 2022

I think my usage of the word "meaningful" was very much generous. Let me clarify my intention then: the "correctly" (that is, preferred by speakers) reversed text is unlikely to have a meaning, but it is still likely to adhere the language's phonotactic rules. ゃ (small ya, as opposed to ordinary や ya) is something you never expect to be the first character because it modifies the preceding syllable. The use of such small characters (すてがな sutegana) as the first character in words is very limited in Japanese.

thewakalix · on Feb 27, 2022

I don’t think English is such that every sequence of letters is a valid encoding of phonemes, either.

charcircuit · on Feb 27, 2022

I think it is only you who have a problem with it. Writing Japanese horizontally has only been a thing for about 100 years. About 100 years ago the direction Japanese was written horizontally was mixed between left to right and right to left.

Writing Japanese reversed aka right to left isn't a weird concept.

lifthrasiir · on Feb 27, 2022

I'm aware that RTL Japanese was a thing. In fact, what you have described is not really RTL; its inline direction is top to bottom and its block direction is right to left [1] while what we typically refer to RTL is about the inline direction. RTL Japanese is actually a specific variation of this TTB/RTL writing where the line itself only consists of a single character. Very rare (typically signages), but it did exist.

But that doesn't mean your RTL text should be preemptitively reversed in the memory, which is exactly what the exercise asks you to do. Historically this was the case for some early character encoding, where the reversed text is called the "visual order" opposed to the "logical order". This caused enough problems that Unicode is now primarily the logical order only, except for a few scripts where the historical visual order is retained (e.g. Thai).

[1] See https://www.w3.org/TR/css-writing-modes-4/#text-flow for more information.

eqvinox · on Feb 27, 2022

If the goal was to properly reverse Unicode text, this doesn't do that — it completely fails to consider combining characters. Those need to stay in-order… otherwise the combining accent jumps to the next/previous character.

Generally speaking, Unicode text can't be reversed without the UCD (Unicode Character Database) at hand.

Also, as soon as you consider that arbitrary groups of characters (or "Rune"s) need to stay in-order, you might as well stay on UTF-8 since the variable length encoding no longer really matters.

(Also-also, some text just can't be reversed, or might change characters in reverse. For example, Greek Sigma [σ] changes to [ς] if it is at the end of a word. Do you readjust that after flipping a word around?)

lillywastaken · on Feb 27, 2022

A lot of this you can find out from reading the programming guide to plan 9 - http://doc.cat-v.org/plan_9/programming/c_programming_in_pla...

e12e · on Feb 27, 2022

Interesting. Gives me a better intuition for strings in zig as well.

I'm curious about the reverse()-function - it requires the caller to allocate and pass in the "out" buffer as a mutable (well, mutable in the sense that the buffer is written to) - yet returns a pointer at the end (rather than void, or an error code).

Is that a typical c/plan9 idiom?

I would probably prefer the function allocating and returning, or the caller allocating and the function (procedure) just writing to the buffer it got as arguments?

    Rune*
    reverse(Rune *in, Rune *out, usize len)
    {
     int i;
     int to = 0, from = len-1;
     while(from >= 0){
      out[to] = in[from];
      from--;
      to++;
     }

    return out;
    }

henesy · on Feb 28, 2022

Author here.

While the other comments are correct, the exact reason for this use of providing in/out is that the reversal is called on a subset of the incoming array.

  line = Brdstr(in, '\n', 1);

will give us a null-terminated string, but we don't want to flip the null and truncate the string, so to lazily avoid that we do:

  rlen = runestrlen(rstr);
  rev = calloc(rlen+1, sizeof (Rune));
  reverse(rstr, rev, rlen);

so we get the number of runes in the input, add 1 for the \0, then reverse the pre-\0 characters.

We could have the reverse() function allocate n+1 elements for the string and return an always null-delimited string, but then we need to pass it a string that doesn't have a \0, or make it assume that it will always get a \0 and treat that some way.

Passing in both items and the number to iterate felt less noisy for a quick solution :)

e12e · on Feb 28, 2022

Thank you, but that's still not an answer to my question :)

You pass in *out as an argument (allocated by caller, so caller has a handle on it), then the result/value of reverse ends up in *out - and then reverse returns *out as a return value in addition to having done its work.

I was wondering why (I guess you could say I ask why it's a "function" not a "procedure").

I guess that the contract is that "reverse" takes ownership of *out as its passed in, and just happens to not return a pointer to a different buffer. But then my question is why it can't do its own allocation too... (which you did explain).

In other words - why does reverse return a pointer rather than, say a success/error (or void - can it error? Maybe on arbitrary binary data?)

henesy · on Feb 28, 2022

The first version I implemented did allocate and return the allocated output buffer, but I scrapped that design

I never changed the return value after I did this and didn’t see an issue at the time

For the record, this program was written casually on a live stream and is hardly a textbook definition of correctness :)

e12e · on Feb 28, 2022

Ok, thank you. I didn't mean it as criticism, just genuinely curious about the (c language) idioms involved - especially after struggling a bit with Zig strings in relation to advent of code - and after seeing the approach rust takes on ownership.

In zig the idiomatic approach (as far as I can tell) is to manage allocator in main/the caller, and passing that down to the callee. Thus allocation strategy (heap/stack/arena etc) is global/caller determined, while resources are "locally" managed.

I generally work in high level languages like ruby, where one rarely worries about the details of (especially string) allocation (beyond trying to not create way too many copies/buffers).

So it's interesting to see what kind of balance on encapsulation/delegation of this concern can/should be found in C.

Thank you for taking the time to comment.

henesy · on Feb 28, 2022

The other comments are correct that there's value in returning a value even if it was passed in to make chaining functional calls easier.

This can get a little tricky when ownership comes in to play since you don't want to end up in a situation where (not in the above program's situation, but in general) the caller might pass out a value that never gets freed and you can easily create memory leaks.

You wouldn't want to do, for example:

  print("%s\n", smprint("%s", L"世界"));

Since smprint(2)'s return value was allocated and will need to be freed, but we have no way of doing that after the value is passed.

You'd need to do:

  char *buf = smprint("%s", L"世界");
  print("%s\n", buf);
  free(buf);

kevin_thibedeau · on Feb 27, 2022

Caller provided objects are a standard idiom that offers greater flexibility to use static/global vars, objects with a FAM, and custom allocators.

e12e · on Feb 27, 2022

I was more thinking about the combination of caller provided object and callee returning a pointer (in this case to the same object).

But I suppose it decouples "call time" resources from "return time" resources (the current api does not guarantee that the result will be in caller's "out", just that a buffer will be returned that the caller owns and have to de-allocate/manage).

eqvinox · on Feb 27, 2022

That's a common (I wouldn't necessarily call it "standard") C idiom, to allow chaining calls. You could e.g. do

  puts(reverse(fwdbuf, reverse(revbuf, input)));

0x20cowboy · on Feb 28, 2022

I’ve been playing with UTF8 in c99 and the examples here have helped me understand how it works a bit better.