Programming languages are too obsessed with unicode in my opinion: String operat...

sigh_again · on Oct 9, 2024

There are zero reasons to not go full UTF-8, everywhere, all the time. Strings should not be allowed to be built directly from bytes, but only from converting them into explicitly UTF-8 (or, for API specific needs, other encodings should you want to enter the fun world of UCS-2 for some reason), which can and should be a cheap wrapper to minimize costs.

Bytes are not strings. Bytes can represent strings, numbers, pointers, absolutely anything.

theamk · on Oct 9, 2024

That would only work if you never deal with legacy (pre-utf8 data), or if you are American and all your legacy data in ASCII.

If you are actually dealing with legacy data, you want your programs to not care about encodings at all.

File names are sequence of bytes... if it's not a valid UTF-8, why would program even care? A C program from 1980 can print "Cannot open file x<ff><ff>.txt" without known what encoding it, why can't Python do this today without lots of hoops? Sure, the terminal might not show this, but user will likely redirect errors to file anyway and use the right tools to view it.

File contents are sequence of bytes. Every national encoding keeps 0-31 range as control characters, and most of them (except systems like Shift-JIS) keep lower half as well. Which means a program should be able to parse .ini file which says "name=<ff><ff>", store <ff><ff> into a string variable, and later print "Parsing section <ff><ff>" to stdout - all without ever knowning which encoding this is.

There are very few operations which _really_ care about encoding (text rendering, case-insensitive matching, column alignment) and they need careful handling and full-blown unicode libraries anyway, with megabytes of data tables. Forcing all other strings to be UTF-8 just for the sake of it is counterproductive and makes national language support much harder.

afiori · on Oct 10, 2024

I agree that there need to be utf-8 specific functionality but not everything `is` utf-8, for example filenames and filepaths. For example a JSON document should be utf8 encoded, but json strings should be able to encode arbitrary bytes as "\x00...\xff". since they can already contain garbage utf16 we would not lose much.

neves · on Oct 17, 2024

Does your language uses any form of accented letters?