Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Programming languages are too obsessed with unicode in my opinion:

String operations should have bytes and utf-{8,16} versions. The string value would have is_valid_utf_{8,16} flags and operations should unset them if they end up breaking the format (eg str[i] = 0xff would always mark the string as not unicode, str[i] = 0x00 would check if the flag was set and it so check whether the assignment broke a codepoint and unset the flag if so)



There are zero reasons to not go full UTF-8, everywhere, all the time. Strings should not be allowed to be built directly from bytes, but only from converting them into explicitly UTF-8 (or, for API specific needs, other encodings should you want to enter the fun world of UCS-2 for some reason), which can and should be a cheap wrapper to minimize costs.

Bytes are not strings. Bytes can represent strings, numbers, pointers, absolutely anything.


That would only work if you never deal with legacy (pre-utf8 data), or if you are American and all your legacy data in ASCII.

If you are actually dealing with legacy data, you want your programs to not care about encodings at all.

File names are sequence of bytes... if it's not a valid UTF-8, why would program even care? A C program from 1980 can print "Cannot open file x<ff><ff>.txt" without known what encoding it, why can't Python do this today without lots of hoops? Sure, the terminal might not show this, but user will likely redirect errors to file anyway and use the right tools to view it.

File contents are sequence of bytes. Every national encoding keeps 0-31 range as control characters, and most of them (except systems like Shift-JIS) keep lower half as well. Which means a program should be able to parse .ini file which says "name=<ff><ff>", store <ff><ff> into a string variable, and later print "Parsing section <ff><ff>" to stdout - all without ever knowning which encoding this is.

There are very few operations which _really_ care about encoding (text rendering, case-insensitive matching, column alignment) and they need careful handling and full-blown unicode libraries anyway, with megabytes of data tables. Forcing all other strings to be UTF-8 just for the sake of it is counterproductive and makes national language support much harder.


I agree that there need to be utf-8 specific functionality but not everything `is` utf-8, for example filenames and filepaths. For example a JSON document should be utf8 encoded, but json strings should be able to encode arbitrary bytes as "\x00...\xff". since they can already contain garbage utf16 we would not lose much.


Does your language uses any form of accented letters?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: