Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That would only work if you never deal with legacy (pre-utf8 data), or if you are American and all your legacy data in ASCII.

If you are actually dealing with legacy data, you want your programs to not care about encodings at all.

File names are sequence of bytes... if it's not a valid UTF-8, why would program even care? A C program from 1980 can print "Cannot open file x<ff><ff>.txt" without known what encoding it, why can't Python do this today without lots of hoops? Sure, the terminal might not show this, but user will likely redirect errors to file anyway and use the right tools to view it.

File contents are sequence of bytes. Every national encoding keeps 0-31 range as control characters, and most of them (except systems like Shift-JIS) keep lower half as well. Which means a program should be able to parse .ini file which says "name=<ff><ff>", store <ff><ff> into a string variable, and later print "Parsing section <ff><ff>" to stdout - all without ever knowning which encoding this is.

There are very few operations which _really_ care about encoding (text rendering, case-insensitive matching, column alignment) and they need careful handling and full-blown unicode libraries anyway, with megabytes of data tables. Forcing all other strings to be UTF-8 just for the sake of it is counterproductive and makes national language support much harder.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: