Haha. I guess if I had to ask any question about python, it would be, "why on ea...

PeterisP · on April 20, 2018

Most builtins can't use bytes instead of string safely for all data, and the problem is that it propagates - in Python 2 you often have the case that library A uses library B that uses a builtin that treats a piece of text like a string of bytes, and so you can't use library A because it will give you broken results in certain conditions. We've spent time updating some third party open source libraries to support python 3, and that was well worth the time to avoid the waste of programmer time that'd come from working in python 2 due to lack of sane handling of unicode strings; and if you're working with user-facing software (as opposed to, say, scripts for system administration or physics calculations), pretty much every string you encounter nowadays is unicode string. Names of people, names of files, contents of files, results of http requests, results of database requests - all of those can be treated as streams of bytes only if you treat them as a single atomic token and don't look inside that stream in any way whatsoever. As soon as you make the first index operation, substring or split, you can't treat them as bytes anymore safely.

However, you can simply think of it as syntactic sugar that manages the encoding annotations in the default case. Where is it creating problems for you? Is it some performance hit or something else?

madhadron · on April 20, 2018

There speaks someone who hasn't spent significant time working with multilingual texts.