I find the stronger differentiation between bytes and strings leads to a lot of ...

mkesper · on Nov 14, 2019

You at least notice it's wrong. In Python2 sometimes 'it worked' until it broke and you had to figure out why.

snagglegaggle · on Nov 14, 2019

See, it just broke when UTF-8 was interpreted as ASCII. It's entirely possible to treat bytes as bytes and leave encoding out of it for the vast majority of programs. If you're dealing with text editing and so on, then you know you need to be UTF-8 aware, and the broken programs would still be broken in either language.

The visibility of the errors is a minor point, but I think it more appropriate that it be solved by e.g. the windowing toolkit API.

stefco_ · on Nov 14, 2019

> the broken programs would still be broken in either language.

You need to slap a decode anyway on reads from subprocesses in python3, and files open in Unicode mode by default. Wouldn't that fix the majority of silly UTF-8 compat bugs? Or am I missing a class of bugs that's not avoided automatically by python3 strings?

snagglegaggle · on Nov 14, 2019

Well, the summary of the argument is that the python3 UTF-8 does not actually solve the fundamental problem of multiple encoding formats existing. Think: Do you know that the process actually returns UTF-8, or that the file is actually encoded in UTF-8? No, you're just guessing. This puts people in the habit of attempting to turn everything into UTF-8 which could happen automatically and not require so much boilerplate.

On the other end, most programs don't actually care what the data encoding is. They just move it.

int_19h · on Nov 15, 2019

> Think: Do you know that the process actually returns UTF-8, or that the file is actually encoded in UTF-8? No, you're just guessing.

Well, no, not really. You go read the docs and try to find out. Most of the time, there is a definitive encoding - if there weren't, a lot more things would be broken. Sometimes, it is not guaranteed, even though de facto that is the case - and this highlights broken interface specifications. When it is truly unknown, you explicitly treat it as raw bytes.

And the good thing about Python 3 is that it forces you to think about this. In Python 2, most of the time, data processing code can be hacked together, and it "just works", right until the point the input happens to include something unanticipated. Like, say, the word "naïve".

> On the other end, most programs don't actually care what the data encoding is. They just move it.

It doesn't necessarily mean that they get to dodge the bullet. In Python 2, if you read data from a file, you get raw bytes, but if you read data from parsed JSON, you get Unicode strings - because JSON itself is guaranteed to be Unicode. Guess what happens when the byte string you've read from the file, and the Unicode string you've read from a JSON HTTP response, are concatenated?