Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I find the stronger differentiation between bytes and strings leads to a lot of gotchas, like when I forget to encode bytes or pass a string where bytes are expected.

I understand why it's the way it is, but when it comes the the typical unixy things I need to do shuffling of files around, tar'ing stuff, etc, it definitely trips me up more than I'd wish.



You at least notice it's wrong. In Python2 sometimes 'it worked' until it broke and you had to figure out why.


See, it just broke when UTF-8 was interpreted as ASCII. It's entirely possible to treat bytes as bytes and leave encoding out of it for the vast majority of programs. If you're dealing with text editing and so on, then you know you need to be UTF-8 aware, and the broken programs would still be broken in either language.

The visibility of the errors is a minor point, but I think it more appropriate that it be solved by e.g. the windowing toolkit API.


> the broken programs would still be broken in either language.

You need to slap a decode anyway on reads from subprocesses in python3, and files open in Unicode mode by default. Wouldn't that fix the majority of silly UTF-8 compat bugs? Or am I missing a class of bugs that's not avoided automatically by python3 strings?


Well, the summary of the argument is that the python3 UTF-8 does not actually solve the fundamental problem of multiple encoding formats existing. Think: Do you know that the process actually returns UTF-8, or that the file is actually encoded in UTF-8? No, you're just guessing. This puts people in the habit of attempting to turn everything into UTF-8 which could happen automatically and not require so much boilerplate.

On the other end, most programs don't actually care what the data encoding is. They just move it.


> Think: Do you know that the process actually returns UTF-8, or that the file is actually encoded in UTF-8? No, you're just guessing.

Well, no, not really. You go read the docs and try to find out. Most of the time, there is a definitive encoding - if there weren't, a lot more things would be broken. Sometimes, it is not guaranteed, even though de facto that is the case - and this highlights broken interface specifications. When it is truly unknown, you explicitly treat it as raw bytes.

And the good thing about Python 3 is that it forces you to think about this. In Python 2, most of the time, data processing code can be hacked together, and it "just works", right until the point the input happens to include something unanticipated. Like, say, the word "naïve".

> On the other end, most programs don't actually care what the data encoding is. They just move it.

It doesn't necessarily mean that they get to dodge the bullet. In Python 2, if you read data from a file, you get raw bytes, but if you read data from parsed JSON, you get Unicode strings - because JSON itself is guaranteed to be Unicode. Guess what happens when the byte string you've read from the file, and the Unicode string you've read from a JSON HTTP response, are concatenated?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: