I doubt you can handle UTF-8 properly with that attitude.
The problems is, there is one very popular OS which is very hard to enforce UTF-8 everywhere, Microsoft Windows.
It's very hard to ensure all the software stack you are depending on it use Unicode version of Win32 API. Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back. Even if you don't you have to ensure all the low level code you are depending on it do the same for you.
Oh and don't forget about the Unicode Normalizations. There is no THE UTF-8. There are bunch of UTF-8s with different Unicode normalizations. Apple macOS use NFD while other mostly use NFC.
These are Just some examples. When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.
> Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back.
Yes. You should convert your strings. Thankfully, UTF-16 is very difficult to confuse with UTF-8 because they're completely incompatible encodings. Conversion is (or should be) a relatively simple process in basically any modern language or environment. And personally, I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?). The different forms are (or should be) visually completely identical for the user - at least on modern computers with decent unicode fonts.
The largest problem with UTF-8 (and its biggest strength) is how similar it is to ASCII. It is for this reason we should consider emoji to be a wonderful gift to software correctness everywhere. Correctly handling emoji requires that your software can handle unicode correctly - because they need multi-unit encoding with both UTF-16 and UTF-8. And emoji won't render correctly unless your software can also handle grapheme clusters.
> When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.
Check! If your application deals with text, throw your favorite multi-codepoint emoji into your unit testing data. (Mine is the polar bear). Users love emoji, and your software should handle it correctly. There's no excuse! Even the windows filesystem passes this test today.
My native language uses some additional CJK chars on plane 2, and before ~2010s a lot of software had glitches beyond the basic plane of unicode. I am forever grateful for the "Gen Z" who pushed for Emojis.
Javascript's String.length is still semantically broken though. Too bad it's part of a unchangeable spec...
There's no definition of String.length that would be the obvious right choice. It depends on the use case. So probably better to let the application provide its own implementation.
> So probably better to let the application provide its own implementation.
I’d be very happy with the standard library providing multiple “length” functions for strings. Generally I want three:
- Length in bytes of the utf-8 encoded form. Eg useful for http’s content-length field.
- Number of Unicode codepoints in the text. This is useful for cursor positions, CRDT work, and some other stuff.
- Number of grapheme clusters in the text when displayed.
These should all be reasonably easy to query. But they’re all different functions. They just so happen to return the same result on (most) ascii text. (I’m not sure how many grapheme clusters \0 or a bell is).
Javascript’s string.length is particularly useless because it isn’t even any of the above methods. It returns the number of bytes needed to encode the string as UTF16, divided by 2. I’ve never wanted to know that. It’s a totally useless measure. Deceptively useless, because it’s right there and it works fine so long as your strings only ever contain ascii. Last I checked, C# and Java strings have the same bug.
The built-in string.length method is useless (it returns the number of char objects) and I agree that's a problem, but the solution is also built into the language, unlike in JS.
JS these days also has ways to iterate over codepoints and grapheme clusters. If you treat the string as an iterator, then its elements will be single-codepoint strings, on which you can call .codePointAt(0) to get the values. (The major JS engines can allegedly elide the allocations for this.) The codepoint count can be obtained most simply with [...string].length, or more efficiently by looping over the iterator manually.
The Intl.Segmenter API [0] can similarly yield iterable objects with all the grapheme clusters of a string. Also, the TextEncoder [1] and TextDecoder [2] APIs can be used to convert strings to and from UTF-8 byte arrays.
JavaScript’s recently added implementation of String[Symbol.iterator] iterates through Unicode characters. So for example, [...str] will split any string into a list of Unicode scalar values.
Yep. I don't use eslint, but if I did I would want a lint against any use of string.length. Its almost never what you want. Especially now that javascript supports unicode through [...str].
String.length is fine, since it counts UTF16 (UCS2?) code units. The attribute was only accidentally useful for telling how many characters were in a string for a long time, so people think it should work that way.
> I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?)
The main place I've seen it get annoying is searching for some text in some other text. Unless you normalize the data you're searching through the same way as you normalize your search string.
After reading the above comment I went looking for Unicode documentation talking about the different normalisation formats. That was a point I read that surprised me because I hadn’t ever thought of it. They said search should be insensitive to normalisation form - so generally you should normalise all text before running a search.
That’s a great tip - obvious in hindsight but one I’d never considered.
> your software should handle it correctly. There's no excuse!
It is valid for the presentation of compound emoji can fallback to their component parts. You can't expect every platform to have an up to date database of every novel combination. A better test is emoji with color modifiers. Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.
> You can't expect every platform to have an up to date database of every novel combination.
On modern desktop OSes and smart phones, I do expect my platform to have an up-to-date unicode database & font set. Certainly for something like the unicode polar bear, which was added in 2020. I'll begrudgingly look the other way for terminals, embedded systems and maybe video games... but generally it should just work everywhere.
Server code generally shouldn't interact with unicode grapheme clusters at all. I'm struggling to think of any common, valid reason to use a unicode character database in 'normal' backend server code.
> Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.
I didn't know about that one. I'll have to try it out.
I thing being continuously updated should be tied to receiving new external content.
It's fine to have an embedded device that's never updated, but never receives new content - it doesn't matter that a system won't be able to show a new emoji because it doesn't have any content that uses that new emoji.
However, if it is expected to display new and updated content from the internet, then the system itself has to be able to get updated and actually get updated, there's no acceptable excuses for that - if it's going to pull new content, it must also pull new updates for itself.
As the user/owner of the device, no thanks. It should have code updates if and only if I ask it to, which I probably won't unless you have some compelling reason. For the device owner, pulling new updates by itself is just a built in backdoor/RCE exploit, and in practice those backdoors are often used maliciously. I'd much rather my devices have no way to update and boot from ROM.
The fact that we have to go as far back as CD players for a decent example illustrates my point - the "CD player" content distribution model is long dead, effectively nobody sells CD players or devices like CD players, effectively nobody distributes digital content on CDs or things like CDs (like, CD sales are even below vinyl sales) - almost every currently existing product receives content through a channel where updates could trivially be sent as well.
And if we're talking about how new products should be designed, then the "almost" goes away and they 100% wouldn't receive new content through anything like CDs, the issue transforms from an irrelevant niche (like CDs nowadays) to a nonexistent one.
> Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.
I despise that Unicode retroactively applied default emoji presentation to existing symbols, breaking old text. Who the hell though that was a good idea?
Possibly because the major software companies made it work on phones. Thus users saw it working on many apps and complain when your app fails to do this.
Google, Apple, IBM and MS also did a lot of localisation so their code bases deal with encoding.
It is FOSS Unix software that had the ASCII mindset, probably as C and C++ string types are ASCII and many programmers want to treat strings as arrays. The MacOS and Windows APIs do take UTF as their input not char * (agreed earlier versions did not put they have provided the UTF encodings for 25 years at least.
Those are ok, but both of those emoji are represented as a single unicode codepoint. Some bugs (particularly UI bugs) only show up when multiple unicode characters combine to form a single grapheme cluster. I'd recommend something fancier.
I just tried it in gnome-terminal, and while the crying emoji works fine, polar bear or a country flag causes weird issues.
Older versions of mac did enforce NFD for file names, but more recent names don't, at least at the OS level. But many apple programs, such as finder _will_ use NFD. Except that it isn't even Unicode standardized NFD, it is Apple's own modified version of it. And this can cause issues when for example you create a file in finder, then search for it using `find`, and type the name of the file the exact same way, but it can't find the file because find got an NFC form, but the actual file is in NFD.
OTOH, in many applications, you don't really care about the normalization form used. For example, if you are parsing a CSV, you probably don't need to worry about if one of the cells using using a single code point or two code points to represent that accented e.
Thanks, yet another quantum of knowledge that makes one's life irreversibly ever so slightly worse. But not as bad as encryption (and learning all the terrible ways most applications have broken implementations in)
We make some B2B software running on Windows, integrating with customer systems. We get a lot of interesting files.
About a decade ago I wrote some utility code for reading files, where it'll try to detect BOM first, if not scan for invalid UTF-8 sequences. If none are found assume UTF-8 else assume Windows-1252. Worked well for us so far.
Still get the occasional flat file in Windows-1252 with one random field containing UTF-8, so some special handling is needed for those cases. But that's rare.
Fortunately we don't have to worry about normalization for the most part. If we're parsing then any delimiters will be one of the usual suspects and the rest data.
Microsoft Windows is a source of many a headache for me as almost every other client I write code for has to deal with data created by humans using MS Office. Ordinary users could be excused, because they are not devs but even devs don't see a difference between ASCII and UTF-8 and continue to write code today as if it was 1986 and nobody needed to support accented characters.
I got a ticket about some "folders with Chinese characters" showing up on an SMB share at work, my first thought was a Unicode issue and sure enough when you combine two UTF-8/ASCII A-z code points together as one UTF-16 code point, it will usually wind up in the CJK Common Ideographs range of Unicode. Some crappy software had evidently bypassed the appropriate Windows APIs and just directly wrote a C-style ASCII string onto the filesystem without realizing that NTFS is UTF-16.
Do you know of a resource that explains character encoding in greater detail? Just for my own curiosity. I am learning web development and boy, they brow beat UTF-8 upon us which okay, I'll make sure that call is in my meta data, but none bother to explain how or why we got to that point, or why it seems so splintered.
This Joel On Software article [0] is a good starting point. Incredibly it's now over 20 years old so that makes me feel ancient! But still relevant today.
The suggestion that the web should just use utf-8 everywhere is largely true today. But we still have to interact with other software that may not use utf-8 for various legacy reasons - the CSV file example in the original article is a good example. Joel's article also mentions the solution discussed in the original article, i.e. use heuristics to deduce the encoding.
Why would it break? If you just assume that the system codepage is UTF-8, then sure. If you specifically say in your manifest that you want UTF-8, then Windows (10+) will give you UTF-8 regardless of which locale it is:
Some [1] may also consider working for any company/app that needs to display an emoji, to be a waste of at least one life (your life, and all your users' lives).
The problems is, there is one very popular OS which is very hard to enforce UTF-8 everywhere, Microsoft Windows.
It's very hard to ensure all the software stack you are depending on it use Unicode version of Win32 API. Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back. Even if you don't you have to ensure all the low level code you are depending on it do the same for you.
Oh and don't forget about the Unicode Normalizations. There is no THE UTF-8. There are bunch of UTF-8s with different Unicode normalizations. Apple macOS use NFD while other mostly use NFC.
These are Just some examples. When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.