I doubt you can handle UTF-8 properly with that attitude. The problems is, there...

josephg · on April 30, 2024

> Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back.

Yes. You should convert your strings. Thankfully, UTF-16 is very difficult to confuse with UTF-8 because they're completely incompatible encodings. Conversion is (or should be) a relatively simple process in basically any modern language or environment. And personally, I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?). The different forms are (or should be) visually completely identical for the user - at least on modern computers with decent unicode fonts.

The largest problem with UTF-8 (and its biggest strength) is how similar it is to ASCII. It is for this reason we should consider emoji to be a wonderful gift to software correctness everywhere. Correctly handling emoji requires that your software can handle unicode correctly - because they need multi-unit encoding with both UTF-16 and UTF-8. And emoji won't render correctly unless your software can also handle grapheme clusters.

> When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.

Check! If your application deals with text, throw your favorite multi-codepoint emoji into your unit testing data. (Mine is the polar bear). Users love emoji, and your software should handle it correctly. There's no excuse! Even the windows filesystem passes this test today.

hnfong · on April 30, 2024

This.

My native language uses some additional CJK chars on plane 2, and before ~2010s a lot of software had glitches beyond the basic plane of unicode. I am forever grateful for the "Gen Z" who pushed for Emojis.

Javascript's String.length is still semantically broken though. Too bad it's part of a unchangeable spec...

winternewt · on April 30, 2024

There's no definition of String.length that would be the obvious right choice. It depends on the use case. So probably better to let the application provide its own implementation.

josephg · on April 30, 2024

> So probably better to let the application provide its own implementation.

I’d be very happy with the standard library providing multiple “length” functions for strings. Generally I want three:

- Length in bytes of the utf-8 encoded form. Eg useful for http’s content-length field.

- Number of Unicode codepoints in the text. This is useful for cursor positions, CRDT work, and some other stuff.

- Number of grapheme clusters in the text when displayed.

These should all be reasonably easy to query. But they’re all different functions. They just so happen to return the same result on (most) ascii text. (I’m not sure how many grapheme clusters \0 or a bell is).

Javascript’s string.length is particularly useless because it isn’t even any of the above methods. It returns the number of bytes needed to encode the string as UTF16, divided by 2. I’ve never wanted to know that. It’s a totally useless measure. Deceptively useless, because it’s right there and it works fine so long as your strings only ever contain ascii. Last I checked, C# and Java strings have the same bug.

yau8edq12i · on April 30, 2024

I don't know about Java, but the C# standard library is exceptionally well design with respect to variable byte encoding. https://learn.microsoft.com/en-us/dotnet/standard/base-types...

The built-in string.length method is useless (it returns the number of char objects) and I agree that's a problem, but the solution is also built into the language, unlike in JS.

LegionMammal978 · on April 30, 2024

JS these days also has ways to iterate over codepoints and grapheme clusters. If you treat the string as an iterator, then its elements will be single-codepoint strings, on which you can call .codePointAt(0) to get the values. (The major JS engines can allegedly elide the allocations for this.) The codepoint count can be obtained most simply with [...string].length, or more efficiently by looping over the iterator manually.

The Intl.Segmenter API [0] can similarly yield iterable objects with all the grapheme clusters of a string. Also, the TextEncoder [1] and TextDecoder [2] APIs can be used to convert strings to and from UTF-8 byte arrays.

[0] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

[1] https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder

[2] https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder

singpolyma3 · on April 30, 2024

Don't you want grapheme clusters for cursor positions?

Length in encoded form can be found after encoding by checking the length of the binary content I guess.

I think for historical reasons access to codepoints can be useful, but it's rarely what one wants.

CryZe · on April 30, 2024

There's Intl.Segmenter now which does Unicode Segmentation to count the amount of graphemes for example: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

Though you are right in that I don't know of a built-in way to count Unicode Scalar Values (USVs).

josephg · on April 30, 2024

JavaScript’s recently added implementation of String[Symbol.iterator] iterates through Unicode characters. So for example, [...str] will split any string into a list of Unicode scalar values.

josephg · on April 30, 2024

Yep. I don't use eslint, but if I did I would want a lint against any use of string.length. Its almost never what you want. Especially now that javascript supports unicode through [...str].

extraduder_ire · on April 30, 2024

String.length is fine, since it counts UTF16 (UCS2?) code units. The attribute was only accidentally useful for telling how many characters were in a string for a long time, so people think it should work that way.

extraduder_ire · on April 30, 2024

> I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?)

The main place I've seen it get annoying is searching for some text in some other text. Unless you normalize the data you're searching through the same way as you normalize your search string.

josephg · on April 30, 2024

After reading the above comment I went looking for Unicode documentation talking about the different normalisation formats. That was a point I read that surprised me because I hadn’t ever thought of it. They said search should be insensitive to normalisation form - so generally you should normalise all text before running a search.

That’s a great tip - obvious in hindsight but one I’d never considered.

kevin_thibedeau · on April 30, 2024

> your software should handle it correctly. There's no excuse!

It is valid for the presentation of compound emoji can fallback to their component parts. You can't expect every platform to have an up to date database of every novel combination. A better test is emoji with color modifiers. Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.

josephg · on April 30, 2024

> You can't expect every platform to have an up to date database of every novel combination.

On modern desktop OSes and smart phones, I do expect my platform to have an up-to-date unicode database & font set. Certainly for something like the unicode polar bear, which was added in 2020. I'll begrudgingly look the other way for terminals, embedded systems and maybe video games... but generally it should just work everywhere.

Server code generally shouldn't interact with unicode grapheme clusters at all. I'm struggling to think of any common, valid reason to use a unicode character database in 'normal' backend server code.

> Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.

I didn't know about that one. I'll have to try it out.

wheybags · on April 30, 2024

> I'm struggling to think of any common, valid reason to use a unicode character database in 'normal' backend server code.

Case insensitive search

kevin_thibedeau · on April 30, 2024

Unicode gets used in places that aren't continuousky updated. Good luck showing a pirate flag emoji on an embedded device like an infotainment system.

PeterisP · on April 30, 2024

I thing being continuously updated should be tied to receiving new external content.

It's fine to have an embedded device that's never updated, but never receives new content - it doesn't matter that a system won't be able to show a new emoji because it doesn't have any content that uses that new emoji.

However, if it is expected to display new and updated content from the internet, then the system itself has to be able to get updated and actually get updated, there's no acceptable excuses for that - if it's going to pull new content, it must also pull new updates for itself.

ndriscoll · on April 30, 2024

As the user/owner of the device, no thanks. It should have code updates if and only if I ask it to, which I probably won't unless you have some compelling reason. For the device owner, pulling new updates by itself is just a built in backdoor/RCE exploit, and in practice those backdoors are often used maliciously. I'd much rather my devices have no way to update and boot from ROM.

bee_rider · on April 30, 2024

So like a CD player needs some way to get updates? I guess they could send out CDs with updates but approximately nobody would actually do that.

PeterisP · on May 2, 2024

The fact that we have to go as far back as CD players for a decent example illustrates my point - the "CD player" content distribution model is long dead, effectively nobody sells CD players or devices like CD players, effectively nobody distributes digital content on CDs or things like CDs (like, CD sales are even below vinyl sales) - almost every currently existing product receives content through a channel where updates could trivially be sent as well.

And if we're talking about how new products should be designed, then the "almost" goes away and they 100% wouldn't receive new content through anything like CDs, the issue transforms from an irrelevant niche (like CDs nowadays) to a nonexistent one.

kps · on April 30, 2024

> Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.

I despise that Unicode retroactively applied default emoji presentation to existing symbols, breaking old text. Who the hell though that was a good idea?

ezoe · on April 30, 2024

That's one good thing emoji bring to the software developers mind set.

Before emoji, if somebody open a bug report like: "Your software doesn't handle UTF-8 correctly. It doesn't handle Japanese.",

the response was "Huh? We don't bother to support Japanese. Go pound sand. Close ticket with wontfix.".

Now it's "Your software doesn't handle UTF-8 correctly. It doesn't handle emoji" and we're like "Oh shit! My software can't handle my beloved emoji!"

sharpshadow · on April 30, 2024

Exactly I was and still am surprised how fast and wide the adoption of emojis went.

pasc1878 · on April 30, 2024

Possibly because the major software companies made it work on phones. Thus users saw it working on many apps and complain when your app fails to do this.

Google, Apple, IBM and MS also did a lot of localisation so their code bases deal with encoding.

It is FOSS Unix software that had the ASCII mindset, probably as C and C++ string types are ASCII and many programmers want to treat strings as arrays. The MacOS and Windows APIs do take UTF as their input not char * (agreed earlier versions did not put they have provided the UTF encodings for 25 years at least.

marcosdumay · on April 30, 2024

> And personally, I've never run into a problem where the difference between NFC and NFD mattered.

You mean like opening a file by name?

eclipticplane · on April 30, 2024

> (Mine is the polar bear).

Mine is the crying emoji.

And after enough failures in breaking the system, the 100 emoji.

josephg · on April 30, 2024

Those are ok, but both of those emoji are represented as a single unicode codepoint. Some bugs (particularly UI bugs) only show up when multiple unicode characters combine to form a single grapheme cluster. I'd recommend something fancier.

I just tried it in gnome-terminal, and while the crying emoji works fine, polar bear or a country flag causes weird issues.

rocqua · on April 30, 2024

Crying emoji but with a different skin color?

josephg · on April 30, 2024

That’d work!

thayne · on April 30, 2024

> Apple macOS use NFD while other mostly use NFC.

It's actually worse than that.

Older versions of mac did enforce NFD for file names, but more recent names don't, at least at the OS level. But many apple programs, such as finder _will_ use NFD. Except that it isn't even Unicode standardized NFD, it is Apple's own modified version of it. And this can cause issues when for example you create a file in finder, then search for it using `find`, and type the name of the file the exact same way, but it can't find the file because find got an NFC form, but the actual file is in NFD.

OTOH, in many applications, you don't really care about the normalization form used. For example, if you are parsing a CSV, you probably don't need to worry about if one of the cells using using a single code point or two code points to represent that accented e.

neonsunset · on April 30, 2024

Thanks, yet another quantum of knowledge that makes one's life irreversibly ever so slightly worse. But not as bad as encryption (and learning all the terrible ways most applications have broken implementations in)

staunton · on April 30, 2024

> most applications have broken implementations

What applications? Almost nobody writes their own implementations of encryption nowadays (nor should they). You mean openssl is "broken"?

neonsunset · on April 30, 2024

By broken implementations I meant incorrect usage of cryptographic APIs - padding errors, nonce reuse, using weak hash functions, etc.

ezoe · on April 30, 2024

Yeah, I know that but omit it to make my comment shorter. The world will be a slightly better if there is no macOS.

DavidPiper · on April 30, 2024

And immeasurably better if there were no Microsoft Windows ;)

(Kidding, mostly.)

magicalhippo · on April 30, 2024

We make some B2B software running on Windows, integrating with customer systems. We get a lot of interesting files.

About a decade ago I wrote some utility code for reading files, where it'll try to detect BOM first, if not scan for invalid UTF-8 sequences. If none are found assume UTF-8 else assume Windows-1252. Worked well for us so far.

Still get the occasional flat file in Windows-1252 with one random field containing UTF-8, so some special handling is needed for those cases. But that's rare.

Fortunately we don't have to worry about normalization for the most part. If we're parsing then any delimiters will be one of the usual suspects and the rest data.

surfingdino · on April 30, 2024

Microsoft Windows is a source of many a headache for me as almost every other client I write code for has to deal with data created by humans using MS Office. Ordinary users could be excused, because they are not devs but even devs don't see a difference between ASCII and UTF-8 and continue to write code today as if it was 1986 and nobody needed to support accented characters.

morpheuskafka · on April 30, 2024

I got a ticket about some "folders with Chinese characters" showing up on an SMB share at work, my first thought was a Unicode issue and sure enough when you combine two UTF-8/ASCII A-z code points together as one UTF-16 code point, it will usually wind up in the CJK Common Ideographs range of Unicode. Some crappy software had evidently bypassed the appropriate Windows APIs and just directly wrote a C-style ASCII string onto the filesystem without realizing that NTFS is UTF-16.

dgellow · on April 30, 2024

I’ve been sharing it multiple times but I love it: WTF-16 spec https://simonsapin.github.io/wtf-8/#ill-formed-utf-16

0xEF · on April 30, 2024

Do you know of a resource that explains character encoding in greater detail? Just for my own curiosity. I am learning web development and boy, they brow beat UTF-8 upon us which okay, I'll make sure that call is in my meta data, but none bother to explain how or why we got to that point, or why it seems so splintered.

screwt · on May 2, 2024

(very late reply, but in case you see it)

This Joel On Software article [0] is a good starting point. Incredibly it's now over 20 years old so that makes me feel ancient! But still relevant today.

The suggestion that the web should just use utf-8 everywhere is largely true today. But we still have to interact with other software that may not use utf-8 for various legacy reasons - the CSV file example in the original article is a good example. Joel's article also mentions the solution discussed in the original article, i.e. use heuristics to deduce the encoding.

[0] https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

int_19h · on April 30, 2024

> If you're writing low level code, you have to convert UTF-8 to UTF-16 and back.

It depends. If you're writing an app, just add the necessary incantation to your manifest, and all the narrow char APIs start talking UTF-8 to you.

For a library, yeah.

ezoe · on April 30, 2024

Don't be surprised such UTF-8 locale program will break so bad on Windows with default language set to Japanese.

int_19h · on April 30, 2024

Why would it break? If you just assume that the system codepage is UTF-8, then sure. If you specifically say in your manifest that you want UTF-8, then Windows (10+) will give you UTF-8 regardless of which locale it is:

https://learn.microsoft.com/en-us/windows/apps/design/global...

Comma2976 · on April 30, 2024

Some[1] would see breaking Windows as a feature

[1]Me, surely at least 1 other

mikhailfranco · on April 30, 2024

Some [1] may also consider working for any company/app that needs to display an emoji, to be a waste of at least one life (your life, and all your users' lives).

[1] Me, for sure.

Comma2976 · on April 30, 2024

Some[1] may have missed the point, emojis were never supported by some[2] in their projects and they consider them to be the mark of the beast[3]

[1]you may find him at ::1

[2]HEAD~2

[3]Front-end work