You're mixing up ucs-2 and utf-16.

Robin_Message · on Nov 17, 2020

To expand on this comment, UCS-2 defines a fixed-length, 2-byte encoding of Unicode. It can therefore only represent the first 65536 characters in the Basic Multilingual Plane (BMP).

UTF-16 allows representing characters outside of the BMP by using a reserved area to split a single codepoint into two surrogates that form a pair.

This makes UTF-16 complicated and in some ways worse than UTF-8: the encoding is longer for many typical texts, but is still not fixed-width. The bug you typically see is that codepoints outside of the BMP are munged when clipping the text to a certain length (or reversing it, but that doesn't happen in real systems generally.)

seba_dos1 · on Nov 17, 2020

The reason why some older mobile phones struggle with SMS containing emojis instead of just displaying tofus in place of unsupported characters is that there's no way to send emojis in accordance to SMS standard - it defines the encoding to be UCS-2. In order to put emojis in SMS, newer phones send the messages as UTF-16 instead, technically violating the standard, which can break some parsers that only expect UCS-2 to be there.

lokedhs · on Nov 18, 2020

UTF-16 is the worst of both worlds when compared to UTF-8 and UTF-32. The only reason it exists (and, unfortunately prevalent) is because a number of popular technologies (Java, Javascript, Windows) thought they were being smart when building their Unicode support on UCS-2, and now here we are.

Now, the issue of clipping or reversing strings is a problem not just because of encoding. It simply doesn't work even with UTF-32. You're going to end up cutting off combining characters for example. Manipulating strings is very difficult, and software should never really try to do it unless they know what they are doing, and even then you need to use a library to help you do it.

innocenat · on Nov 18, 2020

> thought they were being smart when building their Unicode support on UCS-2, and now here we are.

Not sure what you mean by 'being smart' when all of those were released before Unicode 2.0.

lokedhs · on Nov 18, 2020

I said they thought they were smart. I'm not going to judge whether it actually was smart based on the situation then.

That said, UTF-8 was already 4 years old by the time Java came out. Surrogate pairs was added to Unicode in 1996, one year prior to the release of Java.

I joined Sun Microsystems around that time, and Unicode really wasn't a thing in the Solaris world for a few more years, so the fact that people wasn't aggressively pushing good Unicode support at the time is understandable. People just didn't have much experience with it.

ucarion · on Nov 17, 2020

To that point, what are systems supposed to make of UTF-8 strings encoding codepoints in the surrogate pair range? Is that well-defined?

In other words, to what extent are surrogate pairs a UTF-16 thing, rather than a Unicode thing that exists to accommodate for UCS-2 -> UTF-16?

SloopJon · on Nov 18, 2020

Surrogates are technically a UTF-16 only thing. Realizing that sometimes they nevertheless escape out into the wild, WTF-8 defines a superset of UTF-8 that encodes them:

https://simonsapin.github.io/wtf-8/

To be clear, this is not an official Unicode spec. It's a hack (albeit a pretty natural and obvious one) to deal with systems that don't do Unicode quite right.

I recently came across some old code that narrows wchar_t to UCS-2 by zeroing out the high-order bytes. Even though my test was careful not to generate any surrogates in the input, they showed up in the output when a randomly generated code point like U+1DF7C was mangled into U+DF7C.

A corrupted value like that is not necessarily a great example of something you want to preserve, but it's the sort of thing that late 90s code assumed about Unicode.

account42 · on Nov 19, 2020

Specifically, filenames on Windows are not UTF-16 (or UCS-12) but rather WTF-16 - like UTF-16 but with possibly unmatched surragate pairs. WTF-8 provides an 8-bit encoding for such filenames that matches UTF-8 wherever the original was valid UTF-16 while converting the rest in the most straightforward way possible, menaing you need less code to go from WTF-16 to WTF-8 than going from UTF-16 to UTF-8 while rejecting invalid characters.

ChrisSD · on Nov 18, 2020

It's invalid according to the spec. They are permanently reserved code points for use in UTF-16.

> The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

They could be replaced by the replacement character to produce a valid string.

a1369209993 · on Nov 17, 2020

Nitpick: UCS-2 actually isn't fixed-length either, eg "ẍ̊" (small x+umlaut+ring above) is two code units (1E8D 030A) or possibly three (0078 0308 030A).

ygra · on Nov 17, 2020

UCS-2 uses a fixed number of (16-bit) code units to represent a Unicode scalar value (code point). Of course, to represent a grapheme cluster, more than one code point may be needed, but that's true of Unicode in general.

a1369209993 · on Nov 17, 2020

> that's true of Unicode in general.

Yes, that was rather my point: if you're using a Unicode-based character encoding, you're going to have variable-width characters regardless, so you might as well use UTF-8.

> UCS-2 uses a fixed number of (16-bit) code units to represent a Unicode scalar value (code point).

Sure, but that's a implementaion detail of the mapping from characters (at the application level) to bytes (at the physical(-ish) representation level).