UTF-16 and Surrogate Pairs: How JavaScript Handles Text

JavaScript strings are encoded internally using UTF-16, a legacy inherited from the days when Unicode was expected to fit within 65,536 characters. Understanding UTF-16 — and its mechanism for handling characters outside that range — is essential for writing correct string-handling code in JavaScript, Java, and C#.

How UTF-16 Works

UTF-16 uses 2 bytes (one 16-bit code unit) for characters in the Basic Multilingual Plane (U+0000–U+FFFF) and 4 bytes (two 16-bit code units, called a surrogate pair) for characters in supplementary planes (U+10000–U+10FFFF). This includes all emoji and many historic scripts.

Surrogate Pairs Explained

The Unicode range U+D800–U+DFFF is permanently reserved for surrogate code points — they have no character assignments and exist solely as UTF-16 encoding machinery. A high surrogate (U+D800–U+DBFF) is always followed by a low surrogate (U+DC00–U+DFFF). Together they encode one supplementary character. For example, the 😀 emoji (U+1F600) encodes as the surrogate pair 0xD83D 0xDE00 in UTF-16.

JavaScript's Surrogate Problem

JavaScript's string methods count UTF-16 code units, not Unicode characters. "😀".length returns 2, not 1. Methods like charAt() and charCodeAt() operate on code units, so they can return half a surrogate pair. ES2015 introduced String.prototype.codePointAt() and the for...of loop, which handle supplementary characters correctly.

Comparing Encodings

For most transmission and storage purposes, UTF-8 is preferred over UTF-16 because it's more space-efficient for ASCII-heavy text and has no endianness issues. UTF-16 requires a Byte Order Mark (BOM) to specify byte order when used in files. You can compare how specific characters encode in both formats using our character browser.

How UTF-16 Works

Surrogate Pairs Explained

JavaScript's Surrogate Problem

Comparing Encodings

More Articles