UTF-16 and Surrogate Pairs: How JavaScript Handles Text

· 2 min read

JavaScript strings are encoded internally using UTF-16, a legacy inherited from the days when Unicode was expected to fit within 65,536 characters. Understanding UTF-16 — and its mechanism for handling characters outside that range — is essential for writing correct string-handling code in JavaScript, Java, and C#.

How UTF-16 Works

UTF-16 uses 2 bytes (one 16-bit code unit) for characters in the Basic Multilingual Plane (U+0000–U+FFFF) and 4 bytes (two 16-bit code units, called a surrogate pair) for characters in supplementary planes (U+10000–U+10FFFF). This includes all emoji and many historic scripts.

Surrogate Pairs Explained

The Unicode range U+D800–U+DFFF is permanently reserved for surrogate code points — they have no character assignments and exist solely as UTF-16 encoding machinery. A high surrogate (U+D800–U+DBFF) is always followed by a low surrogate (U+DC00–U+DFFF). Together they encode one supplementary character. For example, the 😀 emoji (U+1F600) encodes as the surrogate pair 0xD83D 0xDE00 in UTF-16.

JavaScript's Surrogate Problem

JavaScript's string methods count UTF-16 code units, not Unicode characters. "😀".length returns 2, not 1. Methods like charAt() and charCodeAt() operate on code units, so they can return half a surrogate pair. ES2015 introduced String.prototype.codePointAt() and the for...of loop, which handle supplementary characters correctly.

Comparing Encodings

For most transmission and storage purposes, UTF-8 is preferred over UTF-16 because it's more space-efficient for ASCII-heavy text and has no endianness issues. UTF-16 requires a Byte Order Mark (BOM) to specify byte order when used in files. You can compare how specific characters encode in both formats using our character browser.

More Articles

View all articles