UTF-16 LE

Unicode

Little-endian UTF-16. Used internally by Windows, Java, and .NET. Variable-width: 2 bytes for BMP characters, 4 bytes (surrogate pairs) for supplementary characters.

UTF-16LE
Variable (2–4 bytes)
FF FE
1996

Byte Structure

UTF-16 uses 2 bytes for characters in the Basic Multilingual Plane (U+0000–U+FFFF). For supplementary characters (U+10000 and above), it uses surrogate pairs: two consecutive 2-byte sequences in the range D800–DFFF.

Codepoint rangeEncodingBytes
U+0000–U+FFFFDirect 2-byte value2
U+10000–U+10FFFFSurrogate pair (D800–DBFF + DC00–DFFF)4

When to Use UTF-16 LE

UTF-16 is the internal string representation in Windows, Java, JavaScript (V8), and Swift. You'll encounter it when reading Windows files (e.g. many .txt files saved by Notepad), parsing Java or .NET strings at the binary level, or working with APIs that return UTF-16 encoded data. For storage or transmission, UTF-8 is almost always a better choice unless your target platform requires UTF-16.

Sample Characters in UTF-16 LE

The table below shows how a selection of characters are represented in UTF-16 LE. Bytes are shown in hexadecimal. Characters marked "not supported" cannot be encoded in UTF-16 LE and would need to be replaced or transliterated when converting from Unicode.

Character Codepoint Name Bytes (Hex) Bytes (Decimal) Supported
A U+0041 LATIN CAPITAL LETTER A 41 00 65 0 Yes
a U+0061 LATIN SMALL LETTER A 61 00 97 0 Yes
0 U+0030 DIGIT ZERO 30 00 48 0 Yes
$ U+0024 DOLLAR SIGN 24 00 36 0 Yes
£ U+00A3 POUND SIGN A3 00 163 0 Yes
© U+00A9 COPYRIGHT SIGN A9 00 169 0 Yes
U+20AC EURO SIGN AC 20 172 32 Yes
α U+03B1 GREEK SMALL LETTER ALPHA B1 03 177 3 Yes
А U+0410 CYRILLIC CAPITAL LETTER A 10 04 16 4 Yes
U+4E2D 2D 4E 45 78 Yes
U+3042 HIRAGANA LETTER A 42 30 66 48 Yes
U+263A WHITE SMILING FACE 3A 26 58 38 Yes

Working with UTF-16 LE in Code

Every major language has built-in support for encoding conversion. The examples below show how to encode a string to UTF-16 LE bytes and decode it back to a Unicode string. Always specify the encoding explicitly — never rely on system defaults, which vary by OS and locale.

# Encode a string to utf-16le bytes
text = "Hello, 世界"
encoded = text.encode("UTF-16LE")

# Decode bytes back to a string
decoded = encoded.decode("UTF-16LE")
// Convert to utf-16le
$bytes = mb_convert_encoding(
    "Hello, 世界",
    "UTF-16LE",
    "UTF-8"
);

// Convert back to UTF-8
$text = mb_convert_encoding(
    $bytes,
    "UTF-8",
    "UTF-16LE"
);
// Encode to UTF-8 bytes
const encoder = new TextEncoder(); // UTF-8
const bytes = encoder.encode("Hello, 世界");

// Decode bytes
const decoder = new TextDecoder("UTF-16LE");
const text = decoder.decode(bytes);
-- Create a database with UTF-16 LE
CREATE DATABASE mydb
  ENCODING 'UTF-16LE'
  LC_COLLATE 'en_US.UTF-8';

-- Check database encoding
SELECT pg_encoding_to_char(encoding)
FROM pg_database
WHERE datname = current_database();

Compare with Other Encodings

See how UTF-16 LE differs from other encodings — which characters each supports and how the byte representations compare.

UTF-16 LE FAQ

What is the difference between UTF-16LE and UTF-16BE?

The difference is byte order. UTF-16LE (Little Endian) stores the low byte first; UTF-16BE (Big Endian) stores the high byte first. For the letter A (U+0041), UTF-16LE is 41 00 and UTF-16BE is 00 41. A Byte Order Mark (BOM, U+FEFF) at the start of a file indicates which variant is in use. Windows and most Windows APIs default to UTF-16LE.

What are UTF-16 surrogate pairs?

Characters above U+FFFF require two 2-byte UTF-16 code units called a surrogate pair: a high surrogate (D800–DBFF) followed by a low surrogate (DC00–DFFF). Together they encode codepoints in the range U+10000–U+10FFFF. Code that processes UTF-16 strings must handle surrogate pairs to correctly count characters or slice strings — a common source of bugs with emoji.

Does Java use UTF-16?

Java's char type is a single UTF-16 code unit (2 bytes), and String is a sequence of code units. Characters above U+FFFF require two char values (a surrogate pair). Java's String.length() returns the number of UTF-16 code units, not the number of Unicode codepoints — a common source of bugs when processing emoji or supplementary characters.