UTF-16 LE

Unicode

Little-endian UTF-16. Used internally by Windows, Java, and .NET. Variable-width: 2 bytes for BMP characters, 4 bytes (surrogate pairs) for supplementary characters.

IANA Name

UTF-16LE

Width

Variable (2–4 bytes)

BOM

FF FE

Introduced

1996

Byte Structure

UTF-16 uses 2 bytes for characters in the Basic Multilingual Plane (U+0000–U+FFFF). For supplementary characters (U+10000 and above), it uses surrogate pairs: two consecutive 2-byte sequences in the range D800–DFFF.

Codepoint range	Encoding	Bytes
U+0000–U+FFFF	Direct 2-byte value	2
U+10000–U+10FFFF	Surrogate pair (D800–DBFF + DC00–DFFF)	4

When to Use UTF-16 LE

UTF-16 is the internal string representation in Windows, Java, JavaScript (V8), and Swift. You'll encounter it when reading Windows files (e.g. many .txt files saved by Notepad), parsing Java or .NET strings at the binary level, or working with APIs that return UTF-16 encoded data. For storage or transmission, UTF-8 is almost always a better choice unless your target platform requires UTF-16.

Sample Characters in UTF-16 LE

The table below shows how a selection of characters are represented in UTF-16 LE. Bytes are shown in hexadecimal. Characters marked "not supported" cannot be encoded in UTF-16 LE and would need to be replaced or transliterated when converting from Unicode.

Character	Codepoint	Name	Bytes (Hex)	Bytes (Decimal)	Supported
A	U+0041	LATIN CAPITAL LETTER A	41 00	65 0	Yes
a	U+0061	LATIN SMALL LETTER A	61 00	97 0	Yes
0	U+0030	DIGIT ZERO	30 00	48 0	Yes
$	U+0024	DOLLAR SIGN	24 00	36 0	Yes
£	U+00A3	POUND SIGN	A3 00	163 0	Yes
©	U+00A9	COPYRIGHT SIGN	A9 00	169 0	Yes
€	U+20AC	EURO SIGN	AC 20	172 32	Yes
α	U+03B1	GREEK SMALL LETTER ALPHA	B1 03	177 3	Yes
А	U+0410	CYRILLIC CAPITAL LETTER A	10 04	16 4	Yes
中	U+4E2D		2D 4E	45 78	Yes
あ	U+3042	HIRAGANA LETTER A	42 30	66 48	Yes
☺	U+263A	WHITE SMILING FACE	3A 26	58 38	Yes

Working with UTF-16 LE in Code

Every major language has built-in support for encoding conversion. The examples below show how to encode a string to UTF-16 LE bytes and decode it back to a Unicode string. Always specify the encoding explicitly — never rely on system defaults, which vary by OS and locale.

Python

# Encode a string to utf-16le bytes
text = "Hello, 世界"
encoded = text.encode("UTF-16LE")

# Decode bytes back to a string
decoded = encoded.decode("UTF-16LE")

PHP

// Convert to utf-16le
$bytes = mb_convert_encoding(
    "Hello, 世界",
    "UTF-16LE",
    "UTF-8"
);

// Convert back to UTF-8
$text = mb_convert_encoding(
    $bytes,
    "UTF-8",
    "UTF-16LE"
);

JavaScript

// Encode to UTF-8 bytes
const encoder = new TextEncoder(); // UTF-8
const bytes = encoder.encode("Hello, 世界");

// Decode bytes
const decoder = new TextDecoder("UTF-16LE");
const text = decoder.decode(bytes);

SQL (PostgreSQL)

-- Create a database with UTF-16 LE
CREATE DATABASE mydb
  ENCODING 'UTF-16LE'
  LC_COLLATE 'en_US.UTF-8';

-- Check database encoding
SELECT pg_encoding_to_char(encoding)
FROM pg_database
WHERE datname = current_database();

Compare with Other Encodings

See how UTF-16 LE differs from other encodings — which characters each supports and how the byte representations compare.

UTF-16 LE vs UTF-8 → UTF-16 LE vs UTF-16 BE → UTF-16 LE vs UTF-32 LE →

UTF-16 LE FAQ

What is the difference between UTF-16LE and UTF-16BE?

The difference is byte order. UTF-16LE (Little Endian) stores the low byte first; UTF-16BE (Big Endian) stores the high byte first. For the letter A (U+0041), UTF-16LE is 41 00 and UTF-16BE is 00 41. A Byte Order Mark (BOM, U+FEFF) at the start of a file indicates which variant is in use. Windows and most Windows APIs default to UTF-16LE.

What are UTF-16 surrogate pairs?

Characters above U+FFFF require two 2-byte UTF-16 code units called a surrogate pair: a high surrogate (D800–DBFF) followed by a low surrogate (DC00–DFFF). Together they encode codepoints in the range U+10000–U+10FFFF. Code that processes UTF-16 strings must handle surrogate pairs to correctly count characters or slice strings — a common source of bugs with emoji.

Does Java use UTF-16?

Java's char type is a single UTF-16 code unit (2 bytes), and String is a sequence of code units. Characters above U+FFFF require two char values (a surrogate pair). Java's String.length() returns the number of UTF-16 code units, not the number of Unicode codepoints — a common source of bugs when processing emoji or supplementary characters.

← All Encodings Browse Characters →