Character Encodings

A character encoding is a mapping between numbers and characters. Different encodings cover different sets of characters and use different numbers of bytes. Here are all the encodings covered in this reference.

Unicode

UTF-8

Unicode

UTF-8

The dominant encoding for the web. Variable-width (1–4 bytes). Fully backwards-compatible with ASCII. The default encodi...

1–4 bytes/char Since 1993 BOM: EF BB BF

UTF-16 LE

Unicode

UTF-16LE

Little-endian UTF-16. Used internally by Windows, Java, and .NET. Variable-width: 2 bytes for BMP characters, 4 bytes (s...

2–4 bytes/char Since 1996 BOM: FF FE

UTF-16 BE

Unicode

UTF-16BE

Big-endian UTF-16. Network byte-order variant of UTF-16. Used in some network protocols and file formats.

2–4 bytes/char Since 1996 BOM: FE FF

UTF-32 LE

Unicode

UTF-32LE

Fixed-width encoding using 4 bytes per character. Simple to process but memory-inefficient. Little-endian byte order.

4 byte/char Since 2003 BOM: FF FE 00 00

UTF-32 BE

Unicode

UTF-32BE

Fixed-width encoding using 4 bytes per character. Big-endian byte order. Rarely used in practice.

4 byte/char Since 2003 BOM: 00 00 FE FF

Western European

ASCII

US-ASCII

The original 7-bit character encoding standard. Covers 128 characters: English letters, digits, punctuation, and control...

1 byte/char Since 1963

Latin-1 (ISO-8859-1)

ISO-8859-1

Extends ASCII to 256 characters, covering most Western European languages. The first 256 Unicode codepoints map 1:1 to L...

1 byte/char Since 1987

Windows-1252

windows-1252

Microsoft's extension of Latin-1. Assigns printable characters to the C1 control code range (0x80–0x9F), including the e...

1 byte/char Since 1985

Central European

ISO-8859-2 (Latin-2)

ISO-8859-2

Covers Central and Eastern European languages using Latin script: Polish, Czech, Slovak, Hungarian, Romanian, Croatian,...

1 byte/char Since 1987

Cyrillic

ISO-8859-5 (Cyrillic)

ISO-8859-5

ISO standard for Cyrillic script. Covers Russian, Bulgarian, Serbian, Macedonian, and other languages. Largely replaced...

1 byte/char Since 1988

KOI8-R

Russian character encoding widely used in Unix systems and early internet. Designed so that stripping the high bit gives...

1 byte/char Since 1993

East Asian

Shift-JIS

Shift_JIS

Variable-width encoding for Japanese. Single-byte for ASCII and half-width kana, double-byte for kanji and full-width ch...

1–2 bytes/char Since 1982

EUC-JP

Extended Unix Code for Japanese. Variable-width encoding common in Unix/Linux Japanese environments and older web pages.

1–3 bytes/char Since 1991

GBK

Chinese national standard encoding for Simplified Chinese. Superset of GB2312. Variable-width: single-byte for ASCII, do...

1–2 bytes/char Since 1993

Big5

Traditional Chinese encoding used in Taiwan, Hong Kong, and Macau. Variable-width: single-byte for ASCII, double-byte fo...

1–2 bytes/char Since 1984

Encoding FAQ

Which encoding should I use?

Use UTF-8 for almost everything. It covers all Unicode characters, is ASCII-compatible, and is the default for HTML5, JSON, XML, and most modern databases and programming languages. The only common reason to choose a different encoding is compatibility with legacy systems.

What's the difference between a character set and an encoding?

A character set (like Unicode) is the abstract list of characters and their assigned numbers (codepoints). An encoding (like UTF-8) is the concrete rule for turning those numbers into bytes. Unicode is a character set; UTF-8, UTF-16, and UTF-32 are different encodings of Unicode.

Are UTF-8 and Unicode the same thing?

No. Unicode is a character standard that assigns a unique number to every character. UTF-8 is one way to encode those numbers as bytes. UTF-8 is by far the most popular encoding of Unicode, which is why they're often conflated, but they're distinct concepts.