Character Encodings
A character encoding is a mapping between numbers and characters. Different encodings cover different sets of characters and use different numbers of bytes. Here are all the encodings covered in this reference.
Unicode
UTF-8
UnicodeThe dominant encoding for the web. Variable-width (1–4 bytes). Fully backwards-compatible with ASCII. The default encodi...
UTF-16 LE
UnicodeLittle-endian UTF-16. Used internally by Windows, Java, and .NET. Variable-width: 2 bytes for BMP characters, 4 bytes (s...
UTF-16 BE
UnicodeBig-endian UTF-16. Network byte-order variant of UTF-16. Used in some network protocols and file formats.
UTF-32 LE
UnicodeFixed-width encoding using 4 bytes per character. Simple to process but memory-inefficient. Little-endian byte order.
UTF-32 BE
UnicodeFixed-width encoding using 4 bytes per character. Big-endian byte order. Rarely used in practice.
Western European
ASCII
The original 7-bit character encoding standard. Covers 128 characters: English letters, digits, punctuation, and control...
Latin-1 (ISO-8859-1)
Extends ASCII to 256 characters, covering most Western European languages. The first 256 Unicode codepoints map 1:1 to L...
Windows-1252
Microsoft's extension of Latin-1. Assigns printable characters to the C1 control code range (0x80–0x9F), including the e...
Central European
Cyrillic
ISO-8859-5 (Cyrillic)
ISO standard for Cyrillic script. Covers Russian, Bulgarian, Serbian, Macedonian, and other languages. Largely replaced...
KOI8-R
Russian character encoding widely used in Unix systems and early internet. Designed so that stripping the high bit gives...
East Asian
Shift-JIS
Variable-width encoding for Japanese. Single-byte for ASCII and half-width kana, double-byte for kanji and full-width ch...
EUC-JP
Extended Unix Code for Japanese. Variable-width encoding common in Unix/Linux Japanese environments and older web pages.
GBK
Chinese national standard encoding for Simplified Chinese. Superset of GB2312. Variable-width: single-byte for ASCII, do...
Big5
Traditional Chinese encoding used in Taiwan, Hong Kong, and Macau. Variable-width: single-byte for ASCII, double-byte fo...
Encoding FAQ
Which encoding should I use?
Use UTF-8 for almost everything. It covers all Unicode characters, is ASCII-compatible, and is the default for HTML5, JSON, XML, and most modern databases and programming languages. The only common reason to choose a different encoding is compatibility with legacy systems.
What's the difference between a character set and an encoding?
A character set (like Unicode) is the abstract list of characters and their assigned numbers (codepoints). An encoding (like UTF-8) is the concrete rule for turning those numbers into bytes. Unicode is a character set; UTF-8, UTF-16, and UTF-32 are different encodings of Unicode.
Are UTF-8 and Unicode the same thing?
No. Unicode is a character standard that assigns a unique number to every character. UTF-8 is one way to encode those numbers as bytes. UTF-8 is by far the most popular encoding of Unicode, which is why they're often conflated, but they're distinct concepts.