UTF-8

Unicode

The dominant encoding for the web. Variable-width (1–4 bytes). Fully backwards-compatible with ASCII. The default encoding for HTML5, JSON, and most modern protocols.

IANA Name

UTF-8

Width

Variable (1–4 bytes)

BOM

EF BB BF

Introduced

1993

Byte Structure

UTF-8 uses a variable-length prefix scheme. The number of leading 1-bits in the first byte tells you how many bytes the character uses. Continuation bytes always start with 10.

Codepoint range	Byte 1	Byte 2	Byte 3	Byte 4	Usable bits
U+0000–U+007F	0xxxxxxx	—			7
U+0080–U+07FF	110xxxxx	10xxxxxx	—		11
U+0800–U+FFFF	1110xxxx	10xxxxxx	10xxxxxx	—	16
U+10000–U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx	21

When to Use UTF-8

Use UTF-8 as your default encoding for everything — files, databases, APIs, HTML, JSON, and network protocols. It's the only encoding you should be choosing in new systems. The only time you'd choose something else is when interfacing with a legacy system that mandates a specific encoding. If you're not sure which encoding to use, the answer is UTF-8.

Sample Characters in UTF-8

The table below shows how a selection of characters are represented in UTF-8. Bytes are shown in hexadecimal. Characters marked "not supported" cannot be encoded in UTF-8 and would need to be replaced or transliterated when converting from Unicode.

Character	Codepoint	Name	Bytes (Hex)	Bytes (Decimal)	Supported
A	U+0041	LATIN CAPITAL LETTER A	41	65	Yes
a	U+0061	LATIN SMALL LETTER A	61	97	Yes
0	U+0030	DIGIT ZERO	30	48	Yes
$	U+0024	DOLLAR SIGN	24	36	Yes
£	U+00A3	POUND SIGN	C2 A3	194 163	Yes
©	U+00A9	COPYRIGHT SIGN	C2 A9	194 169	Yes
€	U+20AC	EURO SIGN	E2 82 AC	226 130 172	Yes
α	U+03B1	GREEK SMALL LETTER ALPHA	CE B1	206 177	Yes
А	U+0410	CYRILLIC CAPITAL LETTER A	D0 90	208 144	Yes
中	U+4E2D		E4 B8 AD	228 184 173	Yes
あ	U+3042	HIRAGANA LETTER A	E3 81 82	227 129 130	Yes
☺	U+263A	WHITE SMILING FACE	E2 98 BA	226 152 186	Yes

Working with UTF-8 in Code

Every major language has built-in support for encoding conversion. The examples below show how to encode a string to UTF-8 bytes and decode it back to a Unicode string. Always specify the encoding explicitly — never rely on system defaults, which vary by OS and locale.

Python

# Encode a string to utf-8 bytes
text = "Hello, 世界"
encoded = text.encode("UTF-8")

# Decode bytes back to a string
decoded = encoded.decode("UTF-8")

PHP

// Convert to utf-8
$bytes = mb_convert_encoding(
    "Hello, 世界",
    "UTF-8",
    "UTF-8"
);

// Convert back to UTF-8
$text = mb_convert_encoding(
    $bytes,
    "UTF-8",
    "UTF-8"
);

JavaScript

// Encode to UTF-8 bytes
const encoder = new TextEncoder(); // UTF-8
const bytes = encoder.encode("Hello, 世界");

// Decode bytes
const decoder = new TextDecoder("UTF-8");
const text = decoder.decode(bytes);

SQL (PostgreSQL)

-- Create a database with UTF-8
CREATE DATABASE mydb
  ENCODING 'UTF-8'
  LC_COLLATE 'en_US.UTF-8';

-- Check database encoding
SELECT pg_encoding_to_char(encoding)
FROM pg_database
WHERE datname = current_database();

Compare with Other Encodings

See how UTF-8 differs from other encodings — which characters each supports and how the byte representations compare.

UTF-8 vs Latin-1 (ISO-8859-1) → UTF-8 vs Windows-1252 → UTF-8 vs UTF-16 LE → UTF-8 vs UTF-32 LE → UTF-8 vs ASCII →

UTF-8 FAQ

Is UTF-8 backward compatible with ASCII?

Yes. Every ASCII character (U+0000–U+007F) is encoded in UTF-8 as a single byte with the same value. A file containing only ASCII is simultaneously valid UTF-8. This backward compatibility is the main reason UTF-8 became the dominant encoding on the web.

What is the difference between UTF-8 and UTF-16?

UTF-8 uses 1–4 bytes per character and is efficient for ASCII-heavy text. UTF-16 uses 2–4 bytes and stores BMP characters in 2 bytes. UTF-8 is self-synchronizing, has no byte-order issues, and is the standard for web, file, and network protocols. UTF-16 is used internally by Windows, Java, and JavaScript engines.

Can UTF-8 encode every Unicode character?

Yes. UTF-8 can encode all 1,114,112 codepoints in the Unicode standard (U+0000 through U+10FFFF). No Unicode character is left out — emoji, historic scripts, mathematical symbols, and private-use characters all have valid UTF-8 representations.

How do I detect if a file is UTF-8?

UTF-8 files sometimes start with a BOM (EF BB BF), though this is discouraged. More reliably, a parser validates the byte sequences: valid UTF-8 follows a strict pattern of leading and continuation bytes. Tools like file (Unix), chardet (Python), or uchardet detect encodings automatically. A valid UTF-8 parse with no replacement characters is strong evidence the file is UTF-8.

← All Encodings Browse Characters →