UTF-8
UnicodeThe dominant encoding for the web. Variable-width (1–4 bytes). Fully backwards-compatible with ASCII. The default encoding for HTML5, JSON, and most modern protocols.
Byte Structure
UTF-8 uses a variable-length prefix scheme. The number of leading 1-bits in the first byte tells you how many bytes the character uses. Continuation bytes always start with 10.
| Codepoint range | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Usable bits |
|---|---|---|---|---|---|
| U+0000–U+007F | 0xxxxxxx | — | 7 | ||
| U+0080–U+07FF | 110xxxxx | 10xxxxxx | — | 11 | |
| U+0800–U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | — | 16 |
| U+10000–U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 21 |
When to Use UTF-8
Use UTF-8 as your default encoding for everything — files, databases, APIs, HTML, JSON, and network protocols. It's the only encoding you should be choosing in new systems. The only time you'd choose something else is when interfacing with a legacy system that mandates a specific encoding. If you're not sure which encoding to use, the answer is UTF-8.
Sample Characters in UTF-8
The table below shows how a selection of characters are represented in UTF-8. Bytes are shown in hexadecimal. Characters marked "not supported" cannot be encoded in UTF-8 and would need to be replaced or transliterated when converting from Unicode.
| Character | Codepoint | Name | Bytes (Hex) | Bytes (Decimal) | Supported |
|---|---|---|---|---|---|
| A | U+0041 | LATIN CAPITAL LETTER A | 41 | 65 | Yes |
| a | U+0061 | LATIN SMALL LETTER A | 61 | 97 | Yes |
| 0 | U+0030 | DIGIT ZERO | 30 | 48 | Yes |
| $ | U+0024 | DOLLAR SIGN | 24 | 36 | Yes |
| £ | U+00A3 | POUND SIGN | C2 A3 | 194 163 | Yes |
| © | U+00A9 | COPYRIGHT SIGN | C2 A9 | 194 169 | Yes |
| € | U+20AC | EURO SIGN | E2 82 AC | 226 130 172 | Yes |
| α | U+03B1 | GREEK SMALL LETTER ALPHA | CE B1 | 206 177 | Yes |
| А | U+0410 | CYRILLIC CAPITAL LETTER A | D0 90 | 208 144 | Yes |
| 中 | U+4E2D | E4 B8 AD | 228 184 173 | Yes | |
| あ | U+3042 | HIRAGANA LETTER A | E3 81 82 | 227 129 130 | Yes |
| ☺ | U+263A | WHITE SMILING FACE | E2 98 BA | 226 152 186 | Yes |
Working with UTF-8 in Code
Every major language has built-in support for encoding conversion. The examples below show how to encode a string to UTF-8 bytes and decode it back to a Unicode string. Always specify the encoding explicitly — never rely on system defaults, which vary by OS and locale.
# Encode a string to utf-8 bytes
text = "Hello, 世界"
encoded = text.encode("UTF-8")
# Decode bytes back to a string
decoded = encoded.decode("UTF-8")
// Convert to utf-8
$bytes = mb_convert_encoding(
"Hello, 世界",
"UTF-8",
"UTF-8"
);
// Convert back to UTF-8
$text = mb_convert_encoding(
$bytes,
"UTF-8",
"UTF-8"
);
// Encode to UTF-8 bytes
const encoder = new TextEncoder(); // UTF-8
const bytes = encoder.encode("Hello, 世界");
// Decode bytes
const decoder = new TextDecoder("UTF-8");
const text = decoder.decode(bytes);
-- Create a database with UTF-8
CREATE DATABASE mydb
ENCODING 'UTF-8'
LC_COLLATE 'en_US.UTF-8';
-- Check database encoding
SELECT pg_encoding_to_char(encoding)
FROM pg_database
WHERE datname = current_database();
Compare with Other Encodings
See how UTF-8 differs from other encodings — which characters each supports and how the byte representations compare.
UTF-8 FAQ
Is UTF-8 backward compatible with ASCII?
Yes. Every ASCII character (U+0000–U+007F) is encoded in UTF-8 as a single byte with the same value. A file containing only ASCII is simultaneously valid UTF-8. This backward compatibility is the main reason UTF-8 became the dominant encoding on the web.
What is the difference between UTF-8 and UTF-16?
UTF-8 uses 1–4 bytes per character and is efficient for ASCII-heavy text. UTF-16 uses 2–4 bytes and stores BMP characters in 2 bytes. UTF-8 is self-synchronizing, has no byte-order issues, and is the standard for web, file, and network protocols. UTF-16 is used internally by Windows, Java, and JavaScript engines.
Can UTF-8 encode every Unicode character?
Yes. UTF-8 can encode all 1,114,112 codepoints in the Unicode standard (U+0000 through U+10FFFF). No Unicode character is left out — emoji, historic scripts, mathematical symbols, and private-use characters all have valid UTF-8 representations.
How do I detect if a file is UTF-8?
UTF-8 files sometimes start with a BOM (EF BB BF), though this is discouraged. More reliably, a parser validates the byte sequences: valid UTF-8 follows a strict pattern of leading and continuation bytes. Tools like file (Unix), chardet (Python), or uchardet detect encodings automatically. A valid UTF-8 parse with no replacement characters is strong evidence the file is UTF-8.