UTF-8: The Encoding That Took Over the Web
UTF-8 is now the dominant character encoding on the internet, used by over 98% of websites. But this dominance wasn't inevitable — it was the result of clever engineering and the practical needs of a global internet. Understanding how UTF-8 works helps every developer write more reliable, portable software.
How UTF-8 Encodes Characters
UTF-8 is a variable-width encoding: it uses between 1 and 4 bytes per character. The first 128 code points — the original ASCII characters — are encoded as a single byte identical to their ASCII representation. This backward compatibility with ASCII was one of the key reasons UTF-8 succeeded where other encodings failed.
Characters with code points above U+007F use multi-byte sequences. The leading byte signals how many continuation bytes follow. For example, the Euro sign € (U+20AC) encodes to three bytes: 0xE2 0x82 0xAC. You can explore any character's UTF-8 encoding in our character browser.
Why UTF-8 Won
Before UTF-8 became universal, the web was a patchwork of encodings: Latin-1, Windows-1252, Shift-JIS, and dozens of others. Pages declared their encoding in HTTP headers or <meta> tags, and browsers had to guess when declarations were missing. This led to widespread mojibake — garbled text from encoding mismatches.
UTF-8 solved the globalization problem: a single encoding that can represent every character in Unicode while remaining efficient for the most common use cases. ASCII-only text takes no extra space, and even multi-byte characters are compact compared to fixed-width alternatives like UTF-32.
UTF-8 in Practice
Modern programming languages and databases default to UTF-8. When in doubt, use it. The only common alternative worth considering is UTF-16, which is used internally by Windows, JavaScript, and Java — but even there, UTF-8 is preferred for storage and transmission.
Use our text encoder to see exactly how any string encodes in UTF-8, or browse the full list of supported encodings on this site.