What is Unicode? A Developer's Introduction
Before Unicode, the computing world was fragmented. Different countries and vendors used different character sets, and software that worked in one locale often broke in another. Unicode solved this by providing a single, universal character repertoire that can represent virtually every writing system on Earth.
Code Points and Characters
Unicode assigns a unique number — called a code point — to every character. Code points are written in the form U+XXXX, where XXXX is a hexadecimal number. For example, the Latin capital letter A is U+0041, the snowman is U+2603, and the pile of poo emoji is U+1F4A9.
Unicode currently defines over 149,000 characters across more than 150 writing systems. You can browse all of them in our character browser, filtered by block, script, or category.
Unicode vs Encoding
A common source of confusion: Unicode is not itself an encoding. It's an abstract catalog of characters and their properties. The actual bytes used to represent these characters in memory or files are determined by an encoding like UTF-8, UTF-16, or UTF-32. Each encoding is a different way to serialize Unicode code points into bytes.
Unicode Planes
The Unicode code space is divided into 17 planes, each containing 65,536 code points. The most important is the Basic Multilingual Plane (BMP, U+0000–U+FFFF), which covers virtually all modern writing systems. The remaining planes (called supplementary planes) contain emoji, historic scripts, and specialized symbols. Characters outside the BMP require special handling in encodings like UTF-16, which uses surrogate pairs.
Getting Started
The best way to understand Unicode is to explore it directly. Try our text encoder to see how a string maps to bytes in different encodings, or look up a specific character using the character browser. The ASCII table is also a great starting point — it's the foundation Unicode was built on.