How Emoji Are Encoded in Unicode
Emoji may seem simple — colorful pictographs you tap on a phone — but their encoding in Unicode is surprisingly complex. Many emoji are single code points; others are sequences of multiple code points joined together, with variation selectors and zero-width joiners playing crucial roles.
Emoji as Single Code Points
The first emoji to enter Unicode were assigned single code points. The 😀 grinning face is U+1F600; the ❤ red heart is U+2764; the ☺ smiling face is U+263A. Most emoji from the Emoticons block (U+1F600–U+1F64F) and Miscellaneous Symbols and Pictographs (U+1F300–U+1F5FF) are single code points. Browse them in our character browser.
Variation Selectors
Some code points render as text by default but can be made to render as emoji by appending Variation Selector-16 (U+FE0F). For example, ☎ (U+260E) by default renders as a black-and-white telephone symbol. Adding U+FE0F produces the emoji version: 📞. Conversely, Variation Selector-15 (U+FE0E) forces text presentation.
ZWJ Sequences
The most complex emoji use the Zero Width Joiner (U+200D) to combine multiple code points into a single rendered glyph. Skin tone modifiers (U+1F3FB–U+1F3FF) attach to human emoji to change skin tone. The rainbow flag 🏳️🌈 is a sequence: white flag (U+1F3F3) + VS16 + ZWJ + rainbow (U+1F308). These sequences are only rendered as combined emoji when the operating system and font support them.
The Encoding Overhead
Because most emoji are in supplementary Unicode planes (above U+FFFF), they require UTF-16 surrogate pairs or 4-byte UTF-8 sequences. The 😀 emoji is 4 bytes in UTF-8 (0xF0 0x9F 0x98 0x80) and 4 bytes (two code units) in UTF-16. This is why JavaScript's string.length returns 2 for a single emoji character — a common source of bugs in string-handling code.