Articles — Unicode and Character Encoding Guides

Jul 2026

Han Unification in Unicode: A Technical Compromise with Cultural Impact

One of the most controversial decisions in Unicode's history is Han Unification: the assignment of a single code point to characters from Chinese, Japanese, Kor...

16

Jul 2026

Control Characters in Unicode: The Invisible Operators

Control characters are non-printing code points that were originally designed to control hardware devices like printers and terminals. Many have been repurposed...

09

Jul 2026

JavaScript and Unicode: The UCS-2 Legacy Problem

JavaScript's string handling has a quirk that surprises many developers: strings are sequences of UTF-16 code units, not Unicode code points. This is a legacy o...

02

Jul 2026

Python 3 and Unicode: Strings Are Text, Not Bytes

One of the most significant changes in Python 3 was the strict separation of text (str) and bytes (bytes). In Python 2, strings were bytes with ambiguous encodi...

25

Jun 2026

Character Encoding in Databases: MySQL utf8 vs utf8mb4

Choosing the right character encoding for your database is one of those decisions that's easy to get wrong and painful to fix later. The most notorious example...

18

Jun 2026

Percent-Encoding in URLs: What Those %20s Actually Mean

Percent-encoding (also called URL encoding) is the mechanism for representing characters in URLs that are either not allowed by the URL specification or that ha...

11

Jun 2026

The Unicode Private Use Area: Custom Characters for Special Needs

Unicode reserves certain ranges of code points as the Private Use Area (PUA) — regions where organizations and individuals can assign their own characters witho...

04

Jun 2026

Combining Characters: Building Complex Glyphs from Simple Parts

Combining characters are Unicode code points that attach to the preceding base character to modify its appearance. Rather than encoding every possible letter +...

28

May 2026

Bidirectional Text: How Unicode Handles Arabic and Hebrew

Most writing systems read left-to-right, but Arabic, Hebrew, Persian, and several other scripts read right-to-left. Text that mixes both directions — such as an...

21

May 2026

How Emoji Are Encoded in Unicode

Emoji may seem simple — colorful pictographs you tap on a phone — but their encoding in Unicode is surprisingly complex. Many emoji are single code points; othe...

14

May 2026

Zero-Width Characters: Invisible but Surprisingly Powerful

Zero-width characters are Unicode code points that take up no horizontal space when rendered. They're invisible in most contexts yet can have significant effect...

07

May 2026

Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained

Unicode allows some characters to be represented in multiple equivalent ways. The letter é, for instance, can be encoded as a single precomposed character (U+00...