Understanding Mojibake: When Text Goes Wrong

· 2 min read

Mojibake (文字化け) is a Japanese term meaning "character transformation" — the garbled text that appears when a file encoded in one character set is decoded using a different one. If you've ever seen strings like “hello world†or ÃÆ', you've encountered mojibake.

Why Mojibake Happens

Every character stored in a file or transmitted over a network is ultimately a sequence of bytes. Those bytes are meaningless without knowing which encoding was used to create them. When software reads those bytes using the wrong encoding, the bytes map to different characters — or to invalid sequences — producing gibberish.

The most common mojibake today involves UTF-8 text being read as Windows-1252 (or vice versa). The curly quote character (U+201C) encodes to 0xE2 0x80 0x9C in UTF-8. When those bytes are interpreted as Windows-1252, they become “ — three separate characters instead of one.

Common Mojibake Patterns

Different encoding mismatches produce recognizable patterns. UTF-8 bytes misread as Latin-1 or Windows-1252 produce the characteristic sequences starting with Ã, â€, or Â. You can look up known mojibake patterns in our mojibake reference, which maps corrupted strings back to their likely originals.

Fixing Mojibake

The fix depends on where in the chain the mismatch occurred. If you have raw bytes, re-decode them with the correct encoding. If the bytes have already been converted (double-encoded), you may need to reverse multiple transformations. Our mojibake decoder can help you identify and reverse common conversion errors. Preventing mojibake is simpler: always declare and respect encoding at every layer — files, databases, HTTP headers, and HTML meta tags.

More Articles

View all articles