Mojibake Patterns
Mojibake (文字化け) is garbled text caused by decoding bytes with the wrong character encoding. The table below covers the most common patterns — what the garbled output looks like, which encoding mismatch caused it, and how to fix it.
Common UTF-8 → Latin-1 / Windows-1252 Patterns
The most frequent mojibake scenario: UTF-8 encoded text (especially curly quotes, dashes, and accented letters) mistakenly read as Latin-1 or Windows-1252. Each multi-byte UTF-8 sequence becomes 2–4 garbage characters.
| Garbled Output | Correct Character | Codepoint | UTF-8 Bytes | Cause |
|---|---|---|---|---|
| ’ | ’ | U+2019 | E2 80 99 | UTF-8 bytes read as Windows-1252 |
| “ | “ | U+201C | E2 80 9C | UTF-8 bytes read as Windows-1252 |
| â€\u{009D} | ” | U+201D | E2 80 9D | UTF-8 bytes read as Windows-1252 |
| â€" | – | U+2013 | E2 80 93 | UTF-8 bytes read as Windows-1252 |
| â€" | — | U+2014 | E2 80 94 | UTF-8 bytes read as Windows-1252 |
| … | … | U+2026 | E2 80 A6 | UTF-8 bytes read as Latin-1 |
| é | é | U+00E9 | C3 A9 | UTF-8 bytes read as Latin-1 |
| Ã | à | U+00E0 | C3 A0 | UTF-8 bytes read as Latin-1 |
| è | è | U+00E8 | C3 A8 | UTF-8 bytes read as Latin-1 |
| ü | ü | U+00FC | C3 BC | UTF-8 bytes read as Latin-1 |
| ö | ö | U+00F6 | C3 B6 | UTF-8 bytes read as Latin-1 |
| Ä | Ä | U+00C4 | C3 84 | UTF-8 bytes read as Latin-1 |
| ñ | ñ | U+00F1 | C3 B1 | UTF-8 bytes read as Latin-1 |
| € | € | U+20AC | E2 82 AC | UTF-8 bytes read as Latin-1 |
| ð\u{009F}\u{0098}\u{0082} | 😂 | U+1F602 | F0 9F 98 82 | UTF-8 4-byte sequence read as Latin-1 (emoji) |
How to Fix Mojibake
Python
When a UTF-8 string was incorrectly decoded as Latin-1:
# Re-encode as Latin-1, then decode as UTF-8
fixed = garbled.encode('latin-1').decode('utf-8')
# Or using ftfy (Fix Text For You)
import ftfy
fixed = ftfy.fix_text(garbled)
PHP
Latin-1 bytes that should be UTF-8:
// PHP 8.1+ (mb_convert_encoding) $fixed = mb_convert_encoding($garbled, 'UTF-8', 'ISO-8859-1'); // Older PHP $fixed = utf8_encode($garbled);
MySQL
Table stored as Latin-1 but contains UTF-8 bytes:
-- Step 1: change column charset without conversion ALTER TABLE t MODIFY col BLOB; -- Step 2: convert BLOB back to utf8mb4 ALTER TABLE t MODIFY col TEXT CHARACTER SET utf8mb4; -- Or convert entire table ALTER TABLE t CONVERT TO CHARACTER SET utf8mb4;
JavaScript / Node.js
Reading a buffer with the wrong encoding:
// Specify encoding when reading files
const text = fs.readFileSync('file.txt', 'utf8');
// Convert buffer manually
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(buffer);
// If bytes are actually Windows-1252:
const decoder = new TextDecoder('windows-1252');
const text = decoder.decode(buffer);
Other Common Mismatches
| Actual Encoding | Read As | Symptom | Common Context |
|---|---|---|---|
| UTF-8 | Latin-1 / Windows-1252 | Multi-byte sequences become 2–4 Latin characters (é, ’, etc.) | Web scrapers, legacy PHP apps, MySQL with wrong charset |
| Windows-1252 | UTF-8 | Replacement characters (�) or UnicodeDecodeError in Python |
Old Windows files opened in modern editors without charset detection |
| Shift-JIS | UTF-8 | Japanese text shows as replacement characters or garbage | Japanese Windows files, legacy Japanese web pages |
| GB2312 / GBK | UTF-8 | Chinese text replaced with replacement characters or question marks | Simplified Chinese legacy documents, older Chinese websites |
| ISO-8859-5 (Cyrillic) | Windows-1252 | Cyrillic letters replaced with Latin characters and symbols | Russian email, legacy documents from pre-UTF-8 systems |
| UTF-16 LE (with BOM) | UTF-8 | File begins with ÿþ or þÿ, alternating null bytes |
Files saved by Notepad (Windows), some Windows APIs |
| UTF-8 with BOM | UTF-8 (BOM ignored) | File starts with three invisible bytes (EF BB BF), causes issues in PHP, CSV, HTTP headers | Files saved by Notepad or Visual Studio with "UTF-8 with BOM" option |
| Latin-1 | UTF-8 | High-byte characters (0x80–0xFF) trigger UnicodeDecodeError or are replaced |
Legacy European text files, email attachments |
Diagnosing Encoding Problems
Look at the bytes
Use a hex editor or xxd / hexdump
to see the raw bytes. UTF-8 multi-byte sequences start with 0xC2–0xF4.
Bytes 0x80–0xBF are continuation bytes. If you see standalone 0x80–0xFF,
it's likely a single-byte encoding.
Identify the pattern
The signature é always means UTF-8 read as Latin-1.
The signature ’ means UTF-8 read as Windows-1252.
Lots of ? or �
means bytes that don't fit the encoding at all.
Use detection libraries
chardet (Python) and uchardet
(C/CLI) use statistical analysis to guess the encoding from byte frequency patterns.
They work best on longer texts — short strings may produce wrong guesses.
Use the Decode Bytes tool
on this site to try different decodings interactively.
Frequently Asked Questions
What is mojibake?
Mojibake (文字化け) is Japanese for "character transformation." It refers to garbled text produced when bytes are decoded with the wrong character encoding. The raw bytes are correct — only the interpretation is wrong.
What does ’ mean?
’ is the classic mojibake pattern for a right single quotation mark (U+2019). Its UTF-8 encoding is E2 80 99. When those three bytes are read as Windows-1252 instead, E2 becomes â, 80 becomes €, and 99 becomes ™ — yielding ’.
What does é mean?
é is the mojibake pattern for é (U+00E9 LATIN SMALL LETTER E WITH ACUTE). Its UTF-8 encoding is C3 A9. When read as Latin-1, C3 is à and A9 is ©, yielding é.
How do I fix mojibake in MySQL?
The common MySQL mojibake fix: ALTER TABLE t MODIFY col BLOB; then ALTER TABLE t MODIFY col TEXT CHARACTER SET utf8mb4; — this re-interprets the bytes without double-converting them. Also ensure your connection charset is utf8mb4 at connection time.
Why does Python throw UnicodeDecodeError?
Python's str.decode() and open() default to the system locale encoding. If your file is Latin-1 or Windows-1252 and Python tries to read it as UTF-8, bytes in the range 0x80–0xBF will fail validation. Specify the encoding explicitly: open('file.txt', encoding='latin-1') or use errors='replace' to substitute replacement characters.
What is the UTF-8 BOM and why does it cause problems?
The UTF-8 BOM is the byte sequence EF BB BF at the start of a file. It was borrowed from UTF-16 as an encoding signature, but UTF-8 does not need a BOM. Some tools (PHP, CSV parsers, HTTP servers) mishandle it — a PHP file starting with a BOM will emit whitespace before any output, breaking headers. Save files as "UTF-8 without BOM" to avoid this.