Mojibake (文字化け) is Japanese for 'character transformation.' It refers to garbled text that appears when a file or string is decoded with the wrong encoding. The raw bytes are correct, but the program interpreting them is using a different character map, producing nonsense symbols.

Why does UTF-8 text show as mojibake in MySQL?

MySQL's 'utf8' charset is actually a 3-byte subset of UTF-8 that cannot store characters above U+FFFF (like emoji). Use 'utf8mb4' instead. Mojibake in MySQL often occurs when the connection charset doesn't match the stored charset — set names utf8mb4 at connection time to fix it.

Mojibake Patterns

Q: What does â€™ mean?

â€™ is the mojibake pattern for a right single quotation mark (') that was encoded as UTF-8 but read as Latin-1 or Windows-1252. The UTF-8 bytes E2 80 99 for U+2019 RIGHT SINGLE QUOTATION MARK, when misread as Latin-1, produce the three characters â (E2), € (80 in Windows-1252), and ™ (99 in Windows-1252).

Q: How do I fix mojibake?

The fix depends on the context. In MySQL: ALTER TABLE t CONVERT TO CHARACTER SET utf8mb4. In Python: text.encode('latin-1').decode('utf-8'). In PHP: utf8_encode() for Latin-1 input. The key is to identify which encoding the bytes are actually in (usually UTF-8) and which encoding they were mistakenly read as (often Latin-1 or Windows-1252).

Mojibake (文字化け) is garbled text caused by decoding bytes with the wrong character encoding. The table below covers the most common patterns — what the garbled output looks like, which encoding mismatch caused it, and how to fix it.

Common UTF-8 → Latin-1 / Windows-1252 Patterns

The most frequent mojibake scenario: UTF-8 encoded text (especially curly quotes, dashes, and accented letters) mistakenly read as Latin-1 or Windows-1252. Each multi-byte UTF-8 sequence becomes 2–4 garbage characters.

Garbled Output	Correct Character	Codepoint	UTF-8 Bytes	Cause
â€™	’	U+2019	E2 80 99	UTF-8 bytes read as Windows-1252
â€œ	“	U+201C	E2 80 9C	UTF-8 bytes read as Windows-1252
â€\u{009D}	”	U+201D	E2 80 9D	UTF-8 bytes read as Windows-1252
â€"	–	U+2013	E2 80 93	UTF-8 bytes read as Windows-1252
â€"	—	U+2014	E2 80 94	UTF-8 bytes read as Windows-1252
â€¦	…	U+2026	E2 80 A6	UTF-8 bytes read as Latin-1
Ã©	é	U+00E9	C3 A9	UTF-8 bytes read as Latin-1
Ã	à	U+00E0	C3 A0	UTF-8 bytes read as Latin-1
Ã¨	è	U+00E8	C3 A8	UTF-8 bytes read as Latin-1
Ã¼	ü	U+00FC	C3 BC	UTF-8 bytes read as Latin-1
Ã¶	ö	U+00F6	C3 B6	UTF-8 bytes read as Latin-1
Ã„	Ä	U+00C4	C3 84	UTF-8 bytes read as Latin-1
Ã±	ñ	U+00F1	C3 B1	UTF-8 bytes read as Latin-1
â‚¬	€	U+20AC	E2 82 AC	UTF-8 bytes read as Latin-1
ð\u{009F}\u{0098}\u{0082}	😂	U+1F602	F0 9F 98 82	UTF-8 4-byte sequence read as Latin-1 (emoji)

How to Fix Mojibake

Python

When a UTF-8 string was incorrectly decoded as Latin-1:

# Re-encode as Latin-1, then decode as UTF-8
fixed = garbled.encode('latin-1').decode('utf-8')

# Or using ftfy (Fix Text For You)
import ftfy
fixed = ftfy.fix_text(garbled)

PHP

Latin-1 bytes that should be UTF-8:

// PHP 8.1+ (mb_convert_encoding)
$fixed = mb_convert_encoding($garbled, 'UTF-8', 'ISO-8859-1');

// Older PHP
$fixed = utf8_encode($garbled);

MySQL

Table stored as Latin-1 but contains UTF-8 bytes:

-- Step 1: change column charset without conversion
ALTER TABLE t MODIFY col BLOB;
-- Step 2: convert BLOB back to utf8mb4
ALTER TABLE t MODIFY col TEXT CHARACTER SET utf8mb4;

-- Or convert entire table
ALTER TABLE t CONVERT TO CHARACTER SET utf8mb4;

JavaScript / Node.js

Reading a buffer with the wrong encoding:

// Specify encoding when reading files
const text = fs.readFileSync('file.txt', 'utf8');

// Convert buffer manually
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(buffer);

// If bytes are actually Windows-1252:
const decoder = new TextDecoder('windows-1252');
const text = decoder.decode(buffer);

Other Common Mismatches

Actual Encoding	Read As	Symptom	Common Context
UTF-8	Latin-1 / Windows-1252	Multi-byte sequences become 2–4 Latin characters (Ã©, â€™, etc.)	Web scrapers, legacy PHP apps, MySQL with wrong charset
Windows-1252	UTF-8	Replacement characters (�) or `UnicodeDecodeError` in Python	Old Windows files opened in modern editors without charset detection
Shift-JIS	UTF-8	Japanese text shows as replacement characters or garbage	Japanese Windows files, legacy Japanese web pages
GB2312 / GBK	UTF-8	Chinese text replaced with replacement characters or question marks	Simplified Chinese legacy documents, older Chinese websites
ISO-8859-5 (Cyrillic)	Windows-1252	Cyrillic letters replaced with Latin characters and symbols	Russian email, legacy documents from pre-UTF-8 systems
UTF-16 LE (with BOM)	UTF-8	File begins with `ÿþ` or `þÿ`, alternating null bytes	Files saved by Notepad (Windows), some Windows APIs
UTF-8 with BOM	UTF-8 (BOM ignored)	File starts with three invisible bytes (EF BB BF), causes issues in PHP, CSV, HTTP headers	Files saved by Notepad or Visual Studio with "UTF-8 with BOM" option
Latin-1	UTF-8	High-byte characters (0x80–0xFF) trigger `UnicodeDecodeError` or are replaced	Legacy European text files, email attachments

Diagnosing Encoding Problems

Look at the bytes

Use a hex editor or xxd / hexdump to see the raw bytes. UTF-8 multi-byte sequences start with 0xC2–0xF4. Bytes 0x80–0xBF are continuation bytes. If you see standalone 0x80–0xFF, it's likely a single-byte encoding.

Identify the pattern

The signature Ã© always means UTF-8 read as Latin-1. The signature â€™ means UTF-8 read as Windows-1252. Lots of ? or � means bytes that don't fit the encoding at all.

Use detection libraries

chardet (Python) and uchardet (C/CLI) use statistical analysis to guess the encoding from byte frequency patterns. They work best on longer texts — short strings may produce wrong guesses. Use the Decode Bytes tool on this site to try different decodings interactively.

Frequently Asked Questions

What is mojibake?

Mojibake (文字化け) is Japanese for "character transformation." It refers to garbled text produced when bytes are decoded with the wrong character encoding. The raw bytes are correct — only the interpretation is wrong.

What does â€™ mean?

â€™ is the classic mojibake pattern for a right single quotation mark (U+2019). Its UTF-8 encoding is E2 80 99. When those three bytes are read as Windows-1252 instead, E2 becomes â, 80 becomes €, and 99 becomes ™ — yielding â€™.

Ã© is the mojibake pattern for é (U+00E9 LATIN SMALL LETTER E WITH ACUTE). Its UTF-8 encoding is C3 A9. When read as Latin-1, C3 is Ã and A9 is ©, yielding Ã©.

How do I fix mojibake in MySQL?

The common MySQL mojibake fix: ALTER TABLE t MODIFY col BLOB; then ALTER TABLE t MODIFY col TEXT CHARACTER SET utf8mb4; — this re-interprets the bytes without double-converting them. Also ensure your connection charset is utf8mb4 at connection time.

Why does Python throw UnicodeDecodeError?

Python's str.decode() and open() default to the system locale encoding. If your file is Latin-1 or Windows-1252 and Python tries to read it as UTF-8, bytes in the range 0x80–0xBF will fail validation. Specify the encoding explicitly: open('file.txt', encoding='latin-1') or use errors='replace' to substitute replacement characters.

What is the UTF-8 BOM and why does it cause problems?

The UTF-8 BOM is the byte sequence EF BB BF at the start of a file. It was borrowed from UTF-16 as an encoding signature, but UTF-8 does not need a BOM. Some tools (PHP, CSV parsers, HTTP servers) mishandle it — a PHP file starting with a BOM will emit whitespace before any output, breaking headers. Save files as "UTF-8 without BOM" to avoid this.

Related Tools & References

→ Mojibake Detector Tool → Decode Bytes Tool → Encoding Reference → U+FFFD Replacement Character