The Byte Order Mark (BOM): Helper or Headache?

The Byte Order Mark (BOM) is a special Unicode character — U+FEFF — that appears at the very beginning of a text file or stream. Its original purpose was to signal the byte order (endianness) of UTF-16 or UTF-32 encoded text, but it's also sometimes written at the start of UTF-8 files, where it serves as an encoding signature rather than a true byte order indicator.

BOM in UTF-16 and UTF-32

In UTF-16, multi-byte characters can be stored in two byte orders: big-endian (most significant byte first) or little-endian (least significant byte first). The BOM resolves this ambiguity. A file starting with 0xFE 0xFF is UTF-16 BE; one starting with 0xFF 0xFE is UTF-16 LE. Similarly, UTF-32 uses 0x00 0x00 0xFE 0xFF (BE) or 0xFF 0xFE 0x00 0x00 (LE).

BOM in UTF-8: The Controversy

UTF-8 has no byte order issue — it's always the same byte sequence regardless of the platform's endianness. The UTF-8 BOM (0xEF 0xBB 0xBF) is not required and is actually discouraged in most contexts. However, some Windows tools (including older versions of Notepad) add a UTF-8 BOM automatically. This can cause problems: PHP scripts fail if the BOM precedes the opening <?php, and Unix tools may display the BOM as a visible character (ï»¿ or a question mark).

Best Practice

For UTF-8: omit the BOM unless you're specifically targeting Windows tools that require it. For UTF-16 or UTF-32 files that will be exchanged between systems: always include a BOM so readers can detect byte order automatically. Our encodings reference shows the BOM bytes for each encoding that uses one.

BOM in UTF-16 and UTF-32

BOM in UTF-8: The Controversy

Best Practice

More Articles