Unicode and UTF-8 Explained for Developers
April 1, 2026 · 7 min read
Character encoding bugs are among the most frustrating to debug: garbled text, missing characters, mojibake (文字化け), and crashes when processing user input with unexpected characters. Understanding Unicode and UTF-8 from first principles prevents most of these problems.
The Problem Unicode Solves
Before Unicode, every region had its own character encoding: ASCII for English (128 characters), Latin-1 for Western Europe (256 characters), GB2312 for Chinese, Shift-JIS for Japanese, and hundreds of others. A document written in one encoding was unreadable in another. Unicode provides a single, universal character set that includes every character from every writing system on Earth — over 149,000 characters as of Unicode 15.
Code Points
Unicode assigns every character a unique integer called a code point, written as U+XXXX:
U+0041 → A (LATIN CAPITAL LETTER A) U+00E9 → é (LATIN SMALL LETTER E WITH ACUTE) U+4E2D → 中 (CJK UNIFIED IDEOGRAPH, "middle") U+1F600 → 😀 (GRINNING FACE — an emoji in the supplementary planes) U+0000 → NUL (null character)
The full Unicode range covers code points from U+0000 to U+10FFFF — over 1.1 million possible characters, though most are not yet assigned.
Encodings: How Code Points Become Bytes
A code point is an abstract number. An encoding defines how that number is stored as bytes on disk or transmitted over a network. Three encodings are most relevant:
UTF-8
Variable-width encoding using 1–4 bytes per code point. ASCII characters (U+0000–U+007F) use exactly one byte and are identical to ASCII — making UTF-8 backward compatible.
| Code point range | Bytes | Example |
|---|---|---|
| U+0000 – U+007F | 1 | A = 0x41 |
| U+0080 – U+07FF | 2 | é = 0xC3 0xA9 |
| U+0800 – U+FFFF | 3 | 中 = 0xE4 0xB8 0xAD |
| U+10000 – U+10FFFF | 4 | 😀 = 0xF0 0x9F 0x98 0x80 |
UTF-16
Uses 2 bytes for most characters (the Basic Multilingual Plane, U+0000–U+FFFF) and 4 bytes for supplementary characters using surrogate pairs. Used internally by JavaScript, Java, and Windows APIs. Because null bytes appear in ASCII text, UTF-16 is not safe for C-style string functions.
UTF-32
Fixed-width: always 4 bytes per code point. Simple but wasteful — a plain ASCII document quadruples in size. Rarely used in practice.
BOM (Byte Order Mark)
UTF-16 and UTF-32 can be written in big-endian or little-endian byte order. A BOM (U+FEFF at the start of the file) signals which order is used. UTF-8 has no byte order, so a UTF-8 BOM is unnecessary and causes problems in many tools (Python, Linux shell scripts) that see unexpected bytes at the start of a file. Avoid UTF-8 BOMs — use BOM-less UTF-8.
Common Bugs and How to Avoid Them
Mojibake
When a UTF-8 file is read as Latin-1 (or vice versa), characters like é display as é. Always declare and enforce encoding at every boundary: database connections, file I/O, HTTP responses, and HTML meta tags.
<!-- HTML: declare encoding in the first 1024 bytes -->
<meta charset="UTF-8">
# Python: always specify encoding explicitly
with open("file.txt", "r", encoding="utf-8") as f:
content = f.read()
# MySQL: set connection charset
SET NAMES utf8mb4;The MySQL utf8 vs utf8mb4 trap
MySQL's utf8 charset only stores up to 3-byte UTF-8 sequences — it cannot store emoji (which need 4 bytes). Always use utf8mb4 in MySQL databases. This is one of the most common causes of emoji being silently dropped on save.
String length vs byte length
# Python 3 — len() counts code points, not bytes
len("café") # 4
len("café".encode("utf-8")) # 5 (é = 2 bytes)
# JavaScript — length counts UTF-16 code units
"😀".length // 2 (emoji takes 2 surrogate pairs)
[..."😀"].length // 1 (spread counts code points correctly)Normalisation
The same visible character can be represented in multiple ways in Unicode. The letter é can be a single code point (U+00E9) or two code points (U+0065 e + U+0301 combining acute). These compare as unequal as raw strings. Always normalise before comparing: use NFC (composed form) for most purposes.
# Python
import unicodedata
unicodedata.normalize("NFC", text)
# JavaScript
text.normalize("NFC")Use the Text Tools on io9.me to encode, decode, and inspect text encoding right in your browser.