UnicodeEncodingStrings

Unicode and UTF-8 Explained for Developers

April 1, 2026 · 7 min read

Character encoding bugs are among the most frustrating to debug: garbled text, missing characters, mojibake (文字化け), and crashes when processing user input with unexpected characters. Understanding Unicode and UTF-8 from first principles prevents most of these problems.

The Problem Unicode Solves

Before Unicode, every region had its own character encoding: ASCII for English (128 characters), Latin-1 for Western Europe (256 characters), GB2312 for Chinese, Shift-JIS for Japanese, and hundreds of others. A document written in one encoding was unreadable in another. Unicode provides a single, universal character set that includes every character from every writing system on Earth — over 149,000 characters as of Unicode 15.

Code Points

Unicode assigns every character a unique integer called a code point, written as U+XXXX:

U+0041  →  A   (LATIN CAPITAL LETTER A)
U+00E9  →  é   (LATIN SMALL LETTER E WITH ACUTE)
U+4E2D  →  中  (CJK UNIFIED IDEOGRAPH, "middle")
U+1F600 →  😀  (GRINNING FACE — an emoji in the supplementary planes)
U+0000  →  NUL (null character)

The full Unicode range covers code points from U+0000 to U+10FFFF — over 1.1 million possible characters, though most are not yet assigned.

Encodings: How Code Points Become Bytes

A code point is an abstract number. An encoding defines how that number is stored as bytes on disk or transmitted over a network. Three encodings are most relevant:

UTF-8

Variable-width encoding using 1–4 bytes per code point. ASCII characters (U+0000–U+007F) use exactly one byte and are identical to ASCII — making UTF-8 backward compatible.

Code point range	Bytes	Example
U+0000 – U+007F	1	A = 0x41
U+0080 – U+07FF	2	é = 0xC3 0xA9
U+0800 – U+FFFF	3	中 = 0xE4 0xB8 0xAD
U+10000 – U+10FFFF	4	😀 = 0xF0 0x9F 0x98 0x80

UTF-16

Uses 2 bytes for most characters (the Basic Multilingual Plane, U+0000–U+FFFF) and 4 bytes for supplementary characters using surrogate pairs. Used internally by JavaScript, Java, and Windows APIs. Because null bytes appear in ASCII text, UTF-16 is not safe for C-style string functions.

UTF-32

Fixed-width: always 4 bytes per code point. Simple but wasteful — a plain ASCII document quadruples in size. Rarely used in practice.

BOM (Byte Order Mark)

UTF-16 and UTF-32 can be written in big-endian or little-endian byte order. A BOM (U+FEFF at the start of the file) signals which order is used. UTF-8 has no byte order, so a UTF-8 BOM is unnecessary and causes problems in many tools (Python, Linux shell scripts) that see unexpected bytes at the start of a file. Avoid UTF-8 BOMs — use BOM-less UTF-8.

Common Bugs and How to Avoid Them

Mojibake

When a UTF-8 file is read as Latin-1 (or vice versa), characters like é display as Ã©. Always declare and enforce encoding at every boundary: database connections, file I/O, HTTP responses, and HTML meta tags.

<!-- HTML: declare encoding in the first 1024 bytes -->
<meta charset="UTF-8">

# Python: always specify encoding explicitly
with open("file.txt", "r", encoding="utf-8") as f:
    content = f.read()

# MySQL: set connection charset
SET NAMES utf8mb4;

The MySQL utf8 vs utf8mb4 trap

MySQL's utf8 charset only stores up to 3-byte UTF-8 sequences — it cannot store emoji (which need 4 bytes). Always use utf8mb4 in MySQL databases. This is one of the most common causes of emoji being silently dropped on save.

String length vs byte length

# Python 3 — len() counts code points, not bytes
len("café")   # 4
len("café".encode("utf-8"))  # 5 (é = 2 bytes)

# JavaScript — length counts UTF-16 code units
"😀".length    // 2 (emoji takes 2 surrogate pairs)
[..."😀"].length  // 1 (spread counts code points correctly)

Normalisation

The same visible character can be represented in multiple ways in Unicode. The letter é can be a single code point (U+00E9) or two code points (U+0065 e + U+0301 combining acute). These compare as unequal as raw strings. Always normalise before comparing: use NFC (composed form) for most purposes.

# Python
import unicodedata
unicodedata.normalize("NFC", text)

# JavaScript
text.normalize("NFC")

Use the Text Tools on io9.me to encode, decode, and inspect text encoding right in your browser.