Contents show

Introduction

Encoding is the backbone of how text is stored, transmitted, and displayed. When migrating databases, files, or applications, mishandling character encodings can lead to broken characters, data corruption, and sorting anomalies. Transitioning to a modern, Unicode-capable encoding—usually UTF-8 or UTF-16—improves internationalization, emoji support, interoperability across systems, and long-term maintainability. The goal is to avoid issues such as mojibake, double-encoding, and collation mismatches while ensuring your data remains complete, searchable, and consistently rendered.

Prerequisites / Before You Start

Backups and snapshots
- Take full database backups (logical and physical if possible).
- Snapshot storage volumes or VM images.
- Export flat files (CSV, JSON, XML) with recorded source encodings.
- Verify you can restore in a staging environment.
Target decisions
- Choose a target encoding: typically UTF-8 (for MySQL use utf8mb4; avoid legacy utf8).
- Choose a target collation per language/locale needs (e.g., utf8mb4_0900_ai_ci for MySQL 8, ICU-based collations for PostgreSQL, SQL Server recommended collations with UTF-8).
- Decide normalization form if relevant (NFC vs NFD). Most ecosystems default to NFC.
Inventory and discovery
- Identify current encodings across:
  - Databases (server/database/table/column collations and charsets).
  - Files (CSV, TXT, XML, JSON feeds).
  - Message queues and APIs (Content-Type, charset).
  - Application layers (.NET, Java, Python, Node.js) and drivers.
  - ETL tools, middleware, and logging pipelines.
- Document regional settings (OS locale, LC_ALL, LANG, code pages on Windows).
Environment and tooling
- Staging environment mirroring production.
- Command-line tools: iconv, recode, file, uchardet/chardet, enca, ICU tools (uconv).
- Database utilities: mysqldump, pg_dump/pg_restore, bcp/sqlcmd, expdp/impdp (Oracle).
- Monitoring and logging ready to capture anomalies.
Stakeholders and communication
- Freeze window and rollback plan approved.
- Test datasets including multilingual samples (accented characters, emoji, CJK, RTL scripts).
- QA scripts for validation (hashes, row counts, sample queries).

Step-by-Step Migration guide

1) Establish the target encoding and collation

Prefer UTF-8 for most systems. For MySQL, use utf8mb4 to support 4-byte characters (emoji, some CJK).
Select a collation that meets search/sort needs:
- Case-insensitive vs. case-sensitive.
- Accent-insensitive vs. sensitive.
- Language-specific collations if required.

Example choices:

MySQL 8: utf8mb4 with utf8mb4_0900_ai_ci for general-purpose accent-insensitive operations.
PostgreSQL: server_encoding UTF8; use ICU collations for locale-specific sorting.
SQL Server 2019+: UTF-8 supported in char/varchar with collations like Latin1_General_100_CI_AS_SC_UTF8.

2) Inventory current encodings and detect mixed data

Verify server, database, table, and column-level settings.
Check code pages and file encodings. Beware of mixed encodings in a single data source.

Commands and techniques:

Linux file detection:
- file -i data.csv
- uchardet data.csv
MySQL:
- SHOW VARIABLES LIKE ‘character_set%’;
- SHOW VARIABLES LIKE ‘collation%’;
- SHOW FULL COLUMNS FROM table_name;
PostgreSQL:
- SHOW SERVER_ENCODING;
- \l+ in psql for db encodings; \d+ table_name for column info.
SQL Server:
- SELECT name, collation_name FROM sys.databases;
- SELECT name, collation_name FROM sys.columns c JOIN sys.tables t ON c.object_id = t.object_id WHERE t.name=’YourTable’;

Create a manifest documenting all encodings discovered.

3) Prepare a safe Migration window

Quiesce the system or enable a short write-freeze.
Stop ETL or ingestion jobs that might add mixed-encoding rows during the cutover.
Record baseline metrics: row counts, key checksums, representative text samples.

4) Extract data with the correct source encoding

Ensuring you read the data using its actual encoding prevents mojibake.

Examples:

MySQL logical dump:
- mysqldump –default-character-set=latin1 –skip-set-charset yourdb > dump_latin1.sql
- If you know the data is truly UTF-8: mysqldump –default-character-set=utf8mb4 yourdb > dump_utf8.sql
PostgreSQL:
- pg_dump –encoding=LATIN1 -d yourdb > dump_lat1.sql
CSV export with a specified encoding from ETL tools (ensure they declare the charset in metadata).

Note: When in doubt, export raw bytes and transcode explicitly in a controlled step.

5) Transcode and normalize

Convert from Legacy code pages (e.g., Windows-1252, ISO-8859-1) to UTF-8.
Normalize to NFC unless your application requires NFD (e.g., some macOS filesystems).

Examples:

iconv:
- iconv -f WINDOWS-1252 -t UTF-8 input.csv > output_utf8.csv
recode:
- recode cp1252..utf8 input.txt
ICU uconv (also can normalize):
- uconv -f shift_jis -t utf-8 –callback substitute input.txt > output_utf8.txt
- uconv -x any-nfc input_utf8.txt > output_nfc.txt

Handle invalid sequences carefully:

iconv with //TRANSLIT or //IGNORE as a last resort; better to log and fix source data where possible.

6) Configure the destination schema

Set the database default encoding and collation before import.
Ensure column types are appropriate (avoid fixed-size char for variable-length Unicode; prefer varchar/text with required length constraints).
Increase column sizes when moving to UTF-8 if byte-length limits existed.

Examples:

MySQL:
- ALTER DATABASE appdb CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci;
- ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
- Ensure connection collation: SET NAMES utf8mb4;
PostgreSQL:
- CREATE DATABASE appdb WITH ENCODING ‘UTF8’ LC_COLLATE ‘en_US.UTF-8’ LC_CTYPE ‘en_US.UTF-8’ TEMPLATE template0;
- For existing DBs, create a new DB with UTF-8 and move data; changing server_encoding in-place isn’t supported.
SQL Server:
- CREATE DATABASE AppDb COLLATE Latin1_General_100_CI_AS_SC;
- Use UTF-8 collations for varchar where appropriate, e.g., Latin1_General_100_CI_AS_SC_UTF8.

7) Import data with explicit encoding settings

Ensure the client/session encoding matches the dump/transcoded files.
Use batch imports cautiously; validate a sample before bulk load.

Examples:

MySQL:
- mysql –default-character-set=utf8mb4 appdb < dump_utf8.sql
PostgreSQL:
- psql “dbname=appdb options=’-c client_encoding=UTF8′” -f dump_utf8.sql
SQL Server (bcp/sqlcmd with UTF-8 files):
- Use -f format files or -w/-W switches appropriately; confirm collation and NVARCHAR usage if needed.

8) Update application and driver configurations

Ensure the application layer reads/writes in UTF-8 consistently.
Set driver parameters:
- JDBC: useUnicode=true; characterEncoding=UTF-8
- MySQL Connector/J: useUnicode=true&characterEncoding=utf8&connectionCollation=utf8mb4_0900_ai_ci
- .NET: ensure System.Text.Encoding.UTF8; set connection string properties.
- Python:
  - open(‘file.txt’, ‘r’, encoding=’utf-8′)
  - psycopg2: set client_encoding=’UTF8′ or rely on connection defaults.
Update HTTP headers and meta tags:
- Content-Type: text/html; charset=UTF-8

9) Validate Data integrity and behavior

Compare row counts and key aggregates.
Hash or checksum representative text columns before/after.
Search tests:
- Accent-insensitive search if expected by users.
- Case-insensitive behavior validation.
Sorting and collation checks on multilingual samples.

10) Rollback and contingency

If critical issues appear, restore from backups and apply hot fixes to your pipeline.
Keep a log of non-conforming data and remediation steps for the next attempt.

Risks, Common Issues, and How to Avoid Them

Mojibake and garbled text
- Cause: Reading bytes in the wrong encoding (e.g., interpreting Windows-1252 as UTF-8).
- Avoidance: Always specify source and destination encodings; verify with detection tools and controlled samples.
Double-encoding
- Cause: Data already in UTF-8 gets encoded again, turning characters into sequences like Ã© instead of é.
- Avoidance: Normalize pipelines; ensure exactly one transcoding pass. Add unit tests around data ingress.
Truncated multibyte sequences
- Cause: Length-limited columns sized in bytes, not characters.
- Avoidance: Resize columns; use character-length semantics; validate lengths after conversion.
Unsupported characters (emoji, CJK extensions)
- Cause: Using 3-byte UTF-8 variants (MySQL utf8) that cannot store 4-byte code points.
- Avoidance: Use utf8mb4 and appropriate collations; test with emoji and rare scripts.
Collation changes affecting queries
- Cause: New collation alters sort order and comparisons.
- Avoidance: Review index and query plans; pick collations aligned with user expectations; explicitly set collation in critical queries.
Normalization inconsistencies (NFC vs NFD)
- Cause: Different platforms store composed/decomposed forms, leading to duplicates or failed matches.
- Avoidance: Normalize data during migration; standardize application behavior; apply canonicalization in search.
Binary vs text confusion
- Cause: Converting binary columns or base64 fields as if they were text.
- Avoidance: Whitelist text columns; do not transcode BLOBs; validate MIME types.
ETL and logging pipelines silently altering encodings
- Cause: Default settings in tools or middleware.
- Avoidance: Audit every hop; set explicit charset in connectors; strip Byte Order Marks (BOM) if required.
Driver and session mismatches
- Cause: Application connects with default encodings that override database settings.
- Avoidance: Force client/session encodings; verify with SHOW/SELECT session settings.

Post-Migration Checklist

Data validation
- Row counts match across key tables.
- Sample records with diacritics, RTL text, CJK, and emoji display correctly.
- No replacement characters (�) in core datasets.
Collation and behavior
- Sorting order and case/accent sensitivity match requirements.
- LIKE/I/REGEXP searches function as expected.
Application layer
- HTTP responses include charset=UTF-8.
- HTML templates use .
- APIs send/accept UTF-8 (Content-Type and Accept-Charset).
- Log files and analytics pipelines store UTF-8 without corruption.
Database and drivers
- Database default character set and collation set correctly.
- Connection/session encodings verified.
- Key indexes and full-text search rebuilt if needed.
Files and ETL
- Scheduled jobs read/write UTF-8 consistently.
- CSV/XML/JSON exporters declare charset in headers.
- Any BOM handling standardized (prefer BOM-less UTF-8 for interoperability).
Monitoring and alerts
- Dashboards for encoding errors (e.g., invalid byte sequences).
- Ticketing procedure for reported display issues with fast turnaround.

Reference Configurations and Commands

Component vs Setting (examples):

Operating System
- Linux locale: export LANG=en_US.UTF-8; export LC_ALL=en_US.UTF-8
- Windows PowerShell: [Console]::OutputEncoding = [System.Text.Encoding]::UTF8
Web/HTTP
- Header: Content-Type: application/json; charset=UTF-8
- HTML:
Java
- JVM default (optional): -Dfile.encoding=UTF-8
- JDBC URL (MySQL): jdbc:mysql://host/db?useUnicode=true&characterEncoding=utf8&connectionCollation=utf8mb4_0900_ai_ci
Python
- File I/O: open(‘file.csv’, ‘r’, encoding=’utf-8′, newline=”)
- Requests: requests.post(url, data=json.dumps(obj).encode(‘utf-8’), headers={‘Content-Type’: ‘application/json; charset=UTF-8’})
Node.js
- fs.readFile(‘file.txt’, ‘utf8’, cb)
- Buffer.from(str, ‘utf8’)
MySQL
- server: character_set_server=utf8mb4; collation_server=utf8mb4_0900_ai_ci
PostgreSQL
- server: SHOW SERVER_ENCODING; CREATE DATABASE … ENCODING ‘UTF8’
SQL Server
- Use _UTF8 collations with varchar for UTF-8; NVARCHAR remains UTF-16.

Examples:

Detect encodings and convert files:

Detect (heuristic)

uchardet sample.csv
file -i sample.csv

Convert CP1252 to UTF-8

iconv -f CP1252 -t UTF-8 sample.csv > sample_utf8.csv

Normalize to NFC (ICU)

uconv -x any-nfc sample_utf8.csv > sample_utf8_nfc.csv

MySQL dump and restore with utf8mb4:

Export with known source encoding

mysqldump –default-character-set=latin1 –skip-set-charset legacydb > legacy_latin1.sql

Replace any explicit SET NAMES in dump if needed, then import as utf8mb4

mysql –default-character-set=utf8mb4 newdb < legacy_latin1.sql

Ensure server and connection are utf8mb4

SET NAMES utf8mb4;
SHOW VARIABLES LIKE ‘character_set%’;

PostgreSQL client encoding and import:

Ensure client communicates in UTF-8

psql “dbname=newdb options=’-c client_encoding=UTF8′”

Alternatively in session

SET client_encoding = ‘UTF8’;

Python re-encode safely:

from pathlib import Path
import unicodedata

data = Path(‘in_cp1252.txt’).read_text(encoding=’cp1252′, errors=’strict’)
data_nfc = unicodedata.normalize(‘NFC’, data)
Path(‘out_utf8.txt’).write_text(data_nfc, encoding=’utf-8′, errors=’strict’)

Practical Examples by Platform

MySQL to MySQL (latin1 to utf8mb4)

Inventory:
- SHOW VARIABLES LIKE ‘character_set%’;
- SHOW FULL COLUMNS FROM your_table;
Dump:
- mysqldump –default-character-set=latin1 –skip-set-charset olddb > dump.sql
Create target:
- CREATE DATABASE newdb CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
Import:
- mysql –default-character-set=utf8mb4 newdb < dump.sql
Convert tables if needed:
- ALTER TABLE your_table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
App config:
- JDBC URL with characterEncoding=UTF-8; SET NAMES utf8mb4 at connection.

PostgreSQL LATIN1 to UTF-8 via staging

Create target DB with UTF-8:
- CREATE DATABASE newdb WITH ENCODING ‘UTF8’ TEMPLATE template0;
Export with encoding:
- pg_dump –encoding=LATIN1 olddb > dump_lat1.sql
Import with client UTF-8:
- psql “dbname=newdb options=’-c client_encoding=UTF8′” -f dump_lat1.sql
Validate:
- \encoding in psql shows UTF8; run sample selects with diacritics.

SQL Server enable UTF-8

Choose UTF-8 collation for varchar columns:
- ALTER DATABASE CURRENT COLLATE Latin1_General_100_CI_AS_SC_UTF8;
- Or set per column/table if full switch is risky.
For bulk import, ensure data file is UTF-8 and bcp format supports it.
Validate with strings containing emoji.

Testing Strategy

Unit tests at boundaries (ingress/egress):
- Feed in strings with é, ñ, €, 你好, مرحبا, 🙂 and verify round-trip.
Integration tests:
- API, database writes, and ETL round-trip with mixed languages.
Property-based testing:
- Random Unicode across categories (letters, marks, symbols) to catch unexpected failures.
Performance:
- Rebuild and test indexes; some collations change index sizes and sort Performance.

Troubleshooting Playbook

Symptoms: � or question marks
- Cause: Lossy transcoding, storage driver fallback.
- Fix: Re-extract from backup; set explicit UTF-8; use errors=’strict’ to fail fast in scripts.
Symptoms: “Invalid byte sequence for encoding UTF8”
- Cause: Source not actually UTF-8 or includes binary blobs.
- Fix: Identify row/column via binary search; transcode with iconv; exclude binary columns.
Symptoms: Sort order unexpected
- Cause: Wrong collation or mixed collations across columns.
- Fix: Standardize collation at DB and column; specify COLLATE in queries where needed.
Symptoms: Question marks in SQL Server
- Cause: Data went through varchar with non-UTF-8 collation or ANSI code page.
- Fix: Use UTF-8 collation for varchar or NVARCHAR; ensure client sends UTF-8.

Minimal Decision Matrix

If you need global language support, emoji, and compatibility: choose UTF-8 (utf8mb4 on MySQL).
If Windows/.NET ecosystem and SQL Server NVARCHAR heavy: UTF-16 can be fine, but UTF-8 is increasingly recommended.
If you must interoperate with legacy partners: implement strict boundaries where you transcode at ingress/egress, never in the middle of pipelines.

FAQ

How can I reliably detect the source encoding of legacy files?

Automatic detection is heuristic. Use tools like file, uchardet, or enca to get a guess, then confirm with domain knowledge and sample inspection. If you see characters like “Ã©” instead of “é,” your interpretation is wrong. When possible, obtain documentation from the data provider or look for byte-order marks (BOM) in UTF encodings.

Should I use utf8 or utf8mb4 in MySQL?

Use utf8mb4. The older utf8 in MySQL stores only up to 3 bytes per character and cannot handle some Unicode characters such as emoji. Pair it with a modern collation like utf8mb4_0900_ai_ci for reasonable case/accent-insensitive behavior.

What’s the difference between encoding and collation?

Encoding defines how characters are represented as bytes (e.g., UTF-8, UTF-16, Windows-1252). Collation defines how strings are compared and sorted (case sensitivity, accent handling, locale rules). You can store text in UTF-8 but still choose a collation that sorts according to specific language rules.

How do I prevent double-encoding during migration?

Centralize transcoding in a single, explicit step. Ensure exports specify their source encoding and imports set their session/client encoding. Add tests that check for common double-encoded patterns and fail the pipeline if found. Avoid ad-hoc conversions in multiple layers.

Do I need to normalize Unicode to NFC?

It’s advisable for consistency unless your platform requires otherwise. NFC is widely used and reduces duplicate-looking strings that differ only by composition. Normalize at the point of ingestion or during migration, and ensure search/indexing treat text consistently.

How to Handle Character Encoding Issues During Migration

Introduction

Prerequisites / Before You Start

Step-by-Step Migration guide

1) Establish the target encoding and collation