048 —Migration
Latin1 to utf8mb4 on legacy Joomla: a no-mojibake migration
A Dutch agency we work with inherited a 2011 Joomla site. The articles looked fine until somebody searched for an accented name. Here is the routine we used to fix it without mojibake.
A Dutch agency we work with sent a Loom at 23:41 on a Tuesday. The screen showed a Joomla 3.10 article preview with the headline "Café Plein 12, Nîmes, ouvert le dimanche" rendered as "Café Plein 12, Nîmes, ouvert le dimanche". The database had been carried forward, byte for byte, since the original 2011 install on Joomla 1.7 and MySQL 5.1. Every accented character was now a Latin-1 garbage pair.
The site needed to move to utf8mb4 before a PHP 8.2 upgrade. Doing it wrong twice in production is how you end up with three layers of mojibake and no rollback. Here is the routine we use, step by step, to convert a latin1 Joomla database to utf8mb4 without losing a single character.
Reading the actual bytes
Joomla tables that look like latin1_swedish_ci in phpMyAdmin almost never contain Swedish. On a 2011 install, the column declared latin1 was usually being written by a PHP layer sending UTF-8 bytes over a connection that did not SET NAMES. MySQL stored the raw bytes without converting them. The column is mislabelled, not miscoded.
Before touching anything, dump one row in hex and look at it:
SELECT HEX(title), title
FROM jos_content
WHERE id = 412;
If "café" stores as 63 61 66 C3 A9, the bytes are valid UTF-8 wearing a latin1 label. That is the easy case. If it stores as 63 61 66 C3 83 C2 A9, somebody already ran a half-broken conversion and the data is double-encoded. Fix the double-encoding first by reversing the extra pass, otherwise everything that follows will compound the damage.
The safe export
The whole trick is to read the bytes out in the same charset they were written under, not in the charset MySQL thinks they are in. mysqldump with --default-character-set=latin1 tells the server "do not transcode anything, just hand me the raw bytes". The output file is then UTF-8 byte-for-byte, even though the dump header still claims latin1.
mysqldump \
--default-character-set=latin1 \
--skip-set-charset \
--single-transaction \
--routines --triggers \
-u root -p \
joomla_prod > joomla_latin1.sql
--skip-set-charset drops the SET NAMES latin1 line from the dump header. --single-transaction keeps you out of trouble on InnoDB without locking the tables. The mysqldump manual covers the remaining flags if you need to slice by database.
Rewriting and reimporting
Now rewrite the dump's character set declarations and collations. The file body is already valid UTF-8; only the metadata needs to change.
sed -i.bak \
-e 's/CHARSET=latin1/CHARSET=utf8mb4/g' \
-e 's/COLLATE=latin1_swedish_ci/COLLATE=utf8mb4_unicode_ci/g' \
-e 's/ DEFAULT CHARACTER SET latin1/ DEFAULT CHARACTER SET utf8mb4/g' \
joomla_latin1.sql
mv joomla_latin1.sql joomla_utf8mb4.sql
Create a fresh empty database with the right defaults, then import into it:
CREATE DATABASE joomla_new
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
mysql \
--default-character-set=utf8mb4 \
-u root -p \
joomla_new < joomla_utf8mb4.sql
Query a known accented row in the new database. It should read clean.
The byte-for-byte logic
You are not asking MySQL to convert anything. You read raw bytes that happened to be valid UTF-8, declared them as such on the way back in, and the labels finally match the contents. Mathias Bynens' write-up remains the clearest reference on why utf8mb4 (not utf8) is the only correct target on MySQL.
The Joomla index length trap
The reimport will fail on at least one table with ERROR 1071: Specified key was too long; max key length is 767 bytes. utf8mb4 is four bytes per character, so a VARCHAR(255) index hits 1020 bytes. The fix depends on your MySQL version.
On MySQL 5.7+ and MariaDB 10.2+ with InnoDB, the limit becomes 3072 bytes if three conditions hold:
innodb_file_per_table = ONinnodb_file_format = Barracuda(5.7 only; 8.0 removed the toggle)- The table is created with
ROW_FORMAT=DYNAMICorCOMPRESSED
The dump from a 2011 Joomla install will be full of ROW_FORMAT=COMPACT or no row format at all. Patch the dump before importing:
sed -i \
-e 's/ENGINE=InnoDB/ENGINE=InnoDB ROW_FORMAT=DYNAMIC/g' \
joomla_utf8mb4.sql
If a stray ALTER still trips on a key, the Joomla-specific offenders are usually #__session.session_id and #__update_sites.location. Both can be shortened to VARCHAR(191) without breaking anything Joomla does at runtime.
Configuration and verification
Point Joomla at the new database in configuration.php:
public $dbtype = 'mysqli';
public $host = 'localhost';
public $user = 'joomla';
public $password = '...';
public $db = 'joomla_new';
public $dbprefix = 'jos_';
Joomla itself did not set a connection charset before 3.5; on older sites, either add an explicit mysqli_set_charset($conn, 'utf8mb4') call in libraries/joomla/database/driver/mysqli.php or upgrade. Modern Joomla does it automatically when the server supports it.
Spot-check three things after the cutover:
- An article with an accent, a curly quote, and an emoji round-trips through the editor and saves identically.
SHOW CREATE TABLE jos_content;showsCHARSET=utf8mb4on every table you migrated.- A row-count comparison between old and new databases matches exactly.
SELECT table_name, table_collation
FROM information_schema.tables
WHERE table_schema = 'joomla_new'
AND table_collation NOT LIKE 'utf8mb4%';
If that query returns rows, those tables were missed by the sed pass. Convert them in place with ALTER TABLE ... CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; and re-run the check.
What this saves you on the next site
The most common way this work goes sideways is doing the conversion live, one ALTER TABLE at a time, on the production database. MySQL happily transcodes already-UTF-8 bytes a second time, which is how you end up reading "café" where you wrote "café". The export, rewrite, reimport routine above does the conversion on a copy and lets you verify before cutting over.
When we built Pier we hit this exact problem on a Magento 1.9 audit and a Drupal 6 cleanup in the same week. The way we ended up handling it inside the app was to surface the actual stored bytes alongside the rendered text in the MySQL editor, so you can spot a mislabelled column before writing a single ALTER, and roll back through version history if a conversion goes wrong.
If you have a legacy site sitting on latin1, the smallest useful thing to do today is run that HEX() query on one accented row. The byte pattern will tell you, in thirty seconds, whether you are looking at the easy case or the double-encoded one.
— Questions —
Why utf8mb4 and not utf8?
MySQL's 'utf8' is a three-byte subset that cannot store emoji or some CJK characters. utf8mb4 is real four-byte UTF-8 and is the only correct target on MySQL.
Do I need to take the site offline?
Only for the cutover. The export, rewrite and reimport happen on a copy. When the new database verifies clean, switch configuration.php and flush Joomla's cache.
What if HEX() shows the double-encoded pattern?
Reverse one pass first. Run a transcoded export, then convert latin1 to utf8 once on the offending column before applying the routine. Test on one column first.
Will this work on MariaDB?
Yes. MariaDB 10.2+ defaults to ROW_FORMAT=DYNAMIC and supports the 3072-byte key limit out of the box, so the ROW_FORMAT patch step is usually a no-op there.