016 —Operations
Backups that aren't really backups: a seven-point audit
An agency lead sent a Loom at 23:41: the restore file was 11 MB, the database is 4.7 GB. Here's how that happens, and how to make sure it doesn't to you.
A Dutch agency we work with sent a Loom at 23:41 on a Tuesday. Three minutes long, narration cut off after the first sentence: "We just tried to restore last week's backup and the SQL file is 11 megabytes." Production is a Magento 1.9 store. The database is 4.7 GB. The nightly backup had been running for two years, green check every morning, and nobody had ever opened the file.
This is the audit nobody runs until it's the only thing that matters. The seven failures below are the ones we see on legacy-site engagements often enough that we now check for all of them on intake.
Where the lie usually lives
A backup is two things stapled together: a file, and a tested path back to running production. Drop either side and what you have is hopeful storage. Agencies fool themselves about backups for the same reason they fool themselves about test coverage: green dots feel like a contract. They aren't.
1. The hosting-panel snapshot on the same disk
cPanel's "Full Backup" wizard, Plesk's on-server archive, DirectAdmin's daily tarball. They all write to the same volume the site is served from. When the disk fails, the archive fails with it. When the account is suspended for an unpaid invoice, the archive is suspended too. When the host's RAID controller eats itself, there is no Plan B.
The 3-2-1 rule, written up in CISA's data backup guidance, predates the cloud and still holds: three copies, two media, one off-site. A file at /home/USER/backups/ is one copy.
2. The dump that ran but didn't finish
mysqldump exits 0 on truncation more often than people think. If max_allowed_packet on the server is 16M and a serialized blob in wp_options is 22M, the dump terminates early, the file looks normal at a glance, and the restore later dies on a single row no one can find. The cron sends its success email regardless.
mysqldump --single-transaction --quick --max-allowed-packet=512M \
--routines --triggers --events \
-h db.example.com -u backup_ro -p"$BACKUP_PWD" \
example_db | gzip > /backups/example_db-$(date +%F).sql.gz
The flags that matter, all documented on the official mysqldump page: --single-transaction to avoid mid-write tearing on InnoDB, --quick to stream row by row instead of buffering in RAM, --max-allowed-packet set bigger than the largest row you might ever have, and --routines --triggers --events to capture the schema objects mysqldump otherwise drops without comment.
Always check the tail of the file before you trust it:
gunzip -c /backups/example_db-2026-05-15.sql.gz | tail -1
# -- Dump completed on 2026-05-15 2:14:38
3. The files no one is backing up
The database is glamorous. The 38 GB of wp-content/uploads is not. We see this monthly: the SQL backup is fine, the file backup hasn't run in seven months because someone wrote an rsync exclusion to skip *.zip and then someone else renamed the daily archive to backup.zip.
Walk the actual paths on disk and compare them to what's in the off-site archive:
ssh prod 'du -sh /var/www/html/wp-content/uploads \
/var/www/html/wp-content/plugins \
/etc/letsencrypt /etc/nginx'
If those numbers are not within a few percent of what your latest archive contains, the backup you have is not the site you run.
4. The off-site that isn't
Hetzner production server, Hetzner Storage Box, both in FSN1. Looks off-site on the diagram. Isn't. When Hetzner has a regional incident, the site and its "backup" go dark together. Same blast radius is not redundancy.
The cheap fix is a second provider for the cold copy, not a different region of the same one. Backblaze B2 alongside a Hetzner production box; rsync.net alongside AWS; Wasabi alongside Vercel object storage. Anything that arrives as a separate invoice from a separate company.
5. The restore that's never been tried
A backup that has never been restored to a working URL is folklore. We've watched five-figure agency contracts go sideways over this. The script ran, the file existed, no one ever poured it back into a fresh database and checked that the homepage rendered.
Once a quarter, take last night's archive and restore it into a throwaway container:
docker run -d --name restore-test \
-e MYSQL_ROOT_PASSWORD=test mysql:8
gunzip -c example_db-2026-05-15.sql.gz | \
docker exec -i restore-test mysql -uroot -ptest \
-e "CREATE DATABASE r; USE r; SOURCE /dev/stdin"
If it errors, you have until the next real incident to find out why.
6. The secrets missing from the archive
WordPress salts, Drupal's settings.php, .env files, Magento's app/etc/env.php with the crypt key. When these are excluded by .gitignore and also excluded from the backup rsync, restoring the database and uploads buys you a half-broken site that can't decrypt its own user data.
For Magento, losing env.php means losing the crypt key, which means saved customer payment tokens become unreadable garbage. For Drupal, losing hash_salt in settings.php invalidates every logged-in session and breaks one-time login links. Treat per-site config as a first-class backup artifact, encrypted at rest, with a documented recovery path. Pin it next to the password manager, not on the same disk it came from.
7. The cron that's been failing for nine months
Cron jobs that depend on outbound DNS, a remote mount, or an API key tend to fail quietly the first time the surrounding infrastructure moves. The job is still on the crontab, the mtime on the destination is just always last August.
find /backups -type f -name '*.sql.gz' \
-printf '%T+ %p\n' | sort | tail -5
If the newest file in /backups is from August and today is May, your cron has been writing to a stale path or failing without notification for nine months. Wire a heartbeat that proves success, not just exit code 0:
0 3 * * * /usr/local/bin/backup.sh \
&& curl -fsS -m 10 https://hc-ping.com/your-uuid \
|| curl -fsS -m 10 https://hc-ping.com/your-uuid/fail
The absence of a heartbeat is the signal you actually want. Anything that pings home reliably will do; the point is that silence becomes an alert.
Tonight's audit
Two hours, before the next backup window:
- SSH into production.
du -shthe four largest directories on disk. - Open the most recent off-site archive and verify the same paths and sizes are inside it.
- Tail the latest SQL dump. Look for
-- Dump completed on. Look for anyGot error. - Restore the dump to a throwaway MySQL container. Open the homepage at
/tmp. - Note where the archive lives and what bill it shows up on. If that's the same bill as production, escalate.
- Add a heartbeat URL the cron pings on success, not just on schedule.
- Put the next restore drill on the calendar with a specific date and a named owner.
When we built Pier we kept hitting items 2, 3, and 5 on inherited legacy sites, often in the first hour of an engagement. The way we ended up handling it inside the app is a local version history of every chat-driven edit, kept separately from whatever the host's backup is doing, so the undo path never depends on the same infrastructure that just broke.
The smallest thing you can do today: open a terminal, run the find command above on your backup directory, and read the modification date of your newest file out loud. If you stumble on the year, that's your answer.
— Questions —
How often should an agency actually test restores?
Quarterly at a minimum, with a named owner and a calendar invite. Untested backups have a measured failure rate at first restore that's well above zero, and the cost lands on the agency, not the host.
Is a hosting panel backup ever enough on its own?
No. Panel backups commonly land on the same disk and same account as production. Treat them as a fast local copy, then add an independent off-site copy on a different provider's invoice.
What's the single most common silent failure?
A mysqldump that truncates on max_allowed_packet but still exits 0. The file looks normal until the day you try to restore it. Always verify the trailing 'Dump completed on' footer is present.