— Article — № 013

013 —Workflow

Legacy PHP site snapshot: a one-afternoon playbook

You just got handed SFTP creds, a vague Slack thread, and a site nobody has touched in four years. Here is how to know what you are looking at by 5pm.

Antique nautical chart on dark oak table with brass dividers and small red wax seal beside a paper tag.
Hero · staged still№ 013

It is just past one in the afternoon. An agency you have worked with twice before has dropped a Bitwarden share into your inbox: SFTP creds, a database password, the URL of a site nobody at the agency has logged into for four years. Their client wants a quote for "modernising it a bit" and the agency wants a written assessment before five so they can send a number tonight. You have never seen the codebase. You have four hours.

What follows is the playbook we use for that exact situation. It assumes a PHP legacy site on shared or VPS hosting with SFTP and direct MySQL access. By 17:00 you should have a local snapshot, a one-page summary of what is actually running, and enough signal to defend a number.

The brief and the four hours

Set the goal explicitly before opening anything. You are not refactoring. You are not patching. You are producing two artefacts:

  1. A frozen copy of the site (files plus database) that you could restore over the live site in fifteen minutes.
  2. A one-page document that tells the agency what stack, what version, what plugins, what cron, and what known risks.

That framing matters because the temptation, once you are in, is to start fixing the obvious things. Resist it. A snapshot is a measurement. The fix is a separate conversation, and that conversation goes better when you are quoting from a document.

Hour one: inventory the file tree

Open SFTP and resist the urge to download everything to your laptop first. Most legacy WordPress installs are 2 to 8 GB once you include uploads, and you do not need uploads to know what the site is. Start with shape, not bytes.

If you have shell access, four commands pay for themselves immediately:

cd /var/www/site
find . -name "*.php" | wc -l
du -sh . */ 2>/dev/null | sort -h
ls -la wp-config.php sites/default/settings.php configuration.php app/etc/local.xml 2>/dev/null
php -v 2>/dev/null || php-cgi -v 2>/dev/null

The first tells you roughly how much PHP you are dealing with. A plain WordPress is around 2,500 PHP files. If you see 20,000 you are looking at a site stuffed with plugins, or a Drupal 7 with a lot of contrib modules, or a Magento. The second tells you which directory ate the disk (almost always wp-content/uploads, occasionally a forgotten backup-db/ directory that someone dropped in 2019). The third tells you, in one shot, which CMS you are dealing with. The fourth tells you which PHP version is actually executing, which matters because shared hosts will lie to you in the control panel.

If you do not have shell, every decent SFTP client (Cyberduck, Transmit, FileZilla) can show file counts per directory. Drop a one-line PHP file at the webroot to get the runtime version:

<?php echo PHP_VERSION, ' ', PHP_SAPI, ' ', php_uname(); ?>

Visit it once. Delete it. Note the version. PHP's supported-versions page tells you whether that version is still in security support, and on a legacy site the answer is usually no.

Spotting the CMS without guessing

Each CMS leaves a signature you can trust:

  • WordPress: wp-config.php at the root, wp-content/plugins/ and wp-content/themes/.
  • Drupal: sites/default/settings.php, core/CHANGELOG.txt (D8+), or CHANGELOG.txt at the root (D7).
  • Joomla: configuration.php at the root with a JConfig class.
  • Magento 1: app/etc/local.xml. Magento 2: app/etc/env.php.
  • Custom PHP: an index.php that includes a router, no CMS markers. Treat with caution. Custom PHP from 2014 is the genre most likely to have a forgotten admin route.

Hour two: read the database

Before you read a single row, dump it. A read-only audit is still a write event in someone's mind, and a host can fail at any moment. A snapshot you cannot restore is not a snapshot.

mysqldump --single-transaction --quick --routines --triggers \
  -h DBHOST -u DBUSER -p DBNAME | gzip > site-$(date +%F).sql.gz

The --single-transaction flag is the one that lets you take a consistent dump of an InnoDB database without locking the site. MySQL's mysqldump reference spells out the trade-offs against MyISAM. If the dump finishes in under two minutes, the database is small enough that you can pull a copy down and work locally for the rest of the audit.

Now look at shape:

SELECT table_name,
       engine,
       table_rows,
       ROUND(data_length/1024/1024, 1)  AS data_mb,
       ROUND(index_length/1024/1024, 1) AS index_mb
FROM   information_schema.tables
WHERE  table_schema = DATABASE()
ORDER  BY data_length DESC
LIMIT  20;

This one query tells you where the weight is. On WordPress the top of the list is usually wp_options (autoloaded option bloat), wp_postmeta (orphan meta from deleted plugins), and wp_posts (revisions). On Magento 1 the top is almost always core_url_rewrite and log_url_info, both of which mean nothing for users and everything for backups.

For WordPress specifically, two follow-up queries tell you the rest of the story:

SELECT option_name, LENGTH(option_value) AS bytes
FROM   wp_options
WHERE  autoload = 'yes'
ORDER  BY bytes DESC
LIMIT  20;

SELECT post_status, COUNT(*) FROM wp_posts GROUP BY post_status;

The first surfaces autoloaded options larger than a few kilobytes (usually a transient that forgot to expire, or a plugin storing its entire log inside wp_options). The second tells you whether the 80,000 rows in wp_posts are real content or 78,000 revisions.

For Drupal, the equivalents are the cache_* table sizes and the watchdog table; for Magento, log_* and report_event. None of this needs fixing today. You are taking a measurement.

Hour three: capture the runtime

The site does things you cannot see from files alone. Two places hide the rest of the truth.

The .htaccess walk

Read every .htaccess from the document root down. The interesting ones are rarely at the top. A typical legacy WordPress will have a stock block at the root and a hand-rolled redirect map two directories deep that nobody documented:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteRule ^old-product-([0-9]+)\.html$ /shop/?p=$1 [R=301,L]
RewriteCond %{HTTP_HOST} ^www\.
RewriteRule ^(.*)$ https://example.com/$1 [R=301,L]
</IfModule>

# php_value upload_max_filesize 64M
# php_value memory_limit 512M
SetEnvIf Request_URI "\.(jpg|png|css|js)$" no-log

What you are looking for: redirect chains that double-hop, PHP overrides that conflict with what the host's PHP-FPM pool is doing, and IP allowlists that will lock the agency out the moment they switch office. Apache's mod_rewrite documentation is the reference if a rule looks suspicious.

Cron, queues, and error logs

Run crontab -l if you have shell. If not, look for the CMS-specific signatures:

  • WordPress: DISABLE_WP_CRON in wp-config.php means a real system cron is hitting wp-cron.php somewhere. Find it.
  • Drupal: drush core:cron from a system crontab, or a curl to /cron/{key}.
  • Magento: bin/magento cron:run, or the older cron.sh.

If cron is broken (and on legacy sites it usually is, because someone migrated hosts and never re-added the entry) you will see scheduled posts that never publish, transients that never expire, and queue tables that grow forever. Note it. Do not fix it.

Then tail the last 200 lines of wp-content/debug.log or whatever error_log sits next to the script that crashes most. You will learn more about the site's actual health in two minutes than the agency learned in two years. Repeated stack traces from the same plugin are usually the headline of your handover document.

Hour four: freeze, summarise, hand back

You now have everything you need. Spend the last hour producing the two artefacts.

The freeze:

tar --exclude='wp-content/cache' \
    --exclude='wp-content/uploads' \
    -czf files-$(date +%F).tar.gz .

Excluding uploads is deliberate. You already have the database, you already have the code, and uploads are restorable from the live site if you ever need them. A 200 MB code archive plus a 50 MB database dump is a snapshot you can email. A 12 GB tarball is a snapshot that will sit on a drive and rot.

The summary, in one page, with these headings exactly:

  • Stack: PHP version, MySQL version, web server, CMS and version.
  • Surface: file count, code size, upload size, top three directories by weight.
  • Database: total size, top five tables by weight, charset and collation.
  • Plugins or modules: count, names of the five with the most database rows or files.
  • Cron: present or absent, mechanism, last successful run if you can tell.
  • Risks: PHP EOL status, exposed files, redirect chains, anything in the error log that repeats.
  • Restore: where the snapshot lives, the one command to put it back.

That is it. Send it. The agency now has something they can quote against, and you have a frozen point in time you can return to without re-discovering everything from scratch.

Where the playbook tends to bite you

Two things go wrong often enough to be worth naming.

The first: mysqldump against a database with stored procedures or triggers will silently omit them unless you pass --routines --triggers. The site appears to restore cleanly and then breaks at the first checkout three days later. Always include both flags, even when you are sure the database has no procedures. You are not always sure.

The second: file ownership. If you SFTP files back as your user, the web server (running as www-data or nginx) may not be able to read them. Note the ownership of a few representative files during your audit (ls -la) and document it. Restoring without that note is a recoverable mistake. Restoring without realising it is a Friday-evening mistake.

When we built Pier we ran into this exact afternoon over and over, which is why the app opens an SFTP and a MySQL connection in the same window and keeps a version history of every file and every row from the moment you connect. The first audit we ran with it took two hours instead of four, mostly because we stopped switching tabs between the MySQL editor and the file tree.

The smallest thing to do today

Pick a site you already have credentials for and run just hour one. The four commands at the top of this post, against a site you think you know, will tell you something you did not know. That is the entire promise of a snapshot, and four commands is a small enough start to actually do before the day ends.

— Questions —

Do I really need to dump the database before I read it?

Yes. A read-only audit can still trigger plugin behaviour, and a host can fail at any moment. A dump in your first ten minutes is the cheapest insurance you will ever buy.

What if I don't have shell access, only SFTP?

The same playbook works over SFTP plus a MySQL client. You lose the speed of find and du, but the four artefacts (file list, DB dump, .htaccess, error log) are all reachable from a desktop client.

How long should the handover document be?

One page. Anything longer gets skimmed; anything shorter omits a risk. Bullet form, named headings, no narrative. The agency will paste from it directly into their quote.

What do I do about uploads if I excluded them from the archive?

Note the uploads directory size and last-modified date in the handover. If a restore is ever needed, rsync uploads from the live site at that point. Do not bake them into the snapshot.