009 —AI
AI-assisted PHP review: what it catches, what it misses
An honest look at what AI code review actually changes when you're refactoring 14-year-old PHP: the bugs it finds in seconds, the legacy decisions it can't read.
It's Wednesday, half past three. You're three hours into a refactor of a 14-year-old WordPress theme that a Dutch agency we work with inherited from a long-departed contractor. The file is functions.php. Line 312 reads extract($_REQUEST);. The AI review pass flagged it in under a second, alongside fourteen other findings across wp-content/themes/. None of them are wrong. About half of them matter.
This is the perspective shift nobody really wrote down: AI-assisted code review on legacy PHP doesn't make the work faster in the obvious way. It makes the bottleneck move. The detection problem is mostly solved. The decision problem is now the whole job.
What the model catches in seconds
For pattern-level bugs, current models are very good. Given a 600-file legacy WordPress or Magento 1 codebase, an AI pass will reliably surface:
- Every surviving
mysql_query()call, deprecated in PHP 5.5 and gone in PHP 7. - Unprepared SQL with concatenated
$_GETor$_POSTvalues. - Calls to
create_function(),each(),split(), and the rest of the PHP 7.2 deprecation list. - Type juggling bugs around
==on stringy numbers (see the php.net comparison table). - SSRF-shaped
file_get_contents()calls on user-controlled URLs. - Output written to the page without
esc_html()oresc_attr().
A senior PHP developer doing this audit by hand takes a week per site. A model trained on enough PHP does it in the time it takes to brew coffee. The output is roughly as accurate as a tired human, possibly slightly better on the boring pattern matches, definitely worse on anything that requires reading the project's history.
This part of the change is real and uncontroversial. If a 2024 audit was "find all the broken bits," a 2026 audit starts with a printable list of broken bits and asks "now which ones do we touch."
Where the model gets confidently wrong
The failures cluster around context, not capability. A few that we keep seeing on real legacy sites.
The missing break that isn't a bug
A switch case without a break looks like an obvious oversight. On one Drupal 7 site we audited, the missing break was load-bearing: it intentionally fell through to a legacy redirect that an analytics consultant wired up in 2017. Two thousand pages a month hit that fall-through. The model said "likely bug." It was, in fact, the business logic.
Dead code that's still called
Legacy PHP loves require_once with relative paths two directories up. An AI report will confidently call a function dead because nothing in the file or namespace references it. Then a Joomla module from 2012 includes the file directly and explodes in production the morning after deploy.
The deprecated function the host pinned
Some shared hosts still serve PHP 7.4, sometimes 7.2. The model will recommend replacing each() or create_function() with the modern equivalent, which is correct in isolation and breaks the moment you push to a host that's three major versions behind. The fix requires reading .htaccess:
AddHandler application/x-httpd-php74 .php
# or, on cPanel:
# AddType application/x-httpd-ea-php74 .php .php7 .phtml
The model can't see that line unless you give it to the model. It will give the right answer for the wrong PHP version and call it done.
The hook-registered-twice pattern is a quieter cousin of the same problem. WordPress plugins from the 2014 to 2018 era frequently re-register the same filter inside both plugins_loaded and init because a competitor plugin used to deregister it. The "duplicate hook registration" finding is technically correct and operationally catastrophic.
The review loop, rewritten
The shape of a legacy PHP refactor used to be linear. Read the file. Write notes in a scratchpad. Pattern-match against bugs you've seen before. Decide what to change. Run the change against a staging copy. Wait for a client to email about the thing you broke.
With AI review in the loop, the shape is now:
- Run the model across the whole tree.
- Triage the findings against the codebase's actual history.
- For each finding worth acting on, write the change as a hypothesis.
- Verify against a real database and a real request flow.
- Keep an undo path open until the client has lived with the change for a week.
Step 2 is the work. It's where domain knowledge earns its keep. A finding like "the wp_users query at line 88 isn't prepared" is a hypothesis until you've checked the git blame, the support tickets, and the staging behaviour. The model isn't wrong. It just doesn't know whether the variable comes from a constant in wp-config.php or from a query string a customer can craft. The OWASP write-up on SQL injection is still the right primer for that judgement call.
Triage as the actual skill
The senior developers we work with are getting noticeably better at one specific motion: reading an AI finding and deciding, in about fifteen seconds, whether it deserves a real investigation. The motion goes:
- Is the input attacker-controlled? If yes, real read.
- Does this code path actually execute on the production site? Check the access logs, not the file.
- If we change it, can we put it back in one click?
The third question used to be answered by "I'll zip the file first." Now it's answered by file-level version history on the project itself, which removes the deterrent against trying the change. The cost of being wrong about a triage call drops to roughly zero, which means the bar for "worth trying" drops too.
What this changes about the work
End to end, the day looks less like archaeology and more like editing. You start with a list of suspect lines. You decide on five of them. You make the change. You leave the audit trail intact. You move on. The thing AI didn't do is make the developer redundant. It moved them up a layer, from "find the bug" to "decide which bug is real."
When we built Pier we ran into this exact loop on every legacy site we touched. The way we ended up handling it was to put the AI review, the FTP edit, the MySQL editor, and the per-file version history inside one window, so a triage decision and the undo path live one keystroke apart.
The smallest thing you can do today: take the oldest functions.php in your project tree, run an AI review pass on it, and before fixing anything, read the git blame on the three most alarming findings. Half of them will be real bugs. The other half will teach you something about the site that the model couldn't see.
— Questions —
Can AI reliably catch SQL injection in legacy PHP?
Yes for the obvious string-concat pattern. It will miss the subtler cases where input passes through three function calls and a session variable. Treat every flag as a hypothesis to verify.
Does AI code review replace senior PHP developers?
It replaces the detection part of their work. The triage call about what's safe to change still belongs to whoever knows the codebase's history and the client's quirks.
How accurate is AI on PHP 7 to 8 deprecation audits?
Very accurate on pattern matches like create_function or each. Less reliable when the host runs an older PHP version, because the model can't read your .htaccess unless you hand it over.
What's the biggest risk on legacy sites?
Confidently wrong findings on context-dependent code: missing break statements that fall through on purpose, dead functions called by old plugins, hooks registered twice for compatibility.