A Performance Analysis for a 3 Million Record XML Import
I recently needed to import large XML files into a MySQL database. The largest file was 11.3 GB containing over 3 million records. I thought this was going to be a straightforward streaming import, but it turned into an interesting debugging adventure that led me to discover some surprising limitations in PHP's XML handling*.
*Actually, these limitations likely extend beyond PHP, as the root cause may be in an underlying C library used by a number of other programming languages.
Here's what I learned, the rabbit holes I went down, and the surprisingly simple solutions that gave me massive performance improvements.
Context
- The XML files I was importing contained a small intro/header section of data followed by an array of ~3 million nodes that all shared a known schema.
- When importing, each element could be handled in isolation.
1. The Speed Problem
I initially wrote a simple artisan command that would truncate the database table, take in the source xml file as a command argument, use the XMLReader class to iterate through the array's sibling nodes, and then bulk import the objects 5,000 records at a time.
public function handle()
{
// truncate, handle file argument...
$reader = \XMLReader::open($this->file);
$batch = [];
$count = 0;
$batchSize = 5000;
foreach($this->iterateRecords($reader, $elementName) as $xmlRecord) {
$dto = $this->createDto($xmlRecord, $type);
$batch[] = [
...$dto->toModelAttributes(),
// other static attributes
];
// Bulk insert batch
if (count($batch) >= $batchSize) {
$modelClass::query()->insert($batch);
$count += count($batch);
$batch = [];
}
}
// Insert remaining batch
if (! empty($batch)) {
$modelClass::query()->insert($batch);
$count += count($batch);
}
// clean up...
}
private function iterateRecords(XMLReader $reader, string $elementName): Generator
{
// This isn't verbatim code but gets the point across...
// Get to the first instance of our element to import
do {
if ($elementName === $reader->localName) {
break;
}
} while ($reader->read());
// Yield the outer xml for each element
do {
yield $reader->readOuterXml();
} while ($this->reader->next());
}
With this in place, my import process would start strong but degrade in speed over time:
Starting to import records...
Imported 5000 records. Current rate: 3557 records/s
Imported 10000 records. Current rate: 2273 records/s
Imported 15000 records. Current rate: 1931 records/s
...
Imported 125000 records. Current rate: 320 records/s
Imported 130000 records. Current rate: 307 records/s
Starting at ~3,500 records/second and dropping to ~300/s after just 130,000 records. At that rate, importing 3 million records would take forever—assuming PHP didn't run out of memory first.
That wasn't going to do.
2. Noticing the Memory Leak
To begin my investigation of the speed issue, I opened Activity Monitor and noticed two things: the PHP process was consuming an ever-growing amount of memory, and 2) the mysqld process was barely touching the cpu.
So I now had two symptoms to investigate:
- Performance degradation - throughput dropping from 3,500/s to 300/s
- Memory leak - unbounded memory growth of the PHP process
I assumed these were related. Spoiler: they weren't.
3. Investigating the Leak
My first instinct was that the memory leak was related to XML parsing or query logs. We were streaming gigabytes of data after all. All it would take is some distant reference messing up the garbage collector or keeping a hold of the xml strings and kaboom-big leak. So, I tried some of the simple and obvious options:
Forced garbage collection:
if (count($batch) >= $batchSize) {
$modelClass::query()->insert($batch);
unset($batch);
gc_collect_cycles();
gc_mem_caches(); // PHP 8.1+ - releases string interning cache
}
Disabling Laravel's query log:
DB::disableQueryLog();
Neither of these attempts made a significant difference. The performance continued to degrade, and memory continued to grow.
I wrote a series of tinker tests to isolate the issue, testing raw XMLReader, XMLReader with DTO creation, batch building, and various combinations. The tests revealed something surprising: XMLReader itself wasn't leaking memory in isolation, but performance was degrading severely. We were onto something, at least for the speed issue...
4. The Real Performance Culprit: XMLReader's Traversal
After systematic testing, I discovered the root cause of the performance degradation. It wasn't memory, garbage collection, or the database—it was XMLReader's fundamental traversal behavior.
I ran performance tests comparing different approaches:
===
Raw fread() baseline
How quickly can we buffer through the file?
===
Duration: 0.05s for 50000 records
Batch 1: 5000 records, 777586/s
Batch 2: 10000 records, 1099885/s
Batch 3: 15000 records, 1182893/s
Batch 4: 20000 records, 988756/s
Batch 5: 25000 records, 1094604/s
Batch 6: 30000 records, 948594/s
Batch 7: 35000 records, 1110367/s
Batch 8: 40000 records, 993863/s
Batch 9: 45000 records, 1118839/s
Batch 10: 50000 records, 951261/s
First batch rate: 777586/s
Last batch rate: 951261/s
===
XMLReader::next()
How quickly can the XMLReader traverse elements with no addt'l processing?
===
Duration: 36.29s for 50000 records
Batch 1: 5000 records, 10303/s
Batch 2: 10000 records, 3627/s
Batch 3: 15000 records, 2683/s
Batch 4: 20000 records, 1848/s
Batch 5: 25000 records, 1578/s
Batch 6: 30000 records, 1320/s
Batch 7: 35000 records, 1000/s
Batch 8: 40000 records, 950/s
Batch 9: 45000 records, 839/s
Batch 10: 50000 records, 746/s
First batch rate: 10303/s
Last batch rate: 746/s
The results were damning:
- Raw fread(): Consistent ~1 million records/s with no degradation; disk reading & buffering was plenty fast.
- XMLReader::next(): Starts at 10,303/s but degrades to 746/s—and this was without even reading any content
The degradation appears to be roughly quadratic as the number of nodes grows. Just calling next() to traverse to the next sibling gets progressively slower the further you are into the document.
Why This Happens: libxml2?
PHP's XMLReader is a thin wrapper around libxml2. Therefore, I believe the performance degradation originates in libxml2, not PHP itself. There are known performance issues in libxml2, which may be causing this specific behavior seen during my imports. Here's a specific issue from someone using a XML library in Ruby that showed a similar performance problem.
If I can find some spare time, I may write a C program to test libxml2 in a few ways to ensure that's truly the cause and not something else related to PHP's wrapper of the library.
My Pragmatic Solution: String-Based Parsing
For predictable XML structures where you know the element names you're parsing, simple string operations massively outperform XMLReader:
public function iterate(): Generator
{
$handle = fopen($this->xmlPath, 'r');
$buffer = '';
$bufferSize = 1024 * 1024; // 1MB chunks
while (! feof($handle)) {
$buffer .= fread($handle, $bufferSize);
while (($startPos = $this->findStartTag($buffer)) !== false) {
$endPos = strpos($buffer, $this->endTag, $startPos);
if ($endPos === false) {
break; // Need more data
}
$endPos += strlen($this->endTag);
yield substr($buffer, $startPos, $endPos - $startPos);
$buffer = substr($buffer, $endPos);
}
}
fclose($handle);
}
This approach:
- Reads the file in 1MB chunks with
fread() - Uses
strpos()to find record boundaries - Uses
substr()to extract each record - Yields records one at a time for memory efficiency
I ran a performance test like the ones above for this implementation:
=== String-based (fread + strpos) ===
Duration: 0.64s for 50000 records
Batch 1: 5000 records, 80299/s
Batch 2: 10000 records, 76330/s
Batch 3: 15000 records, 79961/s
Batch 4: 20000 records, 75922/s
Batch 5: 25000 records, 76749/s
Batch 6: 30000 records, 76267/s
Batch 7: 35000 records, 79428/s
Batch 8: 40000 records, 79196/s
Batch 9: 45000 records, 81635/s
Batch 10: 50000 records, 80697/s
First batch rate: 80299/s
Last batch rate: 80697/s
Now we're talkin'. Obviously, it's much slower than just reading through the file, but the key is that the performance remains constant and that it's fast enough for my needs.
5. The Memory Leak: Sentry's Query Tracking
After switching to string-based parsing, throughput was excellent and consistent. But the memory leak persisted. The import would run beautifully at ~17,000 records/s with consistent performance, but eventually crash:
Allowed memory size of 2147483648 bytes exhausted
at vendor/laravel/framework/src/Illuminate/Database/MySqlConnection.php:53
I initially suspected something in Laravel's query builder was holding a reference to the inserted data. Alas, I tested raw PDO statements (bypassing the query builder) and it had stable memory throughout the import. So something about using Laravel's query building and DB connection was causing the issue.
Since the source code between my raw PDO statements and Laravel's connection were so similar, my immediate next suspicion was the QueryExecuted event that is dispatched when running a statement in Laravel. Bazinga. I discovered the culprit: Sentry.
Sentry's Laravel integration listens to QueryExecuted events to capture queries for breadcrumbs and performance tracing. Each query's data was being retained in memory. With thousands of batch inserts, this accumulated until PHP ran out of memory.
| Test | Memory Growth | Leaking? |
|---|---|---|
| Raw PDO | 0 MB | No |
| Laravel insert() | 26 MB | Yes |
| Laravel + disableQueryLog() | 26 MB | Yes |
| Laravel + Event::forget(QueryExecuted) | 0 MB | No |
The fix is simple—disable Sentry's breadcrumbs & tracing via config settings, or, as I chose to do, disable the QueryExecuted event listeners during bulk imports:
use Illuminate\Database\Events\QueryExecuted;
use Illuminate\Support\Facades\Event;
// Disable query event listeners during bulk import
Event::forget(QueryExecuted::class);
// Now Laravel's standard insert() works with stable memory
foreach ($batches as $batch) {
Model::query()->insert($batch);
}
6. Index Optimization
For even more processing throughput, I removed indexes from the database tables before the import and recreated them afterwards. This is a recommended approach by MySQL for bulk data loading.
This reduces how much processing the database has to do per insert, which was fairly significant in my case as some of the tables I was importing to had multiple unique indexes. Thus, the database had to not only update the B-tree structure but also validate uniqueness per insert.
7. Overall Results
After all optimizations:
Starting to import records...
Imported 5000 records. Current rate: 19593 records/s
Imported 10000 records. Current rate: 19549 records/s
Imported 15000 records. Current rate: 19980 records/s
...
Imported 95000 records. Current rate: 20877 records/s
Imported 100000 records. Current rate: 21021 records/s
Consistent ~20,000 records/second throughout the entire import, with stable 80-100MB memory usage.
| Metric | Before | After | Improvement |
|---|---|---|---|
| Initial Rate | 3,500/s | 20,000/s | 5.7x faster |
| Rate at 100k records | ~300/s | 20,000/s | 66x faster |
| Memory Usage | Growing unbounded | Stable 80-100MB | Stable |
| PHP CPU | 100% | 30-40% | Balanced |
| MySQL CPU | ~2% | 60-70% | Properly utilized |
Key Takeaways
-
Don't trust XMLReader for very large files. Despite being marketed as a streaming parser, it has fundamental performance issues with multi-gigabyte files. The degradation is in traversal itself—even
next()without reading content degrades ~14x over 50,000 records. This appears to be a libxml2 issue. Simple string operations can be 100x faster with zero degradation. -
Separate your symptoms. Performance degradation and memory leaks can have completely different causes.
-
Watch your CPU distribution. When PHP is at 100% and MySQL is at 2%, your bottleneck isn't the database.
-
APM tools can cause memory leaks in bulk operations. Sentry (and likely other APM tools) listen to query events and retain data for tracing. For bulk imports, disable query tracking with
Event::forget(QueryExecuted::class). -
Drop and recreate indexes for bulk imports. MySQL recommends this approach for good reason—the time complexity math works in your favor.
-
Profile before optimizing. Systematic isolation testing found XMLReader and Sentry faster than guessing would have.