aruneshvv by aruneshvv · Pull Request #233 · tempestphp/100-million-row-challenge

aruneshvv · 2026-03-04T10:23:49Z

Summary

Multi-process parallel CSV parser using pcntl_fork with 8 workers
Key optimizations: integer date keys (YYYYMMDD) for faster hash lookups, zero-copy leftover handling, 2x loop unrolling, reference-based merge, igbinary serialization when available
Validated and deterministic output across runs

Test plan

php tempest data:validate passes
Deterministic output (3 runs produce byte-identical JSON)
Total visit count matches input line count (10M in = 10M out)
268 unique paths correctly aggregated
Dates sorted ascending within each path
First-appearance key order preserved across parallel merge

/bench

Multi-process architecture with 8 workers using pcntl_fork, each parsing newline-aligned file chunks via fread 8MB buffers. Key optimizations: integer date keys (YYYYMMDD) for 57% faster hash lookups during merge, zero-copy leftover handling across buffers, 2x loop unrolling, reference-based merge, igbinary when available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

brendt · 2026-03-04T11:22:24Z

Benchmarking complete! Mean execution time: 7.88241270524s

Replace nested hash tables with flat integer array for O(1) packed array access in each worker. Key changes: - Pre-computed slug->ID and date->ID mappings from Visit::all() - 8-char date keys (YY-MM-DD) for faster hash lookups - Comma search with fixed 52-char jump - Element-wise array addition merge (replaces nested hash merge) - ~30% faster parsing per worker + simpler merge phase Benchmarked: 1.4-2.0s on 10M rows (vs 2.5s before). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Children now serialize only non-zero count entries (~60K) instead of full flat array (880K entries), reducing temp file size ~14x and speeding up serialization, deserialization, and merge phases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

aruneshvv · 2026-03-04T16:47:16Z

/bench

brendt · 2026-03-04T17:17:58Z

Benchmarking complete! Mean execution time: 4.5404760714s

brendt · 2026-03-04T17:18:01Z

Milliseconds were harmed in the making of this improvement. ⏱️
🏆 leaderboard.csv

xHeaven added the verified label Mar 4, 2026

brendt removed the verified label Mar 4, 2026

aruneshvv and others added 2 commits March 4, 2026 16:06

github-actions bot added the bench_needed label Mar 4, 2026

xHeaven added the verified label Mar 4, 2026

brendt removed bench_needed verified labels Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

aruneshvv#233

aruneshvv#233
aruneshvv wants to merge 3 commits intotempestphp:mainfrom
aruneshvv:main

aruneshvv commented Mar 4, 2026

Uh oh!

brendt commented Mar 4, 2026

Uh oh!

aruneshvv commented Mar 4, 2026

Uh oh!

brendt commented Mar 4, 2026

Uh oh!

brendt commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

aruneshvv commented Mar 4, 2026

Summary

Test plan

Uh oh!

brendt commented Mar 4, 2026

Uh oh!

aruneshvv commented Mar 4, 2026

Uh oh!

brendt commented Mar 4, 2026

Uh oh!

brendt commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants