Ghost in the Codex Machine

How a release-only pre-main change made CUDA/MKL tool calls quietly fall back to slow paths.

Executive Summary

The symptom wasn't a crash. It was worse: "Codex feels slow" in certain real developer environments.

In release builds, a pre-main hardening routine executed before main() and stripped LD_* / DYLD_* environment variables. For a subset of users (CUDA, Conda/MKL, HPC module stacks, custom library layouts), that meant critical dynamic libraries could no longer be discovered inside Codex tool subprocesses. The downstream effect was dramatic but quiet: BLAS fell back to slow implementations; GPU workflows fell back to CPU; and engineers lost time chasing the wrong layer because nothing obviously "failed."

I was able to trace the behavior back to the introducing change, reduce it to a minimal reproduction + representative measurements, and open an upstream issue that maintainers could validate quickly. The fix shipped and is called out in the rust-v0.80.0 release notes with attribution.

If You Were Debugging This

"It works in my terminal, but it's slow inside tool calls."
No obvious error messages; the system just quietly falls back.
Only reproduces in release builds and only for certain environment layouts.
The smoking gun: environment variables differ inside subprocesses vs the user's shell.

Why Agent Builders Care

Tool correctness: Agents rely on subprocesses behaving like the user's environment.
Evaluation stability: Silent fallbacks create noisy perf profiles and flaky "it feels slow" reports.
Trust: If the substrate silently rewrites execution, higher-level behavior becomes harder to explain and debug.

Proof

Related Context

PR #4521 that introduced always-on pre-main hardening

Timeline

2025-09-30

PR #4521 merges (process hardening executes pre-main in CLI release builds)

2025-10-01

First affected release ships (rust-v0.43.0)

2025-10-31

Earlier "Ghosts in the Codex Machine" investigation published (useful context; this regression still remained)

2026-01-08

I open issue #8945 with root cause + reproduction + benchmarks

2026-01-09

Fix merged in PR #8951

2026-01-09+

Fix shipped in rust-v0.80.0 (release notes call-out)

The Problem

What Users Experienced

When Codex runs tools (Python, Node, build systems, CLIs), it does so by spawning subprocesses. For dev workflows, the correct baseline is simple: subprocesses should inherit the developer's environment unless a specific policy says otherwise.

When LD_LIBRARY_PATH / DYLD_LIBRARY_PATH disappears, the failure mode is often not a clean error. Instead, the dynamic linker "finds something else," and performance collapses:

BLAS libraries fall back to unaccelerated implementations.
CUDA tooling may fall back to CPU execution.
Some enterprise/custom libraries fail to load entirely.

Root Cause

Why This Was a "Ghost"

The stripping happened before main() and before most instrumentation/logging was initialized:

It was implemented as a pre-main constructor in release builds (#[ctor::ctor]-style behavior).
It was silent by default (no warning when variables were removed).
It only reproduced on specific environment layouts (non-RPATH installs, legacy Conda, HPC module stacks, custom vendor libraries).
Users could see LD_LIBRARY_PATH set correctly in their shell, but inside codex exec it was empty.

Evidence and Reproduction

How I Proved It

I approached this like a systems regression: reduce the behavior to a small, testable claim and then measure the consequences.

Minimal Repro (Conceptual)

The core claim to test is simple: tool subprocesses should inherit the user's environment unless explicitly documented otherwise.

# Outside the agent/tool runner (baseline)
python -c "import os; print('LD_LIBRARY_PATH=', os.environ.get('LD_LIBRARY_PATH') or ''); print('DYLD_LIBRARY_PATH=', os.environ.get('DYLD_LIBRARY_PATH') or '')"

# Inside the tool runner (should match baseline)
codex exec -- python -c "import os; print('LD_LIBRARY_PATH=', os.environ.get('LD_LIBRARY_PATH') or ''); print('DYLD_LIBRARY_PATH=', os.environ.get('DYLD_LIBRARY_PATH') or '')"

If the inside value is empty (or different), downstream tooling can silently pick different dynamic libraries and fall back to slower paths.

Correlated the introducing change with reports consistent with "environment disappeared" behavior.
Validated env var inheritance behavior directly inside subprocess tool calls.
Used a small BLAS/CUDA-sensitive workload to quantify the performance impact of a library discovery fallback.
Documented a minimal reproduction + representative measurements so maintainers could validate quickly.

The Fix

What Shipped Upstream

Security hardening is valuable, but in a developer CLI it must not silently rewrite the user's execution environment.

Stripping LD_* variables can reduce certain injection risks, but doing so by default in a dev-facing tool runner breaks correctness for legitimate workflows. I suggested the posture "opt-in maximum hardening." Upstream shipped a pragmatic equivalent:

Remove pre-main hardening from the Codex CLI (restoring environment inheritance for subprocesses).
Keep pre-main hardening in the responses API proxy where it is more appropriate.

Collaboration

My contribution: isolate the root cause, produce a minimal repro, attach representative measurements, and write an issue maintainers could validate quickly.
Upstream outcome: fix merged and shipped; behavior called out in release notes.

Release Notes

Release notes excerpt:

"Special thanks to @johnzfitch for the detailed investigation and write-up in #8945."

Release notes screenshot mentioning LD_LIBRARY_PATH/DYLD_LIBRARY_PATH inheritance and crediting @johnzfitch. — Release notes screenshot (credit + technical summary).

Measured Impact

Performance varies by workload and environment. The key point is the failure mode: stripping env vars can force slow, silent fallbacks.

Representative measurements from my verification:

Workload	Before	After	Speedup
MKL/BLAS (repro harness)	~2.71s	~0.239s	11.3x
CUDA workflows (library discovery / GPU fallback)	100x-300x slower	restored	varies

Why This Matters

Beyond One Bug

This is the kind of engineering failure that only shows up in real-world environments:

Subprocess correctness is a product feature: tools must behave the same "inside Codex" as they do in the user's terminal.
Security controls must be explicit, not surprising.
Performance regressions can hide inside "correct" behavior when the system silently falls back.
When the substrate is wrong, everything built on top of it pays the tax (tooling, orchestration, higher-level features).

What This Demonstrates

Recruiter-Relevant

Deep systems debugging (pre-main execution, env inheritance, dynamic linking)
Performance engineering with hard evidence
Security tradeoff reasoning grounded in practical threat models
High-quality upstream collaboration (clear issue, reproducible repro, verified fix, shipped release)

Why This Helped Shipping Velocity

Substrate bugs are expensive because they distort everything built on top:

Tool calls become slower or flakier for affected environments.
Performance investigations get noisy (it looks like "model slowness" or "network issues").
Higher-level features that rely on predictable tool execution (orchestration, planning, collaboration) become harder to validate.

What I Would Add Next

Engineering Hygiene

Integration tests asserting env var inheritance for subprocess execution
A documented "secure mode" switch with explicit tradeoffs
A debug command to dump the effective execution environment (for users and support)