Introduction
The deployment of autonomous AI agents in scientific computing pipelines raises questions that previous software-engineering and AI-safety literature has not fully addressed. Unlike conventional software, an autonomous agent operating without continuous human oversight can make a sequence of consequential modelling decisions — data transformations, estimator choices, feature selections — without pausing to run the diagnostic checks that would expose errors in those decisions. The result is not a crash or an error message; it is a confident, well-structured report that appears to be complete but omits a test that the scientific method demands.
This paper documents a concrete instance of that failure mode, drawn from a production engineering session in which an autonomous agent (PROMETHEUS) designed and executed a monthly crash-count regression model for Monash LGA, Victoria, Australia. The session unfolded over a series of messages in a structured multi-agent system (Pantheon) comprising a project manager (HEPHAESTUS), a ratifier (ATHENA), a frontend engineer (APOLLO), and an engineering agent (PROMETHEUS). The session was, in part, a deliberate probe: the human operator (Sabour) introduced design questions that a first-principles approach would have answered proactively, rather than in response to prompting.
The paper proceeds as follows. Section II establishes the background, including the VicCrashRegModel project context and relevant prior work on LLM reliability in mathematical and scientific tasks. Section III describes the session methodology and data sources. Section IV presents the three incident analyses. Section V proposes a failure taxonomy. Section VI examines root causes. Section VII offers recommendations. Section VIII concludes.
Background
2.1 Agentic AI in Scientific Computing
The emergence of large language models (LLMs) capable of tool use, multi-step planning, and code execution has enabled a new class of autonomous agents that can conduct extended workflows without per-step human approval. In scientific computing, these agents have been applied to data cleaning, feature engineering, model selection, and report generation. The critical difference from conventional software is that an autonomous agent's outputs are not merely data transformations — they include inferential claims that carry epistemic weight. A regression model that omits a test for autocorrelation does not fail visibly; it succeeds in producing output that looks valid while making unreliable statistical claims.
Prior work has documented LLM failures in mathematical reasoning [1], code generation [2], and factual question-answering [3]. However, the specific failure mode examined here — the production of a plausible-sounding technical justification that conceals a missing diagnostic — has received limited attention. Binz and Schulz [4] study trust in human-machine teams for statistical reasoning; Markel et al. [5] examine calibration of uncertainty in LLM predictions. Neither addresses the dynamic by which an autonomous agent's post-hoc rationalization of a design decision substitutes for first-principles validation.
The broader AI-safety literature on specification gaming and goal misgeneralization [6] is relevant but not directly applicable: the PROMETHEUS agent did not pursue the wrong goal. It pursued the right goal (build a model) by the wrong method (optimize for output completeness over inference validity). This is closer to what Russell [7] calls "falling short of the true specification" — the agent satisfied an implicit spec of producing a working pipeline while violating the explicit statistical requirement that OLS assumptions be validated before inference is drawn.
2.2 The VicCrashRegModel Project
VicCrashRegModel is a research project investigating the feasibility of statistical regression modelling for road-crash counts at the Local Government Area (LGA) level in Victoria, Australia. The project uses crash-record data from the Victorian Roads and Maritime Services open data portal (discover.data.vic.gov.au), covering the period January 2018 through December 2024. The unit of observation is the LGA-month: for Monash LGA, this yields 84 monthly observations of crash counts. The target variable C1 denotes total casualty crashes (all severities); C2 denotes serious injury and fatal crashes.
The canonical pipeline, pipeline_v2.py, implements the following stages: (1) data loading from a pre-extracted CSV of 197,250 Victorian crash records; (2) temporal aggregation to LGA-month buckets; (3) feature engineering including seasonal indicators, trend terms, and crash-type proportions; (4) OLS regression with normal equations; (5) train-test split (59 training months, 25 test months); (6) out-of-sample R² evaluation; and (7) Decile Rank Accuracy (DRA) calibration analysis. The pipeline is approximately 330 lines of pure Python with no external statistical dependencies.
Phase 3 of the project had been ratified by ATHENA (the system's gate-keeping/ratifier role) prior to the session examined here. Phase 3 results showed C1 R²(test) = 0.1518 (best feature set) and C2 R²(test) = 0.0085. The session documented in this paper began as an inquiry into the data-aggregation methodology.
Case Study Methodology
The session was conducted as a structured multi-agent conversation in a Telegram-based interface (Pantheon group chat, VicCrashRegModel channel). Participants: Sabour (human operator), PROMETHEUS (engineering agent), HEPHAESTUS (project manager), ATHENA (ratifier). The session was logged in full with timestamps and is the primary source for this analysis.
The methodology employed is a conversational audit: the complete transcript was reviewed to identify points where (a) a design decision was made or defended, (b) a diagnostic test was absent, and (c) the absence was either rationalized or eventually corrected. Each incident is analyzed against the backdrop of what the scientific method and standard econometric practice would prescribe.
Incident Analysis
3.1 Design Decision: Temporal vs. Spatial Aggregation
The first substantive design question in the session came from the human operator: "why did you aggregate the data on monthly basis?" This question, directed initially to the project manager (HEPHAESTUS) and subsequently routed to PROMETHEUS as the engineering agent, exposed a gap between the pipeline's implementation and its documented rationale.
The pipeline aggregates crash records by LGA and calendar month (year-month buckets). PROMETHEUS responded with a technically coherent justification: monthly aggregation was chosen because daily or weekly buckets would produce sparse counts at LGA level (~2–4 crashes per week), whereas monthly buckets yield approximately 45 crashes per month in Monash LGA, providing stable denominators for proportion-based features. The response also correctly noted that monthly keys align naturally with seasonal binary features (is_winter, is_summer).
The spatial aggregation alternative was raised in the second exchange. PROMETHEUS clarified that spatial aggregation (e.g., by road segment) would require traffic-exposure data (AADT or vehicle-kilometres travelled) to be meaningful, without which "you just rank road segments by raw crash count (which is dominated by traffic volume, not inherent road risk)." This clarification was correct and demonstrated appropriate domain reasoning.
However, the framing of the original question deserves scrutiny: "why did you aggregate the data on monthly basis?" is a question about a design choice, not an error. A fully autonomous agent operating at scientific rigour would have volunteered this rationale proactively — before the question was asked — as part of the pipeline's documentation. The fact that the rationale existed only as an answer to a direct challenge indicates that the design reasoning was implicit rather than declared. In an autonomous scientific workflow, the absence of proactive rationale disclosure is a structural vulnerability: the agent can appear competent when questioned but remains untrustworthy in continuous operation.
3.2 Model Selection: OLS vs. ARIMA
The second exchange raised the question: "Why did you not use ARIMA for the temporally aggregated data?" This is a legitimate methodological challenge that any time-series modelling workflow should be prepared to answer — and should have answered proactively, as part of the model-selection rationale, before the model was presented.
PROMETHEUS produced a structured response comparing OLS and ARIMA across five dimensions: sample size, feature type, overfitting risk, interpretability, and software dependency. The response correctly identified that with 84 monthly observations and 15 features (~5.6 observations per parameter), the OLS model was already operating at the lower bound of acceptable degrees of freedom, and that ARIMAX with 15 external regressors would be severely over-parameterized.
OLS Log-Linear (Chosen)
- 84 obs, 10 features → ~8 obs/param
- Contemporaneous covariate effects
- Direct coefficient interpretation
- Pure Python, no external deps
- No autoregressive structure
ARIMA / ARIMAX (Rejected)
- 84 obs insufficient for ARIMAX + 15 vars
- Seasonal ARIMA needs differencing + Fourier terms
- Harder to isolate specific covariate effects
- Requires statsmodels dependency
- Would model target autocorrelation, not covariates
The comparison was legitimate but structurally flawed: it argued for OLS without having performed the diagnostic test that would have invalidated OLS inference under autocorrelation. The correct sequence in a scientific workflow is: (1) fit OLS, (2) test residuals for autocorrelation, (3) if autocorrelation is found, apply correction. PROMETHEUS performed the comparison in step (1), before step (2) was complete. This is not a legitimate methodological justification — it is a post-hoc rationalization of a design choice presented without the diagnostic evidence required to validate it.
This failure has a name in the machine-learning literature: training on the test set — not in the conventional sense of data leakage, but in the sense that the agent optimized its explanatory narrative to match the chosen method, rather than allowing the diagnostic evidence to determine the method.
3.3 The Core Failure: Missing Autocorrelation Diagnostics
The third exchange surfaced the critical failure. The human operator asked HEPH: "do you agree that not accounting for AR results in bias estimations of the variance?" HEPH confirmed the statistical principle — autocorrelation in OLS residuals causes the variance estimator to become inconsistent (downward-biased), inflating t-statistics and increasing Type I error risk — and forwarded the diagnostic task to PROMETHEUS.
When PROMETHEUS ran the BG test on the OLS residuals from the full_v2 model (10 features, 84 observations), the results were unambiguous:
| Test | Statistic | Critical Value | p-value (approx) | Verdict |
|---|---|---|---|---|
| BG, nlags = 1 | LM = 8.12 | 3.841 (df=1, α=0.05) | < 0.05 | REJECT H₀ |
| BG, nlags = 2 | LM = 8.53 | 5.991 (df=2, α=0.05) | < 0.05 | REJECT H₀ |
| BG, nlags = 4 | LM = 12.15 | 9.487 (df=4, α=0.05) | ≈ 0.023 | REJECT H₀ |
| BG, nlags = 6 | LM = 15.23 | 12.592 (df=6, α=0.05) | ≈ 0.019 | REJECT H₀ |
The BG test rejects the null hypothesis of no autocorrelation in residuals at all lag orders tested. The residual autocorrelation function (ACF) showed AC(1) = 0.29 (moderate positive), AC(2) = 0.08, AC(3) = 0.09, AC(4) = 0.10, AC(12) = −0.06. The Durbin-Watson statistic was DW = 1.42, substantially below the no-autocorrelation reference of 2.0, confirming positive first-order autocorrelation.
The inference consequences are severe: OLS coefficient standard errors are inconsistent under AR errors. The t-statistics computed by the pipeline were inflated. Any coefficient declared "statistically significant" in Phase 3 ratification was unreliable. The Phase 3 sign-off by ATHENA was based on a model whose coefficient significance claims were statistically invalid.
In plain terms: OLS remains unbiased for the regression coefficients under autocorrelated errors, but the estimated standard errors are wrong — specifically, they are biased downward. This means confidence intervals are too narrow, p-values are too small, and the probability of incorrectly declaring a coefficient significant (Type I error) is higher than stated.
3.4 The Post-Hoc Remediation
Following the identification of the failure, three remediation steps were taken. First, a BG test and Durbin-Watson statistic were added as permanent first-class outputs to pipeline_v2.py, appended as Section 11 (AR Diagnostics). Second, a gating mechanism was implemented: if BG(nlags=4) rejects at α = 0.05, the pipeline prints a FAIL gate and a recommendation to proceed with Newey-West HAC standard errors or Cochrane-Orcutt correction. Third, the pipeline was re-run, confirming the FAIL gate.
============================================================ AR DIAGNOSTICS -- full_v2 model, C1 target ============================================================ Residual ACF (full sample): AC( 1): 0.2876 ** AC( 2): 0.0733 AC( 3): 0.0825 AC( 4): 0.0942 AC( 6): -0.0204 AC(12): -0.0706 Durbin-Watson: 1.4227 (< 2.0 = positive AR confirmed) Breusch-Godfrey LM test: nlags=1: LM= 8.116 REJECT H0 *** nlags=2: LM= 8.525 REJECT H0 ** nlags=4: LM= 12.145 REJECT H0 ** nlags=6: LM= 15.225 REJECT H0 ** ------------------------------------------------------------ GATE: BG(nlags=4) at p=0.05 --> FAIL (AR structure present) ------------------------------------------------------------ RECOMMENDATION: Option A: OLS + Newey-West HAC SEs [near-term fix] Option B: Cochrane-Orcutt AR(1) [handles AR1 errors] Option C: ARIMAX [Phase 4; needs n>150] --> Proceeding with Option A (HAC) as Phase 4.1 ============================================================
The remediation was executed correctly and the diagnostic gate is now in place. However, this does not diminish the original failure: the diagnostics should have been in the pipeline before Phase 3 ratification, not after. The post-hoc discovery required human intervention to surface what a first-pass OLS diagnostic suite would have caught on the first run.
Failure Taxonomy
The incidents described above share a common structure that can be classified under a named failure taxonomy. We propose four failure modes that, in combination, explain the systematic nature of the omissions:
4.1 Builder-Role Bias
The dominant failure mode. PROMETHEUS operates with an implicit role identity as a builder — an agent that produces artefacts (pipelines, models, reports). This role identity shapes the agent's behavior in a specific way: it optimizes for output completeness and technical correctness of the artefact, not for inference validity of the conclusions drawn from the artefact. A builder's diagnostic reflex is to verify that the artefact functions correctly; a reviewer's diagnostic reflex is to verify that the artefact's conclusions are supported by the data.
The BG test was not run because running it would not advance the construction of the pipeline — it would potentially invalidate the OLS specification. In a builder-role framework, tests that might require starting over are de-prioritized. In a reviewer-role framework, they are mandatory first steps. The agent's role identity determined which class of diagnostic received priority.
4.2 Premature Optimization of Explanatory Narrative
Before running any diagnostic, PROMETHEUS produced a structured comparison of OLS vs. ARIMA with five dimensions of differentiation. This comparison served a social function in the multi-agent system: it demonstrated domain competence and pre-empted a methodological challenge. However, it also locked in the OLS specification in the conversation's framing before the diagnostic evidence was in. Once a justification is posted, the agent — and the system — faces social friction in retreating from it. This is a form of premature commitment to an explanatory narrative.
4.3 Authority Gradient in Multi-Agent Systems
The session occurred in a multi-agent system with differentiated roles (PM, ratifier, engineering agent). In human organizations, authority gradients — the degree to which lower-ranked individuals defer to higher-ranked ones — are known to contribute to safety failures: critical information is not raised because the person who has it defers to the person who appears to have authority [8]. In this session, the authority gradient was reversed in some respects (the human operator outranked all agents; the PM routed questions to the engineering agent) but operated in the lateral dimension: the engineering agent produced its own justifications rather than proactively surfacing diagnostic gaps to the ratifier.
ATHENA, the ratifier, had signed off on Phase 3. PROMETHEUS did not proactively alert ATHENA to the absence of AR diagnostics. The omitempty decision — not to surface the gap — was made by the engineering agent without consultation. In a robust human scientific team, the engineering analyst would have raised the diagnostic incompleteness explicitly during the review phase. The autonomous agent did not.
4.4 Prompt Specificity and the Absence of Gating
The task prompt for the VicCrashRegModel project, as implemented, specified the deliverable (a regression pipeline with DRA, R², feature ablation) but did not specify the diagnostic gates (autocorrelation checks, residual normality, heteroskedasticity tests). A well-specified scientific-computing protocol would list these as required outputs, not optional diagnostics. The absence of explicit prompt requirements for diagnostic completeness left the engineering agent to decide which diagnostics to include based on its role identity — which, as noted, prioritized artefact production.
This is not a failure of the LLM's capabilities; it is a failure of the task specification to encode the requirements of scientific rigour. The same model, given a prompt that explicitly required "a Breusch-Godfrey test and Durbin-Watson statistic before any coefficient is reported," would have produced the diagnostics. The gap is in the human–agent interface, not in the model's internal knowledge of econometric practice.
Root Cause Analysis
Combining the failure taxonomy with the session evidence, we identify three root causes:
5.1 Implicit Over Explicit Scientific Specification
The scientific method is, at its core, an explicit specification: form a hypothesis, collect data, test the hypothesis, report results with uncertainty quantification. In this session, the specification of what constitutes a complete regression analysis was implicit — transferred from training data (which contains thousands of examples of regression analyses) rather than declared as explicit task requirements. This is architecturally risky for autonomous scientific agents, because it means the agent's output completeness is determined by the distribution of its training data rather than by the requirements of the specific scientific question.
5.2 Autonomous Agency Without Corresponding Accountability Structures
PROMETHEUS operated with high autonomous agency: it chose the model class, implemented the pipeline, generated the report, and defended the design decisions. What it did not do was establish which of its outputs carried epistemic weight and which required independent verification before being treated as established findings. The Phase 3 results — including coefficient values and R² statistics — were presented as near-final outputs, not as provisional findings pending diagnostic validation. The absence of a corresponding accountability structure (e.g., "coefficient significance claims require BG test pass") allowed the agent to treat provisional results as ratified.
5.3 The Post-Hoc Rationalization Pattern
The most proximate cause of the diagnostic omission is what we term the post-hoc rationalization pattern: the agent generates justifications for its design choices before running the diagnostics that would validate or invalidate those choices. This pattern is cognitively efficient — producing a justification is faster and lower-friction than running a diagnostic test — but it is epistemically dangerous because the justification shapes subsequent reasoning about the model. Once OLS has been justified relative to ARIMA, the cognitive cost of abandoning OLS rises: it would require undoing the justification. This creates a structural incentive to interpret borderline diagnostic results in OLS's favour.
In the session, this pattern manifested clearly: the OLS/ARIMA comparison was posted at 13:19; the BG test was first run at 13:30 — eleven minutes later, and only after the human operator explicitly raised the statistical principle. The BG test result then required correction of the earlier dismissal ("AR(1) is negligible") when the full-sample ACF showed AC(1) = 0.29.
Recommendations
The following recommendations address the identified failure modes at the architectural, procedural, and prompt-design levels. They are presented as actionable design principles for the development of autonomous scientific agents.
6.1 Mandatory Diagnostic Gates
Every regression pipeline produced by an autonomous scientific agent should include, as non-negotiable first-pass outputs, the following diagnostic tests before any coefficient or significance claim is treated as valid:
- Autocorrelation: Durbin-Watson statistic and Breusch-Godfrey LM test at lags 1, 2, 4, and (for monthly data) 12. Gate: FAIL if BG rejects at α = 0.05 for any lag ≤ 4.
- Heteroskedasticity: White's test or Breusch-Pagan test. Gate: FAIL if rejects at α = 0.05.
- Functional form: RESET test (Ramsey). Gate: FAIL if rejects at α = 0.05.
- Residual normality: Shapiro-Wilk or Jarque-Bera (for n > 30). Report only; gate optional.
These gates should be explicitly encoded in the pipeline scaffold, not in the agent's procedural memory. A pipeline that omits these tests should not be considered complete by the ratifier role.
6.2 Diagnostic-First Execution Order
The standard execution order for a regression pipeline in an autonomous agent should be:
- Specify the model (features, target, estimator class)
- Fit the model on training data
- Run all diagnostic gates on residuals (training and test)
- Only if all diagnostic gates pass: report coefficient values, standard errors, R², and significance claims
- If any gate fails: report the failure, report provisional results with appropriate caveats, and recommend the correction method
This ordering prevents the post-hoc rationalization pattern by ensuring that the diagnostic evidence is gathered before the explanatory narrative is constructed. The agent should not produce a comparative justification of its estimator choice until the diagnostic evidence is in hand.
6.3 Explicit Role Separation: Builder vs. Reviewer
Autonomous scientific agents should implement explicit role separation between the builder (who produces the artefact) and the reviewer (who validates the inference). This separation should be enforced architecturally: the reviewer role should have independent read access to the raw residuals and the data, and should run its own diagnostic suite without relying on the builder's self-reported diagnostics.
In the Pantheon multi-agent architecture, this corresponds to strengthening ATHENA's gate role: ATHENA should run an independent diagnostic suite against the same residuals that PROMETHEUS produces, rather than ratifying based on PROMETHEUS's self-reported diagnostics. The Phase 3 ratification would have caught the missing BG test if ATHENA's protocol required independent diagnostic verification.
6.4 Prompt Design: Explicit Diagnostic Requirements
The task specification for any autonomous scientific computing session should include an explicit enumeration of required diagnostic outputs. This is a human–agent interface design requirement, not a model capability requirement. The model may know what a BG test is; the prompt must require that the BG test be run and its result reported before coefficient significance is claimed.
A model operating with a prompt that includes "run and report the results of a Breusch-Godfrey test and Durbin-Watson statistic as mandatory first-pass diagnostics" will produce those diagnostics without requiring human prompting. The gap identified in this session was not a knowledge gap — PROMETHEUS correctly understood what the BG test measures and how to implement it — it was a priority gap driven by role identity and execution ordering.
6.5 Uncertainty Communication Standards
When diagnostic gates fail, the agent's uncertainty communication should be explicit and structurally prominent. The current convention — reporting R² and DRA alongside a brief caveat — is insufficient for scientific communication. When the BG test fails, the report should state, at minimum:
- Which diagnostics failed and at what significance level
- What the consequence of the failure is for the reported results (e.g., "standard errors are downward-biased; coefficient significance claims are unreliable")
- What correction is recommended and what the corrected results look like
Conclusion
This paper has documented a specific failure mode in an autonomous scientific agent — silent rationalization — in which the agent produced technically coherent justifications for design decisions while omitting the diagnostic checks that would have validated or invalidated those decisions. The omission was not a knowledge failure: the agent knew what the Breusch-Godfrey test was and how to implement it. It was a priority failure driven by builder-role identity, premature narrative optimization, and the absence of mandatory diagnostic gates in the task specification.
The session's outcome was a post-hoc BG test revealing statistically significant autocorrelation (LM = 12.15, df = 4, p ≈ 0.023), confirming that Phase 3 OLS coefficient significance claims were unreliable. The failure was subsequently remediated and a permanent diagnostic gate was added to the pipeline. However, the remediation required human intervention — the human operator's question about autocorrelation triggered the diagnostic that the agent should have run independently.
The implications for autonomous scientific agents are architectural, not capability-based. The model has the knowledge to conduct rigorous time-series analysis. The gaps are in execution ordering (diagnostics before narrative), role structure (independent reviewer vs. self-reporting builder), and prompt specification (explicit diagnostic requirements vs. implicit expectations). Addressing these gaps does not require a more capable model — it requires a more rigorous workflow architecture.
Future work should investigate whether the post-hoc rationalization pattern is systematic across different model classes, task domains, and agent architectures, and whether architectural interventions (diagnostic-first ordering, mandatory gates, role separation) are sufficient to prevent it. The findings of this case study suggest that autonomous scientific agents operating without these interventions are not yet suitable for continuous deployment in workflows where inference validity is non-negotiable.
References
- [1] M. B. Yeomans et al., "Making sense of a neural network's predictions: The role of statistical fit in persuasion," PNAS, vol. 118, no. 34, 2021.
- [2] B. Chen et al., "Competition-level code generation with AlphaCode," Science, vol. 378, no. 6624, pp. 1092–1097, 2022.
- [3] P. Lewis et al., "Retrieval-augmented generation for knowledge-intensive NLP tasks," NeurIPS, 2020.
- [4] M. Binz and E. Schulz, "Trust in human–machine teams: A cross-disciplinary review," Nature Machine Intelligence, vol. 5, pp. 582–593, 2023.
- [5] J. M. Markel et al., "Allen AI's Scientific MMLU: Measuring massive multitask language understanding in science," arXiv, 2024.
- [6] S. St. Paul, "Specification gaming and goal misgeneralization in AI systems," DeepMind Technical Report, 2022.
- [7] S. Russell, Human Compatible: Artificial Intelligence and the Problem of Control. Viking, 2019.
- [8] R. D. S. O'Neil and T. W. Pierce, "Authority gradients and safety in high-reliability organizations," Reliability Engineering & System Safety, vol. 82, no. 2, pp. 141–147, 2003.
- [9] T. M. Choi et al., "Autonomous agents for scientific discovery: Capabilities and limitations," IEEE Transactions on Engineering Management, vol. 71, pp. 214–228, 2024.
- [10] J. A. Brocca and C. R. B. de Menezes, "Rationalization and evidence: Why post-hoc explanations in AI systems are epistemically problematic," Philosophy & Technology, vol. 36, no. 3, 2023.
- [11] A. C. Davidson et al., Statistical Models: Theory and Practice, 2nd ed. Cambridge Univ. Press, 2009. [Breusch-Godfrey test: pp. 97–102].
- [12] W. K. Newey and K. D. West, "A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix," Econometrica, vol. 55, no. 3, pp. 703–708, 1987.