← Back to Home

Experiments

Learning from both successes and failures

Why Document Experiments?

In ML/AI research, failed experiments are often more valuable than successes. They reveal edge cases, expose flawed assumptions, and guide future work. This is a collection of real experiments—what worked, what didn't, and what we learned.

Philosophy: If an experiment "fails" but teaches you something important about the problem space, it's actually a success. The only true failure is not learning from the attempt.

"I have not failed. I've just found 10,000 ways that won't work." — Thomas Edison
Failed (Insightful)
v1: Q-Learning for Trading
November 4, 2025
Trained a Q-learning agent to decide WHEN to trade based on transformer return predictions. Agent learned to HOLD everything (100% HOLD, 0% return). But the failure was insightful...
Key Insight:

Q-learning worked correctly, but state representation was flawed. 99.8% of predictions fell into one bucket because transformer only produced positive predictions (+1.68% to +10.74%), but bucketing thresholds assumed -10% to +10% range. Lesson: State representation is critical in RL.

🎯 Reinforcement Learning
📊 129K Episodes
⏱️ ~8 hours
Partial Success
v2: Percentile Bucketing Fix
November 4, 2025 (same day!)
Fixed the state representation from v1 using percentile-based bucketing. Agent learned selective trading: 30.5% BUY on high-confidence predictions, 69.5% HOLD on rest. Achieved +2.12% test return (vs +3.65% always-buy baseline). Success: Q-learning works! Problem revealed: Transformer calibration needs improvement.
Key Insight:

Q-learning worked as a diagnostic tool - it proved the algorithm works correctly, but revealed that transformer's confidence levels don't correlate with actual returns. High-confidence predictions don't actually outperform average predictions.

Q-Learning Validated
⚠️ Transformer Calibration Issue
⏱️ ~2 hours
Success
v3: Real Portfolio Backtest
November 4, 2025 (same day!)
Built a real backtesting system to simulate actual portfolio trading. Q-learning agent achieved +11.20% returns over 87 days by being extremely selective (only 20 trades out of 24,436 opportunities). Beat always-buy baseline (+3.65%) by 7.55 percentage points. Proves Q-learning works in practice with real portfolio mechanics!
Key Insight:

Per-trade metrics (v2's +2.12%) don't tell the full story. Real portfolio simulation with capital concentration and compounding turned the same Q-learning algorithm into +11.20% returns. Agent learned that 20 well-timed trades beat 7,440 average trades.

🎯 +11.20% Returns
📊 20 Trades / 24,436
⏱️ 87-day backtest
Failed (Valuable!)
v4: Event-Based Q-Learning
November 5, 2025
Tested whether Q-learning could learn directly from event counts without transformers. Agent achieved +0.02% (random), far below v1's +11.20%. This validates that transformers extract valuable patterns (sequences, combinations, temporal dynamics) that simple count features cannot capture. Transformers do real feature engineering, not just compression.
Key Insight:

The +11.18% gap between transformer-based Q-learning (v1) and event-count Q-learning (v4) quantifies the value of transformers. This valuable negative result proves transformers are necessary for extracting trading signals from SEC filings - they're not optional complexity.

🔬 Controlled Experiment
📊 267K Filings
Hypothesis Tested
Complete
v6: Nanochat Portfolio Manager
November 2025
Trained a custom 561M parameter LLM to generate portfolio decisions from SEC filing events. Training successful (val_loss 0.096), model generates valid decisions with reasoning. However, 100% position size override rate revealed architecture flaw: LLMs should generate signals (conviction), not portfolio decisions (dollar amounts).
Key Insight:

Separation of concerns needed: LLM for pattern recognition and conviction scoring (0.0-1.0), deterministic code for portfolio construction and risk management. Also achieved 10x backtesting speedup via pre-computed contexts (59 contexts/sec).

🧠 Custom LLM (561M)
10x Speedup
🎯 Architecture Insights
Critical Lessons
v7: Signals Model
November 2025
Implemented clean architecture separating LLM signal generation from portfolio management. Model learned selectivity (BUY outperforms baseline by +2.97%), achieved 200x speedup. However, revealed fundamental problem: using future stock prices as ground truth labels is flawed due to regime non-stationarity. Same events have different outcomes in different market regimes (QE vs rate hikes).
Key Insight:

Future price returns are regime-dependent and non-stationary. LLMs should predict stationary patterns (event cascades) instead of regime-dependent outcomes (price movements). This insight directly led to V8's breakthrough: predicting events from events.

200x Speedup
📊 228K Training Examples
💡 Regime Non-Stationarity
FIRST SUCCESS ⭐
v8: Event Prediction Model
November 2025
Our first truly successful model! Predicts future corporate event probabilities from past SEC filing events. Uses stationary event patterns instead of regime-dependent price returns. Achieved statistically significant predictive power: 0.25 correlation (p < 1e-36) on 5,000 test examples. Event cascades like "Layoffs → Material weakness → Bankruptcy" work across ALL market regimes.
Key Insight:

Event patterns are stationary—they work whether it's 2010 or 2024, QE or rate hikes, bull or bear market. Turnaround events show strongest predictive power (0.31 correlation). This solves V7's regime non-stationarity problem and provides a foundation for production trading systems.

🎯 0.25 Correlation (p < 1e-36)
93-96% Precision
📊 Validated on 5K Test Examples