Experiments - Kee Kimbrell

Why Document Experiments?

In ML/AI research, failed experiments are often more valuable than successes. They reveal edge cases, expose flawed assumptions, and guide future work. This is a collection of real experiments—what worked, what didn't, and what we learned.

Philosophy: If an experiment "fails" but teaches you something important about the problem space, it's actually a success. The only true failure is not learning from the attempt.

Failed (Insightful)

v1: Q-Learning for Trading

November 4, 2025

Trained a Q-learning agent to decide WHEN to trade based on transformer return predictions. Agent learned to HOLD everything (100% HOLD, 0% return). But the failure was insightful...

Key Insight:

Q-learning worked correctly, but state representation was flawed. 99.8% of predictions fell into one bucket because transformer only produced positive predictions (+1.68% to +10.74%), but bucketing thresholds assumed -10% to +10% range. Lesson: State representation is critical in RL.

🎯 Reinforcement Learning

📊 129K Episodes

⏱️ ~8 hours

Partial Success

v2: Percentile Bucketing Fix

November 4, 2025 (same day!)

Fixed the state representation from v1 using percentile-based bucketing. Agent learned selective trading: 30.5% BUY on high-confidence predictions, 69.5% HOLD on rest. Achieved +2.12% test return (vs +3.65% always-buy baseline). Success: Q-learning works! Problem revealed: Transformer calibration needs improvement.

Key Insight:

Q-learning worked as a diagnostic tool - it proved the algorithm works correctly, but revealed that transformer's confidence levels don't correlate with actual returns. High-confidence predictions don't actually outperform average predictions.

✅ Q-Learning Validated

⚠️ Transformer Calibration Issue

⏱️ ~2 hours

Success

v3: Real Portfolio Backtest

November 4, 2025 (same day!)

Built a real backtesting system to simulate actual portfolio trading. Q-learning agent achieved +11.20% returns over 87 days by being extremely selective (only 20 trades out of 24,436 opportunities). Beat always-buy baseline (+3.65%) by 7.55 percentage points. Proves Q-learning works in practice with real portfolio mechanics!

Key Insight:

Per-trade metrics (v2's +2.12%) don't tell the full story. Real portfolio simulation with capital concentration and compounding turned the same Q-learning algorithm into +11.20% returns. Agent learned that 20 well-timed trades beat 7,440 average trades.

v4: Event-Based Q-Learning

November 5, 2025

Tested whether Q-learning could learn directly from event counts without transformers. Agent achieved +0.02% (random), far below v1's +11.20%. This validates that transformers extract valuable patterns (sequences, combinations, temporal dynamics) that simple count features cannot capture. Transformers do real feature engineering, not just compression.

Key Insight:

The +11.18% gap between transformer-based Q-learning (v1) and event-count Q-learning (v4) quantifies the value of transformers. This valuable negative result proves transformers are necessary for extracting trading signals from SEC filings - they're not optional complexity.

🔬 Controlled Experiment

📊 267K Filings

✅ Hypothesis Tested

Complete

v6: Nanochat Portfolio Manager

November 2025

Trained a custom 561M parameter LLM to generate portfolio decisions from SEC filing events. Training successful (val_loss 0.096), model generates valid decisions with reasoning. However, 100% position size override rate revealed architecture flaw: LLMs should generate signals (conviction), not portfolio decisions (dollar amounts).

Key Insight:

Separation of concerns needed: LLM for pattern recognition and conviction scoring (0.0-1.0), deterministic code for portfolio construction and risk management. Also achieved 10x backtesting speedup via pre-computed contexts (59 contexts/sec).

🧠 Custom LLM (561M)

⚡ 10x Speedup

🎯 Architecture Insights