← Back to Experiments
Failed (but Insightful)

Experiment v1: Q-Learning for Trading

November 4, 2025 • ~8 hours • 129K episodes

Hypothesis

Q-learning can learn WHEN to trade based on transformer return predictions by discovering which prediction magnitudes are reliable, when price has already moved (too late), and when to hold positions vs. take profits.

The Setup

Architecture

State Space: 30 states = (prediction_bucket, price_bucket, has_position)

  • Prediction buckets (5): Very negative (<-10%), Negative (-10% to -3%), Neutral (-3% to +3%), Positive (+3% to +10%), Very positive (>+10%)
  • Price buckets (3): Down (<-3%), Flat (-3% to +3%), Up (>+3%)
  • Position (2): No position, Holding position

Actions: BUY, HOLD, SELL

Rewards: Actual 3-month return from filing

Data Pipeline

1. Filing Returns Database

244K filings with 3-month returns (2023-2025) stored in SQLite

2. Generate Inference Dataset

Created 129K tokenized event sequences (54% of training data, sufficient for Q-learning development)

3. Transformer Inference

Ran batch inference on H200 GPU (vortex) - completed in ~3 minutes

4. Import Predictions

Matched 122,180 predictions with returns (97,744 train, 24,436 test)

5. Train Q-Learning

Single-step Q-learning with α=0.1, ε decay from 0.3 to 0.05

The Results

What the Agent Learned

Strategy: HOLD everything (100% HOLD actions)

The agent rationally decided that HOLD (guaranteed 0% return) was better than BUY (uncertain return with high variance).

0%
Q-Learning Return
+3.65%
Always-BUY Baseline
100%
HOLD Actions
99.8%
Data in One State

Q-Table Learned Values

State: large_positive (99.8% of data) BUY: -1.41% ← Learned: buying gives slight loss HOLD: 0.00% ← Safe choice: guaranteed 0% SELL: -11.40% ← Very bad: shorting loses money State: small_positive (0.2% of data) BUY: -1.04% HOLD: 0.00% SELL: -2.31% State: neutral (0.0% of data) BUY: -0.14% HOLD: 0.00% SELL: -0.24%

The agent's logic was sound: if BUY gives -1.41% and HOLD gives 0%, always choose HOLD.

The Problem: State Representation Failure

❌ What Went Wrong

State Distribution:

  • large_negative: 0 (0.0%)
  • small_negative: 0 (0.0%)
  • neutral: 9 (0.0%)
  • small_positive: 192 (0.2%)
  • large_positive: 97,543 (99.8%)

Root Cause: Transformer predictions ranged from +1.68% to +10.74%, but bucketing thresholds were designed for -10% to +10%.

✓ The Fix

Percentile-Based Bucketing:

Instead of fixed thresholds, use percentiles of actual prediction distribution (20th, 40th, 60th, 80th percentiles).

Result: Guaranteed balanced state distribution (20% in each bucket).

Expected Outcome: Agent learns different strategies for different confidence levels - BUY high-confidence, HOLD medium, possibly SELL low-confidence.

Key Insight: Q-Learning Worked Perfectly

This wasn't a Q-learning failure - it was a state representation failure. The agent learned correctly from the data it was given. With 99.8% of states identical, the rational strategy is to treat all states the same and choose the safest action (HOLD).

The lesson: In reinforcement learning, state representation is critical. If your states don't capture meaningful distinctions in the environment, the agent can't learn meaningful policies. This is analogous to feature engineering in supervised learning - garbage in, garbage out.

Why BUY Had Negative Q-Value

The transformer predicted +9.67% average return, but actual returns were +4.43%. With high variance (-50% to +50%), many BUY actions during exploration resulted in losses. The agent learned that:

  1. BUY is risky: Even with positive mean return, variance is high
  2. HOLD is safe: Guaranteed 0% with zero variance
  3. Given no way to distinguish between good and bad BUY opportunities (all states identical), the risk-averse strategy is HOLD
The agent's reasoning: "I can't tell which predictions are reliable (they all look the same). So why take risk when I can get guaranteed 0%?"

What We Learned

  1. State representation is critical in RL - Poor states lead to poor policies, even with correct learning
  2. Always check state distributions - Imbalanced states (99.8% in one bucket) are a red flag
  3. Fixed thresholds are fragile - Percentile-based bucketing is more robust to distribution shifts
  4. Q-learning debugging requires domain understanding - The agent's behavior was rational given the states
  5. Transformer has signal - Test correlation of +0.1214 suggests predictions do contain useful information

Next Steps

Option 1: Fix State Representation (Immediate)

  • Implement percentile-based bucketing (20/40/60/80 splits)
  • Re-import episodes with balanced state distribution
  • Re-train Q-learning agent
  • Expected: Agent learns to selectively trade on highest-confidence predictions

Option 2: Deep Q-Learning (If percentile bucketing insufficient)

  • Use neural network Q-function instead of Q-table
  • Work with continuous predictions directly (no bucketing needed)
  • More flexible representation learning

Option 3: Add Transaction Costs

  • Penalize BUY/SELL with 0.1% transaction cost
  • Agent learns to only trade when expected return exceeds costs
  • More realistic trading scenario

Technical Details

Transformer Model

  • Architecture: 6-layer transformer encoder, 128-dim embeddings, 8 attention heads
  • Vocabulary: 389 hybrid events (Option 6b - winner with 42.8% correlation vs 23.1% baseline)
  • Training: Temporal split (respects time ordering)
  • Performance: Test correlation +42.8%, RMSE 15.78

Q-Learning Configuration

Learning rate α = 0.1 Discount factor γ = 0.95 (not used - single-step) Epsilon start ε₀ = 0.3 Epsilon decay = 0.995 Epsilon min ε_min = 0.05 Training epochs = 10 Train/test split = 80/20

Experiment Status

Failed to beat baseline, but successfully identified the state representation issue. Ready to iterate with percentile-based bucketing.

"The only real mistake is the one from which we learn nothing." — Henry Ford