Experiment v1: Q-Learning for Trading

Hypothesis

Q-learning can learn WHEN to trade based on transformer return predictions by discovering which prediction magnitudes are reliable, when price has already moved (too late), and when to hold positions vs. take profits.

The Setup

Architecture

State Space: 30 states = (prediction_bucket, price_bucket, has_position)

Prediction buckets (5): Very negative (<-10%), Negative (-10% to -3%), Neutral (-3% to +3%), Positive (+3% to +10%), Very positive (>+10%)
Price buckets (3): Down (<-3%), Flat (-3% to +3%), Up (>+3%)
Position (2): No position, Holding position

Actions: BUY, HOLD, SELL

Rewards: Actual 3-month return from filing

Data Pipeline

1. Filing Returns Database

244K filings with 3-month returns (2023-2025) stored in SQLite

2. Generate Inference Dataset

Created 129K tokenized event sequences (54% of training data, sufficient for Q-learning development)

3. Transformer Inference

Ran batch inference on H200 GPU (vortex) - completed in ~3 minutes

4. Import Predictions

Matched 122,180 predictions with returns (97,744 train, 24,436 test)

5. Train Q-Learning

Single-step Q-learning with α=0.1, ε decay from 0.3 to 0.05

The Results

What the Agent Learned

Strategy: HOLD everything (100% HOLD actions)

The agent rationally decided that HOLD (guaranteed 0% return) was better than BUY (uncertain return with high variance).

Q-Learning Return

+3.65%

Always-BUY Baseline

100%

HOLD Actions

99.8%

Data in One State

Q-Table Learned Values

State: large_positive (99.8% of data)
  BUY:  -1.41%  ← Learned: buying gives slight loss
  HOLD:  0.00%  ← Safe choice: guaranteed 0%
  SELL: -11.40% ← Very bad: shorting loses money

State: small_positive (0.2% of data)
  BUY:  -1.04%
  HOLD:  0.00%
  SELL:  -2.31%

State: neutral (0.0% of data)
  BUY:  -0.14%
  HOLD:  0.00%
  SELL:  -0.24%

The agent's logic was sound: if BUY gives -1.41% and HOLD gives 0%, always choose HOLD.

The Problem: State Representation Failure

❌ What Went Wrong

State Distribution:

large_negative: 0 (0.0%)
small_negative: 0 (0.0%)
neutral: 9 (0.0%)
small_positive: 192 (0.2%)
large_positive: 97,543 (99.8%)

Root Cause: Transformer predictions ranged from +1.68% to +10.74%, but bucketing thresholds were designed for -10% to +10%.

✓ The Fix

Percentile-Based Bucketing:

Instead of fixed thresholds, use percentiles of actual prediction distribution (20th, 40th, 60th, 80th percentiles).

Result: Guaranteed balanced state distribution (20% in each bucket).

Expected Outcome: Agent learns different strategies for different confidence levels - BUY high-confidence, HOLD medium, possibly SELL low-confidence.

Key Insight: Q-Learning Worked Perfectly

This wasn't a Q-learning failure - it was a state representation failure. The agent learned correctly from the data it was given. With 99.8% of states identical, the rational strategy is to treat all states the same and choose the safest action (HOLD).

The lesson: In reinforcement learning, state representation is critical. If your states don't capture meaningful distinctions in the environment, the agent can't learn meaningful policies. This is analogous to feature engineering in supervised learning - garbage in, garbage out.

Why BUY Had Negative Q-Value

The transformer predicted +9.67% average return, but actual returns were +4.43%. With high variance (-50% to +50%), many BUY actions during exploration resulted in losses. The agent learned that:

BUY is risky: Even with positive mean return, variance is high
HOLD is safe: Guaranteed 0% with zero variance
Given no way to distinguish between good and bad BUY opportunities (all states identical), the risk-averse strategy is HOLD

The agent's reasoning: "I can't tell which predictions are reliable (they all look the same). So why take risk when I can get guaranteed 0%?"

What We Learned

State representation is critical in RL - Poor states lead to poor policies, even with correct learning
Always check state distributions - Imbalanced states (99.8% in one bucket) are a red flag
Fixed thresholds are fragile - Percentile-based bucketing is more robust to distribution shifts
Q-learning debugging requires domain understanding - The agent's behavior was rational given the states
Transformer has signal - Test correlation of +0.1214 suggests predictions do contain useful information

Next Steps

Option 1: Fix State Representation (Immediate)

Implement percentile-based bucketing (20/40/60/80 splits)
Re-import episodes with balanced state distribution
Re-train Q-learning agent
Expected: Agent learns to selectively trade on highest-confidence predictions

Option 2: Deep Q-Learning (If percentile bucketing insufficient)

Use neural network Q-function instead of Q-table
Work with continuous predictions directly (no bucketing needed)
More flexible representation learning

Option 3: Add Transaction Costs

Penalize BUY/SELL with 0.1% transaction cost
Agent learns to only trade when expected return exceeds costs
More realistic trading scenario

Technical Details

Transformer Model

Architecture: 6-layer transformer encoder, 128-dim embeddings, 8 attention heads
Vocabulary: 389 hybrid events (Option 6b - winner with 42.8% correlation vs 23.1% baseline)
Training: Temporal split (respects time ordering)
Performance: Test correlation +42.8%, RMSE 15.78

Q-Learning Configuration

Learning rate α = 0.1
Discount factor γ = 0.95 (not used - single-step)
Epsilon start ε₀ = 0.3
Epsilon decay = 0.995
Epsilon min ε_min = 0.05
Training epochs = 10
Train/test split = 80/20

Experiment Status

Failed to beat baseline, but successfully identified the state representation issue. Ready to iterate with percentile-based bucketing.

"The only real mistake is the one from which we learn nothing." — Henry Ford