Hypothesis
Q-learning can learn WHEN to trade based on transformer return predictions by discovering which prediction magnitudes are reliable, when price has already moved (too late), and when to hold positions vs. take profits.
The Setup
Architecture
State Space: 30 states = (prediction_bucket, price_bucket, has_position)
- Prediction buckets (5): Very negative (<-10%), Negative (-10% to -3%), Neutral (-3% to +3%), Positive (+3% to +10%), Very positive (>+10%)
- Price buckets (3): Down (<-3%), Flat (-3% to +3%), Up (>+3%)
- Position (2): No position, Holding position
Actions: BUY, HOLD, SELL
Rewards: Actual 3-month return from filing
Data Pipeline
1. Filing Returns Database
244K filings with 3-month returns (2023-2025) stored in SQLite
2. Generate Inference Dataset
Created 129K tokenized event sequences (54% of training data, sufficient for Q-learning development)
3. Transformer Inference
Ran batch inference on H200 GPU (vortex) - completed in ~3 minutes
4. Import Predictions
Matched 122,180 predictions with returns (97,744 train, 24,436 test)
5. Train Q-Learning
Single-step Q-learning with α=0.1, ε decay from 0.3 to 0.05
The Results
What the Agent Learned
Strategy: HOLD everything (100% HOLD actions)
The agent rationally decided that HOLD (guaranteed 0% return) was better than BUY (uncertain return with high variance).
Q-Table Learned Values
The agent's logic was sound: if BUY gives -1.41% and HOLD gives 0%, always choose HOLD.
The Problem: State Representation Failure
❌ What Went Wrong
State Distribution:
- large_negative: 0 (0.0%)
- small_negative: 0 (0.0%)
- neutral: 9 (0.0%)
- small_positive: 192 (0.2%)
- large_positive: 97,543 (99.8%)
Root Cause: Transformer predictions ranged from +1.68% to +10.74%, but bucketing thresholds were designed for -10% to +10%.
✓ The Fix
Percentile-Based Bucketing:
Instead of fixed thresholds, use percentiles of actual prediction distribution (20th, 40th, 60th, 80th percentiles).
Result: Guaranteed balanced state distribution (20% in each bucket).
Expected Outcome: Agent learns different strategies for different confidence levels - BUY high-confidence, HOLD medium, possibly SELL low-confidence.
Key Insight: Q-Learning Worked Perfectly
This wasn't a Q-learning failure - it was a state representation failure. The agent learned correctly from the data it was given. With 99.8% of states identical, the rational strategy is to treat all states the same and choose the safest action (HOLD).
The lesson: In reinforcement learning, state representation is critical. If your states don't capture meaningful distinctions in the environment, the agent can't learn meaningful policies. This is analogous to feature engineering in supervised learning - garbage in, garbage out.
Why BUY Had Negative Q-Value
The transformer predicted +9.67% average return, but actual returns were +4.43%. With high variance (-50% to +50%), many BUY actions during exploration resulted in losses. The agent learned that:
- BUY is risky: Even with positive mean return, variance is high
- HOLD is safe: Guaranteed 0% with zero variance
- Given no way to distinguish between good and bad BUY opportunities (all states identical), the risk-averse strategy is HOLD
The agent's reasoning: "I can't tell which predictions are reliable (they all look the same). So why take risk when I can get guaranteed 0%?"
What We Learned
- State representation is critical in RL - Poor states lead to poor policies, even with correct learning
- Always check state distributions - Imbalanced states (99.8% in one bucket) are a red flag
- Fixed thresholds are fragile - Percentile-based bucketing is more robust to distribution shifts
- Q-learning debugging requires domain understanding - The agent's behavior was rational given the states
- Transformer has signal - Test correlation of +0.1214 suggests predictions do contain useful information
Next Steps
Option 1: Fix State Representation (Immediate)
- Implement percentile-based bucketing (20/40/60/80 splits)
- Re-import episodes with balanced state distribution
- Re-train Q-learning agent
- Expected: Agent learns to selectively trade on highest-confidence predictions
Option 2: Deep Q-Learning (If percentile bucketing insufficient)
- Use neural network Q-function instead of Q-table
- Work with continuous predictions directly (no bucketing needed)
- More flexible representation learning
Option 3: Add Transaction Costs
- Penalize BUY/SELL with 0.1% transaction cost
- Agent learns to only trade when expected return exceeds costs
- More realistic trading scenario
Technical Details
Transformer Model
- Architecture: 6-layer transformer encoder, 128-dim embeddings, 8 attention heads
- Vocabulary: 389 hybrid events (Option 6b - winner with 42.8% correlation vs 23.1% baseline)
- Training: Temporal split (respects time ordering)
- Performance: Test correlation +42.8%, RMSE 15.78
Q-Learning Configuration
Experiment Status
Failed to beat baseline, but successfully identified the state representation issue. Ready to iterate with percentile-based bucketing.
"The only real mistake is the one from which we learn nothing." — Henry Ford