Built a real backtesting system to simulate actual portfolio trading with the Q-learning agent. Agent achieved +11.20% returns over 87 days by being extremely selective, beating the always-buy baseline (+3.65%) by 7.55 percentage points. Only traded 20 times out of 24,436 opportunities (0.082% selectivity rate).
Starting Capital: $1,000,000
Ending Value: $1,111,972
Total Return: +11.20% (vs +3.65% for always-buy)
Outperformance: +7.55 percentage points
Strategy: Ultra-selective trading on highest confidence predictions only
After training on 97,744 episodes of historical SEC filings paired with actual stock returns, the Q-learning agent learned three critical strategies:
The agent didn't just learn to "buy more" - it learned when to ignore the model. Out of 24,436 opportunities, it said "no thanks" 24,416 times. That discipline is what generated alpha.
All 20 trades happened on just 2 days in early March 2024, focusing on 3 major banks with very bullish signals:
| Date | Ticker | State | Predicted | Actual Return | Price |
|---|---|---|---|---|---|
| Mar 4, 2024 | GS | very_bullish | +10.74% | +17.10% | $379.87 |
| Mar 4, 2024 | JPM | very_bullish | +10.66% | +7.67% | $179.59 |
| Mar 4, 2024 | C | bullish | +10.34% | +11.01% | $53.46 |
| Mar 5, 2024 | GS | very_bullish | +10.74% | +17.59% | $378.57 |
| Mar 5, 2024 | JPM | very_bullish | +10.66% | +5.70% | $181.39 |
| Mar 5, 2024 | C | bullish | +10.34% | +7.45% | $53.58 |
Note: Table shows unique ticker/date combinations. Some tickers had multiple SEC filings on the same day, resulting in 20 total trades.
All trades were financial sector stocks with transformer predictions above +10.3%. The agent learned that extreme prediction confidence (top 1%) on large-cap banks in early March was a reliable signal.
87 days after the first trade, the portfolio holds:
| Ticker | Shares | Buy Price | Current Price | Value | Return |
|---|---|---|---|---|---|
| GS | 921 | $379.68 | $438.68 | $403,982 | +15.54% |
| C | 5,598 | $53.54 | $59.48 | $332,984 | +11.11% |
| JPM | 1,944 | $179.84 | $192.88 | $375,006 | +7.25% |
| Experiment | Approach | Return | Status |
|---|---|---|---|
| v1: Fixed Bucketing | Q-learning with fixed thresholds | 0% (HOLD all) | Failed |
| v2: Percentile Bucketing | Fixed state representation | +2.12% | Partial Success |
| v3: Real Backtest | Full portfolio simulation | +11.20% | Success! |
The difference between v2 (+2.12%) and v3 (+11.20%) comes down to real portfolio mechanics:
Per-trade metrics (v2's +2.12%) don't tell the full story. In a real portfolio:
Result: Same Q-learning algorithm, real portfolio simulation, 5x better results (+11.20% vs +2.12%)
This backtest demonstrates the concept, but the current system has important constraints:
Each decision is independent - doesn't consider current holdings or cash level. Could theoretically invest $50K even when only $10K cash available.
Always invests exactly $50,000 per trade regardless of confidence level or portfolio state. High confidence should get larger positions.
Could theoretically put 100% of portfolio in one stock. Real portfolios need diversification constraints.
Agent never sells positions (buy and hold only). Can't take profits, cut losses, or rebalance portfolio.
Doesn't plan ahead for multiple filings. Can't reason about "save cash for better opportunity tomorrow."
We're building an advanced system (v2_portfolio_qlearning) that addresses all v1 limitations:
Agent considers cash level, number of positions, and concentration when making decisions. State includes: (prediction_bucket, cash_level, num_positions, concentration)
Instead of single filings, agent sees sequences of 50-100 opportunities. Learns to build portfolios over time, balance diversification vs conviction.
Position sizes vary based on confidence and available cash. Very_bullish gets $100K, bullish gets $50K, neutral gets $25K, etc.
Real portfolio constraints: max 10% per stock, max 20 positions, sector concentration limits, cash reserves requirements.
Fair apples-to-apples testing across models: same scenarios, same opportunities, same starting conditions. Clear metrics: Sharpe ratio, max drawdown, win rate.
Agent learned extreme selectivity beats constant trading
+11.20% over 87 days with only 20 trades
Outperformed always-buy by 7.55 percentage points
v2 with portfolio awareness should do even better
v1 (Failed): Learned to HOLD everything (state representation bug)
v2 (Partial Success): Fixed states, achieved +2.12% (revealed transformer calibration issue)
v3 (Success!): Real backtest shows +11.20% returns with extreme selectivity
Next: v2 portfolio system with multi-step planning and risk management