Q-Learning Explained: Learning by Doing

Understanding reinforcement learning through simple examples

The Core Problem: Learning Without Being Told

Imagine teaching a robot to navigate a maze. You could:

❌ Supervised Learning Approach

Give the robot 10,000 examples of "when you're in position X, the correct move is Y"

Problem: You have to know the correct answer for every possible situation. What if the maze changes?

✅ Q-Learning Approach

Let the robot try different moves, crash into walls, find the goal, and learn from the results

Advantage: The robot figures out the best strategy on its own through trial and error.

Key Insight: Q-learning learns from experience, not from labeled examples. You don't need to tell it what to do - just let it try things and give it feedback (rewards/penalties).

Grid World: The Classic Example

Let's use a simple 4x4 grid world to understand Q-learning:

🤖

🧱

🎯

Goal: Agent (🤖) needs to reach the goal (🎯)

Obstacles: Walls (🧱) block the path

Actions: Move Up, Down, Left, Right

The Learning Process

Agent starts at random position
Agent picks an action (Up, Down, Left, Right)
Agent gets feedback: +10 if reached goal, -1 if hit wall, 0 otherwise
Agent updates its knowledge about which actions work well in each position
Repeat 1,000s of times until agent learns the optimal path

Important: The agent isn't told "when you're at position (0,0), move right". It figures this out by trying moves and seeing what leads to the goal.

The Three Core Concepts

1. State (Where am I now?)

The current situation the agent is in. In grid world, it's the agent's position (row, column).

Markov Property: The state contains all information needed to make a decision. You don't need to know the full history of how you got here - just where you are right now.

Example: If you're at position (2,1), that's your state. You don't need to remember that you came from (2,0) or (1,1) - just your current position matters.

2. Action (What can I do?)

The choices available to the agent. In grid world: Up, Down, Left, Right.

Example: When at position (2,1), you can choose any of the 4 directions. Some might hit walls, some might move you closer to the goal.

3. Reward (How did I do?)

Feedback from the environment after taking an action.

✅ +10: Reached the goal (great!)
❌ -1: Hit a wall (bad move)
➖ 0: Moved to empty cell (neutral)

The Q-Table: The Agent's Memory

The Q-table stores the agent's learned knowledge about which actions are good in each state.

Q(state, action) = Expected future reward for taking this action in this state

Example Q-Table (Partially Learned)

State (Position)	Up	Down	Left	Right
(0, 0) - Start	0.5	2.3	0.1	4.7
(0, 1)	1.2	5.8	0.8	3.2
(2, 2)	3.1	2.7	-0.5	7.9
(3, 3) - Goal	0.0	0.0	0.0	0.0

Higher values = better actions. The agent picks the action with the highest Q-value.

How the Q-Table Updates

After each action, the Q-value is updated based on what actually happened:

Q(state, action) ← Q(state, action) + α × [reward + γ × max_future_Q - Q(state, action)]

In plain English:

α (learning rate): How much to update (e.g., 0.1 = small updates, 1.0 = big updates)
reward: Immediate reward from the action you just took
γ (discount factor): How much to value future rewards (e.g., 0.9 = care about future)
max_future_Q: Best possible reward from the next state

You don't need to memorize this formula. Just understand: The Q-value gets updated based on (1) the immediate reward you got, and (2) the best possible future rewards from where you ended up.

Exploration vs Exploitation

A key challenge in Q-learning: Should the agent try new things or stick with what it knows works?

🔍 Exploration

Strategy: Try random actions, even if they don't look good

Why: Might discover better paths you didn't know about

Early learning: Explore a lot (e.g., 50% random moves)

🎯 Exploitation

Strategy: Always pick the action with the highest Q-value

Why: Use what you've learned to maximize reward

Later learning: Exploit more (e.g., 95% best action)

The Balance: Start with high exploration to learn, gradually shift to exploitation to maximize performance. This is called "epsilon-greedy" strategy.

🤔 Why Do We Need Both Transformer AND Q-Learning?

This is the #1 question from Raul and Elise. Here's the clear answer:

Transformer Model

PREDICTION

"I predict what will happen"

Input: SEC filing event sequences
Output: Predicted stock return (e.g., "+15%")
Example: "Based on this company's events, I predict the stock will go up 15% in 90 days"
Does NOT decide: Whether to buy, how much to buy, when to sell

Q-Learning Agent

ACTION

"I decide what to DO about it"

Input: Transformer prediction + market context + portfolio state
Output: BUY, HOLD, or SELL decision
Example: "Model says +15% AND volatility is low AND we don't own this stock → BUY"
Learns: Which predictions to trust, when to act, risk management

How They Work Together

Simple Analogy:

Transformer = Weather Forecast: "It will probably rain tomorrow (70% chance)"
Q-Learning = You Deciding: "Weather says 70% rain + I have meetings outside + umbrella is broken → Take car instead of walking"

Trading Example:

Q-Learning State (What it considers):

• Transformer prediction:        +15% return
• Current portfolio position:    Don't own this stock
• Volatility:                    Low (good)
• Sector trend:                  Tech sector up 20% this quarter
• Insider activity:              3 executives bought shares
• Market conditions:             Bull market
• Risk exposure:                 Portfolio 40% tech (getting high)

→ Q-Learning Decision: HOLD (don't add more tech exposure)

The Power of Compound Intelligence:

Transformer provides ONE piece of information (predicted return). Q-learning considers that prediction PLUS everything else (portfolio risk, market conditions, insider signals, sector trends) to make the actual trading decision.

Even when the transformer is RIGHT about +15%, Q-learning might say "HOLD" because adding more tech exposure would create too much risk. That's why you need both.

Stock Trading State Example

In our stock trading system, the Q-learning state is much richer than the transformer prediction alone:

What the Transformer Sees

Input:  Event sequences from SEC filings
        ["acquired_company", "issued_debt", "announced_partnership", ...]

Output: Predicted return = +15%

What Q-Learning Sees (Full State)

State = {
    transformer_prediction: +15%,
    current_position: 0 shares (don't own),
    portfolio_value: $100,000,
    sector_exposure: 40% tech (high),
    stock_volatility: 0.15 (low),
    market_trend: bull_market,
    insider_buying: cluster_detected,
    days_held: 0,
    unrealized_gain: 0%
}

Actions: [BUY, HOLD, SELL]

Q-Learning chooses: HOLD
Reason: Even though prediction is good (+15%),
        tech exposure is already 40% (risk management)

State = Present Only (Markov Property): The Q-learning state contains everything needed to make a decision RIGHT NOW. We don't need the full history of every trade - just the current portfolio state, latest prediction, and market conditions.

Why Q-Learning is Powerful for Trading

Learns from Real Outcomes: Not just predictions, but actual trading results (profit/loss)
Adapts to Market Changes: If a strategy stops working, Q-learning will discover new strategies through exploration
Risk Management: Learns to consider portfolio balance, exposure limits, volatility - not just predicted returns
Handles Imperfect Predictions: Even when transformer is wrong, Q-learning can learn to protect capital (like we saw in Phase 1: 0% vs -2.5%)
Multi-Factor Decision Making: Combines predictions with position sizing, timing, market context

The Bottom Line

Q-Learning in One Sentence:

A learning system that figures out the best actions through trial and error, using only the current state (not full history) and learning from real outcomes (not labeled examples).

Why We Need Both Models:

Transformer predicts what will happen. Q-learning decides what to DO about it. One is prediction, the other is action. You need both to build an intelligent trading system.