Imagine teaching a robot to navigate a maze. You could:
Give the robot 10,000 examples of "when you're in position X, the correct move is Y"
Problem: You have to know the correct answer for every possible situation. What if the maze changes?
Let the robot try different moves, crash into walls, find the goal, and learn from the results
Advantage: The robot figures out the best strategy on its own through trial and error.
Let's use a simple 4x4 grid world to understand Q-learning:
Goal: Agent (🤖) needs to reach the goal (🎯)
Obstacles: Walls (🧱) block the path
Actions: Move Up, Down, Left, Right
The current situation the agent is in. In grid world, it's the agent's position (row, column).
Markov Property: The state contains all information needed to make a decision. You don't need to know the full history of how you got here - just where you are right now.
Example: If you're at position (2,1), that's your state. You don't need to remember that you came from (2,0) or (1,1) - just your current position matters.
The choices available to the agent. In grid world: Up, Down, Left, Right.
Example: When at position (2,1), you can choose any of the 4 directions. Some might hit walls, some might move you closer to the goal.
Feedback from the environment after taking an action.
The Q-table stores the agent's learned knowledge about which actions are good in each state.
Q(state, action) = Expected future reward for taking this action in this state
| State (Position) | Up | Down | Left | Right |
|---|---|---|---|---|
| (0, 0) - Start | 0.5 | 2.3 | 0.1 | 4.7 |
| (0, 1) | 1.2 | 5.8 | 0.8 | 3.2 |
| (2, 2) | 3.1 | 2.7 | -0.5 | 7.9 |
| (3, 3) - Goal | 0.0 | 0.0 | 0.0 | 0.0 |
Higher values = better actions. The agent picks the action with the highest Q-value.
After each action, the Q-value is updated based on what actually happened:
In plain English:
A key challenge in Q-learning: Should the agent try new things or stick with what it knows works?
Strategy: Try random actions, even if they don't look good
Why: Might discover better paths you didn't know about
Early learning: Explore a lot (e.g., 50% random moves)
Strategy: Always pick the action with the highest Q-value
Why: Use what you've learned to maximize reward
Later learning: Exploit more (e.g., 95% best action)
This is the #1 question from Raul and Elise. Here's the clear answer:
"I predict what will happen"
"I decide what to DO about it"
Simple Analogy:
Trading Example:
• Transformer prediction: +15% return
• Current portfolio position: Don't own this stock
• Volatility: Low (good)
• Sector trend: Tech sector up 20% this quarter
• Insider activity: 3 executives bought shares
• Market conditions: Bull market
• Risk exposure: Portfolio 40% tech (getting high)
→ Q-Learning Decision: HOLD (don't add more tech exposure)
In our stock trading system, the Q-learning state is much richer than the transformer prediction alone:
Input: Event sequences from SEC filings
["acquired_company", "issued_debt", "announced_partnership", ...]
Output: Predicted return = +15%
State = {
transformer_prediction: +15%,
current_position: 0 shares (don't own),
portfolio_value: $100,000,
sector_exposure: 40% tech (high),
stock_volatility: 0.15 (low),
market_trend: bull_market,
insider_buying: cluster_detected,
days_held: 0,
unrealized_gain: 0%
}
Actions: [BUY, HOLD, SELL]
Q-Learning chooses: HOLD
Reason: Even though prediction is good (+15%),
tech exposure is already 40% (risk management)
Q-Learning in One Sentence:
A learning system that figures out the best actions through trial and error, using only the current state (not full history) and learning from real outcomes (not labeled examples).
Why We Need Both Models:
Transformer predicts what will happen. Q-learning decides what to DO about it. One is prediction, the other is action. You need both to build an intelligent trading system.