💎
November 2025 • Alpha Research
Event Database Analysis & Alpha Generation
10 systematic strategies to extract alpha from 9.1M events across 51,551 unique event types. IDF-weighted importance scoring, event sequence mining, co-occurrence networks, sentiment momentum, cascade detection, company signatures, embeddings, and ML features. Production-ready database with 97% confidence extraction quality. Not 30 hand-crafted events - 51K emergent event types discovered from data.
10 Alpha Generation Strategies:
- Inflection point event clustering (retrospective pattern mining + sequences)
- IDF-weighted importance scoring (rare events = high alpha, IDF > 12)
- Company event signature analysis (healthy vs distressed fingerprints)
- Event co-occurrence networks (graph analysis, rare paths)
- Temporal event density (burst detection, velocity analysis)
- Sentiment momentum strategy (decay-weighted, magnitude-scaled)
- Critical event cascade detection (distress, turnaround, growth cascades)
- Industry-specific patterns (biotech, finance, tech event profiles)
- Event type embeddings (Word2Vec on 51K event space)
- Multi-horizon ML features (7d to 90d feature engineering)
💎 Alpha Generation
📊 9.1M Events
🎯 51K Event Types
📖 32 min read
📊
November 2025 • V12/V13 Vision • 🎯 Raul's Priority
Company Scoring System: Multi-Dimensional Intelligence
Score companies at inflection points using 10-20 years of historical patterns. Multi-dimensional scores (0-10) across quality, risk, success probability, speed, and sustainability. Not quarterly trading signals - this is strategic intelligence. Learn from breakouts, collapses, sustained trends across 15K-25K inflection points. Different product from V7: 6-12 month horizon vs 90 days, monthly updates vs quarterly, scoring vs trading.
Key Innovations:
- Inflection point detection: scan 10-20 years to find breakouts, collapses, recoveries
- Multi-dimensional scores: quality, risk, success probability, speed, sustainability
- Industry-specific: biotech (FDA trials), mining (resource discoveries), tech
- Richer context: 1-2 year event lookback vs 180 days for V7
- Better labels: risk-adjusted scores vs binary profitable/not
- Use cases: Raul's dashboard, credit risk, portfolio monitoring, investment screening
- 15K-25K training examples from 5,000 companies × 20 years
- Timeline: 6-8 weeks after V7 proven
📊 Company Scoring
🎯 Raul's Priority
⏰ 6-12 Month Horizon
📖 25 min read
🗺️
November 2025 • Strategic Roadmap
Future Work Roadmap: V8-V11+ Development
Modular feature additions from market context to full technical analysis. Each model adds ONE complexity layer: V8 (market context), V9 (fundamentals), V10 (technicals), V11 (everything). Clear upsell path from base event signals (1x) to elite multi-factor alpha engines (5x). Special purpose models: V7a (insider intelligence), V7b (agreement intelligence). Plus strategic opportunity: SEDAR Canadian market ($3M ARR potential, zero competition).
Key Topics Covered:
- Core philosophy: Keep it simple, one layer at a time
- V8: Market context (VIX, sector performance, credit spreads, rates)
- V9: Fundamentals (P/E, leverage, growth, profitability, quality)
- V10: Technicals (RSI, MACD, volume, volatility, price action)
- V11: Elite everything (kitchen sink model for enterprise)
- V7a: Enhanced insider intelligence (clustering, magnitude, timing)
- V7b: Agreement intelligence (licensing, supply, JV, M&A analysis)
- SEDAR opportunity: Canadian market, 95% code reuse, first mover
- Modular pricing: Base + add-ons ($X to $5X)
🗺️ Product Roadmap
📊 Modular Architecture
💰 Pricing Strategy
📖 28 min read
⚡
November 2025 • Future Research
Diffusion Models for Trading Signals: 50-100x Speedup
Single-step diffusion models revolutionizing LLM-based scoring: 3-15ms inference vs 1-2s for current LLMs. Perfect for V12/V13 company scoring (5,000 companies in 1 second vs 100 minutes). Not a replacement for reasoning-based V7 signals, but transformative for high-throughput scoring with native uncertainty quantification. Hybrid architecture planned: diffusion for speed, LLM for reasoning.
Key Insights:
- 50-100× speedup: 1-2s → 3-15ms per prediction
- Perfect for V12/V13: Score 5,000 companies in 1 second
- Native uncertainty: Sample 100× for free (vs expensive LLM sampling)
- Consistency Trajectory Models (CTM): OpenAI's 1-step distillation
- Not for V7: Reasoning and explainability still matter
- Use cases: portfolio rebalancing, universe screening, Monte Carlo, real-time events
- Hybrid vision: Diffusion for all → LLM reasoning for top 50
- Timeline: Prototype after V7 validates, production for V12
⚡ 50-100x Faster
🎯 V12/V13 Perfect Fit
🔬 Future Research
📖 18 min read
🎓
November 4, 2025 • 🏃 Phase 1 Training 79% Complete
Model Distillation: Train Ultra-Fast Event Extraction
Pivoted to nanochat (Karpathy's minimal LLM) instead of Phi-3-Mini. Currently training d20 model (561M params) on 1M examples with 2x RTX 3090. Discovered need for multi-phase architecture: small skip classifier (d8/d12, 100-200M) + larger extractor (d20, 561M). Training at step 17,588/22,222, ~3 hours from completion. Replace $50/day H200 with $0 CPU inference.
Key Updates:
- Training nanochat d20 (561M params) on vortex with 2x RTX 3090
- Step 17,588/22,222 (79% complete), training loss 0.02-0.05
- Multi-phase architecture: Skip classifier + Event extractor
- 100K validation model: 1.3 hours, val loss 0.0426 (excellent)
- Phase 2 next: Train skip classifier (d8/d12) on balanced dataset
- Target: 2 seconds per filing on CPU vs hours with Qwen H200
- Cost: $0/month (on-prem CPU) vs $1,500/month (H200)
- Original plan (Phi-3-Mini) kept for reference in page
🏃 Training In Progress
🎓 nanochat LLM
💰 $1,500/mo Savings
📖 25 min read
📊
November 3, 2025 • Product #4
Company Scoring System: 0-100 Algorithmic Rankings
Comprehensive 0-100 company scoring algorithm combining 6 key categories from SEC events, insider trading signals, and transformer predictions. Operational health + financial strength + strategic momentum + governance quality + growth trajectory + risk indicators. Real-time scoring for 110K companies updated daily, targeting institutional investors, wealth advisors, and risk analysts.
Key Topics Covered:
- 6-category scoring algorithm (±20, ±15, ±15, ±15, ±10, ±10 points)
- Operational Health: expansions, suspensions, facility metrics
- Financial Strength: refinancing, covenant violations, defaults
- Strategic Momentum: partnerships, acquisitions, expansions
- Governance Quality: insider buying/selling, auditor changes
- Growth Trajectory: transformer predictions (42.8% correlation)
- Risk Indicators: investigations, lawsuits, regulatory actions
- Use cases: portfolio screening, risk monitoring, due diligence, sector rotation
- Pricing tiers: $20K-$300K/month, $8-15M ARR potential
- Competitive advantages vs Moody's, S&P, FactSet, Bloomberg
📊 Company Scoring
💡 Algorithm Design
🎯 Institutional Product
📖 23 min read
🔮
November 3, 2025 • ✅ Pattern Detection Complete
Agreement Pattern Predictions: 30-180 Day Lead Time
By analyzing temporal patterns in how companies file legal agreements, predict M&A deals, financial distress, IPOs, and strategic moves 30-180 days before public announcement. When companies file Stock Purchase Agreement + Voting Agreement + Standstill within 60 days → M&A deal announced 45 days later. 301 agreement types, 10 prediction rules, pattern detection complete.
Key Insights:
- 301 agreement types across 8 major categories (pattern detection ✅ complete)
- 10 core prediction rules: M&A (30-60 days), Distress, Pre-IPO (6-12 months), etc.
- Agreement clustering signals strategic events before press releases
- M&A Imminent: Stock Purchase + Voting + Standstill → Deal in 45 days
- Financial Distress: Forbearance + Amendment + Asset Sale → Restructuring
- Pre-IPO Signal: Lock-Up + Registration Rights → S-1 in 4-6 months
- Geographic Expansion: Multiple leases in new regions → Store openings
- Vector search system (semantic similarity) planned for 6-week implementation
- Use cases: Investment research, sales targeting, risk management, competitive intel
- TAM: $100M-500M annual revenue potential (credit analysts, M&A, legal teams)
🔮 Early Warning System
✅ Pattern Detection Live
⏰ 30-180 Day Lead
📖 24 min read
💼
November 2, 2025
Commercial Product Strategy - Selling into Equity Markets
Comprehensive analysis of 11 product opportunities for monetizing SEC event extraction,
transformer predictions, and Q-learning trading systems. From basic data feeds to
premium alpha signals, with detailed pricing, GTM strategies, and revenue projections.
Key Topics Covered:
- 11 product opportunities from data to platform
- Tier 1-4 product portfolio ($5M to $100M ARR path)
- Competitive positioning vs Bloomberg, FactSet, S&P
- 42.8% transformer correlation advantage
- 5-year revenue projections to $100M ARR
- Go-to-market strategy by customer tier
- Risk mitigation and exit strategies
📊 Product Strategy
💰 Revenue Modeling
🎯 GTM Strategy
📖 21 min read
📊
November 3, 2025 • Product #4
Company Scoring System: 0-100 Algorithmic Rankings
Comprehensive 0-100 company scoring algorithm combining 6 key categories from SEC events, insider trading signals, and transformer predictions. Operational health + financial strength + strategic momentum + governance quality + growth trajectory + risk indicators. Real-time scoring for 110K companies updated daily, targeting institutional investors, wealth advisors, and risk analysts.
Key Topics Covered:
- 6-category scoring algorithm (±20, ±15, ±15, ±15, ±10, ±10 points)
- Operational Health: expansions, suspensions, facility metrics
- Financial Strength: refinancing, covenant violations, defaults
- Strategic Momentum: partnerships, acquisitions, expansions
- Governance Quality: insider buying/selling, auditor changes
- Growth Trajectory: transformer predictions (42.8% correlation)
- Risk Indicators: investigations, lawsuits, regulatory actions
- Use cases: portfolio screening, risk monitoring, due diligence, sector rotation
- Pricing tiers: $20K-$300K/month, $8-15M ARR potential
- Competitive advantages vs Moody's, S&P, FactSet, Bloomberg
📊 Company Scoring
💡 Algorithm Design
🎯 Institutional Product
📖 23 min read
🔮
November 3, 2025 • ✅ Pattern Detection Complete
Agreement Pattern Predictions: 30-180 Day Lead Time
By analyzing temporal patterns in how companies file legal agreements, predict M&A deals, financial distress, IPOs, and strategic moves 30-180 days before public announcement. When companies file Stock Purchase Agreement + Voting Agreement + Standstill within 60 days → M&A deal announced 45 days later. 301 agreement types, 10 prediction rules, pattern detection complete.
Key Insights:
- 301 agreement types across 8 major categories (pattern detection ✅ complete)
- 10 core prediction rules: M&A (30-60 days), Distress, Pre-IPO (6-12 months), etc.
- Agreement clustering signals strategic events before press releases
- M&A Imminent: Stock Purchase + Voting + Standstill → Deal in 45 days
- Financial Distress: Forbearance + Amendment + Asset Sale → Restructuring
- Pre-IPO Signal: Lock-Up + Registration Rights → S-1 in 4-6 months
- Geographic Expansion: Multiple leases in new regions → Store openings
- Vector search system (semantic similarity) planned for 6-week implementation
- Use cases: Investment research, sales targeting, risk management, competitive intel
- TAM: $100M-500M annual revenue potential (credit analysts, M&A, legal teams)
🔮 Early Warning System
✅ Pattern Detection Live
⏰ 30-180 Day Lead
📖 24 min read
📈
November 2025
Can Markets Be Predicted?
If markets are stochastic (same events → different outcomes), is prediction hopeless? Examining the Random Walk Hypothesis, EMH, and counter-evidence from academic research and Renaissance Technologies. Explaining what 42.8% correlation actually means and why your transformer isn't bound to fail.
Key Topics Covered:
- Random Walk Hypothesis and EMH (weak, semi-strong, strong)
- Academic counter-evidence (momentum, value, drift, events)
- Renaissance Technologies: 66% annual returns
- What 42.8% correlation actually means (r² = 18.3%)
- Stochastic ≠ Unpredictable (weather analogy)
- Three sources of returns (beta, luck, alpha)
- Why your system finds alpha (4 key advantages)
- Reconciling Random Walk with your evidence
📊 Theory vs Evidence
🔬 Academic Research
💡 Fundamental Question
📖 16 min read
⚔️
October 31, 2025
Competitive Analysis: Event-Based Architecture vs Fintool
Fundamentally different architectural philosophies for processing SEC filings. Fintool's RAG approach (store everything, retrieve on demand) vs our semantic event extraction (compress knowledge upfront, enable prediction). Knowledge compression creates 166x data reduction while preserving 100% of predictive signal.
Key Analysis Points:
- Architecture comparison: RAG vs Semantic Events
- 166x compression ratio (500GB → 3GB)
- Cost advantage: $5K-10K one-time vs $1M+/week
- Event Oracle: Superior Q&A for "what did they do?"
- Unique capabilities: temporal patterns, predictions
- Defensible moat: 30 event types, 11.9M proprietary events
- Multi-model architecture from same foundation
- Positioning: Descriptive vs Predictive
⚔️ Competitive Strategy
💰 Cost Analysis
🛡️ Defensible Moat
📖 22 min read
🎯
November 2025
Insider Trading Features: From Raw Events to Predictive Signals
We already have insider data from Forms 3/4/5/13D/13F, but raw events aren't enough. Feature engineering transforms isolated transactions into powerful predictive signals backed by decades of academic research: cluster buying (+13%), C-suite purchases (+8%), activist stakes (+7-12%). Phased implementation plan from Q-learning to transformer integration.
Key Topics Covered:
- The realization: raw events vs. engineered features
- Forms 3/4/5/13D/13F - what we're already parsing
- Academic evidence: Seyhun, Lakonishok & Lee, Brav et al.
- Top 6 features ranked by predictive power
- Integration strategies: transformer, Q-learning, hybrid
- 3-phase implementation plan (2 weeks to 2 months)
- Python extraction code ready for deployment
- Expected impact: +5-8% over baseline
🎯 Feature Engineering
📊 Academic Research
🚀 Implementation Plan
📖 20 min read