Knowledge Compression: The Foundation

The Secret Sauce

While others are building better search engines for financial documents, we're doing something fundamentally different: multi-stage knowledge compression.

We're not just storing and retrieving text. We're extracting semantic meaning, compressing it into structured events, and compressing those events further into a vocabulary that preserves predictive signal while reducing dimensionality by 166x.

The Core Innovation

Most systems treat financial documents as text to be searched. We treat them as unstructured knowledge to be compressed into structured semantic abstractions. Extract once, use forever. No embedding drift, no retrieval errors, no context windows, no marginal costs.

The Multi-Stage Compression Journey

Stage 0: Raw SEC Filings

500 GB

10-K filings, 10-Q filings, 8-K filings, proxy statements. Dense legal text, tables, exhibits. Signal buried in noise. Unstructured, inconsistent, verbose.

Stage 1: Event Extraction

5 GB (11.9M events)

vLLM + Qwen 9B model extracts semantic events: "acquired_company", "entered_strategic_partnership", "announced_layoffs". Each event has subject, verb, object, temporal certainty, strategic importance, sentiment. 100x compression while preserving predictive signal.

Stage 2: Vocabulary Compression

3 GB (388-3,558 types)

Hybrid IDF×log(frequency) or semantic grouping compresses 37,927 unique event types down to 388-3,558 canonical types. Balances rare high-signal events (activist stakes, insider buying clusters) with common predictive patterns (product launches, partnerships). 1.66x additional compression.

166×

Total Compression Ratio

500GB → 5GB → 3GB

The Technical Pipeline

1

Filing Ingestion

Download 10-K, 10-Q, 8-K filings from SEC EDGAR. Parse HTML/XBRL. Extract text sections.

2

Batch Extraction with vLLM

Run Qwen 9B model with GPU acceleration. Extract 30 event types per section. Structured JSON output with metadata.

3

Event Storage

Store 11.9M events in PostgreSQL. Each event: CIK, company, date, type, subject, verb, object, certainty, importance, sentiment, source filing.

4

Vocabulary Compression

Apply hybrid IDF×log(freq) or semantic grouping. Reduce 37,927 types → 388-3,558 types. Map original events to compressed vocabulary.

5

Sequence Generation

Create time-ordered event sequences per company. Input for Transformer models, Q-learning state representation, Event Oracle queries.

vLLM

Qwen 9B

PostgreSQL

GPU Acceleration

Batch Processing

Structured JSON

Temporal Sequences

Why This Approach is Novel

What We Preserve

Semantic Meaning: "acquired_company" is fundamentally different from "divested_subsidiary". Not just cosine similarity.
Temporal Relationships: Event sequences tell company stories. SQL JOINs for "partnership followed by product launch within 90 days".
Structured Metadata: Subject, verb, object, certainty (factual vs planned), importance (strategic vs routine), sentiment.
Predictive Signal: Events correlate with future returns. 166x compression doesn't lose what matters.

What We Eliminate

Redundant Phrasing: "entered into agreement" = "signed contract" = "executed MOU" → "entered_agreement"
Legal Boilerplate: "pursuant to Section 14(a)" doesn't predict returns. Drop it.
Verbose Descriptions: 500-word risk factor → "identified_regulatory_risk"
Irrelevant Details: Exhibit numbers, footnote references, cross-references to other filings.

Extract Once, Use Forever

Our Approach: Knowledge Compression

One-time cost: $5K-10K for extraction
Storage: 3GB (compressed events)
Query cost: $0.015 (SQL + formatting)
Temporal queries: Native SQL JOINs
Marginal cost: Zero after extraction
Drift: None (structured data)

RAG Approach: Store & Retrieve

Ongoing cost: $1M+/week processing
Storage: 500GB (raw) + embeddings
Query cost: $5-50 (retrieval + generation)
Temporal queries: Impossible
Marginal cost: Linear with usage
Drift: Embeddings change with models

"RAG systems store everything and retrieve on demand. We extract semantic abstractions once and query them forever. It's the difference between renting and owning."

The Hybrid IDF×log(freq) Innovation

Compressing 37,927 event types is a non-trivial problem. Too aggressive and you lose signal. Too conservative and you explode dimensionality.

The Formula

                score = IDF(event) × log(frequency(event))
            

Why This Works

IDF Component: Rare events across companies are high-signal (activist stakes, insider buying clusters, major acquisitions).
Frequency Component: Common events within single company are predictive patterns (repeated partnerships, consecutive product launches).
Log Scaling: Prevents frequency from dominating. Balances rare signals with common patterns.
Result: 388 types covering 97.1% of all events. Preserves both rare alpha signals AND common predictive patterns.

This is a novel contribution. Standard TF-IDF doesn't work (ignores within-company patterns). Pure frequency misses rare signals. Pure IDF misses repeated behaviors. The hybrid approach balances both.

Patent & Paper Potential

The hybrid IDF×log(frequency) vocabulary selection method for financial event compression appears novel. Combines information-theoretic principles (IDF) with behavioral pattern detection (log frequency) for semantic abstraction of time-series events.

✅ Validated with Real Results

42.8% Correlation

Transformer Model Predicting Stock Returns from Compressed Events

85.5% improvement over 23.1% baseline • Trained October 31, 2025

Experimental Validation

389

Event Types Selected

Down from 37,927 types

97.1%

Event Coverage

With minimal vocabulary

42.8%

Test Correlation

Temporal split (future prediction)

+85.5%

Baseline Improvement

vs GradientBoosting (23.1%)

Key Findings

Hybrid IDF×log(freq) beat semantic grouping: 42.8% vs 37.0% correlation (data-driven selection wins)
Smaller vocabulary = better performance: 389 types (1.29M params) outperformed 3,558 types (1.69M params)
Compression preserves predictive signal: 166x compression while achieving 42.8% correlation with future returns
Production-ready for temporal predictions: Predicting Q2 returns from Q1 events demonstrates real-world applicability
Validated the foundation: Multi-stage compression (500GB → 5GB → 3GB) enables high-performance models

This isn't theoretical. We've proven that knowledge compression works. The hybrid vocabulary selection method compresses 166x while preserving enough signal to predict stock returns with 85.5% improvement over baseline.

What This Enables

1. Event Oracle (Working Product)

Natural language → SQL queries on 11.9M events. $0.015/query vs Fintool's $143K/month. Temporal pattern detection impossible with text-based RAG.

2. Transformer Predictions

Event sequences as input. Predict next events. Correlate with future returns. Learn patterns across 4,000+ companies.

3. Q-Learning Trading

State representation from events. Actions based on event patterns. Reward based on returns. Adapts to market regimes.

4. Insider Feature Engineering

Aggregate raw insider events into features: cluster buying, C-suite purchases, activist stakes. Academic evidence: +7-13% returns.

5. Multi-Model Architecture

All models share the same foundation: compressed events. One extraction pipeline, multiple downstream applications. Amortize costs.

"We're not building one product. We're building a semantic abstraction layer that enables an entire product family. The compression is the moat."

The Competitive Moat

Anyone can build a financial document search engine. The hard parts are:

Prompt Engineering: Extracting 30 event types consistently across 11.9M events. Zero-shot prompts that work.
Vocabulary Design: Balancing coverage vs dimensionality. Hybrid IDF×log(freq) or semantic grouping.
Metadata Schema: What to preserve (certainty, importance, sentiment) vs what to drop (legal references).
Temporal Modeling: Event sequences that capture company stories. Not just bag-of-events.
Multi-Model Architecture: One foundation serving Event Oracle, Transformer, Q-learning. Shared infrastructure.

This is defensible. Not because the code is complex, but because the knowledge is hard-won. What events matter? How to compress them? What metadata to preserve? These questions have no obvious answers.

Progress & Next Steps

✅ Completed (October 2025)

Transformer Training Complete: Trained on 236K filings (2023-2025) with compressed vocabulary
Vocabulary Evaluation Complete: Option 6b (hybrid IDF×log(freq)) decisively beat Option 5 (semantic grouping)
Best Result: 42.8% test correlation with temporal split (85.5% improvement over 23.1% baseline)
Winner Selected: 389 event types from hybrid IDF×log(freq) method - use for all future work
Event Oracle Deployed: Working product with 11.9M events, $0.015/query cost

🔄 In Progress (Current)

Expand Training Dataset: Load full 10-year dataset (700K filings, 2015-2025) for robust generalization
Target Performance: 30-35% random split correlation (stable across epochs), 35-40% temporal split
Add Insider Features: Integrate Forms 3/4/5/13D/13F features into Q-learning state representation
Scale Event Oracle: Production deployment with full historical dataset

📋 Medium-Term (3-6 months)

Controlled Vocabulary Design: Use Option 6b's 389 types as foundation for 300-500 canonical event types
Re-extract with Controlled Vocab: Re-run vLLM on all filings with controlled vocabulary prompt (~$500-1K cost)
Academic Paper: Write paper on hybrid IDF×log(freq) vocabulary compression methodology
Patent Filing: File patent for novel compression approach (semantic event extraction + hybrid vocabulary selection)
Walk-Forward Validation: Test robustness across multiple out-of-sample periods (2015-2025)

🚀 Long-Term (6-12 months)

Multi-Model Ensemble: Combine Transformer predictions + Q-learning actions + Event Oracle insights
Production Trading System: Deploy live trading with event-based risk management and portfolio construction
Real-Time Pipeline: Extract and classify 8-K filings within minutes of SEC publication
Expand Document Coverage: Extend to analyst reports, earnings call transcripts, press releases
Hyperparameter Optimization: Larger models (256-dim, 12 layers) if 10-year data supports it