Knowledge Compression: The Foundation

How We Compress 500GB of SEC Filings into 3GB of Actionable Intelligence
Last Updated: January 2025

The Secret Sauce

While others are building better search engines for financial documents, we're doing something fundamentally different: multi-stage knowledge compression.

We're not just storing and retrieving text. We're extracting semantic meaning, compressing it into structured events, and compressing those events further into a vocabulary that preserves predictive signal while reducing dimensionality by 166x.

The Core Innovation

Most systems treat financial documents as text to be searched. We treat them as unstructured knowledge to be compressed into structured semantic abstractions. Extract once, use forever. No embedding drift, no retrieval errors, no context windows, no marginal costs.

The Multi-Stage Compression Journey

Stage 0: Raw SEC Filings
500 GB
10-K filings, 10-Q filings, 8-K filings, proxy statements. Dense legal text, tables, exhibits. Signal buried in noise. Unstructured, inconsistent, verbose.
Stage 1: Event Extraction
5 GB (11.9M events)
vLLM + Qwen 9B model extracts semantic events: "acquired_company", "entered_strategic_partnership", "announced_layoffs". Each event has subject, verb, object, temporal certainty, strategic importance, sentiment. 100x compression while preserving predictive signal.
Stage 2: Vocabulary Compression
3 GB (388-3,558 types)
Hybrid IDF×log(frequency) or semantic grouping compresses 37,927 unique event types down to 388-3,558 canonical types. Balances rare high-signal events (activist stakes, insider buying clusters) with common predictive patterns (product launches, partnerships). 1.66x additional compression.
166×
Total Compression Ratio
500GB → 5GB → 3GB

The Technical Pipeline

1

Filing Ingestion

Download 10-K, 10-Q, 8-K filings from SEC EDGAR. Parse HTML/XBRL. Extract text sections.

2

Batch Extraction with vLLM

Run Qwen 9B model with GPU acceleration. Extract 30 event types per section. Structured JSON output with metadata.

3

Event Storage

Store 11.9M events in PostgreSQL. Each event: CIK, company, date, type, subject, verb, object, certainty, importance, sentiment, source filing.

4

Vocabulary Compression

Apply hybrid IDF×log(freq) or semantic grouping. Reduce 37,927 types → 388-3,558 types. Map original events to compressed vocabulary.

5

Sequence Generation

Create time-ordered event sequences per company. Input for Transformer models, Q-learning state representation, Event Oracle queries.

vLLM
Qwen 9B
PostgreSQL
GPU Acceleration
Batch Processing
Structured JSON
Temporal Sequences

Why This Approach is Novel

What We Preserve

What We Eliminate

Extract Once, Use Forever

Our Approach: Knowledge Compression

  • One-time cost: $5K-10K for extraction
  • Storage: 3GB (compressed events)
  • Query cost: $0.015 (SQL + formatting)
  • Temporal queries: Native SQL JOINs
  • Marginal cost: Zero after extraction
  • Drift: None (structured data)

RAG Approach: Store & Retrieve

  • Ongoing cost: $1M+/week processing
  • Storage: 500GB (raw) + embeddings
  • Query cost: $5-50 (retrieval + generation)
  • Temporal queries: Impossible
  • Marginal cost: Linear with usage
  • Drift: Embeddings change with models
"RAG systems store everything and retrieve on demand. We extract semantic abstractions once and query them forever. It's the difference between renting and owning."

The Hybrid IDF×log(freq) Innovation

Compressing 37,927 event types is a non-trivial problem. Too aggressive and you lose signal. Too conservative and you explode dimensionality.

The Formula

score = IDF(event) × log(frequency(event))

Why This Works

This is a novel contribution. Standard TF-IDF doesn't work (ignores within-company patterns). Pure frequency misses rare signals. Pure IDF misses repeated behaviors. The hybrid approach balances both.

Patent & Paper Potential

The hybrid IDF×log(frequency) vocabulary selection method for financial event compression appears novel. Combines information-theoretic principles (IDF) with behavioral pattern detection (log frequency) for semantic abstraction of time-series events.

✅ Validated with Real Results

42.8% Correlation
Transformer Model Predicting Stock Returns from Compressed Events
85.5% improvement over 23.1% baseline • Trained October 31, 2025

Experimental Validation

389
Event Types Selected
Down from 37,927 types
97.1%
Event Coverage
With minimal vocabulary
42.8%
Test Correlation
Temporal split (future prediction)
+85.5%
Baseline Improvement
vs GradientBoosting (23.1%)

Key Findings

  • Hybrid IDF×log(freq) beat semantic grouping: 42.8% vs 37.0% correlation (data-driven selection wins)
  • Smaller vocabulary = better performance: 389 types (1.29M params) outperformed 3,558 types (1.69M params)
  • Compression preserves predictive signal: 166x compression while achieving 42.8% correlation with future returns
  • Production-ready for temporal predictions: Predicting Q2 returns from Q1 events demonstrates real-world applicability
  • Validated the foundation: Multi-stage compression (500GB → 5GB → 3GB) enables high-performance models

This isn't theoretical. We've proven that knowledge compression works. The hybrid vocabulary selection method compresses 166x while preserving enough signal to predict stock returns with 85.5% improvement over baseline.

What This Enables

1. Event Oracle (Working Product)

Natural language → SQL queries on 11.9M events. $0.015/query vs Fintool's $143K/month. Temporal pattern detection impossible with text-based RAG.

2. Transformer Predictions

Event sequences as input. Predict next events. Correlate with future returns. Learn patterns across 4,000+ companies.

3. Q-Learning Trading

State representation from events. Actions based on event patterns. Reward based on returns. Adapts to market regimes.

4. Insider Feature Engineering

Aggregate raw insider events into features: cluster buying, C-suite purchases, activist stakes. Academic evidence: +7-13% returns.

5. Multi-Model Architecture

All models share the same foundation: compressed events. One extraction pipeline, multiple downstream applications. Amortize costs.

"We're not building one product. We're building a semantic abstraction layer that enables an entire product family. The compression is the moat."

The Competitive Moat

Anyone can build a financial document search engine. The hard parts are:

This is defensible. Not because the code is complex, but because the knowledge is hard-won. What events matter? How to compress them? What metadata to preserve? These questions have no obvious answers.

Progress & Next Steps

✅ Completed (October 2025)

🔄 In Progress (Current)

📋 Medium-Term (3-6 months)

🚀 Long-Term (6-12 months)