While others are building better search engines for financial documents, we're doing something fundamentally different: multi-stage knowledge compression.
We're not just storing and retrieving text. We're extracting semantic meaning, compressing it into structured events, and compressing those events further into a vocabulary that preserves predictive signal while reducing dimensionality by 166x.
Most systems treat financial documents as text to be searched. We treat them as unstructured knowledge to be compressed into structured semantic abstractions. Extract once, use forever. No embedding drift, no retrieval errors, no context windows, no marginal costs.
Download 10-K, 10-Q, 8-K filings from SEC EDGAR. Parse HTML/XBRL. Extract text sections.
Run Qwen 9B model with GPU acceleration. Extract 30 event types per section. Structured JSON output with metadata.
Store 11.9M events in PostgreSQL. Each event: CIK, company, date, type, subject, verb, object, certainty, importance, sentiment, source filing.
Apply hybrid IDF×log(freq) or semantic grouping. Reduce 37,927 types → 388-3,558 types. Map original events to compressed vocabulary.
Create time-ordered event sequences per company. Input for Transformer models, Q-learning state representation, Event Oracle queries.
Compressing 37,927 event types is a non-trivial problem. Too aggressive and you lose signal. Too conservative and you explode dimensionality.
This is a novel contribution. Standard TF-IDF doesn't work (ignores within-company patterns). Pure frequency misses rare signals. Pure IDF misses repeated behaviors. The hybrid approach balances both.
The hybrid IDF×log(frequency) vocabulary selection method for financial event compression appears novel. Combines information-theoretic principles (IDF) with behavioral pattern detection (log frequency) for semantic abstraction of time-series events.
This isn't theoretical. We've proven that knowledge compression works. The hybrid vocabulary selection method compresses 166x while preserving enough signal to predict stock returns with 85.5% improvement over baseline.
Natural language → SQL queries on 11.9M events. $0.015/query vs Fintool's $143K/month. Temporal pattern detection impossible with text-based RAG.
Event sequences as input. Predict next events. Correlate with future returns. Learn patterns across 4,000+ companies.
State representation from events. Actions based on event patterns. Reward based on returns. Adapts to market regimes.
Aggregate raw insider events into features: cluster buying, C-suite purchases, activist stakes. Academic evidence: +7-13% returns.
All models share the same foundation: compressed events. One extraction pipeline, multiple downstream applications. Amortize costs.
Anyone can build a financial document search engine. The hard parts are:
This is defensible. Not because the code is complex, but because the knowledge is hard-won. What events matter? How to compress them? What metadata to preserve? These questions have no obvious answers.