← Back to Ideas

Event Compression: Taming 37,927 Event Types

The LLM is too creative. We asked for structured events and got a vocabulary explosion. A critical technical decision: domain knowledge vs. data-driven compression. The showdown between semantic grouping and hybrid statistical methods.

The Problem

37,927
Unique event types generated by the LLM. Too many for the transformer to learn effectively. The vocabulary explosion makes pattern learning nearly impossible.
Root Cause: Open-ended prompt schema gives LLM too much creative freedom
{verb}_{object}_{magnitude}

The Vocabulary Explosion

Top verbs with the most variations - each should be ~10-20 types, not thousands:

announced
5,563
variations
announced_acquisition
announced_partnership
announced_clinical_milestone
announced_earnings_results
...5,559 more
incurred
7,856
variations
incurred_transaction_costs
incurred_legal_expenses
incurred_restructuring_charges
incurred_professional_fees
...7,852 more
entered
4,249
variations
entered_agreement
entered_contract
entered_credit_facility
entered_licensing_arrangement
...4,245 more
developed
3,311
variations
developed_product
developed_technology
developed_drug_candidate
developed_software_platform
...3,307 more
approved
2,033
variations
approved_compensation
approved_plan
approved_budget
approved_merger_agreement
...2,029 more

Why This Is a Problem

  • Transformer can't learn from sparse data (many types appear only once)
  • Semantically similar events treated as completely different (acquired_company ≠ acquired_business)
  • Vocabulary size makes embedding layer huge and slow to train
  • No similarity signal between related events

Five Compression Approaches Analyzed

We evaluated every option from simple rules to sophisticated ML techniques

1
Controlled Vocabulary (Fix at Source)
Give the LLM a predefined vocabulary of 200-500 canonical event types to choose from. Fix the root cause instead of compressing afterwards.

Pros

  • Cleanest solution (fixes root cause)
  • Consistent across all filings
  • No compression needed
  • Fully interpretable

Cons

  • Requires re-running vLLM on 1.1M filings ($$$)
  • Need to design vocabulary carefully
  • May lose nuance
  • Can't do now (long-term solution)
2
Semantic Clustering (Embeddings)
Use sentence embeddings to cluster similar event types, then map to cluster centroids. Semantically similar events automatically grouped together.
# Cluster similar events automatically
Cluster 147 (Acquisitions):
  - acquired_business_major
  - acquired_company_major
  - purchased_business_major
  → Centroid: "acquired_business"

Cluster 238 (Costs):
  - incurred_transaction_costs_major
  - incurred_legal_expenses_major
  - paid_professional_fees_major
  → Centroid: "incurred_costs"

Pros

  • Semantically principled
  • No re-extraction needed
  • Can tune cluster count
  • Captures similarity automatically

Cons

  • Adds pipeline complexity
  • Need to save clustering model
  • Centroid names may not be perfect
  • Still discrete (loses continuous similarity)
3
Direct Embeddings (Continuous)
Use pre-computed sentence embeddings directly instead of discrete IDs. No vocabulary limit, captures continuous similarity.

Pros

  • Continuous similarity
  • Infinite vocabulary
  • Handles new types at inference
  • Leverages pre-trained models

Cons

  • Can't learn event-specific patterns
  • Fixed embeddings (not adaptive)
  • Larger input dimension
  • Less interpretable
4
Hierarchical (Verb + Object + Magnitude)
Split event types into separate features: verb ID, object ID, magnitude ID. Model learns compositional verb-object interactions.

Pros

  • Compositional learning
  • Smaller vocabularies (50 verbs + 500 objects)
  • Generalizes to unseen combinations
  • More interpretable

Cons

  • Requires restructuring data
  • More complex architecture
  • Need to parse existing types
  • Significant refactoring
5
Improved Rules + Semantic Grouping
Better rules than current "verb + first_word". Use domain knowledge to group similar objects. Quick to implement, no re-extraction needed.
# Semantic grouping of objects
OBJECT_GROUPS = {
  'business': ['business', 'company', 'subsidiary', 'target'],
  'costs': ['costs', 'expenses', 'fees', 'charges'],
  'equity': ['stock', 'shares', 'equity', 'common'],
  'debt': ['debt', 'notes', 'bonds', 'loan'],
}

# Result:
'acquired_company' → 'acquired_business'
'incurred_legal_fees' → 'incurred_costs'
'issued_common_stock' → 'issued_equity'

Pros

  • Fast to implement
  • No re-extraction needed
  • Domain-driven (finance semantics)
  • Fully interpretable

Cons

  • Still rule-based (not data-driven)
  • Manual grouping required
  • May miss patterns
  • Not as sophisticated

⚔️ The Showdown: Two Finalists

Domain knowledge vs. data-driven compression. Both targeting ~800-1,000 event types.

Option 5
Semantic Grouping
(Rule-Based + Domain Knowledge)
  • ✓ Financial domain expertise
  • ✓ Manually defined groupings
  • ✓ Fully interpretable
  • ✓ Fast implementation
  • ✓ No corpus stats needed
Philosophy: Humans know what events matter in finance
VS
Option 6b
Hybrid IDF + Frequency
(Data-Driven Statistics)
  • ✓ IDF (rarity) scoring
  • ✓ Frequency (coverage) weighting
  • ✓ Corpus-adaptive
  • ✓ Objective ranking
  • ✓ Balances rare + common
Philosophy: Let the data tell us what matters
Feature Option 5 (Semantic) Option 6b (Hybrid)
Approach Rule-based + domain knowledge Data-driven statistics
Target Vocab Size ~800-1,000 types ~800-1,000 types
Interpretability High (semantic groups) Medium (can inspect scores)
Adaptability Low (hardcoded rules) High (adapts to any corpus)
Rare Events Manual selection Automatic (IDF scoring)
Common Events Manual grouping Automatic (frequency weighting)
Domain Expertise Required Not required
Implementation Time 3-4 hours 3-4 hours

The Hybrid Score Formula

score = IDF(event) × log(frequency(event))
Why this works: Balances rare high-signal events (dismissed_auditor: IDF=8.5) with common predictive events (acquired_business: IDF=3.4 but freq=12K). Pure IDF would exclude important common events. Pure frequency would include noise. The hybrid captures both.

Experimental Methodology

Rigorous A/B testing to determine the winner

1

Generate Both Vocabularies

Run compression algorithms on full corpus (11.9M events). Option 5: Apply semantic grouping rules. Option 6b: Calculate IDF×freq scores. Target: ~800-1,000 types each.

2

Build Training Sequences

Generate sequences_ml.jsonl (Option 5) and sequences_hybrid_ml.jsonl (Option 6b). Each file ~2.9GB with 236K training sequences (512 events each).

3

Compare Vocabularies

Analyze overlap, unique events in each, verb distribution, interpretability. Manual inspection of top events to validate quality.

4

Train Transformers

Train identical transformer models on both datasets. Same architecture, same hyperparameters, same training data. Only difference: vocabulary compression strategy.

5

Evaluate on Validation Set

Measure correlation between predicted returns and actual returns. Winner = highest validation correlation (target: >0.25). Success threshold: >2% correlation difference.

Expected Outcomes

  • If Option 5 wins: Domain knowledge beats data mining. Finance semantics critical for prediction.
  • If Option 6b wins: Data-driven selection beats human intuition. Use hybrid approach going forward.
  • If they're similar: Use Option 5 for interpretability, Option 6b for automation/scale.
🏆

We Have a Winner!

Option 6b (Hybrid IDF×log(freq))
42.8% Test Correlation
85.5% improvement over baseline • 389 event types • Trained October 31, 2025

📊 Experimental Results

Trained and compared three transformer models on 236K SEC filings (2023-2025) using identical architecture. Option 6b's data-driven vocabulary selection decisively outperformed semantic grouping.

Model Vocab Size Split Method Test Correlation Test RMSE Baseline Improvement
GradientBoosting (baseline) 1,840 Temporal 23.1% 17.19 -
🏆 Option 6b (WINNER) 389 Temporal 42.8% 15.78 +85.5%
Option 5 (Semantic) 3,558 Temporal 37.0% 16.80 +60.5%
Option 6b (Random Split) 389 Random 27.6% 14.71 +19.4%

Why Option 6b Won

📉 Smaller Vocabulary

389 types vs 3,558 types. Fewer parameters (1.29M vs 1.69M) reduces overfitting risk and forces model to learn generalizable patterns.

🎯 Data-Driven Selection

IDF×log(freq) identifies truly discriminative events based on actual occurrence patterns, not human intuition about finance semantics.

🔇 Reduced Noise

Filters out rare, uninformative events while preserving both high-signal rare events AND common predictive patterns.

⚡ Better Signal Extraction

Model focuses on events that actually correlate with returns instead of memorizing semantic categories that may not predict well.

Key Insights from Training

  • Temporal split (42.8%) shows production-ready performance for predicting future returns from past events
  • Random split (27.6%) reveals model needs more data - current 236K filings span only 2.5 years
  • Both models early-stopped at epoch 2-3 - indicating data limitation, not architecture issue
  • Next step: Retrain on full 10-year dataset (700K filings, 2015-2025) for 30-35% stable random split performance
  • The hybrid IDF×log(freq) method is the clear winner - use for all future work

The Verdict

Data-driven methods beat human intuition.
Statistical event selection (IDF×frequency) outperformed semantic grouping by 5.8 percentage points (42.8% vs 37.0%). This validates using hybrid vocabulary compression for the transformer and all future models.

Long-Term Solution: Controlled Vocabulary

Now that we know data-driven selection wins, use Option 6b's insights to design the ultimate solution

Phase 1: Learn from Winner ✅ Complete

Option 6b's hybrid IDF×log(freq) method selected 389 event types that actually predict returns. These 389 types serve as the foundation for our canonical vocabulary design.

Phase 2: Design Canonical Vocabulary (Next Quarter)

Create 300-500 canonical event types based on winner's insights. Update LLM prompts to use controlled vocabulary instead of open-ended schema.

NEW PROMPT:
"Classify this event into ONE of the following types:
  - acquired_business
  - acquired_asset
  - issued_debt
  - issued_equity
  - announced_earnings
  - incurred_major_costs
  ...
[Full list of 300-500 canonical types]

Choose the MOST SPECIFIC type that applies."

Phase 3: Re-extract All Events (Future)

Re-run vLLM extraction on all 1.1M filings with controlled vocabulary. Cost: ~$500-1,000. Benefit: Clean, consistent events from the start. No compression needed ever again.

The Bottom Line

We gave the LLM too much freedom and got 37,927 event types instead of ~500. The showdown revealed that data-driven methods (Option 6b: 42.8%) decisively beat domain knowledge (Option 5: 37.0%). The 389 winning event types now inform our long-term controlled vocabulary design.

This isn't just about compression - it's about understanding what events actually predict returns. The data has spoken: statistical IDF×frequency selection beats human intuition about finance semantics. This validates our approach and provides a clear path forward for the transformer model.