← Back to Ideas

Event Compression: Taming 37,927 Event Types

The LLM is too creative. We asked for structured events and got a vocabulary explosion. A critical technical decision: domain knowledge vs. data-driven compression. The showdown between semantic grouping and hybrid statistical methods.

The Problem

37,927

Unique event types generated by the LLM. Too many for the transformer to learn effectively. The vocabulary explosion makes pattern learning nearly impossible.

Root Cause: Open-ended prompt schema gives LLM too much creative freedom
{verb}_{object}_{magnitude}

The Vocabulary Explosion

Top verbs with the most variations - each should be ~10-20 types, not thousands:

announced

5,563

variations

announced_acquisition
announced_partnership
announced_clinical_milestone
announced_earnings_results
...5,559 more

incurred

7,856

variations

incurred_transaction_costs
incurred_legal_expenses
incurred_restructuring_charges
incurred_professional_fees
...7,852 more

entered

4,249

variations

entered_agreement
entered_contract
entered_credit_facility
entered_licensing_arrangement
...4,245 more

developed

3,311

variations

developed_product
developed_technology
developed_drug_candidate
developed_software_platform
...3,307 more

approved

2,033

variations

approved_compensation
approved_plan
approved_budget
approved_merger_agreement
...2,029 more

Why This Is a Problem

Transformer can't learn from sparse data (many types appear only once)
Semantically similar events treated as completely different (acquired_company ≠ acquired_business)
Vocabulary size makes embedding layer huge and slow to train
No similarity signal between related events

Five Compression Approaches Analyzed

We evaluated every option from simple rules to sophisticated ML techniques

Controlled Vocabulary (Fix at Source)

★ ★ ★ ★ ★

Give the LLM a predefined vocabulary of 200-500 canonical event types to choose from. Fix the root cause instead of compressing afterwards.

Pros

Cleanest solution (fixes root cause)
Consistent across all filings
No compression needed
Fully interpretable

Cons

Requires re-running vLLM on 1.1M filings ($$$)
Need to design vocabulary carefully
May lose nuance
Can't do now (long-term solution)

Semantic Clustering (Embeddings)

★ ★ ★ ★ ☆

Use sentence embeddings to cluster similar event types, then map to cluster centroids. Semantically similar events automatically grouped together.

# Cluster similar events automatically

Cluster 147 (Acquisitions):

  - acquired_business_major

  - acquired_company_major

  - purchased_business_major

  → Centroid: "acquired_business"

Cluster 238 (Costs):

  - incurred_transaction_costs_major

  - incurred_legal_expenses_major

  - paid_professional_fees_major

  → Centroid: "incurred_costs"

Pros

Semantically principled
No re-extraction needed
Can tune cluster count
Captures similarity automatically

Cons

Adds pipeline complexity
Need to save clustering model
Centroid names may not be perfect
Still discrete (loses continuous similarity)

Direct Embeddings (Continuous)

★ ★ ★ ☆ ☆

Use pre-computed sentence embeddings directly instead of discrete IDs. No vocabulary limit, captures continuous similarity.

Pros

Continuous similarity
Infinite vocabulary
Handles new types at inference
Leverages pre-trained models

Cons

Can't learn event-specific patterns
Fixed embeddings (not adaptive)
Larger input dimension
Less interpretable

Hierarchical (Verb + Object + Magnitude)

★ ★ ★ ★ ☆

Split event types into separate features: verb ID, object ID, magnitude ID. Model learns compositional verb-object interactions.

Pros

Compositional learning
Smaller vocabularies (50 verbs + 500 objects)
Generalizes to unseen combinations
More interpretable

Cons

Requires restructuring data
More complex architecture
Need to parse existing types
Significant refactoring

Improved Rules + Semantic Grouping

★ ★ ★ ☆ ☆

Better rules than current "verb + first_word". Use domain knowledge to group similar objects. Quick to implement, no re-extraction needed.

# Semantic grouping of objects

OBJECT_GROUPS = {

  'business': ['business', 'company', 'subsidiary', 'target'],

  'costs': ['costs', 'expenses', 'fees', 'charges'],

  'equity': ['stock', 'shares', 'equity', 'common'],

  'debt': ['debt', 'notes', 'bonds', 'loan'],

}

# Result:

'acquired_company' → 'acquired_business'

'incurred_legal_fees' → 'incurred_costs'

'issued_common_stock' → 'issued_equity'

Pros

Fast to implement
No re-extraction needed
Domain-driven (finance semantics)
Fully interpretable

Cons

Still rule-based (not data-driven)
Manual grouping required
May miss patterns
Not as sophisticated

⚔️ The Showdown: Two Finalists

Domain knowledge vs. data-driven compression. Both targeting ~800-1,000 event types.

Option 5

Semantic Grouping

(Rule-Based + Domain Knowledge)

✓ Financial domain expertise
✓ Manually defined groupings
✓ Fully interpretable
✓ Fast implementation
✓ No corpus stats needed

Philosophy: Humans know what events matter in finance

Option 6b

Hybrid IDF + Frequency

(Data-Driven Statistics)

✓ IDF (rarity) scoring
✓ Frequency (coverage) weighting
✓ Corpus-adaptive
✓ Objective ranking
✓ Balances rare + common

Philosophy: Let the data tell us what matters

Feature	Option 5 (Semantic)	Option 6b (Hybrid)
Approach	Rule-based + domain knowledge	Data-driven statistics
Target Vocab Size	~800-1,000 types	~800-1,000 types
Interpretability	High (semantic groups)	Medium (can inspect scores)
Adaptability	Low (hardcoded rules)	High (adapts to any corpus)
Rare Events	Manual selection	Automatic (IDF scoring)
Common Events	Manual grouping	Automatic (frequency weighting)
Domain Expertise	Required	Not required
Implementation Time	3-4 hours	3-4 hours

The Hybrid Score Formula

                    score = IDF(event) × log(frequency(event))
                

Why this works: Balances rare high-signal events (dismissed_auditor: IDF=8.5) with common predictive events (acquired_business: IDF=3.4 but freq=12K). Pure IDF would exclude important common events. Pure frequency would include noise. The hybrid captures both.

Experimental Methodology

Rigorous A/B testing to determine the winner

Generate Both Vocabularies

Run compression algorithms on full corpus (11.9M events). Option 5: Apply semantic grouping rules. Option 6b: Calculate IDF×freq scores. Target: ~800-1,000 types each.

Build Training Sequences

Generate sequences_ml.jsonl (Option 5) and sequences_hybrid_ml.jsonl (Option 6b). Each file ~2.9GB with 236K training sequences (512 events each).

Compare Vocabularies

Analyze overlap, unique events in each, verb distribution, interpretability. Manual inspection of top events to validate quality.

Train Transformers

Train identical transformer models on both datasets. Same architecture, same hyperparameters, same training data. Only difference: vocabulary compression strategy.

Evaluate on Validation Set

Measure correlation between predicted returns and actual returns. Winner = highest validation correlation (target: >0.25). Success threshold: >2% correlation difference.

Expected Outcomes

If Option 5 wins: Domain knowledge beats data mining. Finance semantics critical for prediction.
If Option 6b wins: Data-driven selection beats human intuition. Use hybrid approach going forward.
If they're similar: Use Option 5 for interpretability, Option 6b for automation/scale.

🏆

We Have a Winner!

Option 6b (Hybrid IDF×log(freq))

42.8% Test Correlation

85.5% improvement over baseline • 389 event types • Trained October 31, 2025

📊 Experimental Results

Trained and compared three transformer models on 236K SEC filings (2023-2025) using identical architecture. Option 6b's data-driven vocabulary selection decisively outperformed semantic grouping.

Model	Vocab Size	Split Method	Test Correlation	Test RMSE	Baseline Improvement
GradientBoosting (baseline)	1,840	Temporal	23.1%	17.19	-
🏆 Option 6b (WINNER)	389	Temporal	42.8%	15.78	+85.5%
Option 5 (Semantic)	3,558	Temporal	37.0%	16.80	+60.5%
Option 6b (Random Split)	389	Random	27.6%	14.71	+19.4%

Why Option 6b Won

📉 Smaller Vocabulary

389 types vs 3,558 types. Fewer parameters (1.29M vs 1.69M) reduces overfitting risk and forces model to learn generalizable patterns.

🎯 Data-Driven Selection

IDF×log(freq) identifies truly discriminative events based on actual occurrence patterns, not human intuition about finance semantics.

🔇 Reduced Noise

Filters out rare, uninformative events while preserving both high-signal rare events AND common predictive patterns.

⚡ Better Signal Extraction

Model focuses on events that actually correlate with returns instead of memorizing semantic categories that may not predict well.

Key Insights from Training

Temporal split (42.8%) shows production-ready performance for predicting future returns from past events
Random split (27.6%) reveals model needs more data - current 236K filings span only 2.5 years
Both models early-stopped at epoch 2-3 - indicating data limitation, not architecture issue
Next step: Retrain on full 10-year dataset (700K filings, 2015-2025) for 30-35% stable random split performance
The hybrid IDF×log(freq) method is the clear winner - use for all future work

The Verdict

Data-driven methods beat human intuition.
Statistical event selection (IDF×frequency) outperformed semantic grouping by 5.8 percentage points (42.8% vs 37.0%). This validates using hybrid vocabulary compression for the transformer and all future models.

Long-Term Solution: Controlled Vocabulary

Now that we know data-driven selection wins, use Option 6b's insights to design the ultimate solution

Phase 1: Learn from Winner ✅ Complete

Option 6b's hybrid IDF×log(freq) method selected 389 event types that actually predict returns. These 389 types serve as the foundation for our canonical vocabulary design.

Phase 2: Design Canonical Vocabulary (Next Quarter)

Create 300-500 canonical event types based on winner's insights. Update LLM prompts to use controlled vocabulary instead of open-ended schema.

NEW PROMPT:

"Classify this event into ONE of the following types:

  - acquired_business

  - acquired_asset

  - issued_debt

  - issued_equity

  - announced_earnings

  - incurred_major_costs

  ...

[Full list of 300-500 canonical types]

Choose the MOST SPECIFIC type that applies."

Phase 3: Re-extract All Events (Future)

Re-run vLLM extraction on all 1.1M filings with controlled vocabulary. Cost: ~$500-1,000. Benefit: Clean, consistent events from the start. No compression needed ever again.

The Bottom Line

We gave the LLM too much freedom and got 37,927 event types instead of ~500. The showdown revealed that data-driven methods (Option 6b: 42.8%) decisively beat domain knowledge (Option 5: 37.0%). The 389 winning event types now inform our long-term controlled vocabulary design.

This isn't just about compression - it's about understanding what events actually predict returns. The data has spoken: statistical IDF×frequency selection beats human intuition about finance semantics. This validates our approach and provides a clear path forward for the transformer model.