{verb}_{object}_{magnitude}
Top verbs with the most variations - each should be ~10-20 types, not thousands:
Domain knowledge vs. data-driven compression. Both targeting ~800-1,000 event types.
| Feature | Option 5 (Semantic) | Option 6b (Hybrid) |
|---|---|---|
| Approach | Rule-based + domain knowledge | Data-driven statistics |
| Target Vocab Size | ~800-1,000 types | ~800-1,000 types |
| Interpretability | High (semantic groups) | Medium (can inspect scores) |
| Adaptability | Low (hardcoded rules) | High (adapts to any corpus) |
| Rare Events | Manual selection | Automatic (IDF scoring) |
| Common Events | Manual grouping | Automatic (frequency weighting) |
| Domain Expertise | Required | Not required |
| Implementation Time | 3-4 hours | 3-4 hours |
Rigorous A/B testing to determine the winner
Run compression algorithms on full corpus (11.9M events). Option 5: Apply semantic grouping rules. Option 6b: Calculate IDF×freq scores. Target: ~800-1,000 types each.
Generate sequences_ml.jsonl (Option 5) and sequences_hybrid_ml.jsonl (Option 6b). Each file ~2.9GB with 236K training sequences (512 events each).
Analyze overlap, unique events in each, verb distribution, interpretability. Manual inspection of top events to validate quality.
Train identical transformer models on both datasets. Same architecture, same hyperparameters, same training data. Only difference: vocabulary compression strategy.
Measure correlation between predicted returns and actual returns. Winner = highest validation correlation (target: >0.25). Success threshold: >2% correlation difference.
Trained and compared three transformer models on 236K SEC filings (2023-2025) using identical architecture. Option 6b's data-driven vocabulary selection decisively outperformed semantic grouping.
| Model | Vocab Size | Split Method | Test Correlation | Test RMSE | Baseline Improvement |
|---|---|---|---|---|---|
| GradientBoosting (baseline) | 1,840 | Temporal | 23.1% | 17.19 | - |
| 🏆 Option 6b (WINNER) | 389 | Temporal | 42.8% | 15.78 | +85.5% |
| Option 5 (Semantic) | 3,558 | Temporal | 37.0% | 16.80 | +60.5% |
| Option 6b (Random Split) | 389 | Random | 27.6% | 14.71 | +19.4% |
389 types vs 3,558 types. Fewer parameters (1.29M vs 1.69M) reduces overfitting risk and forces model to learn generalizable patterns.
IDF×log(freq) identifies truly discriminative events based on actual occurrence patterns, not human intuition about finance semantics.
Filters out rare, uninformative events while preserving both high-signal rare events AND common predictive patterns.
Model focuses on events that actually correlate with returns instead of memorizing semantic categories that may not predict well.
Data-driven methods beat human intuition.
Statistical event selection (IDF×frequency) outperformed semantic grouping by 5.8 percentage points (42.8% vs 37.0%).
This validates using hybrid vocabulary compression for the transformer and all future models.
Now that we know data-driven selection wins, use Option 6b's insights to design the ultimate solution
Option 6b's hybrid IDF×log(freq) method selected 389 event types that actually predict returns. These 389 types serve as the foundation for our canonical vocabulary design.
Create 300-500 canonical event types based on winner's insights. Update LLM prompts to use controlled vocabulary instead of open-ended schema.
Re-run vLLM extraction on all 1.1M filings with controlled vocabulary. Cost: ~$500-1,000. Benefit: Clean, consistent events from the start. No compression needed ever again.
We gave the LLM too much freedom and got 37,927 event types instead of ~500. The showdown revealed that data-driven methods (Option 6b: 42.8%) decisively beat domain knowledge (Option 5: 37.0%). The 389 winning event types now inform our long-term controlled vocabulary design.
This isn't just about compression - it's about understanding what events actually predict returns. The data has spoken: statistical IDF×frequency selection beats human intuition about finance semantics. This validates our approach and provides a clear path forward for the transformer model.