← Back to Ideas
November 2025 • Alpha Research

Event Database Analysis & Alpha Generation

10 systematic strategies to extract alpha from 9.1M events across 51,551 unique event types. From IDF-weighted importance scoring to event sequence mining, co-occurrence networks, and ML embeddings. Production-ready database with 97% confidence extraction quality.
📊 9.1M Events 🎯 51K Event Types 💎 Alpha Generation 📖 32 min read

Database Overview

Core Statistics

  • Scale: 9.1M events
  • Event Types: 51,551 unique types
  • Companies: 96K companies
  • Time Range: 5.5 years (2020-2025)
  • Base Verbs: 491 unique action verbs
  • Confidence: 97% average extraction quality
  • Material Events: 70% (high signal-to-noise ratio)

Key Finding:

This is WAY more than 30 hand-crafted events. The 30 events mentioned in other analyses are just anchor patterns. This system has discovered 51,551 emergent event types from the data itself.

Core Insight: Event Type Richness

Structure: {verb}_{object}_{magnitude}

Example: dismissed_auditor_accounting_disagreement_critical

Distribution Characteristics:

Idea 1: Inflection Point Event Clustering

Concept

Identify unusual event combinations that appear before large price movements.

Approach A: Retrospective Pattern Mining

  1. Get stocks with >20% moves over 30 days
  2. Look back 60 days before the move
  3. Extract event vectors for those windows
  4. Cluster to find common patterns
  5. Test if patterns predict future inflections

Key Metrics:

  • Event diversity: Unique event types in window
  • IDF-weighted importance: Rare events matter more
  • Sentiment momentum: Positive → negative shifts
  • Critical event count: magnitude='critical'

Approach B: Event Sequence Mining

Find predictive patterns like:

  • announced_acquisition → delayed_project_delay → impaired_asset_impairment (acquisition gone wrong cascade)
  • material_weakness → dismissed_auditor → covenant_violation (distress cascade)
  • authorized_buyback_program → upgraded_analyst → expanded_market (positive momentum cascade)

SQL Example:

-- Find distress cascades
SELECT a.cik, a.filing_date, a.event_type as first_event,
       b.event_type as second_event,
       c.event_type as third_event,
       julianday(c.filing_date) - julianday(a.filing_date) as cascade_days
FROM events a
JOIN events b ON a.cik = b.cik AND b.filing_date > a.filing_date
JOIN events c ON b.cik = c.cik AND c.filing_date > b.filing_date
WHERE a.event_type LIKE 'material_weakness%'
  AND b.event_type LIKE 'dismissed_auditor%'
  AND c.event_type LIKE 'covenant_violation%'
  AND julianday(c.filing_date) - julianday(a.filing_date) < 180;

Idea 2: IDF-Weighted Event Importance Scoring

Insight

IDF scores (3.02 to 16.02) measure event rarity. Rare events have maximum alpha potential.

Ultra-High-Alpha Events (IDF > 12):

  • dismissed_auditor_fraud_concern_critical (IDF 16.02, 4 events)
  • dismissed_auditor_accounting_disagreement_critical (IDF 13.38, 14 events)
  • covenant_violation_acceleration_critical (IDF 13-16)
  • Revenue restatements (IDF 15+)

Composite Importance Score:

importance = (
    idf_score * 0.4 +           # Rarity
    confidence * 0.2 +           # Extraction quality
    magnitude_weight * 0.3 +     # Impact size (critical=5, massive=4)
    sentiment_strength * 0.1     # Directional signal
)

Application: Build a daily alert system for events with importance > 8.0

Idea 3: Company Event Signature Analysis

Concept

Each company has an "event fingerprint" - their distribution across 51K event types.

Healthy vs Distressed Signatures:

Metric Healthy Company Distressed Company
Event diversity 100-200 unique types 50-100 unique types
Positive sentiment 25-30% <15%
Negative sentiment <20% >40%
Critical events <1% of total >5% of total
High IDF events <5% >15%
Common events approved_regulatory, completed_acquisition, partnered_joint_venture material_weakness, delayed_project, impaired_asset, covenant_violation

Signature Transition Detection:

-- Detect companies transitioning from healthy to distressed
WITH event_counts AS (
    SELECT cik,
           strftime('%Y-%m', filing_date) as month,
           COUNT(DISTINCT event_type) as diversity,
           AVG(CASE WHEN sentiment='negative' THEN 1.0 ELSE 0.0 END) as neg_pct,
           AVG(CASE WHEN magnitude_class='critical' THEN 1.0 ELSE 0.0 END) as crit_pct
    FROM events
    GROUP BY cik, month
)
SELECT cik, month,
       diversity - LAG(diversity, 3) OVER (PARTITION BY cik ORDER BY month) as diversity_change,
       neg_pct - LAG(neg_pct, 3) OVER (PARTITION BY cik ORDER BY month) as neg_change
FROM event_counts
HAVING diversity_change < -30 OR neg_change > 0.2;

Alpha Signal: When a company's signature suddenly shifts clusters → inflection point ahead

Idea 4: Event Co-Occurrence Networks

Concept

Build a graph where event types are nodes, edges represent co-occurrence within same company+window.

Network Properties:

  • Dense clusters: Event types that always appear together (e.g., M&A events)
  • Bridge events: Events that connect different clusters (transitional states)
  • Rare paths: Unusual event combinations (high alpha potential)

Example Analysis:

# Pseudocode
G = nx.Graph()

# Add edges for events co-occurring within 30 days
for cik in companies:
    events = get_events(cik, window=30)
    for e1, e2 in combinations(events, 2):
        G.add_edge(e1.event_type, e2.event_type, weight=1/e1.idf_score)

# Find unusual paths
communities = nx.community.louvain_communities(G)
for community in communities:
    if avg_idf(community) > 10:
        print(f"High-alpha cluster: {community}")

Application: Detect when a company enters a "rare path" through event space → trade signal

Idea 5: Temporal Event Density Analysis

Concept

Measure event "velocity" - how fast events are accumulating.

Event Burst Detection:

def detect_burst(cik, lookback_days=90):
    recent_events = get_events(cik, days=30)
    historical_avg = get_events(cik, days=lookback_days) / (lookback_days/30)

    burst_ratio = len(recent_events) / historical_avg
    rare_event_count = sum(1 for e in recent_events if e.idf_score > 10)

    return {
        'burst_ratio': burst_ratio,  # >2.0 = unusual activity
        'rare_events': rare_event_count,  # >3 = high alpha
        'diversity': len(set(e.event_type for e in recent_events))
    }

Alert Criteria:

  • Burst ratio > 2.0 (2x normal event frequency)
  • Rare events > 3 (multiple high-IDF events)
  • Diversity > 15 (many different event types)

Idea 6: Sentiment Momentum Strategy

Concept

Track sentiment shifts as leading indicators.

Momentum Calculation:

def sentiment_momentum(cik, window_days=60):
    events = get_events(cik, days=window_days)

    scores = []
    for e in sorted(events, key=lambda x: x.filing_date):
        # Weight recent events more heavily
        age_days = (today - e.filing_date).days
        decay = exp(-age_days / 30)  # 30-day half-life

        sentiment_value = {'positive': 1, 'neutral': 0, 'negative': -1}[e.sentiment]
        magnitude_weight = {
            'massive': 5, 'critical': 4, 'major': 3,
            'medium': 2, 'small': 1
        }.get(e.magnitude_class, 1)

        score = sentiment_value * magnitude_weight * decay * e.idf_score
        scores.append(score)

    return sum(scores) / len(scores)

Signals:

  • Momentum > 2.0: Strong positive (go long)
  • Momentum < -2.0: Strong negative (go short)
  • Rapid momentum reversal: Inflection point detected

Idea 7: Critical Event Cascade Detection

Insight

Some event types predict future critical events.

Cascade Patterns to Track:

Distress Cascade (bearish):
material_weakness → dismissed_auditor → covenant_violation → defaulted

Turnaround Cascade (bullish):
appointed_new_ceo → reorganized → discontinued_unprofitable → authorized_buyback

Growth Cascade (bullish):
partnered → developed → certified → commercialized → expanded

Implementation:

-- Find companies in early stages of distress cascade
SELECT e1.cik, COUNT(DISTINCT e2.event_type) as cascade_depth
FROM events e1
LEFT JOIN events e2 ON e1.cik = e2.cik
    AND e2.filing_date > e1.filing_date
    AND e2.filing_date < date(e1.filing_date, '+180 days')
    AND e2.event_type IN ('dismissed_auditor_%', 'covenant_violation_%', 'defaulted_%')
WHERE e1.event_type LIKE 'material_weakness_%'
    AND e1.filing_date > date('now', '-365 days')
GROUP BY e1.cik
HAVING cascade_depth >= 2;

Idea 8: Industry-Specific Event Patterns

Concept

Different event types matter for different industries.

Approach:

  1. Cluster companies by their event type distributions (K-means on TF-IDF vectors)
  2. Identify industry-specific event patterns
  3. Build industry-specific scoring models

Examples:

  • Biotech: certified_fda_approval_critical is maximum alpha
  • Finance: material_weakness_*, restated_* are critical
  • Tech: partnered_*, developed_*, patented_* are growth signals

Idea 9: Event Type Embedding Space

Concept

Learn embeddings for all 51K event types based on co-occurrence patterns.

Implementation:

from gensim.models import Word2Vec

# Treat each company's events as a "sentence"
sentences = []
for cik in companies:
    events = get_events(cik, sort_by='filing_date')
    sentence = [e.event_type for e in events]
    sentences.append(sentence)

# Train embeddings
model = Word2Vec(sentences, vector_size=100, window=10, min_count=5)

# Find similar events
model.wv.most_similar('dismissed_auditor_accounting_disagreement_critical')
# Returns: covenant_violation_*, material_weakness_*, restated_*

# Cluster event space
from sklearn.cluster import KMeans
vectors = [model.wv[event_type] for event_type in event_types]
clusters = KMeans(n_clusters=100).fit(vectors)

Application: Detect when a company enters a new cluster → state transition signal

Idea 10: Multi-Horizon Event Features for ML

Concept

Create features at different time horizons for predictive modeling.

Feature Engineering:

def create_features(cik, prediction_date):
    features = {}

    for horizon in [7, 14, 30, 60, 90]:
        events = get_events(cik, before=prediction_date, days=horizon)

        features[f'count_{horizon}d'] = len(events)
        features[f'diversity_{horizon}d'] = len(set(e.event_type for e in events))
        features[f'avg_idf_{horizon}d'] = mean([e.idf_score for e in events])
        features[f'neg_pct_{horizon}d'] = sum(e.sentiment=='negative' for e in events) / len(events)
        features[f'critical_count_{horizon}d'] = sum(e.magnitude_class=='critical' for e in events)

        # Top-5 most important events (by IDF * confidence)
        top_events = sorted(events, key=lambda e: e.idf_score * e.confidence, reverse=True)[:5]
        for i, e in enumerate(top_events):
            features[f'top{i}_idf_{horizon}d'] = e.idf_score
            features[f'top{i}_type_{horizon}d'] = e.event_type  # one-hot encode

    return features

Model:

Train XGBoost/neural network to predict:

  • 5-day forward return
  • 20-day forward volatility
  • Probability of >10% move in next 30 days

Practical Implementation Roadmap

Phase 1: Quick Wins (1-2 weeks)

  1. High-IDF Alert System: Daily scan for IDF > 12 events
  2. Distress Cascade Detector: SQL queries for multi-event patterns
  3. Sentiment Momentum Dashboard: Track top/bottom momentum companies

Phase 2: Pattern Discovery (2-4 weeks)

  1. Event Co-Occurrence Analysis: Build network graphs
  2. Sequence Mining: Find predictive event sequences (PrefixSpan algorithm)
  3. Company Signature Clustering: K-means on event type distributions

Phase 3: ML Integration (4-8 weeks)

  1. Event Embeddings: Train Word2Vec on event sequences
  2. Multi-Horizon Features: Create feature pipeline for ML models
  3. Inflection Point Predictor: Train model on historical inflections

Phase 4: Production System (8-12 weeks)

  1. Real-Time Event Processing: Stream new filings → event extraction → scoring
  2. Portfolio Construction: Combine signals into systematic strategy
  3. Backtesting Framework: Validate on out-of-sample data

Key Takeaways

  1. 51K event types, not 30 - Massive feature space for discovery
  2. IDF scores are gold - Events with IDF > 10 are rare and high-alpha
  3. Event sequences matter - Cascades predict inflection points
  4. Company signatures shift - Detecting transitions is key
  5. Temporal dynamics - Event velocity and momentum are signals
  6. 97% confidence - Extraction quality is excellent
  7. 70% material events - Good filtering already in place

The database is production-ready for systematic alpha generation. The next step is connecting these event patterns to actual price movements and building predictive models.