Process mining is computationally expensive. Discovering process models from event logs, checking conformance, computing variant frequencies — these operations touch every event in the dataset. At enterprise scale, that means millions of events.

Here's how we built Sancalana's infrastructure to handle it.

Architecture

Our system splits into three layers: ingestion, computation, and serving.

  System Architecture
  ==========================================

  Event Sources
  (CSV, DB, Warehouse, API)
         |
         v
  +------------------+
  |   Ingestion      |   Parse, validate, normalize
  |   Pipeline       |   Schema detection
  +------------------+   Deduplication
         |
         v
  +------------------+
  |   Event Store    |   Column-oriented storage
  |   (Partitioned)  |   Partitioned by case ID
  +------------------+   Compressed, indexed
         |
         v
  +------------------+
  |   Compute Layer  |   Process discovery algorithms
  |   (On-demand)    |   Conformance checking
  +------------------+   Variant clustering
         |
         v
  +------------------+
  |   Serving Layer  |   Pre-computed models
  |   (Cached)       |   Real-time queries
  +------------------+   Dashboard API

The Ingestion Challenge

Event logs come in messy. Different formats, inconsistent timestamps, duplicate events, missing case IDs. Our ingestion pipeline handles all of this:

  Ingestion Pipeline
  ==========================================

  Raw Upload (CSV/XES/Parquet/API)
         |
         v
  [Format Detection]
         |
         v
  [Schema Mapping]     case_id, activity, timestamp
         |              + optional attributes
         v
  [Validation]          Missing fields? Invalid dates?
         |              Out-of-order events?
         v
  [Normalization]       UTC timestamps, consistent
         |              activity names, dedup
         v
  [Partitioning]        Partition by case_id hash
         |              Sort by timestamp within case
         v
  [Event Store]         Ready for computation

This pipeline processes ~50,000 events per second on a single node. For large uploads (100M+ events), we shard across multiple nodes.

Process Discovery at Scale

The Alpha Miner, Heuristic Miner, and Inductive Miner algorithms all have different computational profiles. We benchmark them:

  Algorithm Performance (10M events, 50K cases)
  ==========================================

  Alpha Miner      ####                 3.2s
  Heuristic Miner  ########             7.8s
  Inductive Miner  ##############       14.1s
  Fuzzy Miner      ######               5.9s

  0s         5s         10s         15s         20s

The Inductive Miner produces the cleanest models but is the slowest. We run it in the background and show Heuristic Miner results first, then upgrade the model when the Inductive Miner finishes.

Query Performance

For interactive exploration (filtering, drilling down, variant analysis), sub-second response is non-negotiable. Our approach:

  Query Path
  ==========================================

  User interaction (filter, click, drill-down)
         |
         v
  [Query Planner]
         |
    +----+----+
    |         |
    v         v
  [Cache]   [Compute]
  < 50ms    < 2s
    |         |
    +----+----+
         |
         v
  [Result Merge + Render]

We pre-compute the most common query patterns (variant frequencies, bottleneck durations, case counts by status) and cache them. For ad-hoc queries, we compute on demand but still target < 2 seconds.

What we learned

Column-oriented storage matters. Process mining queries scan specific columns (activity, timestamp) across all events. Row-oriented storage is 10-20x slower for these patterns.
Partition by case, not time. Most queries filter by case attributes or need complete case histories. Partitioning by case ID hash keeps all events for a case co-located.
Pre-compute aggressively. The process model itself changes rarely (only when new data arrives). Compute it once, cache it, serve it fast.
Show fast results first. Users prefer seeing a rough model in 3 seconds over waiting 15 seconds for a perfect one. Progressive refinement beats blocking computation.

Architecture

Our system splits into three layers: ingestion, computation, and serving.

The Ingestion Challenge

Event logs come in messy. Different formats, inconsistent timestamps, duplicate events, missing case IDs. Our ingestion pipeline handles all of this:

Ingestion Pipeline ========================================== Raw Upload (CSV/XES/Parquet/API) | v [Format Detection] | v [Schema Mapping] case_id, activity, timestamp | + optional attributes v [Validation] Missing fields? Invalid dates? | Out-of-order events? v [Normalization] UTC timestamps, consistent | activity names, dedup v [Partitioning] Partition by case_id hash | Sort by timestamp within case v [Event Store] Ready for computation

This pipeline processes ~50,000 events per second on a single node. For large uploads (100M+ events), we shard across multiple nodes.

Process Discovery at Scale

The Alpha Miner, Heuristic Miner, and Inductive Miner algorithms all have different computational profiles. We benchmark them:

Algorithm Performance (10M events, 50K cases) ========================================== Alpha Miner #### 3.2s Heuristic Miner ######## 7.8s Inductive Miner ############## 14.1s Fuzzy Miner ###### 5.9s 0s 5s 10s 15s 20s

The Inductive Miner produces the cleanest models but is the slowest. We run it in the background and show Heuristic Miner results first, then upgrade the model when the Inductive Miner finishes.

Query Performance

For interactive exploration (filtering, drilling down, variant analysis), sub-second response is non-negotiable. Our approach:

We pre-compute the most common query patterns (variant frequencies, bottleneck durations, case counts by status) and cache them. For ad-hoc queries, we compute on demand but still target < 2 seconds.

What we learned

Column-oriented storage matters. Process mining queries scan specific columns (activity, timestamp) across all events. Row-oriented storage is 10-20x slower for these patterns.

Partition by case, not time. Most queries filter by case attributes or need complete case histories. Partitioning by case ID hash keeps all events for a case co-located.

Pre-compute aggressively. The process model itself changes rarely (only when new data arrives). Compute it once, cache it, serve it fast.

Show fast results first. Users prefer seeing a rough model in 3 seconds over waiting 15 seconds for a perfect one. Progressive refinement beats blocking computation.

Processing 10 Million Events Per Day

Architecture

The Ingestion Challenge

Process Discovery at Scale

Query Performance

What we learned

Processing 10 Million Events Per Day

Architecture

The Ingestion Challenge

Process Discovery at Scale

Query Performance

What we learned