How we built the infrastructure to handle enterprise-scale event logs with sub-second query times.
Process mining is computationally expensive. Discovering process models from event logs, checking conformance, computing variant frequencies — these operations touch every event in the dataset. At enterprise scale, that means millions of events.
Here's how we built Sancalana's infrastructure to handle it.
Our system splits into three layers: ingestion, computation, and serving.
System Architecture
==========================================
Event Sources
(CSV, DB, Warehouse, API)
|
v
+------------------+
| Ingestion | Parse, validate, normalize
| Pipeline | Schema detection
+------------------+ Deduplication
|
v
+------------------+
| Event Store | Column-oriented storage
| (Partitioned) | Partitioned by case ID
+------------------+ Compressed, indexed
|
v
+------------------+
| Compute Layer | Process discovery algorithms
| (On-demand) | Conformance checking
+------------------+ Variant clustering
|
v
+------------------+
| Serving Layer | Pre-computed models
| (Cached) | Real-time queries
+------------------+ Dashboard API
Event logs come in messy. Different formats, inconsistent timestamps, duplicate events, missing case IDs. Our ingestion pipeline handles all of this:
Ingestion Pipeline
==========================================
Raw Upload (CSV/XES/Parquet/API)
|
v
[Format Detection]
|
v
[Schema Mapping] case_id, activity, timestamp
| + optional attributes
v
[Validation] Missing fields? Invalid dates?
| Out-of-order events?
v
[Normalization] UTC timestamps, consistent
| activity names, dedup
v
[Partitioning] Partition by case_id hash
| Sort by timestamp within case
v
[Event Store] Ready for computation
This pipeline processes ~50,000 events per second on a single node. For large uploads (100M+ events), we shard across multiple nodes.
The Alpha Miner, Heuristic Miner, and Inductive Miner algorithms all have different computational profiles. We benchmark them:
Algorithm Performance (10M events, 50K cases)
==========================================
Alpha Miner #### 3.2s
Heuristic Miner ######## 7.8s
Inductive Miner ############## 14.1s
Fuzzy Miner ###### 5.9s
0s 5s 10s 15s 20s
The Inductive Miner produces the cleanest models but is the slowest. We run it in the background and show Heuristic Miner results first, then upgrade the model when the Inductive Miner finishes.
For interactive exploration (filtering, drilling down, variant analysis), sub-second response is non-negotiable. Our approach:
Query Path
==========================================
User interaction (filter, click, drill-down)
|
v
[Query Planner]
|
+----+----+
| |
v v
[Cache] [Compute]
< 50ms < 2s
| |
+----+----+
|
v
[Result Merge + Render]
We pre-compute the most common query patterns (variant frequencies, bottleneck durations, case counts by status) and cache them. For ad-hoc queries, we compute on demand but still target < 2 seconds.
Column-oriented storage matters. Process mining queries scan specific columns (activity, timestamp) across all events. Row-oriented storage is 10-20x slower for these patterns.
Partition by case, not time. Most queries filter by case attributes or need complete case histories. Partitioning by case ID hash keeps all events for a case co-located.
Pre-compute aggressively. The process model itself changes rarely (only when new data arrives). Compute it once, cache it, serve it fast.
Show fast results first. Users prefer seeing a rough model in 3 seconds over waiting 15 seconds for a perfect one. Progressive refinement beats blocking computation.