Using Flink, Kafka/Kinesis, and Understanding Lambda vs Kappa Architecture
Building the ads vertical at Bolt Food — where sponsored listings need real-time performance tracking — made this problem very concrete for me. When designing large-scale advertising systems, one of the most common problems is:
How do we aggregate billions of ad clicks in real time, accurately, and reliably?
This post walks through:
- The system design of a real-time ad click aggregator
- Why Kafka/Kinesis + Flink is a strong choice
- What watermarks are (in simple terms)
- Lambda vs Kappa architecture
- How to discuss this confidently in system design interviews
1. The Problem
We need to:
- Ingest millions of ad click events per second
- Aggregate clicks per ad (per minute, hour, day)
- Handle late and out-of-order events
- Ensure billing accuracy (no double counting)
- Provide real-time dashboards
- Store historical analytics data
Each click event looks like:
{
"ad_id": "123",
"user_id": "abc",
"timestamp": "2026-03-07T12:00:45Z"
}
2. High-Level Architecture
flowchart LR
Users([Users]) --> CC[Click Collector API]
CC --> Kafka[[Kafka / Kinesis]]
Kafka --> Flink[Apache Flink]
Flink --> Redis[(Redis / DynamoDB)]
Flink --> S3[(S3 Parquet)]
Redis --> Dashboard[Dashboard / API]
Components
1. Click Collector
- Validates incoming click events
- Assigns unique click IDs
- Pushes events to Kafka/Kinesis
2. Kafka / Kinesis (Event Stream)
- Durable, scalable message queue
- Partitions by
ad_id - Enables replay of historical events
3. Apache Flink (Stream Processor)
- Consumes click events
- Uses event-time windows
- Maintains state per
ad_id - Emits aggregated counts
4. Storage
- Redis / DynamoDB — real-time dashboard
- S3 (Parquet) — long-term analytics
3. Why Kafka/Kinesis + Flink?
Horizontal Scalability
- Kafka partitions scale ingestion.
- Flink parallel operators scale processing.
- Partition by
ad_idensures even distribution.
Exactly-Once Processing
For billing, accuracy matters.
Flink:
- Maintains state (via RocksDB)
- Uses checkpointing
- Recovers without double counting
Handles Late Events
Mobile clicks arrive late.
Flink uses:
- Event-time processing
- Watermarks
- Allowed lateness
Which brings us to an important concept.
4. What Are Watermarks? (Explained Simply)
Imagine you’re counting clicks for 12:00–12:01.
Some clicks arrive late.
You don’t want to:
- Close too early (miss clicks)
- Wait forever
A watermark is Flink saying:
“I believe I’ve seen all events up to time X.”
Once the watermark passes 12:01:
- Flink closes that window
- Emits the final count
Late events beyond allowed lateness:
- Are dropped
- Or handled separately
This balances accuracy and latency.
5. Deduplication Strategy
Click systems must prevent fraud and duplicates.
Approaches
- Assign unique
click_id - Use Flink keyed state to track seen IDs
- Use TTL to expire old IDs
- Use idempotent writes in the sink
In interviews, say:
“We ensure idempotency at both the stream processing layer and the storage layer.”
6. Lambda vs Kappa Architecture
Now let’s zoom out architecturally.
Lambda Architecture
Two pipelines:
- Batch Layer (Hadoop/Spark)
- Speed Layer (Flink/Storm)
- Serving Layer (merges both)
flowchart TD
Source([Event Source]) --> Speed[Speed Layer - Flink]
Source --> Batch[Batch Layer - Spark]
Speed --> Serving[Serving Layer]
Batch --> Serving
Serving --> Query([Queries])
Flow:
- Real-time updates from Speed Layer
- Nightly batch recomputation fixes inaccuracies
| Pros | Cons |
|---|---|
| High accuracy | Two codebases |
| Good for massive historical datasets | Operational complexity |
Kappa Architecture
Single stream pipeline. Everything flows through streaming.
Need to recompute? Replay from Kafka.
flowchart LR
Source([Event Source]) --> Kafka[[Kafka]]
Kafka --> Flink[Flink]
Flink --> Store[(Storage)]
Store --> Query([Queries])
Kafka -. replay .-> Flink
| Pros | Cons |
|---|---|
| Much simpler | Heavy reprocessing can be expensive |
| One processing path | |
| Easier maintenance |
7. Which One Would I Choose?
For a modern ad click system:
I would choose Kappa architecture.
Because:
- Flink supports event-time + exactly-once
- Kafka allows replay
- Operational complexity is lower
- Real-time is first-class
Lambda is useful when:
- Historical recomputation is massive
- Legacy batch systems already exist
8. Handling Failures
What if Flink crashes?
- State is checkpointed
- Job restarts from last checkpoint
- Kafka offsets are restored
What if Redis slows down?
- Flink applies backpressure
- Autoscale Redis
- Buffer and retry
In interviews:
“We rely on Flink’s checkpointing and Kafka durability to ensure fault tolerance and no data loss.”
9. Multi-Region Design
For global ads:
- Regional Kafka clusters
- Regional Flink jobs
- Global aggregation pipeline
- S3 as central analytics store
This reduces latency and improves resilience. For a deeper look at geo-partitioning and region-based ownership strategies, see the Planet-Scale Auction System post.
Final Thoughts
The real insight in ad click aggregation is that billing accuracy is a streaming problem. You can’t batch your way to real-time correctness when advertisers are paying per click and dashboards need to reflect reality within seconds.
The key architectural decision is Lambda vs Kappa. For most modern systems, Kappa wins — Flink’s event-time processing and exactly-once guarantees mean you no longer need a separate batch layer to “fix” the real-time path. One pipeline, one source of truth, and Kafka replay for reprocessing when you need it.
If you take one thing away: watermarks and exactly-once semantics aren’t academic concepts — they’re the difference between an aggregation system that works and one that silently loses money.