Designing a Real-Time Ad Click Aggregator

Using Flink, Kafka/Kinesis, and Understanding Lambda vs Kappa Architecture

Building the ads vertical at Bolt Food — where sponsored listings need real-time performance tracking — made this problem very concrete for me. When designing large-scale advertising systems, one of the most common problems is:

How do we aggregate billions of ad clicks in real time, accurately, and reliably?

This post walks through:

The system design of a real-time ad click aggregator
Why Kafka/Kinesis + Flink is a strong choice
What watermarks are (in simple terms)
Lambda vs Kappa architecture
How to discuss this confidently in system design interviews

1. The Problem

We need to:

Ingest millions of ad click events per second
Aggregate clicks per ad (per minute, hour, day)
Handle late and out-of-order events
Ensure billing accuracy (no double counting)
Provide real-time dashboards
Store historical analytics data

Each click event looks like:

{
  "ad_id": "123",
  "user_id": "abc",
  "timestamp": "2026-03-07T12:00:45Z"
}

2. High-Level Architecture

flowchart LR
    Users([Users]) --> CC[Click Collector API]
    CC --> Kafka[[Kafka / Kinesis]]
    Kafka --> Flink[Apache Flink]
    Flink --> Redis[(Redis / DynamoDB)]
    Flink --> S3[(S3 Parquet)]
    Redis --> Dashboard[Dashboard / API]

Components

1. Click Collector

Validates incoming click events
Assigns unique click IDs
Pushes events to Kafka/Kinesis

2. Kafka / Kinesis (Event Stream)

Durable, scalable message queue
Partitions by ad_id
Enables replay of historical events

3. Apache Flink (Stream Processor)

Consumes click events
Uses event-time windows
Maintains state per ad_id
Emits aggregated counts

4. Storage

Redis / DynamoDB — real-time dashboard
S3 (Parquet) — long-term analytics

3. Why Kafka/Kinesis + Flink?

Horizontal Scalability

Kafka partitions scale ingestion.
Flink parallel operators scale processing.
Partition by ad_id ensures even distribution.

Exactly-Once Processing

For billing, accuracy matters.

Flink:

Maintains state (via RocksDB)
Uses checkpointing
Recovers without double counting

Handles Late Events

Mobile clicks arrive late.

Flink uses:

Event-time processing
Watermarks
Allowed lateness

Which brings us to an important concept.

4. What Are Watermarks? (Explained Simply)

Imagine you’re counting clicks for 12:00–12:01.

Some clicks arrive late.

You don’t want to:

Close too early (miss clicks)
Wait forever

A watermark is Flink saying:

“I believe I’ve seen all events up to time X.”

Once the watermark passes 12:01:

Flink closes that window
Emits the final count

Late events beyond allowed lateness:

Are dropped
Or handled separately

This balances accuracy and latency.

5. Deduplication Strategy

Click systems must prevent fraud and duplicates.

Approaches

Assign unique click_id
Use Flink keyed state to track seen IDs
Use TTL to expire old IDs
Use idempotent writes in the sink

In interviews, say:

“We ensure idempotency at both the stream processing layer and the storage layer.”

6. Lambda vs Kappa Architecture

Now let’s zoom out architecturally.

Lambda Architecture

Two pipelines:

Batch Layer (Hadoop/Spark)
Speed Layer (Flink/Storm)
Serving Layer (merges both)

flowchart TD
    Source([Event Source]) --> Speed[Speed Layer - Flink]
    Source --> Batch[Batch Layer - Spark]
    Speed --> Serving[Serving Layer]
    Batch --> Serving
    Serving --> Query([Queries])

Flow:

Real-time updates from Speed Layer
Nightly batch recomputation fixes inaccuracies

Pros	Cons
High accuracy	Two codebases
Good for massive historical datasets	Operational complexity

Kappa Architecture

Single stream pipeline. Everything flows through streaming.

Need to recompute? Replay from Kafka.

flowchart LR
    Source([Event Source]) --> Kafka[[Kafka]]
    Kafka --> Flink[Flink]
    Flink --> Store[(Storage)]
    Store --> Query([Queries])
    Kafka -. replay .-> Flink

Pros	Cons
Much simpler	Heavy reprocessing can be expensive
One processing path
Easier maintenance

7. Which One Would I Choose?

For a modern ad click system:

I would choose Kappa architecture.

Because:

Flink supports event-time + exactly-once
Kafka allows replay
Operational complexity is lower
Real-time is first-class

Lambda is useful when:

Historical recomputation is massive
Legacy batch systems already exist

8. Handling Failures

What if Flink crashes?

State is checkpointed
Job restarts from last checkpoint
Kafka offsets are restored

What if Redis slows down?

Flink applies backpressure
Autoscale Redis
Buffer and retry

In interviews:

“We rely on Flink’s checkpointing and Kafka durability to ensure fault tolerance and no data loss.”

9. Multi-Region Design

For global ads:

Regional Kafka clusters
Regional Flink jobs
Global aggregation pipeline
S3 as central analytics store

This reduces latency and improves resilience. For a deeper look at geo-partitioning and region-based ownership strategies, see the Planet-Scale Auction System post.

Final Thoughts

The real insight in ad click aggregation is that billing accuracy is a streaming problem. You can’t batch your way to real-time correctness when advertisers are paying per click and dashboards need to reflect reality within seconds.

The key architectural decision is Lambda vs Kappa. For most modern systems, Kappa wins — Flink’s event-time processing and exactly-once guarantees mean you no longer need a separate batch layer to “fix” the real-time path. One pipeline, one source of truth, and Kafka replay for reprocessing when you need it.

If you take one thing away: watermarks and exactly-once semantics aren’t academic concepts — they’re the difference between an aggregation system that works and one that silently loses money.