System Design Interview: Event-Driven Task Execution Engine – Q&A
Tech Stack Decisions
Q: What language and framework would you choose to build the engine and why?
A: Java with Spring Boot, due to its strong support for concurrency, mature ecosystem, and
enterprise-grade tooling. Spring also integrates well with RabbitMQ and Kafka.
Q: Which message broker would you choose?
A: Kafka for high-throughput, ordered, and durable messaging. RabbitMQ if lower latency and
routing flexibility are more important. Kafka suits event sourcing and stream processing use
cases.
Q: How would you store task state?
A: I'd use a combination of PostgreSQL (for persistent state and history) and Redis (for fast
in-flight task state and locking). Task IDs, statuses, retry counts, and timestamps would be
tracked.
Q: How would you handle idempotency?
A: Every task has a unique ID. All task handlers check if the task ID was already processed
before executing. Idempotency keys can be tracked in Redis with short TTLs.
Scaling Strategy
Q: How would you scale the task engine?
A: Horizontally — by having stateless worker nodes consuming from partitioned Kafka topics.
Kubernetes can autoscale pods based on consumer lag or CPU.
Q: How would you partition work?
A: Based on task type or customer ID. Kafka partitions can ensure ordered processing per
partition. Each partition can be assigned to one consumer at a time.
Q: How would you handle load spikes?
A: Use Kafka's buffering and auto-scaling consumers. Rate-limit incoming producers or apply
backpressure at the API gateway if needed.
Fault Tolerance
Q: What happens if a worker crashes mid-task?
A: Kafka won't commit the offset, so the message is redelivered. The task handler must be
idempotent to safely retry.
Q: How do you handle poisoned messages?
A: Move them to a Dead Letter Queue (DLQ) after N retries. Also, trigger alerts for manual
inspection.
Q: What if the broker goes down?
A: Kafka replicates logs across brokers for HA. Clients retry with exponential backoff until
the cluster is back. We monitor ISR (in-sync replicas) to ensure durability.
Service-to-Service Communication
Q: Would services communicate via events only, or also via APIs?
A: Events for workflows and state changes, APIs for queries and occasional sync operations.
This hybrid model avoids tight coupling.
Q: What guarantees are required?
A: 'At least once' is typical. For sensitive operations, we simulate 'exactly once' with
idempotency keys and deduplication logic.
Q: How do you track which service did what?
A: Add correlation IDs to each event. Use them in logs and traces. Store audit trails of task
execution paths.
Observability
Q: How do you trace a task's journey?
A: Use distributed tracing (e.g., OpenTelemetry) and log each stage with a trace ID. You can
follow the trace from task ingestion to execution to completion.
Q: What metrics would you monitor?
A: Task throughput, processing latency, success/failure rate, retry count, consumer lag, error
rate per task type.
Q: How would you alert?
A: Set SLOs and thresholds. Alert if latency spikes, retry count is too high, or DLQ usage
increases.
Scenario-Based Questions (Experience)
Q: What did you do when RabbitMQ queues were overloaded at Zarya?
A: Identified high-volume publishers and split traffic across queues. Added backpressure logic.
Also prioritized processing using priority queues. Long term, suggested moving to Kafka.
Q: What would you do differently today?
A: Add monitoring from the start. Design for flow control and retry visibility. Implement
proper DLQ processing and backoff strategies.
Final Questions to Expect
- What trade-offs did you make in your design?
- How would you make this multi-region?
- What’s the most fragile part of your design?
- How would you add support for real-time SLAs?
- What’s the impact of out-of-order delivery?