0% found this document useful (0 votes)
16 views2 pages

Event Driven Task Engine Interview QA

Uploaded by

outofline1083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views2 pages

Event Driven Task Engine Interview QA

Uploaded by

outofline1083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

System Design Interview: Event-Driven Task Execution Engine – Q&A

Tech Stack Decisions


Q: What language and framework would you choose to build the engine and why?
A: Java with Spring Boot, due to its strong support for concurrency, mature ecosystem, and
enterprise-grade tooling. Spring also integrates well with RabbitMQ and Kafka.

Q: Which message broker would you choose?


A: Kafka for high-throughput, ordered, and durable messaging. RabbitMQ if lower latency and
routing flexibility are more important. Kafka suits event sourcing and stream processing use
cases.

Q: How would you store task state?


A: I'd use a combination of PostgreSQL (for persistent state and history) and Redis (for fast
in-flight task state and locking). Task IDs, statuses, retry counts, and timestamps would be
tracked.

Q: How would you handle idempotency?


A: Every task has a unique ID. All task handlers check if the task ID was already processed
before executing. Idempotency keys can be tracked in Redis with short TTLs.
Scaling Strategy
Q: How would you scale the task engine?
A: Horizontally — by having stateless worker nodes consuming from partitioned Kafka topics.
Kubernetes can autoscale pods based on consumer lag or CPU.

Q: How would you partition work?


A: Based on task type or customer ID. Kafka partitions can ensure ordered processing per
partition. Each partition can be assigned to one consumer at a time.

Q: How would you handle load spikes?


A: Use Kafka's buffering and auto-scaling consumers. Rate-limit incoming producers or apply
backpressure at the API gateway if needed.
Fault Tolerance
Q: What happens if a worker crashes mid-task?
A: Kafka won't commit the offset, so the message is redelivered. The task handler must be
idempotent to safely retry.

Q: How do you handle poisoned messages?


A: Move them to a Dead Letter Queue (DLQ) after N retries. Also, trigger alerts for manual
inspection.

Q: What if the broker goes down?


A: Kafka replicates logs across brokers for HA. Clients retry with exponential backoff until
the cluster is back. We monitor ISR (in-sync replicas) to ensure durability.
Service-to-Service Communication
Q: Would services communicate via events only, or also via APIs?
A: Events for workflows and state changes, APIs for queries and occasional sync operations.
This hybrid model avoids tight coupling.

Q: What guarantees are required?


A: 'At least once' is typical. For sensitive operations, we simulate 'exactly once' with
idempotency keys and deduplication logic.

Q: How do you track which service did what?


A: Add correlation IDs to each event. Use them in logs and traces. Store audit trails of task
execution paths.
Observability
Q: How do you trace a task's journey?
A: Use distributed tracing (e.g., OpenTelemetry) and log each stage with a trace ID. You can
follow the trace from task ingestion to execution to completion.

Q: What metrics would you monitor?


A: Task throughput, processing latency, success/failure rate, retry count, consumer lag, error
rate per task type.

Q: How would you alert?


A: Set SLOs and thresholds. Alert if latency spikes, retry count is too high, or DLQ usage
increases.
Scenario-Based Questions (Experience)
Q: What did you do when RabbitMQ queues were overloaded at Zarya?
A: Identified high-volume publishers and split traffic across queues. Added backpressure logic.
Also prioritized processing using priority queues. Long term, suggested moving to Kafka.

Q: What would you do differently today?


A: Add monitoring from the start. Design for flow control and retry visibility. Implement
proper DLQ processing and backoff strategies.
Final Questions to Expect
- What trade-offs did you make in your design?
- How would you make this multi-region?
- What’s the most fragile part of your design?
- How would you add support for real-time SLAs?
- What’s the impact of out-of-order delivery?

You might also like