Building a Serverless Fan-Out-Fan-In Mechanism
Building scalable and reliable distributed systems often involves ensuring that a single event can trigger tasks across multiple downstream services while maintaining consistency and fault tolerance. For example, consider an e-commerce platform processing customer orders. Each order triggers multiple services like inventory management, payment processing and shipping logistics.
What happens if one of these services fails? This article delves into how you can implement a robust serverless fan-out-fan-in mechanism using AWS Step Functions, Amazon SNS, SQS, and DynamoDB, simplifying complex workflows and ensuring every task is tracked to completion. Let’s explore how these AWS and Amazon services comes together to solve real-world challenges.
What are the Fan-Out and Fan-Out-Fan-In Patterns?
The Fan-Out Pattern refers to sending a single message or event to multiple services or consumers at once. Think of it as broadcasting a message to several listeners, where each listener processes the message independently. For example, in our e-commerce scenario, an order event is sent to inventory, payment, and shipping services simultaneously.
The Fan-Out-Fan-In Pattern extends this by collecting the responses from these services. Once all services complete their tasks, the results are aggregated to determine the final outcome. If all services succeed, the task is complete; if any fail, the process is marked as failed. This pattern ensures efficient task distribution and systematic result tracking, making it ideal for distributed systems. For more details, check out this documentation on the Fan-Out pattern.
My Personal Journey with Implementing the Fan-Out-Fan-In Mechanism
Recently, I was tasked with designing and implementing a feature that required a service to send messages to other services and wait for them to complete handling the message. Only after all services completed their processing successfully could I mark the task as complete. In my case, this was for tenant creation and deletion events.
What started as a seemingly simple messaging requirement evolved into an interesting architectural challenge, leading me to explore and implement the fan-out-fan-in pattern using AWS serverless services. Let me give you an overview of how I built this solution and the components involved.
A Real-World Scenario: Customer Order Processing
Imagine an E-commerce Order Management System that processes customer orders. When an order is placed, the system must notify several services to handle different aspects of order fulfillment:
- An inventory service to reserve items.
- A payment service to process payment.
- A shipping service to arrange delivery.
These are the key requirements:
- All services must succeed. The order is marked as processed only if all services have been completed successfully.
- Failure handling. If any service fails, the entire process halts, and the customer is notified.
- Timeout management. A timeout mechanism addresses unresponsive services.
A Closer Look at the AWS Services
Before diving into the detailed architecture, let me quickly describe these components and help you understand why they were chosen:
- AWS Step Functions: A robust service for orchestrating distributed workflows. With built-in error handling, timeout management, and retries, they simplify the management of complex processes like order fulfillment.
- Amazon SNS (Simple Notification Service): SNS enables the Fan-Out pattern by distributing events to multiple services simultaneously. This makes it easy to notify multiple downstream systems and allows the addition of new services without modifying the core workflow.
- Amazon SQS (Simple Queue Service): SQS supports the Fan-In pattern by aggregating responses from services. Using FIFO queues ensures that messages related to a specific order are processed sequentially, avoiding race conditions and maintaining consistency.
- Amazon DynamoDB: A fast, scalable NoSQL database that tracks the status of all services involved in the workflow. It acts as the single source of truth for task statuses and ensures data retention with its built-in TTL (Time-to-Live) feature.
Together, these services create a robust, reliable, and scalable architecture for processing customer orders:
How AWS Services are Used to Build E-commerce Management System
I used the following services integrated together to build the e-commerce management system.
- AWS Step Functions: Orchestrates the order fulfillment workflow, waits for responses from all services using the callback pattern with a wait-for-task-token mechanism, and manages error handling and retries.
- SNS Topic(Fan-Out): Publishes the order event to multiple downstream services.
- FIFO SQS Queue for Service Responses(Fan-In): Collects responses from services and feeds them into a Lambda function.
- DynamoDB for Status Tracking: Tracks which services have responded and their statuses.
The Order Processing Workflow
Here is the order processing workflow:
- Order Received. The Order Service starts the OrchestrateOrders Step Function. The Step Function invokes the SaveOrderTaskTokenAndSendServices.
- Save OrderTaskToken. The Lambda function saves the Step Function Task Token and the
orderId
in the OrderProcessingStatus DynamoDB Table and than publishes the order event to the OrderEventsTopic (SNS). - Fan-Out. The
OrderEventsTopic
distributes the event to subscribed services (inventory
,payment
,shipping, etc
). - Services Process: Each service processes the order and sends its response (success/failure) to the OrderServiceResponsesQueue (FIFO SQS).
- Resume the Task Wait Mechanism: A Lambda function updates the DynamoDB status when each service responds. If all services succeed, the order is marked as processed. If any service fails or times out, the process is marked as failed.
My Messaging Structure
To implement the fan-out and fan-in mechanisms, my architecture uses a combination of SNS and SQS (FIFO).
Fan-Out with SNS
- The
OrderEventsTopic
distributes order events to multiple subscribed services (Inventory, Payment, Shipping). - Each service processes the event independently and sends its response back to the system.
Fan-In with FIFO SQS
Service responses are sent to the OrderServiceResponsesQueue.fifo
. To ensure responses for each order are processed in order, the MessageGroupId
is set to the order_id
. This allows the FIFO queue to handle responses sequentially for each order, preventing concurrency issues when updating DynamoDB.
For example, when both the Inventory and Payment services send their responses for order_id = 12345
, both messages will have the same MessageGroupId
of 12345
, ensuring they are processed in the correct sequence.
Message Grouping by orderId
Using the MessageGroupId
in SQS ensures that all messages for a specific order are processed sequentially by the same consumer. This avoids race conditions while allowing messages for different orders to be processed concurrently.
Example of a message sent to the FIFO queue:
{
"MessageBody": {
"order_id": "12345",
"service": "inventory",
"status": "SUCCESS"
},
"MessageGroupId": "12345"
}
Tracking Service Responses for Each Order
The DynamoDB table that I used for this solution tracks service responses for each order. This table ensures that no response is missed and allows efficient updates.
Attributes:
orderId
— Partition key representing the unique order identifier.taskToken
The Step Function task token for resumption.servicesStatus
A map of service statuses{ 'inventory': 'success', 'payment': 'fail' }
.TTL
Time-to-live to delete old records automatically.
Defining a Basic AWS Step Function
Currently, the Step Function only defines a basic task to send order events and wait for the service responses.
{
"StartAt": "sendOrderTaskAndWait",
"States": {
"sendOrderTaskAndWait": {
"Type": "Task",
"TimeoutSeconds": 3600,
"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
"Parameters": {
"FunctionName": "arn:aws:lambda:REGION:ACCOUNT_ID:function:SaveOrderTaskTokenAndSendServicesEvent",
"Payload": {
"taskToken.$": "$$.Task.Token",
"orderId.$": "$.orderId"
}
},
"End": true
}
}
}
Publishing the Order Event
A lambda function called SaveOrderTaskTokenAndSendServicesEvent, saves the Step Function Task Token in DynamoDB and publishes the order event to SNS for downstream services to process.
import boto3
import json
import time
dynamodb = boto3.resource('dynamodb')
sns = boto3.client('sns')
TABLE_NAME = 'OrderProcessingStatusTable'
TOPIC_ARN = 'arn:aws:sns:region:account-id:OrderEventsTopic'
services = ['inventory', 'payment', 'shipping']
def save_task_token_and_send_services_event(event, context):
order_id = event['orderId']
task_token = event['taskToken']
table = dynamodb.Table(TABLE_NAME)
table.put_item(
Item={
'orderId': order_id,
'taskToken': task_token,
'servicesStatus': {service: 'PENDING' for service in services},
'TTL': int(time.time()) + 24 * 60 * 60 # 24 hours TTL
}
)
# Publish to SNS for fan-out
sns.publish(
TopicArn=TOPIC_ARN,
Message=json.dumps({'orderId': order_id})
)
Completing the Task Using the Lambda Function
A Lambda named ServiceTaskCompleted triggered for each service completed the message. The lambda updates the TaskId and signals(send task success/fail) the Step Function to continue or fail.
import json
import logging
import boto3
dynamodb = boto3.resource('dynamodb')
stepfunctions = boto3.client('stepfunctions')
TABLE_NAME = 'OrderProcessingStatusTable'
table = dynamodb.Table(TABLE_NAME)
def service_task_completed(event, context):
for record in event['Records']:
message_body = json.loads(record['body'])
order_id = message_body['orderId']
service_name = message_body['serviceName']
status = message_body['status']
response = table.get_item(Key={'orderId': order_id})
if 'Item' not in response:
raise Exception(f"Order with ID {order_id} not found")
# Update the service status in the DynamoDB table
services_status = response['Item']['servicesStatus']
task_token = response['Item']['taskToken']
services_status[service_name] = status
table.update_item(
Key={'orderId': order_id},
UpdateExpression="SET servicesStatus.#service = :status",
ExpressionAttributeNames={
"#service": service_name
},
ExpressionAttributeValues={
":status": status
},
)
# Check if all services succeeded or any failed
if all(s == 'SUCCESS' for s in services_status.values()):
stepfunctions.send_task_success(
taskToken=task_token,
output=json.dumps({'allServicesSucceeded': True})
)
elif any(s == 'FAILURE' for s in services_status.values()):
stepfunctions.send_task_failure(
taskToken=task_token,
error='ServiceFailure',
cause='One or more services failed.'
)
else:
logging.info(f"Order {order_id}: Waiting for remaining services to respond")
Why I Chose This Design
This design offers a number of benefits, including:
- Scalable Fan-Out. The SNS simplifies notifying multiple services and supports adding new services seamlessly.
- Centralized Status Tracking. DynamoDB ensures every service’s response is logged and monitored.
- Resilient Orchestration. Step Functions handle retries and timeouts, reducing the risk of workflow failures.
- Concurrency Control. FIFO queues ensure order-specific tasks are processed in sequence, avoiding race conditions.
My Ideas for a Future Fan-Out-Fan-In Mechanism
By combining AWS Step Functions, SNS, SQS, and DynamoDB, you can create a robust and scalable workflow for orchestrating tasks across multiple services. This architecture offers fault tolerance, clear status tracking, and flexibility to adapt to future needs.
Here are my suggestions for how to make the design even more efficient:
- Granular Error Handling: Add detailed error reporting for each service, including specific error codes and retry policies.
- Cross-Account Communication: The design can be easily extended to support cross-account communication for secure integration of services across multiple accounts.
- Tenant Isolation: Tenant isolation can be added if needed for enhanced security in multi-tenant environments.
- SNS Message Filtering: Add a task_type attribute to SNS messages, enabling services to process only relevant tasks efficiently.
- Flexible Fan-Out and Fan-In: Enable critical services (e.g., Payment) to require responses (Fan-In), while optional services (e.g., Logging, Notifications) can process events independently without requiring responses (Fan-Out).
What are your thoughts on this design? Share your thoughts in the comments below.