UNIT 5
Chapter 9
1.Explain Architecture of streaming system with technology choices.
Architecture of streaming system with technology choices
Technologies Used:
Java 1.8 for implementation
Apache Maven 3.3.9 for building the project
JavaScript for the web-based frontend
All components use open-source tools, mostly Apache projects
Architecture Components:
1. Collection Tier (Netty):
This tier uses Netty to connect to the Meetup streaming RSVP API and collect real-time
RSVP data from the source.
2. Message Queuing Tier (Apache Kafka):
After data is collected, it is passed to Apache Kafka, which acts as the message broker to
queue and stream data to the next stages.
3. Analysis Tier (Apache Storm):
The queued data is consumed by Apache Storm, which performs real-time analysis such
as filtering, aggregating, or transforming the data.
4. In-Memory Data Store (Apache Kafka):
The results from the analysis tier are temporarily stored in Kafka again, this time as an
in-memory data store to enable fast access.
5. Data Access Tier (Netty):
This tier, also using Netty, serves the processed and stored data to the frontend
application.
6. Web Browser (JavaScript):
The final output is displayed to users via a web browser interface built using JavaScript,
providing a real-time data visualization experience.
2. Collection service data flow with a neat diagram.
When building our collection service, we want to take into consideration the following
capabilities:
Managing the connection to the Meetup API
Ensuring that we don’t lose data
Integrating with the message queuing tier
This diagram shows how our collection service works behind the scenes to collect, log, and
deliver live RSVP messages from Meetup to Apache Kafka, a message queue system that helps
us handle large-scale data safely.
1. Connect: Our service opens a WebSocket connection to Meetup’s RSVP API. This lets
us listen to RSVP events in real-time, like tuning into a live radio channel.
2. Create Client Handler: Once connected, we spin up a client handler — a component
that knows how to process each incoming message.
3. Initialize Logging and Kafka Producer: The handler prepares the message logger (so
nothing gets lost) and sets up the Kafka producer, which will be responsible for
forwarding the data.
4. Receive Messages: Now, as people RSVP, messages flow in through the WebSocket.
We catch them in real-time.
5. Record Message: Before doing anything else, we log the message using a
HybridMessage Logger. This acts like a safety net — making sure we don’t lose data
even if Kafka fails.
6. Send Message to Producer: The client handler sends the message to the RSVP producer,
which prepares and forwards it to Kafka.
7. Produce Message in Kafka: Kafka receives the message and stores it in a queue so that
downstream services can pick it up.
8. Acknowledge and Clean Up:
Once Kafka confirms successful delivery:
• We remove the message from our logs (it’s safely stored now).
• If Kafka fails to confirm, we move the message to a ―failed‖ list for future retries.
3.Explain the step by step procedure to Installing and configuring Kafka
Downloading and installing Apache Kafka:
Download Kafka version 0.10.0.1 from the official Apache Kafka site.Kafka Installation:
$> wget http://www-us.apache.org/dist/kafka/0.10.0.1/kafka_2.11-0.10.0.1.tgz
$> tar -xvf kafka_2.11-0.10.0.1.tgz
Four main key Kafka Concepts:
Producer – Sends messages to Kafka.
Consumer – Receives and processes messages from Kafka.
Broker – A Kafka server that manages message storage and delivery.
Topic – A logical channel where messages are published and from where consumers read.
Topics help in organizing the data stream.
Starting Kafka, Apache ZooKeeper, and Creating a Topic
Change directory to the Kafka installation folder.
Start Apache ZooKeeper (required by Kafka for metadata storage).
Start the Kafka server (broker).
Create a topic called meetup-raw-rsvps with 1 partition and replication factor of 1.
$> cd kafka_2.11-0.10.0.1 1
$> bin/zookeeper-server-start.sh -daemon config/zookeeper.properties 2
$> bin/kafka-server-start.sh -daemon config/server.properties 3
$> bin/kafka-topics.sh --zookeeper localhost:2181 --create \ 4
--topic meetup-raw-rsvps --partitions 1 --replication-factor 1
$> Created topic "meetup-raw-rsvps". 5
Explanation of steps:
1. Navigate to the Kafka directory.
2. Start ZooKeeper.
3. Start the Kafka broker.
4. Create the topic.
5. Confirmation message indicating successful topic creation.
Verifying the Created Topic:
Use the --list command to confirm that the topic was successfully created.
$> bin/kafka-topics.sh --zookeeper localhost:2181 --list 1
meetup-raw-rsvps 2
Explanation:
1. Command to list all Kafka topics.
2. Expected output showing the meetup-raw-rsvps topic.
Kafka is now fully set up and ready to receive messages from your collection service.
4. Explain how to Integrating the collection service and Kafka (OR)
Describe how to build and run the collection service in streaming system.
After setting up Kafka and the collection service, the next step is to integrate them and
verify that data is flowing correctly.
To test the integration, one console window is used to start the Kafka consumer which
listens for messages from a specific topic.
In a second console window, the collection service is built and run, which connects to the
Meetup API and begins sending live data to Kafka, confirming successful integration
when messages appear.
Running the Kafka console consumer
$> bin/kafka-console-consumer.sh --zookeeper localhost:2181\
--topic meetup-raw-rsvps
When the Kafka consumer is started, no output appears initially because the topic has not
received any messages yet — this is expected behavior.
The next step is to start the collection service, which will begin sending data to the Kafka
topic, allowing the consumer to display incoming messages.
Building and running the collection service
$> cd $EXAMPLE_CODE_HOME/Chapter9/collection-service 1
$> mvn clean package 2
$> java -jar target/collection-service-0.0.1.jar 3
WebSocket Client connected and ready to consume RSVPs!
1. Navigate to the directory where the source code for the collection service is located.
2. Use Maven to build the project and generate the necessary artifacts.
3. Run the generated JAR file to start the collection service.
5. Explain step by step procedure to Installing Storm and preparing Kafka
Apache Kafka is a message queue — it stores messages and passes them along for further
processing. Here, we’re setting up a topic where RSVP messages will be stored.
Installing Apache Storm :
Step 1: Download Apache Storm
This step fetches the Apache Storm software package from its official website. You need
this to run and build real-time data processing applications.
Step 2: Decompress the Archive
Running the tar -xvzf command unpacks the downloaded archive. It extracts all the
required files and folders needed to configure and run Apache Storm.
Setting Up Kafka for Analysis Topic :
Step 1: Navigate to Kafka Installation Directory
This step places you inside Kafka’s installation folder so you can run Kafka-related
commands properly from the terminal.
Step 2: Create a Kafka Topic
This command creates a topic named meetup-topn-rsvps. Topics in Kafka are used to store and
organize streams of messages.
• It connects to ZooKeeper (Kafka’s coordination service) on port 2181.
• It sets up the topic with 1 partition and 1 replica, which is enough for testing or local
setups.
Step 3: Verify the Topic is Created
After topic creation, Kafka returns a confirmation message. This ensures the topic is
ready to receive data (like RSVP messages from Meetup).
6.Explain how to Build the top n Storm topology
In Apache Storm, a topology is a directed acyclic graph (DAG) made up of spouts and bolts.
Spouts pull data from sources and emit it as tuples, while bolts process this data through filtering,
aggregation, or storage.
For analyzing top N Meetup RSVPs, a multi-bolt topology can be used to structure the logic
efficiently.
Multi-bolt approach to top n
1. Apache Kafka: Source of raw data.
2. Kafka Spout: Reads from the raw topic.
3. Rolling Count Bolt:
Does local counting per partition (field grouping).
Can be parallelized for performance.
4. Intermediate Ranking Bolt:
Maintains partial top-N rankings from each count bolt.
Works in parallel.
5. Total Ranking Bolt:
Receives data via global grouping from all intermediate bolts.
Merges partial rankings to produce the final Top-N result.
6. Kafka Bolt: Publishes final Top-N list to Kafka.
Use case: Best for high-volume streaming data needing scalable, distributed ranking. Handles
Top-N computation efficiently in stages.
Topology using streaming summary
1. Apache Kafka: Produces raw data to a topic.
2. Kafka Spout: Reads data from Kafka topic.
3. RSVP Summarizer Bolt:
Performs the summary computation (e.g., counting, filtering).
Needs a global grouping to consolidate data from all spout partitions.
4. Kafka Bolt: Writes the final Top-N summary back to Kafka.
Use case: Best for simple global summarization tasks with moderate data volumes. Everything is
centralized in one summarizer bolt.