Proof of Concept on News Aggregator using Big Data Technologies
Last Updated :
02 Feb, 2021
Big Data is a huge dataset that can have a high volume of data, velocity, and variety of data. For example, billions of users searching on Google at the same time and that will be a very large dataset. In this, we will discuss Proof of concept(POC) on news aggregator using Big Data (Hadoop, hive, pig). And will perform operations based on MapReduce Operations. To perform the operation, we will use HiveQL(Hive Query Language) which is a SQL-like querying language that can process structured data using Hive. Hive is used to make querying and analyzing easy. And It is a data warehouse tool on top of Hadoop.
You will see the implementation approach how you can do POC on a news aggregator using Big Data. Here, we will do POC and will be able to find all the queries using big data technologies like Hadoop, hive, and pig. And Queries like the number of news divided into different categories, count the total occurrence of different titles in a table, publisher name, a query on the news which was published, a query for finding title name, and a query on finding the alphanumeric id of the cluster which includes news about the same story, etc. Let's discuss one by one.
Proof of Concept on News Aggregator:
- This POC is based on newsaggregator data.
- Public DATASET is available below the website link.
https://archive.ics.uci.edu/ml/datasets/News+Aggregator
Industry Social Media:
Data
A publicly available dataset with attributes like as follows.
- ID -An integer number of numeric ID.
- TITLE -News Title of type String.
- URL -URL of type String.
- PUBLISHER -Publisher name of type String.
- CATEGORY -News category of type String.
- STORY -Alphanumeric ID of the cluster that includes news about the same story.
- HOSTNAME -URL hostname of type String.
- TIME -Approximate time the news was published.
Problem Statement:
- Find no of news divided into different categories
- Count the total occurrence of different titles in a table.
- Find publisher name and title of business category.
- Find the news which was published for an approximate time.
- Find 5 title names from the table which is published by the Los Angeles Times.
- Find the alphanumeric id of the cluster which includes news about the same story.
Shell Script:
The purpose of this shell script is to create a table and execute the hive command to store the result.
Creating Table: To create a table using the following query as follows.
hive>create table new
(
id bigint,
title String,
url String,
publishername String,
category String,
story String,
hostname String,
time bigint
);
> row format delimited
> fields terminated by '\t'
> lines terminated by '\n'
> stored as textfile;
Loading Tables: To load the tables using the following query as follows.
hive>load data local inpath ‘/home/training/Desktop/news.txt’
>overwrite into table news;
Output: To show the output used the following query.
hive>select * from news;
Hive Commands
1. Find a number of news divided into different categories.
hive >SELECT category, COUNT(*) from news GROUP BY category

2. Count the total occurrence of different titles in a table.
hive > select count (DISTINCT title) from news

3. Find publisher name and title of business category.
hive >select title , publishername from news where category==’b’;


4. Find the news which was published for an approximate time.
hive >SELECT * from news SORT BY time DESC limit 1;

5. Find 5 title names from the table which is published by the Los Angeles Times.
hive> SELECT title FROM news where publishername='Los Angeles Times' LIMIT 5;

6. Find the alphanumeric id of the cluster which includes news about the same story.
hive>SELECT story, COUNT(*) from news GROUP BY story;

Similar Reads
Popular Big Data Technologies
Big Data deals with large data sets or deals with the deals with complexities handled by traditional data processing application software. It has three key concepts like volume, variety, and velocity. In volume, determining the size of data and in variety, data will be categorized means will determi
5 min read
Role of Cloud Computing in Big Data Analytics
In this day and age where information is everything, organizations are overwhelmed. This information, often called âbig data,â refers to huge, complicated datasets that ordinary procedures cannot process. Businesses are increasingly turning to cloud computing in order to unlock the true value of big
11 min read
Aggregation in Data Mining
Aggregation in data mining is the process of finding, collecting, and presenting the data in a summarized format to perform statistical analysis of business schemes or analysis of human patterns. When numerous data is collected from various datasets, it's important to gather accurate data to provide
7 min read
How Do Companies Use Big Data Analytics in Real World?
The most valuable item for any company in modern times is data! Companies can work much more efficiently by analyzing large amounts of data and making business decisions on that basis. This means that Big Data Analytics is the current path to profit! So, is it any surprise that more and more compani
7 min read
How NoSQL System Handle Big Data Problem?
Datasets that are difficult to store and analyze by any software database tool are referred to as big data. Due to the growth of data, an issue arises that based on recent fads in the IT region, how the data will be effectively processed. A requirement for ideas, techniques, tools, and technologies
2 min read
Top 10 Trends on Big Data Analytics
The market of Big data Analytics is expected to rise shortly as big data analytics is important because it helps companies leverage their data and also identify opportunities for better performance. Big data analytics is high in demand because it provides better customer service, and improves operat
8 min read
Difference Between Big Data and Data Science
The terms "Big Data" and "Data Science" often emerge as pivotal concepts driving innovation and decision-making. Despite their frequent interchangeability in casual conversation, Big Data and Data Science represent distinct but interrelated fields. Understanding their differences, applications, and
4 min read
How to Use ChatGPT to Analyze Data?
In an age where everything is online, increased data in all formats is almost obvious. This data forms the basis of most of the marketing strategies and further product design and assembly. It is almost impossible to work without data today. Right from social media to online shopping, everything is
8 min read
Difference Between Big Data and Data Mining
Big Data: It is huge, large or voluminous data, information or the relevant statistics acquired by the large organizations and ventures. Many software and data storage created and prepared as it is difficult to compute the big data manually. It is used to discover patterns and trends and make decisi
3 min read
How Big Data Artificial Intelligence is Changing the Face of Traditional Big Data?
Big data is slowly becoming a technology of the past. Recently, Big Data AI, a combination of Big Data and Artificial Intelligence, is empowering businesses to compile data as well as respond to it. Both big data and AI technologies are among the hottest trends with a variety of applications in the
6 min read