0% found this document useful (0 votes)
10 views

Cse Big Data 702 Notes

The document provides an overview of Big Data and its significance in data science, detailing the processes involved in data analysis and the types of data including structured, semi-structured, and unstructured data. It outlines the characteristics of Big Data through the 5 V's: Volume, Variety, Veracity, Value, and Velocity, and compares traditional data with Big Data. Additionally, it discusses the evolution of Big Data technologies and highlights key technologies used for data storage, mining, analytics, and visualization.

Uploaded by

jahanviroy11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Cse Big Data 702 Notes

The document provides an overview of Big Data and its significance in data science, detailing the processes involved in data analysis and the types of data including structured, semi-structured, and unstructured data. It outlines the characteristics of Big Data through the 5 V's: Volume, Variety, Veracity, Value, and Velocity, and compares traditional data with Big Data. Additionally, it discusses the evolution of Big Data technologies and highlights key technologies used for data storage, mining, analytics, and visualization.

Uploaded by

jahanviroy11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

UNIT-1

BIG DATA

Data science is the study of data analysis by advanced technology (Machine


Learning, Artificial Intelligence, Big data). It processes a huge amount of structured, semi-
structured, and unstructured data to extract insight meaning, from which one pattern can be
designed that will be useful to take a decision for grabbing the new business opportunity, the
betterment of product/service, and ultimately business growth. Data science process to make
sense of Big data/huge amount of data that is used in business. The workflow of Data science
is as below:
 Objective and the issue of business determining – What is the organization’s objective,
what level the organization wants to achieve, and what issue the company is facing -these
are the factors under consideration. Based on such factors which type of data are relevant is
considered.
 Collection of relevant data- relevant data are collected from various sources.
 Cleaning and filtering collected data – non-relevant data are removed.
 Explore the filtered, cleaned data – Finding any hidden pattern, synchronization in data,
plotting them in the graph, chart, etc. form that is understandable to a non-technical person.
 Creating a model by analyzing data – creating a model, validate it.
 Visualization of finding by interpreting data or creating a model for a business person.
 Help businesspeople in making the decision and taking the step for the sack of business
growth.

Types of Big Data

Structured Data

 Structured data can be crudely defined as the data that resides in a fixed field within a
record.
 It is type of data most familiar to our everyday lives. for ex: birthday,address
 A certain schema binds it, so all the data has the same set of properties. Structured data is
also called relational data. It is split into multiple tables to enhance the integrity of the data
by creating a single record to depict an entity. Relationships are enforced by the application
of table constraints.
 The business value of structured data lies within how well an organization can utilize its
existing systems and processes for analysis purposes.
A Structured Query Language (SQL) is needed to bring the data together. Structured data is
easy to enter, query, and analyze. All of the data follows the same format. However, forcing a
consistent structure also means that any alteration of data is too tough as each record has to be
updated to adhere to the new structure. Examples of structured data include numbers, dates,
strings, etc. The business data of an e-commerce website can be considered to be structured
data.
Name Class Section Roll No Grade

Geek1 11 A 1 A

Geek2 11 A 2 B

Geek3 11 A 3 A

Cons of Structured Data


1. Structured data can only be leveraged in cases of predefined functionalities. This means
that structured data has limited flexibility and is suitable for certain specific use cases only.
2. Structured data is stored in a data warehouse with rigid constraints and a definite schema.
Any change in requirements would mean updating all of that structured data to meet the
new needs. This is a massive drawback in terms of resource and time management.

Semi-Structured Data

 Semi-structured data is not bound by any rigid schema for data storage and handling. The
data is not in the relational format and is not neatly organized into rows and columns like
that in a spreadsheet. However, there are some features like key-value pairs that help in
discerning the different entities from each other.
 Since semi-structured data doesn’t need a structured query language, it is commonly
called NoSQL data.
 A data serialization language is used to exchange semi-structured data across systems that
may even have varied underlying infrastructure.
 Semi-structured content is often used to store metadata about a business process but it can
also include files containing machine instructions for computer programs.
 This type of information typically comes from external sources such as social media
platforms or other web-based data feeds.
Data is created in plain text so that different text-editing tools can be used to draw valuable
insights. Due to a simple format, data serialization readers can be implemented on hardware
with limited processing resources and bandwidth.
Data Serialization Languages
Software developers use serialization languages to write memory-based data in files, transit,
store, and parse. The sender and the receiver don’t need to know about the other system. As
long as the same serialization language is used, the data can be understood by both systems
comfortably. There are three predominantly used Serialization languages.

1. XML– XML stands for eXtensible Markup Language. It is a text-based markup language
designed to store and transport data. XML parsers can be found in almost all popular
development platforms. It is human and machine-readable. XML has definite standards for
schema, transformation, and display. It is self-descriptive. Below is an example of a
programmer’s details in XML.

 XML

<ProgrammerDetails>

<FirstName>Jane</FirstName>

<LastName>Doe</LastName>

<CodingPlatforms>

<CodingPlatform
Type="Fav">GeeksforGeeks</CodingPlatform>

<CodingPlatform
Type="2ndFav">Code4Eva!</CodingPlatform>
<CodingPlatform
Type="3rdFav">CodeisLife</CodingPlatform>

</CodingPlatforms>

</ProgrammerDetails>

<!--The 2ndFav and 3rdFav Coding Platforms


are imaginative because Geeksforgeeks is the
best!-->

XML expresses the data using tags (text within angular brackets) to shape the data (for ex:
FirstName) and attributes (For ex: Type) to feature the data. However, being a verbose and
voluminous language, other formats have gained more popularity.
2. JSON– JSON (JavaScript Object Notation) is a lightweight open-standard file format for
data interchange. JSON is easy to use and uses human/machine-readable text to store and
transmit data objects.

 Javascript

"firstName": "Jane",

"lastName": "Doe",

"codingPlatforms": [

{ "type": "Fav", "value": "Geeksforgeeks" },

{ "type": "2ndFav", "value": "Code4Eva!" },


{ "type": "3rdFav", "value": "CodeisLife" }

This format isn’t as formal as XML. It’s more like a key/value pair model than a formal data
depiction. Javascript has inbuilt support for JSON. Although JSON is very popular amongst
web developers, non-technical personnel find it tedious to work with JSON due to its heavy
dependence on JavaScript and structural characters (braces, commas, etc.)
3. YAML– YAML is a user-friendly data serialization language. Figuratively, it stands
for YAML Ain’t Markup Language. It is adopted by technical and non-technical handlers all
across the globe owing to its simplicity. The data structure is defined by line separation and
indentation and reduces the dependency on structural characters. YAML is extremely
comprehensive and its popularity is a result of its human-machine readability.

YAML example
A product catalog organized by tags is an example of semi-structured data.

Unstructured Data

 Unstructured data is the kind of data that doesn’t adhere to any definite schema or set of
rules. Its arrangement is unplanned and haphazard.
 Photos, videos, text documents, and log files can be generally considered unstructured data.
Even though the metadata accompanying an image or a video may be semi-structured, the
actual data being dealt with is unstructured.
 Additionally, Unstructured data is also known as “dark data” because it cannot be analyzed
without the proper software tools.

Big Data Characteristics-


Big Data contains a large amount of data that is not being processed by traditional data storage or
the processing unit. It is used by many multinational companies to process the data and
business of many organizations. The data flow would exceed 150 exabytes per day before
replication.

There are five v's of Big Data that explains the characteristics.

5 V's of Big Data

o Volume
o Veracity
o Variety
o Value
o Velocity

Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.

Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.

Variety

Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
The data is categorized as below:

1. Structured data: In Structured schema, along with all the required columns. It is in a
tabular form. Structured Data is stored in the relational database management system.
2. Semi-structured: In Semi-structured, the schema is not appropriately defined,
e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction Processing)
systems are built to work with semi-structured data. It is stored in relations, i.e., tables.
3. Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they
did not know how to derive the value of data since the data is raw.
4. Quasi-structured Data:The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools.

Example: Web server logs, i.e., the log file is created and maintained by some server that
contains a list of activities.

Veracity

Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.

For example, Facebook posts with hashtags.

Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.

Velocity

Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.

Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.

The main differences between traditional data and big data are as follows:

Traditional Data Big Data


It is usually a small amount of data that can be It is usually a big amount of data that cannot be
collected and analyzed using traditional methods processed and analyzed easily using traditional
easily. methods.

It is usually structured data and can be stored in It includes semi-structured, unstructured, and
spreadsheets, databases, etc. structured data.

It collects information automatically with the use of


It often collects data manually.
automated systems.

It comes from various sources such as mobile devices,


It usually comes from internal systems.
social media, etc.

It consists of data such as customer information,


It consists of data such as images, videos, etc.
financial transactions, etc.

Analysis of traditional data can be done with the use of Analysis of big data needs advanced analytics methods
primary statistical methods. such as machine learning, data mining, etc.

Traditional methods to analyze data are slow and


Methods to analyze big data are fast and instant.
gradual.

It generates data after the happening of an event. It generates data every second.

It is typically processed in batches. It is developed and processed in real-time.

It provides valuable insights and patterns for good


It is limited in its value and insights.
decision-making.

It may contain unreliable, inconsistent, or inaccurate


It contains reliable and accurate data.
data because of its size and complexity.
It is used for simple and small business processes. It is used for complex and big business processes.

It does not provide in-depth insights. It provides in-depth insights.

It is easy to secure and protect than big data because of It is harder to secure and protect than traditional data
its small size and simplicity. because of its size and complexity.

It requires less time and money to store traditional


It requires more time and money to store big data.
data.

It requires distributed storage across numerous


It can be stored on a single computer or server.
systems.

It is less efficient than big data. It is more efficient than traditional data.

It requires a decentralized infrastructure to manage the


It can be managed in a centralized structure easily.
data.

Evolution of Big Data:-

If we see the last few decades, we can analyze that Big Data technology has gained so much
growth. There are a lot of milestones in the evolution of Big Data which are described below:

1. DataWarehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large volumes
of structured data.
2. Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed storage
medium and large data processing are provided by Hadoop, and it is an open-source
framework.
3. NoSQLDatabases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
4. CloudComputing:
Cloud Computing technology helps companies to store their important data in data
centers that are remote, and it saves their infrastructure cost and maintenance costs.
5. MachineLearning:
Machine Learning algorithms are those algorithms that work on large data, and analysis
is done on a huge amount of data to get meaningful insights from it. This has led to the
development of artificial intelligence (AI) applications.
6. DataStreaming:
Data Streaming technology has emerged as a solution to process large volumes of data in
real time.
7. EdgeComputing:
dge Computing is a kind of distributed computing paradigm that allows data processing
to be done at the edge or the corner of the network, closer to the source of the data.

Top Big Data Technologies

We can categorize the leading big data technologies into the following four sections:

o Data Storage
o Data Mining
o Data Analytics
o Data Visualization

Data Storage

Let us first discuss leading Big Data Technologies that come under Data Storage:

o Hadoop: When it comes to handling big data, Hadoop is one of the leading technologies
that come into play. This technology is based entirely on map-reduce architecture and is
mainly used to process batch information. Also, it is capable enough to process tasks in
batches. The Hadoop framework was mainly introduced to store and process data in a
distributed data processing environment parallel to commodity hardware and a basic
programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from
various machines with a faster speed and low cost. That is why Hadoop is known as one
of the core components of big data technologies. The Apache Software
Foundation introduced it in Dec 2011. Hadoop is written in Java programming language.
o MongoDB: MongoDB is another important component of big data technologies in terms
of storage. No relational properties and RDBMS properties apply to MongoDb because it
is a NoSQL database. This is not the same as traditional RDBMS databases that use
structured query languages. Instead, MongoDB uses schema documents.
The structure of the data storage in MongoDB is also different from traditional RDBMS
databases. This enables MongoDB to hold massive amounts of data. It is based on a
simple cross-platform document-oriented design. The database in MongoDB uses
documents similar to JSON with the schema. This ultimately helps operational data
storage options, which can be seen in most financial organizations. As a result,
MongoDB is replacing traditional mainframes and offering the flexibility to handle a
wide range of high-volume data-types in distributed architectures.
MongoDB Inc. introduced MongoDB in Feb 2009. It is written with a combination of
C++, Python, JavaScript, and Go language.
o RainStor: RainStor is a popular database management system designed to manage and
analyze organizations' Big Data requirements. It uses deduplication strategies that help
manage storing and handling vast amounts of data for reference.
RainStor was designed in 2004 by a RainStor Software Company. It operates just like
SQL. Companies such as Barclays and Credit Suisse are using RainStor for their big data
needs.
o Hunk: Hunk is mainly helpful when data needs to be accessed in remote Hadoop clusters
using virtual indexes. This helps us to use the spunk search processing language to
analyze data. Also, Hunk allows us to report and visualize vast amounts of data from
Hadoop and NoSQL data sources.
Hunk was introduced in 2013 by Splunk Inc. It is based on the Java programming
language.
o Cassandra: Cassandra is one of the leading big data technologies among the list of top
NoSQL databases. It is open-source, distributed and has extensive column storage
options. It is freely available and provides high availability without fail. This ultimately
helps in the process of handling data efficiently on large commodity groups. Cassandra's
essential features include fault-tolerant mechanisms, scalability, MapReduce support,
distributed nature, eventual consistency, query language property, tunable consistency,
and multi-datacenter replication, etc.
Cassandra was developed in 2008 by the Apache Software Foundation for the
Facebook inbox search feature. It is based on the Java programming language.

Data Mining

Let us now discuss leading Big Data Technologies that come under Data Mining:
o Presto: Presto is an open-source and a distributed SQL query engine developed to run
interactive analytical queries against huge-sized data sources. The size of data sources
can vary from gigabytes to petabytes. Presto helps in querying the data in Cassandra,
Hive, relational databases and proprietary data storage systems.
Presto is a Java-based query engine that was developed in 2013 by the Apache Software
Foundation. Companies like Repro, Netflix, Airbnb, Facebook and Checkr are using this
big data technology and making good use of it.
o RapidMiner: RapidMiner is defined as the data science software that offers us a very
robust and powerful graphical user interface to create, deliver, manage, and maintain
predictive analytics. Using RapidMiner, we can create advanced workflows and scripting
support in a variety of programming languages.
RapidMiner is a Java-based centralized solution developed in 2001 by Ralf
Klinkenberg, Ingo Mierswa, and Simon Fischer at the Technical University of
Dortmund's AI unit. It was initially named YALE (Yet Another Learning Environment).
A few sets of companies that are making good use of the RapidMiner tool are Boston
Consulting Group, InFocus, Domino's, Slalom, and Vivint.SmartHome.
o ElasticSearch: When it comes to finding information, elasticsearch is known as an
essential tool. It typically combines the main components of the ELK stack (i.e., Logstash
and Kibana). In simple words, ElasticSearch is a search engine based on the Lucene
library and works similarly to Solr. Also, it provides a purely distributed, multi-tenant
capable search engine. This search engine is completely text-based and contains schema-
free JSON documents with an HTTP web interface.
ElasticSearch is primarily written in a Java programming language and was developed in
2010 by Shay Banon. Now, it has been handled by Elastic NV since 2012. ElasticSearch
is used by many top companies, such as LinkedIn, Netflix, Facebook, Google, Accenture,
StackOverflow, etc.

Data Analytics

Now, let us discuss leading Big Data Technologies that come under Data Analytics:

o Apache Kafka: Apache Kafka is a popular streaming platform. This streaming platform is
primarily known for its three core capabilities: publisher, subscriber and consumer. It is
referred to as a distributed streaming platform. It is also defined as a direct messaging,
asynchronous messaging broker system that can ingest and perform data processing on
real-time streaming data. This platform is almost similar to an enterprise messaging
system or messaging queue.
Besides, Kafka also provides a retention period, and data can be transmitted through a
producer-consumer mechanism. Kafka has received many enhancements to date and
includes some additional levels or properties, such as schema, Ktables, KSql, registry,
etc. It is written in Java language and was developed by the Apache software
community in 2011. Some top companies using the Apache Kafka platform include
Twitter, Spotify, Netflix, Yahoo, LinkedIn etc.
o Splunk: Splunk is known as one of the popular software platforms for capturing,
correlating, and indexing real-time streaming data in searchable repositories. Splunk can
also produce graphs, alerts, summarized reports, data visualizations, and dashboards, etc.,
using related data. It is mainly beneficial for generating business insights and web
analytics. Besides, Splunk is also used for security purposes, compliance, application
management and control.
Splunk Inc. introduced Splunk in the year 2014. It is written in combination with AJAX,
Python, C ++ and XML. Companies such as Trustwave, QRadar, and 1Labs are making
good use of Splunk for their analytical and security needs.
o KNIME: KNIME is used to draw visual data flows, execute specific steps and analyze
the obtained models, results, and interactive views. It also allows us to execute all the
analysis steps altogether. It consists of an extension mechanism that can add more
plugins, giving additional features and functionalities.
KNIME is based on Eclipse and written in a Java programming language. It was
developed in 2008 by KNIME Company. A list of companies that are making use of
KNIME includes Harnham, Tyler, and Paloalto.
o Spark: Apache Spark is one of the core technologies in the list of big data technologies. It
is one of those essential technologies which are widely used by top companies. Spark is
known for offering In-memory computing capabilities that help enhance the overall speed
of the operational process. It also provides a generalized execution model to support more
applications. Besides, it includes top-level APIs (e.g., Java, Scala, and Python) to ease the
development process.
Also, Spark allows users to process and handle real-time streaming data using batching
and windowing operations techniques. This ultimately helps to generate datasets and data
frames on top of RDDs. As a result, the integral components of Spark Core are produced.
Components like Spark MlLib, GraphX, and R help analyze and process machine
learning and data science. Spark is written using Java, Scala, Python and R language.
The Apache Software Foundation developed it in 2009. Companies like Amazon,
ORACLE, CISCO, VerizonWireless, and Hortonworks are using this big data technology
and making good use of it.
o R-Language: R is defined as the programming language, mainly used in statistical
computing and graphics. It is a free software environment used by leading data miners,
practitioners and statisticians. Language is primarily beneficial in the development of
statistical-based software and data analytics.
R-language was introduced in Feb 2000 by R-Foundation. It is written in Fortran.
Companies like Barclays, American Express, and Bank of America use R-Language for
their data analytics needs.
o Blockchain: Blockchain is a technology that can be used in several applications related to
different industries, such as finance, supply chain, manufacturing, etc. It is primarily used
in processing operations like payments and escrow. This helps in reducing the risks of
fraud. Besides, it enhances the transaction's overall processing speed, increases financial
privacy, and internationalize the markets. Additionally, it is also used to fulfill the needs
of shared ledger, smart contract, privacy, and consensus in any Business Network
Environment.
Blockchain technology was first introduced in 1991 by two researchers, Stuart
Haber and W. Scott Stornetta. However, blockchain has its first real-world application
in Jan 2009 when Bitcoin was launched. It is a specific type of database based on Python,
C++, and JavaScript. ORACLE, Facebook, and MetLife are a few of those top companies
using Blockchain technology.

Data Visualization

Let us discuss leading Big Data Technologies that come under Data Visualization:

o Tableau: Tableau is one of the fastest and most powerful data visualization tools used by
leading business intelligence industries. It helps in analyzing the data at a very faster
speed. Tableau helps in creating the visualizations and insights in the form of dashboards
and worksheets.
Tableau is developed and maintained by a company named TableAU. It was introduced
in May 2013. It is written using multiple languages, such as Python, C, C++, and Java.
Some of the list's top companies are Cognos, QlikQ, and ORACLE Hyperion, using this
tool.
o Plotly: As the name suggests, Plotly is best suited for plotting or creating graphs and
relevant components at a faster speed in an efficient way. It consists of several rich
libraries and APIs, such as MATLAB, Python, Julia, REST API, Arduino, R, Node.js,
etc. This helps interactive styling graphs with Jupyter notebook and Pycharm.
Plotly was introduced in 2012 by Plotly company. It is based on JavaScript. Paladins and
Bitbank are some of those companies that are making good use of Plotly.
Expected Properties of a Big Data System
There are various properties which mostly relies on complexity as per their scalability in the
big data. As per these properties, Big data system should perform well, efficient, and
reasonable as well. Let’s explore these properties step by step.
1. Robustness and error tolerance –
As per the obstacles in distributed system encountered, it is quite arduous to build a system
that “do the right thing”. Systems are required to behave in a right manner despite
machines going down randomly, the composite semantics of uniformity in distributed
databases, redundancy, concurrency, and many more. These obstacles make it complicated
to reason about the functioning of the system. Robustness of big data system is the solution
to overcome the obstacles associated with it.
It’s domineering for system to tolerate the human-fault. It’s an often-disregarded property
of the system which can not be overlooked. In a production system, its domineering that
the operator of the system might make mistakes, such as by providing incorrect program
that can interrupt the functioning of the database. If re-computation and immutability is
built in the core of a big data system, the system will be distinctively robust against human
fault by delivering a relevant and quite cinch mechanism for recovery.
2. Debuggability –
A system must be debug when unfair thing happens by the required information delivered
by the big data system. The key must be able to recognize, for every value in the system.
Debuggability is proficient in the Lambda Architecture via the functional behaviour of the
batch layer and with the help of re-computation algorithm when needed.
3. Scalability –
It is the tendency to handle the performance in the context of growing data and load by
adding resources to the system. The Lambda Architecture is straight scalable diagonally to
all layers of the system stack: scaling is achieved by including further number of machines.
4. Generalization –
A wide range of applications can be function in a general system. As Lambda Architecture
is based on function of all data, a number of applications can run in a generalized system.
Also, Lambda architecture can generalize social networking, applications, etc.
5. Ad hoc queries –
The ability to perform ad hoc queries on the data is significant. Every large dataset contains
unanticipated value in it. Having the ability of data mining constantly provides
opportunities for new application and business optimization.
6. Extensibility –
Extensible system enables to function to be added cost effectively. Sometimes, a new
feature or a change to an already existing system feature needs to reallocate of pre-existing
data into a new data format. Large-scale transfer of data become easy as it is the part in
building an extensible system.
7. Low latency reads and updates –
Numerous applications are needed the read with low latency, within a few milliseconds and
hundred milliseconds. In Contradict, Update latency varies within the applications. Some
of the applications needed to be broadcast with low latency, while some can function with
few hours of latency. In big data system, there is a need of applications low latency or
updates propagated shortly.
8. Minimal Maintenance –
Maintenance is like penalty for developers. It is the operations which is needed to keep the
functionality of the systems smooth. This includes forestalling when to increase number of
machines to scale, keeping processes functioning well along with their debugging.
Selecting components with probably little complexity plays a significant role in minimal
maintenance. A developer always willing to rely on components along with quite relevant
mechanism. Significantly, distributed database has more probability of complicated
internals.
UNIT-2

What is Hadoop:-

Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the
cluster.

Modules of Hadoop

1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks and
stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts it
into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.

Advantages of Hadoop

o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing
the processing time. It is able to process terabytes of data in minutes and Peta bytes in
hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
really cost effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
o Core Components of Hadoop Architecture
o 1. Hadoop Distributed File System (HDFS)
o One of the most critical components of Hadoop architecture is the Hadoop Distributed
File System (HDFS). HDFS is the primary storage system used by Hadoop applications.
It’s designed to scale to petabytes of data and runs on commodity hardware. What sets
HDFS apart is its ability to maintain large data sets across multiple nodes in a distributed
computing environment.
o HDFS operates on the basic principle of storing large files across multiple machines. It
achieves high throughput by dividing large data into smaller blocks, which are managed
by different nodes in the network. This nature of HDFS makes it an ideal choice for
applications with large data sets.
o 2. Yet Another Resource Negotiator (YARN)
o Yet Another Resource Negotiator (YARN) is responsible for managing resources in the
cluster and scheduling tasks for users. It is a key element in Hadoop architecture as it
allows multiple data processing engines such as interactive processing, graph processing,
and batch processing to handle data stored in HDFS.
o YARN separates the functionalities of resource management and job scheduling into
separate daemons. This design ensures a more scalable and flexible Hadoop architecture,
accommodating a broader array of processing approaches and a wider array of
applications.
o 3. MapReduce Programming Model
o MapReduce is a programming model integral to Hadoop architecture. It is designed to
process large volumes of data in parallel by dividing the work into a set of independent
tasks. The MapReduce model simplifies the processing of vast data sets, making it an
indispensable part of Hadoop.
o MapReduce is characterized by two primary tasks, Map and Reduce. The Map task takes
a set of data and converts it into another set of data, where individual elements are broken
down into tuples. On the other hand, the Reduce task takes the output from the Map as
input and combines those tuples into a smaller set of tuples.
o 4. Hadoop Common
o Hadoop Common, often referred to as the ‘glue’ that holds Hadoop architecture together,
contains libraries and utilities needed by other Hadoop modules. It provides the necessary
Java files and scripts required to start Hadoop. This component plays a crucial role in
ensuring that the hardware failures are managed by the Hadoop framework itself, offering
a high degree of resilience and reliability.

Hadoop Eco system-

Apache Hadoop is an open source framework intended to make interaction with big
data easier, However, for those who are not acquainted with this technology, one question
arises that what is big data ? Big data is a term given to the data sets which can’t be processed
in an efficient manner with the help of traditional methodology such as RDBMS. Hadoop has
made its place in the industries and companies that need to work on large data sets which are
sensitive and needs efficient handling. Hadoop is a framework that enables processing of large
data sets which reside in the form of clusters. Being a framework, Hadoop is made up of
several modules that are supported by a large ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to
solve the big data problems. It includes Apache projects and various commercial tools and
solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and
Hadoop Common Utilities. Most of the tools or solutions are used to supplement or support
these major elements. All these tools work collectively to provide services such as absorption,
analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
Hive Architecture-

The following architecture explains the flow of submission of query into Hive.
Hive Client

Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-

o Thrift Server - It is a cross-language service provider platform that serves the request
from all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications.
The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect to
Hive.

Hive Services

The following are the services provided by Hive:-

o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column and its
type information, the serializers and deserializers which is used to read and write data and
the corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements
into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.

Differences between Hadoop and RDBMS

Aspect Hadoop RDBMS

Handles structured and


Data Structure Primarily structured data
unstructured data

Horizontal scaling: add commodity Vertical scaling: enhance single


Scalability
hardware server

Batch processing using MapReduce


ProcessingParadigm Interactive querying using SQL
or Spark

Big data analytics, log processing, Transactional applications,


UseCases
data lakes relational data

Opensource software, commodity


Cost Licensing fees, hardware upgrades
hardware
What is HDFS

Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over
several machines and replicated to ensure their durability to failure and high availability to
parallel application.

It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes
and node name.

Where to use HDFS

o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.


o Streaming Data Access: The time to read whole data set is more important than latency
in reading the first. HDFS is built on write-once and read-many-times pattern.
o Commodity Hardware:It works on low cost hardware.

Where not to use HDFS

o Low Latency data access: Applications that require very less time to access the first data
should not use HDFS as it is giving importance to whole data rather than time to fetch the
first record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if the
files are small in size it takes a lot of memory for name node's memory which is not
feasible.
o Multiple Writes:It should not be used when we have to write multiple times.

HDFS Concepts

1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks
are 128 MB by default and this is configurable.Files n HDFS are broken into block-sized
chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS
is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file
stored in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is
large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and the
metadata of all the files in HDFS; the metadata information being file permission, names
and location of each block.The metadata are small, so it is stored in the memory of name
node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently,so all this information is handled bya single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being a commodity hardware also does the work of block creation, deletion and
replication as stated by the name node.

Hadoop YARN Architecture

YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described
as a “Redesigned Resource Manager” at the time of its launching, but it has now evolved to be
known as large-scale distributed operating system used for Big Data processing.

YARN architecture basically separates resource management layer from the processing layer.
In Hadoop 1.0 version, the responsibility of Job tracker is split between the resource manager
and application manager.
YARN also allows different data processing engines like graph processing, interactive
processing, stream processing as well as batch processing to run and process data stored in
HDFS (Hadoop Distributed File System) thus making the system much more efficient.
Through its various components, it can dynamically allocate various resources and schedule
the application processing. For large volume data processing, it is quite necessary to manage
the available resources properly so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-

 Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to


extend and manage thousands of nodes and clusters.
 Compatibility: YARN supports the existing map-reduce applications without disruptions
thus making it compatible with Hadoop 1.0 as well.
 Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop,
which enables optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of
multi-tenancy.

The main components of YARN architecture include:

 Client: It submits map-reduce jobs.


 Resource Manager: It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications. Whenever it receives a processing
request, it forwards it to the corresponding node manager and allocates resources for the
completion of the request accordingly. It has two major components:
o Scheduler: It performs scheduling based on the allocated application and
available resources. It is a pure scheduler, means it does not perform other tasks
such as monitoring or tracking and does not guarantee a restart if a task fails.
The YARN scheduler supports plugins such as Capacity Scheduler and Fair
Scheduler to partition the cluster resources.
o Application manager: It is responsible for accepting the application and
negotiating the first container from the resource manager. It also restarts the
Application Master container if a task fails.
 Node Manager: It take care of individual node on Hadoop cluster and manages application
and workflow and that particular node. Its primary job is to keep-up with the Resource
Manager. It registers with the Resource Manager and sends heartbeats with the health
status of the node. It monitors resource usage, performs log management and also kills a
container based on directions from the resource manager. It is also responsible for creating
the container process and start it on the request of Application master.
 Application Master: An application is a single job submitted to a framework. The
application master is responsible for negotiating resources with the resource manager,
tracking the status and monitoring progress of a single application. The application master
requests the container from the node manager by sending a Container Launch
Context(CLC) which includes everything an application needs to run. Once the application
is started, it sends the health report to the resource manager from time-to-time.
 Container: It is a collection of physical resources such as RAM, CPU cores and disk on a
single node. The containers are invoked by Container Launch Context(CLC) which is a
record that contains information such as environment variables, security tokens,
dependencies etc.

Application workflow in Hadoop YARN:

1. Client submits an application


2. The Resource Manager allocates a container to start the Application Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor application’s status
8. Once the processing is complete, the Application Manager un-registers with the Resource
Manager

Advantages :

 Flexibility: YARN offers flexibility to run various types of distributed processing systems
such as Apache Spark, Apache Flink, Apache Storm, and others. It allows multiple
processing engines to run simultaneously on a single Hadoop cluster.
 Resource Management: YARN provides an efficient way of managing resources in the
Hadoop cluster. It allows administrators to allocate and monitor the resources required by
each application in a cluster, such as CPU, memory, and disk space.
 Scalability: YARN is designed to be highly scalable and can handle thousands of nodes in
a cluster. It can scale up or down based on the requirements of the applications running on
the cluster.
 Improved Performance: YARN offers better performance by providing a centralized
resource management system. It ensures that the resources are optimally utilized, and
applications are efficiently scheduled on the available resources.
 Security: YARN provides robust security features such as Kerberos authentication, Secure
Shell (SSH) access, and secure data transmission. It ensures that the data stored and
processed on the Hadoop cluster is secure.

Disadvantages :

 Complexity: YARN adds complexity to the Hadoop ecosystem. It requires additional


configurations and settings, which can be difficult for users who are not familiar with
YARN.
 Overhead: YARN introduces additional overhead, which can slow down the performance
of the Hadoop cluster. This overhead is required for managing resources and scheduling
applications.
 Latency: YARN introduces additional latency in the Hadoop ecosystem. This latency can
be caused by resource allocation, application scheduling, and communication between
components.
 Single Point of Failure: YARN can be a single point of failure in the Hadoop cluster. If
YARN fails, it can cause the entire cluster to go down. To avoid this, administrators need
to set up a backup YARN instance for high availability.
 Limited Support: YARN has limited support for non-Java programming languages.
Although it supports multiple processing engines, some engines have limited language
support, which can limit the usability of YARN in certain environments.
MapReduce and HDFS are the two major components of Hadoop which makes it so powerful
and efficient to use. MapReduce is a programming model used for efficient processing in
parallel over large data-sets in a distributed manner. The data is first split and then combined to
produce the final result. The libraries for MapReduce is written in so many programming
languages with various different-different optimizations. The purpose of MapReduce in
Hadoop is to Map each of the jobs and then it will reduce it to equivalent tasks for providing
less overhead over the cluster network and to reduce the processing power. The MapReduce
task is mainly divided into two phases Map Phase and Reduce Phase.

MapReduce Architecture:
Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of
all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to the
Hadoop MapReduce Master. Now, the MapReduce master will divide this job into further
equivalent job-parts. These job-parts are then made available for the Map and Reduce Task.
This Map and Reduce task will contain the program as per the requirement of the use-case that
the particular company is solving. The developer writes their logic to fulfill the requirement
that the industry requires. The input data which we are using is then fed to the Map Task and
the Map will generate intermediate key-value pair as its output. The output of Map i.e. these
key-value pairs are then fed to the Reducer and the final output is stored on the HDFS. There
can be n number of Map and Reduce tasks made available for processing the data as per the
requirement. The algorithm for Map and Reduce is made with a very optimized way such that
the time complexity or space complexity is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs. The
input to the map may be a key-value pair where the key can be the id of some kind of
address and value is the actual value that it keeps. The Map() function will be executed in
its memory repository on each of these input key-value pairs and generates the intermediate
key-value pair which works as input for the Reducer or Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and
sort and send to the Reduce() function. Reducer aggregate or group the data based on its
key-value pair as per the reducer algorithm written by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all the jobs across
the cluster and also to schedule each map on the Task Tracker running on the same data
node since there can be hundreds of data nodes available in the cluster.

2. Task Tracker: The Task Tracker can be considered as the actual slaves that are working
on the instruction given by the Job Tracker. This Task Tracker is deployed on each of the
nodes available in the cluster that executes the Map and Reduce task as instructed by Job
Tracker.
There is also one important component of MapReduce Architecture known as Job History
Server. The Job History Server is a daemon process that saves and stores historical
information about the task or application, like the logs which are generated during or after the
job execution are stored on Job History Server.
UNIT-3
– IntroductiontoHadoop, ApacheHive
The major components of Hive and its interaction with the Hadoop is demonstrated in the
figure below and all the components are described further:
 UserInterface(UI) –
As the name describes User interface provide an interface between user and hive. It enables
user to submit queries and other operations to the system. Hive web UI, Hive command
line, and Hive HD Insight (In windows server) are supported by the user interface.

 Hive Server – It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
 Driver –
Queries of the user after the interface are received by the driver within the Hive. Concept
of session handles is implemented by driver. Execution and Fetching of APIs modelled on
JDBC/ODBC interfaces is provided by the user.

 Compiler –
Queries are parses, semantic analysis on the different query blocks and query expression is
done by the compiler. Execution plan with the help of the table in the database and
partition metadata observed from the metastore are generated by the compiler eventually.

 Metastore –
All the structured data or information of the different tables and partition in the warehouse
containing attributes and attributes level information are stored in the metastore. Sequences
or de-sequences necessary to read and write data and the corresponding HDFS files where
the data is stored. Hive selects corresponding database servers to stock the schema or
Metadata of databases, tables, attributes in a table, data types of databases, and HDFS
mapping.

 ExecutionEngine –
Execution of the execution plan made by the compiler is performed in the execution
engine. The plan is a DAG of stages. The dependencies within the various stages of the
plan is managed by execution engine as well as it executes these stages on the suitable
system components.
Diagram – Architecture of Hive that is built on the top of Hadoop
In the above diagram along with architecture, job execution flow in Hive with Hadoop is
demonstrated step by step.
 Step-1:ExecuteQuery –
Interface of the Hive such as Command Line or Web user interface delivers query to the
driver to execute. In this, UI calls the execute interface to the driver such as ODBC or
JDBC.

 Step-2:GetPlan –
Driver designs a session handle for the query and transfer the query to the compiler to
make execution plan. In other words, driver interacts with the compiler.

 Step-3:GetMetadata –
In this, the compiler transfers the metadata request to any database and the compiler gets
the necessary metadata from the metastore.

 Step-4:SendMetadata –
Metastore transfers metadata as an acknowledgment to the compiler.

 Step-5:SendPlan –
Compiler communicating with driver with the execution plan made by the compiler to
execute the query.

 Step-6:ExecutePlan –
Execute plan is sent to the execution engine by the driver.
o Execute Job
o Job Done
o Dfs operation (Metadata Operation)
 Step-7:FetchResults –
Fetching results from the driver to the user interface (UI).

 Step-8:SendResults –
Result is transferred to the execution engine from the driver. Sending results to Execution
engine. When the result is retrieved from data nodes to the execution engine, it returns the
result to the driver and to user interface (UI).

Advantages of Hive Architecture:

Scalability: Hive is a distributed system that can easily scale to handle large volumes of data
by adding more nodes to the cluster.
Data Accessibility: Hive allows users to access data stored in Hadoop without the need for
complex programming skills. SQL-like language is used for queries and HiveQL is based on
SQL syntax.
Data Integration: Hive integrates easily with other tools and systems in the Hadoop
ecosystem such as Pig, HBase, and MapReduce.
Flexibility: Hive can handle both structured and unstructured data, and supports various data
formats including CSV, JSON, and Parquet.
Security: Hive provides security features such as authentication, authorization, and encryption
to ensure data privacy.

Disadvantages of Hive Architecture:

High Latency: Hive’s performance is slower compared to traditional databases because of the
overhead of running queries in a distributed system.
Limited Real-time Processing: Hive is not ideal for real-time data processing as it is designed
for batch processing.
Complexity: Hive is complex to set up and requires a high level of expertise in Hadoop, SQL,
and data warehousing concepts.
Lack of Full SQL Support: HiveQL does not support all SQL operations, such as transactions
and indexes, which may limit the usefulness of the tool for certain applications.
Debugging Difficulties: Debugging Hive queries can be difficult as the queries are executed
across a distributed system, and errors may occur in different nodes.

HIVE Data Types


Hive data types are categorized in numeric types, string types, misc types, and complex types. A
list of Hive data types is given below.

Integer Types

Type Size Range

TINYINT 1-byte signed integer -128 to 127

SMALLINT 2-byte signed integer 32,768 to 32,767

INT 4-byte signed integer 2,147,483,648 to 2,147,483,647

-9,223,372,036,854,775,808 to
BIGINT 8-byte signed integer
9,223,372,036,854,775,807

Decimal Type

Type Size Range

Single precision floating point


FLOAT 4-byte
number

Double precision floating point


DOUBLE 8-byte
number
Date/Time Types

TIMESTAMP

o It supports traditional UNIX timestamp with optional nanosecond precision.


o As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
o As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with
decimal precision.
o As string, it follows java.sql.Timestamp format "YYYY-MM-DD HH:MM:SS.fffffffff"
(9 decimal place precision)
DATES

The Date value is used to specify a particular year, month and day, in the form YYYY--MM--
DD. However, it didn't provide the time of the day. The range of Date type lies between 0000--
01--01 to 9999--12--31.

String Types

STRING

The string is a sequence of characters. It values can be enclosed within single quotes (') or double
quotes (").

Varchar

The varchar is a variable length type whose range lies between 1 and 65535, which specifies that
the maximum number of characters allowed in the character string.

CHAR

The char is a fixed-length type whose maximum length is fixed at 255.

Complex Type

Type Size Range

It is similar to C struct or an object


Struct where fields are accessed using the struct('James','Roy')
"dot" notation.
It contains the key-value tuples
Map where the fields are accessed using map('first','James','last','Roy')
array notation.

It is a collection of similar type of


Array values that indexable using zero- array('James','Roy')
based integers.

Hive Query Language-

The Hive Query Language (HiveQL) is a query language for Hive to process and analyze
structured data in a Metastore. This chapter explains how to use the SELECT statement with
WHERE clause.

SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a
condition. It filters the data using the condition and gives you a finite result. The built-in
operators and functions generate an expression, which fulfils the condition.

Syntax

Given below is the syntax of the SELECT query:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...


FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];

Example

Let us take an example for SELECT…WHERE clause. Assume we have the employee table as
given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details who earn a salary of more than Rs 30000.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+

The following query retrieves the employee details using the above scenario:

hive> SELECT * FROM employee WHERE salary>30000;

On successful execution of the query, you get to see the following response:

+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+------+--------------+-------------+-------------------+--------+

JDBC Program

The JDBC program to apply where clause for the given example is as follows.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveQLWhere {


private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

public static void main(String[] args) throws SQLException {

// Register driver and create driver instance


Class.forName(driverName);

// get connection
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "",
"");

// create statement
Statement stmt = con.createStatement();

// execute statement
Resultset res = stmt.executeQuery("SELECT * FROM employee WHERE salary>30000;");

System.out.println("Result:");
System.out.println(" ID \t Name \t Salary \t Designation \t Dept ");

while (res.next()) {
System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " " +
res.getString(4) + " " + res.getString(5));
}
con.close();
}
}

Save the program in a file named HiveQLWhere.java. Use the following commands to compile
and execute this program.

$ javac HiveQLWhere.java
$ java HiveQLWhere

Output:

ID Name Salary Designation Dept


1201 Gopal 45000 Technical manager TP
1202 Manisha 45000 Proofreader PR
1203 Masthanvali 40000 Technical writer TP
1204 Krian 40000 Hr Admin HR
Print Page
Introduction to Pig

Pig tutorial provides basic and advanced concepts of Pig. Our Pig tutorial is designed for
beginners and professionals.
Pig is a high-level data flow platform for executing Map Reduce programs of Hadoop. It was
developed by Yahoo. The language for Pig is pig Latin.

Our Pig tutorial includes all topics of Apache Pig with Pig usage, Pig Installation, Pig Run
Modes, Pig Latin concepts, Pig Data Types, Pig example, Pig user defined functions etc.

What is Apache Pig

Apache Pig is a high-level data flow platform for executing MapReduce programs of Hadoop.
The language used for Pig is Pig Latin.

The Pig scripts get internally converted to Map Reduce jobs and get executed on data stored in
HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache Spark.

Pig can handle any type of data, i.e., structured, semi-structured or unstructured and stores the
corresponding results into Hadoop Data File System. Every task which can be achieved using
PIG can also be achieved using java used in MapReduce.

Features of Apache Pig

Let's see the various uses of Pig technology.

1) Ease of programming

Writing complex java programs for map reduce is quite tough for non-programmers. Pig makes
this process easy. In the Pig, the queries are converted to MapReduce internally.

2) Optimization opportunities

It is how tasks are encoded permits the system to optimize their execution automatically,
allowing the user to focus on semantics rather than efficiency.

3) Extensibility

A user-defined function is written in which the user can write their logic to execute over the data
set.

4) Flexible

It can easily handle structured as well as unstructured data.

5) In-built operators
It contains various type of operators such as sort, filter and joins.

Differences between Apache MapReduce and PIG

Apache MapReduce Apache PIG

It is a low-level data processing tool. It is a high-level data flow tool.

Here, it is required to develop complex programs using


It is not required to develop complex programs.
Java or Python.

It is difficult to perform data operations in It provides built-in operators to perform data


MapReduce. operations like union, sorting and ordering.

It doesn't allow nested data types. It provides nested data types like tuple, bag, and map.

Advantages of Apache Pig

o Less code - The Pig consumes less line of code to perform any operation.
o Reusability - The Pig code is flexible enough to reuse again.
o Nested data types - The Pig provides a useful concept of nested data types like tuple, bag,
and map.

ETL Processing-

INTRODUCTION:

1. ETL stands for Extract, Transform, Load and it is a process used in data warehousing to
extract data from various sources, transform it into a format suitable for loading into a data
warehouse, and then load it into the warehouse. The process of ETL can be broken down
into the following three stages:
2. Extract: The first stage in the ETL process is to extract data from various sources such as
transactional systems, spreadsheets, and flat files. This step involves reading data from the
source systems and storing it in a staging area.
3. Transform: In this stage, the extracted data is transformed into a format that is suitable for
loading into the data warehouse. This may involve cleaning and validating the data,
converting data types, combining data from multiple sources, and creating new data fields.
4. Load: After the data is transformed, it is loaded into the data warehouse. This step involves
creating the physical data structures and loading the data into the warehouse.
5. The ETL process is an iterative process that is repeated as new data is added to the
warehouse. The process is important because it ensures that the data in the data warehouse
is accurate, complete, and up-to-date. It also helps to ensure that the data is in the format
required for data mining and reporting.
Additionally, there are many different ETL tools and technologies available, such as
Informatica, Talend, DataStage, and others, that can automate and simplify the ETL process.
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a
process in which an ETL tool extracts the data from various data source systems, transforms it
in the staging area, and then finally, loads it into the Data Warehouse system.

Let us understand each step of the ETL process in-depth:


1. Extraction:
The first step of the ETL process is extraction. In this step, data from various source
systems is extracted which can be in various formats like relational databases, No SQL,
XML, and flat files into the staging area. It is important to extract the data from various
source systems and store it into the staging area first and not directly into the data
warehouse because the extracted data is in various formats and can be corrupted also.
Hence loading it directly into the data warehouse may damage it and rollback will be much
more difficult. Therefore, this is one of the most important steps of ETL process.
2. Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or
functions are applied on the extracted data to convert it into a single standard format. It
may involve following processes/tasks:
 Filtering – loading only certain attributes into the data warehouse.
 Cleaning – filling up the NULL values with some default values, mapping U.S.A,
United States, and America into USA, etc.
 Joining – joining multiple attributes into one.
 Splitting – splitting a single attribute into multiple attributes.
 Sorting – sorting tuples on the basis of some attribute (generally key-attribute).
3. Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is
finally loaded into the data warehouse. Sometimes the data is updated by loading into the
data warehouse very frequently and sometimes it is done after longer but regular intervals.
The rate and period of loading solely depends on the requirements and varies from system
to system.
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can
transformed and during that period some new data can be extracted. And while the transformed
data is being loaded into the data warehouse, the already extracted data can be transformed.
The block diagram of the pipelining of ETL process is shown below:

ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse builder,
CloverETL, and MarkLogic.
Data Warehouses: Most commonly used Data Warehouses are Snowflake, Redshift,
BigQuery, and Firebolt.
ADVANTAGES OR DISADVANTAGES:

Advantages of ETL process in data warehousing:

1. Improved data quality: ETL process ensures that the data in the data warehouse is
accurate, complete, and up-to-date.
2. Better data integration: ETL process helps to integrate data from multiple sources and
systems, making it more accessible and usable.
3. Increased data security: ETL process can help to improve data security by controlling
access to the data warehouse and ensuring that only authorized users can access the data.
4. Improved scalability: ETL process can help to improve scalability by providing a way to
manage and analyze large amounts of data.
5. Increased automation: ETL tools and technologies can automate and simplify the ETL
process, reducing the time and effort required to load and update data in the warehouse.

Disadvantages of ETL process in data warehousing:

1. High cost: ETL process can be expensive to implement and maintain, especially for
organizations with limited resources.
2. Complexity: ETL process can be complex and difficult to implement, especially for
organizations that lack the necessary expertise or resources.
3. Limited flexibility: ETL process can be limited in terms of flexibility, as it may not be
able to handle unstructured data or real-time data streams.
4. Limited scalability: ETL process can be limited in terms of scalability, as it may not be
able to handle very large amounts of data.
5. Data privacy concerns: ETL process can raise concerns about data privacy, as large
amounts of data are collected, stored, and analyzed.

Pig Data Types


Apache Pig supports many data types. A list of Apache Pig Data Types with description and
examples are given below.

Type Description Example


Int Signed 32 bit integer 2

Long Signed 64 bit integer 15L or 15l

Float 32 bit floating point 2.5f or 2.5F

Double 32 bit floating point 1.5 or 1.5e2 or 1.5E2

charArray Character array hello javatpoint

byteArray BLOB(Byte array)

tuple Ordered set of fields (12,43)

bag Collection f tuples {(12,43),(54,28)}

map collection of tuples [open#apache]

Execution model of Pig-

Apache Pig Run Modes


Apache Pig executes in two modes: Local Mode and MapReduce Mode.
Local Mode

o It executes in a single JVM and is used for development experimenting and prototyping.
o Here, files are installed and run using localhost.
o The local mode works on a local file system. The input and output data stored in the local
file system.
The command for local mode grunt shell:

1. $ pig-x local

MapReduce Mode

o The MapReduce mode is also known as Hadoop Mode.


o It is the default mode.
o In this Pig renders Pig Latin into MapReduce jobs and executes them on the cluster.
o It can be executed against semi-distributed or fully distributed Hadoop installation.
o Here, the input and output data are present on HDFS.
The command for Map reduce mode:

1. $ pig
Or,

1. $ pig -x mapreduce

Ways to execute Pig Program

These are the following ways of executing a Pig program on local and MapReduce mode: -

o Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To invoke Grunt
shell, run the pig command. Once the Grunt mode executes, we can provide Pig Latin
statements and command interactively at the command line.
o Batch Mode - In this mode, we can run a script file having a .pig extension. These files
contain Pig Latin commands.
o Embedded Mode - In this mode, we can define our own functions. These functions can
be called as UDF (User Defined Functions). Here, we use programming languages like
Java and Python.
Apache Pig Operators with Syntax and Examples
There is a huge set of Apache Pig Operators available in Apache Pig. In this article,
“Introduction to Apache Pig Operators” we will discuss all types of Apache Pig Operators in
detail.
Such as Diagnostic Operators, Grouping & Joining, Combining & Splitting and many
more. They also have their subtypes.
So, here we will discuss each Apache Pig Operators in depth along with syntax and their
examples.
What is Apache Pig Operators?

We have a huge set of Apache Pig Operators, for performing several types of Operations. Let’s
discuss types of Apache Pig Operators:
1. Diagnostic Operators
2. Grouping & Joining
3. Combining & Splitting
4. Filtering
5. Sorting

FUNCTIONS

Apache Pig provides various built-in functions namely eval, load, store, math, string,
bag and tuple functions.

Eval Functions

Given below is the list of eval functions provided by Apache Pig.

S.N. Function & Description

AVG()
1
To compute the average of the numerical values within a bag.

2 BagToString()
To concatenate the elements of a bag into a string. While concatenating, we can
place a delimiter between these values (optional).

CONCAT()
3
To concatenate two or more expressions of same type.

COUNT()
4 To get the number of elements in a bag, while counting the number of tuples in a
bag.

COUNT_STAR()
5 It is similar to the COUNT() function. It is used to get the number of elements in a
bag.

DIFF()
6
To compare two bags (fields) in a tuple.

IsEmpty()
7
To check if a bag or map is empty.

MAX()
8 To calculate the highest value for a column (numeric values or chararrays) in a
single-column bag.

MIN()
9 To get the minimum (lowest) value (numeric or chararray) for a certain column in a
single-column bag.

PluckTuple()
10 Using the Pig Latin PluckTuple() function, we can define a string Prefix and filter
the columns in a relation that begin with the given prefix.

SIZE()
11
To compute the number of elements based on any Pig data type.

SUBTRACT()
12 To subtract two bags. It takes two bags as inputs and returns a bag which contains
the tuples of the first bag that are not in the second bag.

SUM()
13
To get the total of the numeric values of a column in a single-column bag.

14 TOKENIZE()
To split a string (which contains a group of words) in a single tuple and return a bag
which contains the output of the split operation.
6. Print Page
UNIT-4

NoSQL Databases-
We know that MongoDB is a NoSQL Database, so it is very necessary to know about NoSQL
Database to understand MongoDB throughly.

What is NoSQL Database

Databases can be divided in 3 types:

1. RDBMS (Relational Database Management System)


2. OLAP (Online Analytical Processing)
3. NoSQL (recently developed database)

NoSQL Database

NoSQL Database is used to refer a non-SQL or non relational database.

It provides a mechanism for storage and retrieval of data other than tabular relations model used
in relational databases. NoSQL database doesn't use tables for storing data. It is generally used to
store big data and real-time web applications.

History behind the creation of NoSQL Databases

In the early 1970, Flat File Systems are used. Data were stored in flat files and the biggest
problems with flat files are each company implement their own flat files and there are no
standards. It is very difficult to store data in the files, retrieve data from files because there is no
standard way to store data.

Then the relational database was created by E.F. Codd and these databases answered the question
of having no standard way to store data. But later relational database also get a problem that it
could not handle big data, due to this problem there was a need of database which can handle
every types of problems then NoSQL database was developed.

Advantages of NoSQL

o It supports query language.


o It provides fast performance.
o It provides horizontal scalability.
NoSQL Data architectural patterns

Architecture Pattern is a logical way of categorizing data that will be stored on the
Database. NoSQL is a type of database which helps to perform operations on big data and store
it in a valid format. It is widely used because of its flexibility and a wide variety of services.

ArchitecturePatternsofNoSQL:
The data is stored in NoSQL in any of the following four data architecture patterns.
1. Key-Value Store Database
2. Column Store Database
3. Document Database
4. Graph Database
These are explained as following below.
1.Key-ValueStoreDatabase:
This model is one of the most basic models of NoSQL databases. As the name suggests, the
data is stored in form of Key-Value Pairs. The key is usually a sequence of strings, integers or
characters but can also be a more advanced data type. The value is typically linked or co-
related to the key. The key-value pair storage databases generally store data as a hash table
where each key is unique. The value can be of any type (JSON, BLOB(Binary Large Object),
strings, etc). This type of pattern is usually used in shopping websites or e-commerce
applications.

Advantages:
 Can handle large amounts of data and heavy load,
 Easy retrieval of data by keys.
Limitations:
 Complex queries may attempt to involve multiple key-value pairs which may delay
performance.
 Data can be involving many-to-many relationships which may collide.
Examples:
 DynamoDB
 Berkeley DB
2.ColumnStoreDatabase:
Rather than storing data in relational tuples, the data is stored in individual cells which are
further grouped into columns. Column-oriented databases work only on columns. They store
large amounts of data into columns together. Format and titles of the columns can diverge from
one row to other. Every column is treated separately. But still, each individual column may
contain multiple other columns like traditional databases.
Basically, columns are mode of storage in this type.
Advantages:
 Data is readily available
 Queries like SUM, AVERAGE, COUNT can be easily performed on columns.
Examples:
 HBase
 Bigtable by Google
 Cassandra
3.DocumentDatabase:
The document database fetches and accumulates data in form of key-value pairs but here, the
values are called as Documents. Document can be stated as a complex data structure.
Document here can be a form of text, arrays, strings, JSON, XML or any such format. The use
of nested documents is also very common. It is very effective as most of the data created is
usually in form of JSONs and is unstructured.
Advantages:
 This type of format is very useful and apt for semi-structured data.
 Storage retrieval and managing of documents is easy.
Limitations:
 Handling multiple documents is challenging
 Aggregation operations may not work accurately.
Examples:
 MongoDB
 CouchDB
Figure – Document Store Model in form of JSON documents
4.GraphDatabases:
Clearly, this architecture pattern deals with the storage and management of data in graphs.
Graphs are basically structures that depict connections between two or more objects in some
data. The objects or entities are called as nodes and are joined together by relationships called
Edges. Each edge has a unique identifier. Each node serves as a point of contact for the graph.
This pattern is very commonly used in social networks where there are a large number of
entities and each entity has one or many characteristics which are connected by edges. The
relational database pattern has tables that are loosely connected, whereas graphs are often very
strong and rigid in nature.
Advantages:
 Fastest traversal because of connections.
 Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.
Examples:
 Neo4J
 FlockDB( Used by Twitter)
Figure – Graph model format of NoSQL Databases

What is MongoDB?

 MongoDB the most popular NoSQL database, is an open-source document-


oriented database. The term ‘NoSQL’ means ‘non-relational‘.
 It means that MongoDB isn’t based on the table-like relational database structure but
provides an altogether different mechanism for the storage and retrieval of data. This
format of storage is called BSON ( similar to JSON format).
A simple MongoDB document Structure:
{
title: 'Geeksforgeeks',
by: 'Harshit Gupta',
url: 'https://www.geeksforgeeks.org',
type: 'NoSQL'
}
 SQL databases store data in tabular format. This data is stored in a predefined data model
which is not very much flexible for today’s real-world highly growing applications.
 Modern applications are more networked, social and interactive than ever.
Applications are storing more and more data and are accessing it at higher rates.
 Relational Database Management System(RDBMS) is not the correct choice when it
comes to handling big data by the virtue of their design since they are not horizontally
scalable. If the database runs on a single server, then it will reach a scaling limit.
 NoSQL databases are more scalable and provide superior performance. MongoDB is such a
NoSQL database that scales by adding more and more servers and increases productivity
with its flexible document model.
To learn how to use MongoDB effectively in full-stack applications, from data modeling to
querying, the Full Stack Development with Node JS course offers a comprehensive
introduction to MongoDB and its integration with Node.js.
RDBMS vs MongoDB
 RDBMS has a typical schema design that shows number of tables and the relationship
between these tables whereas MongoDB is document-oriented. There is no concept of
schema or relationship.
 Complex transactions are not supported in MongoDB because complex join operations are
not available.
 MongoDB allows a highly flexible and scalable document structure. For example, one data
document of a collection in MongoDB can have two fields whereas the other document in
the same collection can have four.
 MongoDB is faster as compared to RDBMS due to efficient indexing and storage
techniques.
 There are a few terms that are related in both databases. What’s called Table in RDBMS is
called a Collection in MongoDB. Similarly, a Row is called a Document and a Column is
called a Field. MongoDB provides a default ‘_id’ (if not provided explicitly) which is a 12-
byte hexadecimal number that assures the uniqueness of every document. It is similar to the
Primary key in RDBMS.
MongoDB database features
 Document Oriented: MongoDB stores the main subject in the minimal number of
documents and not by breaking it up into multiple relational structures like RDBMS. For
example, it stores all the information of a computer in a single document called Computer
and not in distinct relational structures like CPU, RAM, Hard disk etc.
 Indexing: Without indexing, a database would have to scan every document of a collection
to select those that match the query which would be inefficient. So, for efficient searching
Indexing is a must and MongoDB uses it to process huge volumes of data in very less time.
 Scalability: MongoDB scales horizontally using sharding (partitioning data across
various servers). Data is partitioned into data chunks using the shard key and these data
chunks are evenly distributed across shards that reside across many physical servers. Also,
new machines can be added to a running database.
 Replication and High Availability: MongoDB increases the data availability with
multiple copies of data on different servers. By providing redundancy, it protects the
database from hardware failures. If one server goes down, the data can be retrieved easily
from other active servers which also had the data stored on them.
 Aggregation: Aggregation operations process data records and return the computed results.
It is similar to the GROUPBY clause in SQL. A few aggregation expressions are sum,
avg, min, max, etc

Where do we use MongoDB?

MongoDB is preferred over RDBMS in the following scenarios:


 Big Data: If we have huge amount of data to be stored in tables, think of MongoDB before
RDBMS databases. MongoDB has built-in solution for partitioning and sharding
our database.
 Unstable Schema: Adding a new column in RDBMS is hard whereas MongoDB
is schema-less. Adding a new field does not effect old documents and will be very easy.
 Distributed data Since multiple copies of data are stored across different servers, recovery
of data is instant and safe even if there is a hardware failure.
Language Support by MongoDB
MongoDB currently provides official driver support for all popular programming languages
like C, C++, Rust, C#, Java, Node.js, Perl, PHP, Python, Ruby, Scala, Go and Erlang..
UNIT-5

What is Social Media Mining?

Social Media has been around for 30 years, but the rise in the user base is recent. The data from
social media platforms can be used to boost business. This is where Social Media Mining takes
over. The amount of data these companies have to deal with is huge and scattered. Social Media
Mining helps in extracting meaning fromextract the data.
It is a system of expertise discovery inside a database. It is not wrong to say that social media is
the biggest contributor to Big Data. The records are not new; they have been around for a long
time. However, the potential to system these statistics has developed. Social media mining can
help us get insights to study customer behavior and interests,systematize and using this
information, you can serve better and compound your earnings.
Benefits of Social Media Mining
We post a lot of information on social media. In this Era where algorithms are everywhere, they
can generate information about the future trends and habits of the users, as this plays a major
role in today’s world. Social Media mining has become a must-have technique in every business.
Here are some of the benefits you can derive from using Social Media Mining :
1. Spot Trends Before They Become Trend
The data available from social media platforms can give important insights regarding society and
user behavior that were not possible earlier and were like finding a needle in a haystack. In
today’s world, the data is corroborative and evolving with time,were and there are multiple
needles to look for. Social Media Data mining is a technique that is capable of finding them all.
It is a process that starts with identifying the target audience and ends with digging into what
they are passionate about. Businesses may analyze the keywords, search results, comments,
and mentions to identify the current trend, and a deeper study of behavior change can also help
in predicting future trends. This data is very useful for businesses to make informed decisions
when the stakes are high.
2. Sentiment Analysis
Sentiment Analysis is the process of identifying positive or negative sentiments portrayed in
information posted on social media platforms. Businesses use Social Media Mining to identify
the same sentiments associated with their brand and product lines.
Sentiment Analysis has a vast application, and its use cannot be limited to self-evaluation only.
Negative sentiment about competitors can be an opportunity to win their customers. The Nestle
Maggie ban is a perfect example of this; competitors in the noodles market used strategies to
market their products as made from healthier alternatives. Patanjali saw this opportunity and
launched its noodles,; claiming to be made from atta,noodles, while the noodles market was full
of refined wheat flour noodles (maida).
When combined with social media monitoring, sentiment analysis can help you analyze
your brand image and bring negative aspects of the business to your attention. With this
information, you can address the negative sentiments and prioritize them so that they can be
addressed properly to improve the customer experience.
3. Keyword Identification
In a world where more than 90% of businesses function online, the importance of using the
right words cannot be emphasized enough. The business has to stand out to compete in a world
where your sales team cannot charm customers with their looks and cheesy talks. Keywords can
give your business an edge over itsimprove the competitors.
Keywords are those words that reveal the behavior of users and highlight the frequently used and
popular terms related to their products. Social Media Data Mining can be highly effective in
finding these keywords. The process is as basic as scanning the list of the most frequent words
or phrases used by customers to search for or define your product.
Using these keywords to define your product in digital media and implementing SEO can
yieldits pretty good results. Your product will rank higher, and by implementing frequent and
popular terms, you can make your product listings better.
4. Create a Better Product
Before the use of Big Data, businesses used to conduct individual surveys to know the public’s
opinion about their product. They faced many challenges; people didn’t entertain them, and even
if someone participated, it was very likely that their answers were not credible. With the
implementation of Social Media Data Mining, the public is responding and participating in
surveys without even realizing it, which provides companies with candid data.
Using the processed data, you can identify the things that bother customers and might give
insights about how you can improve your product to make it even better. In other words, you
are seeking advicegain and opinions from millions of users. By using so much data, you are
essentially tweaking your product in such a way that the probability of its success is very high.
By analyzing the userbase information, you can target the social media platform with the
highest number of users.
5. Competitor Analysis
You are not wrong to assume that your competitors are already using Data Mining techniques to
monitor the market and to compete with them; it becomes essential to Improving yourself by
analyzing others’ mistakes is often less painful than learning from your own.
There’s nothing wrong with following the footprints of a good competitor. You might not make a
fortune, but it will still help you survive tough times. Analyzing competitor behavior on social
media during the launch of a product will help you define a trend and use it to your
advantage.
Posts by competitor employees and management regarding hiring may give you an idea of the
expansion of business or even a subtle change in operations will help you to be proactive.
Having an idea of when to stay on your toes is advantageous in highly competitive industries.
6. Event Identification
Also known as Social Heat Mapping, this technique uses excellent. It is a part of Social Media
mining that helps researchers and agencies to be prepared for unexpected outbursts.
An excellent example of implementing heat mapping on social media was seen during
the Farmer Protests. During the protest, huge crowds were approaching the venue of the
Republic Day celebration.
7. Manage Real-time Events
This approach is mainly used for events, incidents or any issues that occur on social
media.Researchers and government department identify big issues as they use heat mapping or
any other technique to access social media sources. They detect the events and figure out
information faster than traditional sensor approaches. Many users publish the information using
their cell phones, so event identification is real-time and up to date. Organizations can respond
faster as people share information during disasters or social events.
8. Provide Useful Content (and Stop Spamming)
Social media is very close to modern life. This method is used to improve social media mining. It
uses computer algorithms to help companies for sharing information in a way that they prefer
and avoid spam. It can help organizations to identify small patterns and recognize customers who
might be interested in their products. Even social media platforms can use techniques to remove
challenging reports. As a result, social media mining provides important content to secure
all users.
9. Recognize Behavior
Social media mining analyzes our real behavior even when we are not present and helps to learn
about humans. Organizations use some techniques to understand customers. The
government provides facilities for companies to identify the right members and scientists to
explain the events. Therefore, social media mining helps to understand how events link together
that we may not noticed earlier.

Social Networks as a Graph,

A social network graph is a graph where the nodes represent people and the lines between
nodes, called edges, represent social connections between them, such as friendship or working
together on a project. These graphs can be either undirected or directed. For instance, Facebook
can be described with an undirected graph since the friendship is bidirectional, Alice and Bob
being friends is the same as Bob and Alice being friends. On the other hand, Twitter can be
described with a directed graph: Alice can follow Bob without Bob following Alice.

Social networks tend to have characteristic network properties. For instance, there tends to be a
short distance between any two nodes (as in the famous six degrees of separation study where
everyone in the world is at most six degrees away from any other), and a tendency to form
"triangles" (if Alice is friends with Bob and Carol, Bob and Carol are more likely to be friends
with each other.)
Social networks are important to social scientists interested in how people interact as well as
companies trying to target consumers for advertising. For instance if advertisers connect up three
people as friends, co-workers, or family members, and two of them buy the advertiser's product,
then they may choose to spend more in advertising to the third hold-out, on the belief that this
target has a high propensity to buy their product.

Types of social Networks,

Social networks are the networks that depict the relations between people in the form of a
graph for different kinds of analysis. The graph to store the relationships of people is known as
Sociogram. All the graph points and lines are stored in the matrix data structure called
Sociomatrix. The relationships indicate of any kind like kinship, friendship, enemies,
acquaintances, colleagues, neighbors, disease transmission, etc.
Social Network Analysis (SNA) is the process of exploring or examining the social structure
by using graph theory. It is used for measuring and analyzing the structural properties of the
network. It helps to measure relationships and flows between groups, organizations, and other
connected entities. We need specialized tools to study and analyze social networks.
Basically, there are two types of social networks:
 Ego network Analysis
 Complete network Analysis
1. Ego Network Analysis
Ego network Analysis is the one that finds the relationship among people. The analysis is done
for a particular sample of people chosen from the whole population. This sampling is done
randomly to analyze the relationship. The attributes involved in this ego network analysis are a
person’s size, diversity, etc.
This analysis is done by traditional surveys. The surveys involve that they people are asked
with whom they interact with and their name of the relationship between them. It is not
focused to find the relationship between everyone in the sample. It is an effort to find the
density of the network in those samples. This hypothesis is tested using some statistical
hypothesis testing techniques.
The following functions are served by Ego Networks:
 Propagation of information efficiently.
 Sensemaking from links, For example, Social links, relationships.
 Access to resources, efficient connection path generation.
 Community detection, identification of the formation of groups.
 Analysis of the ties among individuals for social support.
2. Complete Network Analysis
Complete network analysis is the analysis that is used in all network analyses. It analyses the
relationship among the sample of people chosen from the large population. Subgroup analysis,
centrality measure, and equivalence analysis are based on the complete network analysis. This
analysis measure helps the organization or the company to make any decision with the help of
their relationship. Testing the sample will show the relationship in the whole network since the
sample is taken from a single set of domains.

Difference between Ego network analysis and Complete network analysis:

The difference between ego and complete network analysis is that the ego network focus on
collecting the relationship of people in the sample with the outside world whereas, in Complete
network, it is focused on finding the relationship among the samples.
The majority of the network analysis will be done only for a particular domain or one
organization. It is not focused on the relationships between the organization. So many of the
social network analysis measure uses only Complete network analysis.

Clustering of social Graphs Direct Discovery of communities in a social graph,-

Social Network:
When we think of a social network, we think of Facebook, Twitter, Google+, or another website
that is called a “social network,” and indeed this kind of network is representative of the broader
class of networks called “social.” The essential characteristics of a social network are:
1. There is a collection of entities that participate in the network. Typically, these entities
are people, but they could be something else entirely
2. There is at least one relationship between entities of the network. On Facebook or its ilk,
this relationship is called friends. Sometimes the relationship is all-or-nothing; two
people are either friends or they are not.
3. There is an assumption of non-randomness or locality. This condition is the hardest to
formalize, but the intuition is that relationships tend to cluster. That is, if entity A is
related to both B and C, then there is a higher probability than average that B and C are
related.
Social network as Graphs: Social networks are naturally modeled as graphs, which we
sometimes refer to as a social graph. The entities are the nodes, and an edge connects two nodes
if the nodes are related by the relationship that characterizes the network. If there is a degree
associated with the relationship, this degree is represented by labeling the edges. Often, social
graphs are undirected, as for the Facebook friends graph. But they can be directed graphs, as for
example the graphs of followers on Twitter or Google+.
Above figure is an example of a tiny social network. The entities are the nodes A through G. The
relationship, which we might think of as “friends,” is represented by the edges. For instance, B is
friends with A, C, and D.
Clustering of Social-Network Graphs:
Clustering of the graph is considered as a way to identify communities. Clustering of graphs
involves following steps:
1. Distance Measures for Social-Network Graphs
If we were to apply standard clustering techniques to a social-network graph, our first step would
be to define a distance measure. When the edges of the graph have labels, these labels might be
usable as a distance measure, depending on what they represented. But when the edges are
unlabeled, as in a “friends” graph, there is not much we can do to define a suitable distance.
Our first instinct is to assume that nodes are close if they have an edge between them and distant
if not. Thus, we could say that the distance d(x, y) is 0 if there is an edge (x, y) and 1 if there is
no such edge. We could use any other two values, such as 1 and ∞, as long as the distance is
closer when there is an edge.
2. Applying Standard Clustering Methods
There are two general approaches to clustering: hierarchical (agglomerative) and point-
assignment. Let us consider how each of these would work on a social-network graph.
Hierarchical clustering of a social-network graph starts by combining some two nodes that are
connected by an edge. Successively, edges that are not between two nodes of the same cluster
would be chosen randomly to combine the clusters to which their two nodes belong. The choices
would be random, because all distances represented by an edge are the same.
Now, consider a point-assignment approach to clustering social networks. Again, the fact that all
edges are at the same distance will introduce a number of random factors that will lead to some
nodes being assigned to the wrong cluster.
3. Betweenness:
Since there are problems with standard clustering methods, several specialized clustering
techniques have been developed to find communities in social networks. The simplest one is
based on finding the edges that are least likely to be inside the community.
Define the betweenness of an edge (a, b) to be the number of pairs of nodes x and y such that the
edge (a, b) lies on the shortest path between x and y. To be more precise, since there can be
several shortest paths between x and y, edge (a, b) is credited with the fraction of those shortest
paths that include the edge (a, b). As in golf, a high score is bad. It suggests that the edge (a, b)
runs between two different communities; that is, a and b do not belong to the same community
4. The Girvan-Newman Algorithm:
In order to exploit the betweenness of edges, we need to calculate the number of shortest paths
going through each edge. We shall describe a method called the Girvan-Newman (GN)
Algorithm, which visits each node X once and computes the number of shortest paths from X to
each of the other nodes that go through each of the edges. The algorithm begins by performing a
breadth-first search (BFS) of the graph, starting at the node X. Note that the level of each node in
the BFS presentation is the length of the shortest path from X to that node. Thus, the edges that
go between nodes at the same level can never be part of a shortest path from X.
Edges between levels are called DAG edges (“DAG” stands for directed, acyclic graph). Each
DAG edge will be part of at least one shortest path from root X. If there is a DAG edge (Y, Z),
where Y is at the level above Z (i.e., closer to the root), then we shall call Y a parent of Z and Z a
child of Y, although parents are not necessarily unique in a DAG as they would be in a tree.
5. Using betweenness to find communities:
The betweenness scores for the edges of a graph behave something like a distance measure on
the nodes of the graph. It is not exactly a distance measure, because it is not defined for pairs of
nodes that are unconnected by an edge, and might not satisfy the triangle inequality even when
defined. However, we can cluster by taking the edges in order of increasing betweenness and add
them to the graph one at a time. At each step, the connected components of the graph form some
clusters.

Recommendation System - Machine Learning


A machine learning algorithm known as a recommendation system combines information
about users and products to forecast a user's potential interests. These systems are used in a
wide range of applications, such as e-commerce, social media, and entertainment, to provide
personalized recommendations to users.
There are several types of recommendation systems, including:

1. Content-based filtering: This type of system uses the characteristics of items that a user
has liked in the past to recommend similar items.
2. Collaborative filtering: This type of system uses the past behaviour of users to
recommend items that similar users have liked.
3. Hybrid: To generate suggestions, this kind of system combines content-based filtering
and collaborative filtering techniques.
4. Matrix Factorization: Using this method, the user-item matrix is divided into two
lower-dimension matrices that are then utilized to generate predictions.
5. Deep Learning: To train the user and item representations that are subsequently utilized
to generate recommendations, these models make use of neural networks.

The choice of which type of recommendation system to use depends on the specific application
and the type of data available.

It's worth noting that recommendation systems are widely used and can have a significant impact
on businesses and users. However, it's important to consider ethical considerations and biases
that may be introduced to the system.

In this article, We utilize a dataset from Kaggle Datasets: Articles Sharing and Reading from
CI&T Deskdrop in this project.

For the purpose of giving customers individualized suggestions, we will demonstrate how to
develop Collaborative Filtering, Content-Based Filtering, and Hybrid techniques in Python.

Details About Dataset

The Deskdrop dataset from CI&T's Internal Communication platform, which is an actual sample
of 12 months' worth of logs (from March 2016 to February 2017). (DeskDrop). On more than 3k
publicly published articles, there are around 73k documented user interactions. Two CSV files
make up the file:

o shared_articles.csv
o Users_interactions.csv

Now, we will try to implement it in the code.

Importing Libraries

1. import sklearn
2. import scipy
3. import numpy as np
4. import random
5. import pandas as pd
6. from nltk.corpus import stopwords
7. from scipy.sparse import csr_matrix
8. from sklearn.model_selection import train_test_split
9. from sklearn.metrics.pairwise import cosine_similarity
10. from scipy.sparse.linalg import svds
11. from sklearn.feature_extraction.text import TfidfVectorizer
12. from sklearn.preprocessing import MinMaxScaler
13. import matplotlib.pyplot as plt
14. import math

Loading the Dataset

Here, we have to load our dataset to perform the machine learning operations.

As we already know that we have two CSV files as the dataset.

1. shared_articles.csv

It includes data on the articles posted on the platform. Each article contains a timestamp for
when it was shared, the original url, the title, plain text content, the language it was shared in
(Portuguese: pt or English: en), and information about the individual who shared it (author).

o SHARED CONTENT: Users can access the article that was shared on the platform.
o CONTENT REMOVED: The article has been taken down from the site and is no longer
accessible for recommendations.
We will just analyze the "CONTENT SHARED" event type here for the purpose of simplicity,
making the erroneous assumption that all articles were accessible for the whole one-year period.
Only publications that were available at a specific time should be recommended for a more
accurate review, but we'll do this exercise for you anyhow.

1. dataframe_articles = pd.read_csv('shared_articles.csv')
2. dataframe_articlesdataframe_articles = dataframe_articles[dataframe_articles['eventType'] == 'C
ONTENT SHARED']
3. dataframe_articles.head(5)
2. users_interactions.csv

It includes user interaction records for shared content. By using the contentId field, it may be
connected to articles shared.csv.

The values for eventType are:

o VIEW: The article has been read by the user.


o LIKE: The user gave the article a like.
o THE USER CREATED REMARK: The user added a comment to the article.
o FOLLOW: The user is selected to get an email when a new remark is made in the
article.
o BOOKMARK: The user has saved the page, so they may easily access it later.
1. dataframe_interactions = pd.read_csv('users_interactions.csv')
2. dataframe_interactions.head(10)

Data Manipulation

Here, we assign a weight or strength to each sort of interaction since there are many kinds. For
instance, we believe that a remark in an article indicates a user's interest in the item is more
significant than a like or a simple view.

1. strength_of_event_type = {
2. 'VIEW': 1.0,
3. 'LIKE': 2.0,
4. 'BOOKMARK': 2.5,
5. 'FOLLOW': 3.0,
6. 'COMMENT CREATED': 4.0,
7. }
8.
9. dataframe_interactions['eventStrength'] = dataframe_interactions['eventType'].apply(lambda x: st
rength_of_event_type[x])
Note: User cold-start is a problem with recommender systems that makes it difficult to give
consumers with little or no consumption history individualized recommendations since there isn't
enough data to model their preferences.
Due to this, we are only retaining users in the dataset who had at least five interactions.

1. dataframe_user_interaction_count = dataframe_interactions.groupby(['personId', 'contentId']).siz


e().groupby('personId').size()
2. print(' Total Number of users: %d' % len(dataframe_user_interaction_count))
3. dataframe_user_with_enough_interaction = dataframe_user_interaction_count[dataframe_user_i
nteraction_count >= 5].reset_index()[['personId']]
4. print('Total Number of users with minimum 5 interactions: %d' % len(dataframe_user_with_eno
ugh_interaction))
Output:

1. print('Total Number of interactions: %d' % len(dataframe_interactions))


2. dataframe_interaction_from_selected_users = dataframe_interactions.merge(dataframe_user_wit
h_enough_interaction,
3. how = 'right',
4. left_on = 'personId',
5. right_on = 'personId')
6. print('Total Number of interactions from users with at least 5 interactions: %d' % len(dataframe_i
nteraction_from_selected_users))
Desk drop allows users to browse articles several times and engage with them in various ways
(e.g. like or comment). We thus combine all of the interactions a user had with an item by a
weighted total of interaction type strength, and then apply a log transformation to smooth the
distribution and utilize this information to model the user interest in a particular article.

1. def preference_of_smooth_users(x):
2. return math.log(1+x, 2)
3.
4. dataframe_interaction_full = dataframe_interaction_from_selected_users \
5. .groupby(['personId', 'contentId'])['eventStrength'].sum() \
6. .apply(preference_of_smooth_users).reset_index()
7. print('Total Number of unique user/item interactions: %d' % len(dataframe_interaction_full))
8. dataframe_interaction_full.head(10)
:

Evaluation

Evaluation is crucial for machine learning projects because it enables objective comparison of
various methods and model hyperparameter selections.

Making sure the trained model generalizes for data it was not trained on utilizing cross-validation
procedures is a crucial component of assessment. Here, we employ a straightforward cross-
validation technique known as a holdout, in which a random data sample?in this case, 20%?is
set aside throughout training and utilized just for assessment. This article's assessment metrics
were all calculated using the test set.

A more reliable assessment strategy would involve dividing the train and test sets according to a
reference date, with the train set being made up of all interactions occurring before that date and
the test set consisting of interactions occurring after that day. For the sake of simplicity, we
decided to utilize the first random strategy for this notebook, but you might want to try the
second way to more accurately replicate how the recsys would behave in production when
anticipating interactions from "future" users.

1. dataframe_interaction_train, dataframe_interaction_text = train_test_split(dataframe_interaction_


full,
2. stratify=dataframe_interaction_full['personId'],
3. test_size=0.20,
4. random_state=42)
5.
6. print('Total Number interactions on Train set: %d' % len(dataframe_interaction_train))
7. print('Total Number interactions on Test set: %d' % len(dataframe_interaction_text))
Output:

There are a number of metrics that are frequently used for assessment in recommender systems.
We decided to employ Top-N accuracy measures, which assess the precision of the top
suggestions made to a user in comparison to the test set items with which the user has actually
interacted.

According to how this assessment process operates:

o For Every User


o For each item with which the user engaged in the test set
o The user hasn't interacted with 100 other objects. Therefore choose 100.
Here, we naively assume that the user doesn't care about the non-
interacted things, but that assumption may not be accurate because the
user may just be unaware of them. Let's stick with this premise,
nonetheless.
o A set consisting of one interacted item and 100 non-interacted ("non-
relevant!") items should be sent to the recommender model to create a
ranked list of suggested things.
o Calculate the Top-N accuracy metrics for this person and the items they
have interacted with from the ranked list of recommendations.
o Add up the Top-N accuracy metrics globally.
Recall@N, which assesses if the interacted item is one of the top N items (hit) in the prioritized
list of 101 suggestions for a user, was chosen as the Top-N accuracy metric.

NDCG@N and MAP@N are two more well-liked ranking metrics whose computation of the
score takes into consideration the position of the pertinent item in the ranked list (max. value if
the relevant item is in the first position).

1. # Indexing by personId to facilitate evaluation search performance


2. dataframe_interaction_fulldataframe_interaction_full_indexed = dataframe_interaction_full.set_i
ndex('personId')
3. dataframe_interaction_traindataframe_interaction_train_indexed = dataframe_interaction_train.s
et_index('personId')
4. dataframe_interaction_textdataframe_interaction_text_indexed = dataframe_interaction_text.set_
index('personId')
5.
6. def getting_items_interacted(person_id, interaction_dataframe):
7. # Gather user information and include movie details.
8. items_interacted = interaction_dataframe.loc[person_id]['contentId']
9. return set(items_interacted if type(items_interacted) == pd.Series else [items_interacted])
Now we will create a class named "ModelEvaluator", as this will be used for the evaluation of
the recommendation model that we will create.

1. #Top-N accuracy metrics consts


2. EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 100
3.
4. class ModelEvaluator:
5.
6.
7. def getting_not_interacted_samples(self, person_id, sample_size, seed=42):
8. items_interacted = getting_items_interacted(person_id, dataframe_interaction_full_indexed)

9. items_all = set(dataframe_articles['contentId'])
10. items_not_interacted = items_all - items_interacted
11.
12. random.seed(seed)
13. sample_non_interacted_items = random.sample(items_not_interacted, sample_size)
14. return set(sample_non_interacted_items)
15.
16. def _to_verify_hit_top_n(self, item_id, items_recommended, topn):
17. try:
18. index = next(i for i, c in enumerate(items_recommended) if c == item_id)
19. except:
20. index = -1
21. hit = int(index in range(0, topn))
22. return hit, index
23.
24. def model_evaluation_for_users(self, model, person_id):
25. # Adding the test set's items.
26. interacted_testset_values = dataframe_interaction_text_indexed.loc[person_id]
27. if type(interacted_testset_values['contentId']) == pd.Series:
28. person_interacted_testset_items = set(interacted_testset_values['contentId'])
29. else:
30. person_interacted_testset_items = set([int(interacted_testset_values['contentId'])])
31. interated_testset_items_count = len(person_interacted_testset_items)
32.
33. # Obtaining a model's rated suggestion list for a certain user.
34. dataframe_person_recs = model.recommending_items(person_id,
35. items_to_ignore=getting_items_interacted(person_id,
36. dataframe_interaction_train_indexed
37. ),
38. topn=10000000000)
39.
40. hits_at_5_count = 0
41. hits_at_10_count = 0
42. # For each item with which the user engaged in the test set
43. for item_id in person_interacted_testset_items:
44. # Selecting 100 random things with which the user hasn't interacted
45. # (to indicate items that are deemed to be no relevant to the user) (to represent items that
are assumed to be not relevant to the user)
46. sample_non_interacted_items = self.getting_not_interacted_samples(person_id,
47. sample_size=EVAL_RANDOM_SAMPLE_NON
_INTERACTED_ITEMS,
48. seed=item_id%(2**32))
49.
50. # Combining the 100 random objects with the currently interacted item
51. items_to_filter_recs = sample_non_interacted_items.union(set([item_id]))
52.
53. # Recommendations are only filtered if they come from the interacted item or a random s
ample of 100 non-interacted items.
54. dataframe_valid_recs = dataframe_person_recs[dataframe_person_recs['contentId'].isin(it
ems_to_filter_recs)]
55. valid_recs_ = dataframe_valid_recs['contentId'].values
56. # Checking whether the currently interacted-with item is one of the Top-
N suggested things.
57. hit_at_5, index_at_5 = self._to_verify_hit_top_n(item_id, valid_recs_, 5)
58. hits_at_5_count += hit_at_5
59. hit_at_10, index_at_10 = self._to_verify_hit_top_n(item_id, valid_recs_, 10)
60. hits_at_10_count += hit_at_10
61.
62. # Recall is the percentage of things that have been engaged with and are included among the
Top-N suggested items.
63. # when combined with a group of unrelated objects
64. recall_at_5 = hits_at_5_count / float(interated_testset_items_count)
65. recall_at_10 = hits_at_10_count / float(interated_testset_items_count)
66.
67. person_metrics = {'hits@5_count':hits_at_5_count,
68. 'hits@10_count':hits_at_10_count,
69. 'interacted_count': interated_testset_items_count,
70. 'recall@5': recall_at_5,
71. 'recall@10': recall_at_10}
72. return person_metrics
73.
74. def model_evaluation(self, model):
75. #print('Running evaluation for users')
76. people_metrics = []
77. for idx, person_id in enumerate(list(dataframe_interaction_text_indexed.index.unique().valu
es)):
78. #if idx % 100 == 0 and idx > 0:
79. # print('%d users processed' % idx)
80. person_metrics = self.model_evaluation_for_users(model, person_id)
81. person_metrics['_person_id'] = person_id
82. people_metrics.append(person_metrics)
83. print('%d users processed' % idx)
84.
85. detailed_results_df = pd.DataFrame(people_metrics) \
86. .sort_values('interacted_count', ascending=False)
87.
88. global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['i
nteracted_count'].sum())
89. global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df
['interacted_count'].sum())
90.
91. global_metrics = {'modelName': model.getting_model_name(),
92. 'recall@5': global_recall_at_5,
93. 'recall@10': global_recall_at_10}
94. return global_metrics, detailed_results_df
95.
96. model_evaluator = ModelEvaluator()

Popularity Model

The Popularity model is a typical baseline strategy that is typically challenging to surpass. This
strategy merely suggests to a user the most well-liked products that the customer has not yet
eaten; it is not personally tailored. As the "wisdom of the multitude" is accounted for by
popularity, it typically offers sound advice that is generally engaging for most people.

A recommender system's main goal, which goes much beyond this straightforward method, is to
apply long-tail products to users with extremely particular interests.

1. # computes the bestselling things


2. dataframe_item_popularity = dataframe_interaction_full.groupby('contentId')['eventStrength'].su
m().sort_values(ascending=False).reset_index()
3. dataframe_item_popularity.head(10)
Output:

1. class PopularityRecommender:
2.
3. MODEL_NAME = 'Popularity'
4.
5. def __init__(self, dataframe_popularity, dataframe_items=None):
6. self.dataframe_popularity = dataframe_popularity
7. self.dataframe_items = dataframe_items
8.
9. def getting_model_name(self):
10. return self.MODEL_NAME
11.
12. def recommending_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
13. # Suggest the most well-liked products that the consumer hasn't yet viewed.
14. dataframe_recommendation = self.dataframe_popularity[~self.dataframe_popularity['conten
tId'].isin(items_to_ignore)] \
15. .sort_values('eventStrength', ascending = False) \
16. .head(topn)
17.
18. if verbose:
19. if self.dataframe_items is None:
20. raise Exception('"dataframe_items" is required in verbose mode')
21.
22. dataframe_recommendationdataframe_recommendation = dataframe_recommendation.m
erge(self.dataframe_items, how = 'left',
23. left_on = 'contentId',
24. right_on = 'contentId')[['eventStrength', 'contentId', 'title', 'url
', 'lang']]
25.
26.
27. return dataframe_recommendation
28.
29. popularity_model = PopularityRecommender(dataframe_item_popularity, dataframe_articles)
Here, using the above-described methodology, we evaluate the Popularity model.

It had a Recall@5 of 0.2417, which suggests that the Popularity model placed around 24% of the
test set's interactive items among the top 5 items (from lists with 100 random items).
Furthermore, as predicted, Recall@10 was significantly higher (37%)

You might find it surprising that popular models can typically perform so well.

1. print('Evaluating Popularity recommendation model...')


2. metrics_pop_global, dataframe_pop_detailed_results = model_evaluator.model_evaluation(popu
larity_model)
3. print('\nGlobal metrics:\n%s' % metrics_pop_global)
4. dataframe_pop_detailed_results.head(10)

Content-Based Filtering model

The descriptions or qualities of the objects with which the user has engaged are used in content-
based filtering techniques to suggest related items. This solution is reliable in preventing the
cold-start issue since it only relies on the user's prior decisions. It is straightforward to create
item profiles and user profiles for text-based objects like books, articles, and news stories using
the raw text.

In this case, we're employing TF-IDF, a highly well-liked information retrieval (search engine)
approach.

Using this method, unstructured text is transformed into a vector structure, where each word is
represented by a location in the vector, and the value indicates how pertinent a word is for an
article. The same Vector Space Model will be used to represent all things, making it possible to
compare articles.

1. # Avoiding stopwords (words without sense) in Portuguese and English (as we have a corpus wit
h mixed languages)
2. stopwordsstopwords_list = stopwords.words('english') + stopwords.words('portuguese')
3.
4. # trains a model with 5000 vectors that is made up of the most common bigrams and unigrams in
the corpus, excluding stopwords.
5. vectorizer = TfidfVectorizer(analyzer='word',
6. ngram_range=(1, 2),
7. min_df=0.003,
8. max_df=0.5,
9. max_features=5000,
10. stop_words=stopwords_list)
11.
12. item_ids = dataframe_articles['contentId'].tolist()
13. tfidf_matrix = vectorizer.fit_transform(dataframe_articles['title'] + "" + dataframe_articles['text'])

14. tfidf_feature_names = vectorizer.get_feature_names()


15. tfidf_matrix
Output:

We average all the item profiles with which the user has engaged in order to model the user
profile. The final user profile will give more weight to the articles on which the user has
interacted the most (e.g., liked or commented), with the average being weighted by the strength
of the interactions.

1. def getting_item_profiles(item_id):
2. idx = item_ids.index(item_id)
3. profile_item = tfidf_matrix[idx:idx+1]
4. return profile_item
5.
6. def getting_item_profiless(ids):
7. list_profiles_item = [getting_item_profiles(x) for x in ids]
8. profile_items = scipy.sparse.vstack(list_profiles_item)
9. return profile_items
10.
11. def building_user_profiles(person_id, dataframe_interaction_indexed):
12. dataframe_interactions_person = dataframe_interaction_indexed.loc[person_id]
13. profiles_user_items = getting_item_profiless(dataframe_interactions_person['contentId'])
14.
15. user_item_strengths = np.array(dataframe_interactions_person['eventStrength']).reshape(-
1,1)
16. # Weighted average of the item profiles by the intensity of the interactions
17. user_item_strengths_weighted_avg = np.sum(profiles_user_items.multiply(user_item_strengt
hs), axis=0) / np.sum(user_item_strengths)
18. user_profile_norm = sklearn.preprocessing.normalize(user_item_strengths_weighted_avg)
19. return user_profile_norm
20.
21. def build_users_profiles():
22. dataframe_interaction_indexed = dataframe_interaction_train[dataframe_interaction_train['co
ntentId'] \
23. .isin(dataframe_articles['contentId'])].set_index('personId')
24. profiles_user = {}
25. for person_id in dataframe_interaction_indexed.index.unique():
26. profiles_user[person_id] = building_user_profiles(person_id, dataframe_interaction_indexe
d)
27. return profiles_user
28.
29. profiles_users = build_users_profiles()
30. len(profiles_users)

Let's look at the profile first. It is a unit vector with a length of 5000. Each position's value
indicates how vital a token (a bigram or a unigram) is to me.

According to a look at below profile, the most pertinent tokens actually do reflect interests in
machine learning, deep learning, artificial intelligence, and the Google Cloud Platform
professionally! Therefore, we may anticipate some solid advice here!

1. my_profile = profiles_users[-1479311724257856983]
2. print(my_profile.shape)
3. pd.DataFrame(sorted(zip(tfidf_feature_names,
4. profiles_users[-1479311724257856983].flatten().tolist()), key=lambda x: -
x[1])[:20],
5. columns=['token', 'relevance'])
Output:
1. class ContentBasedRecommender:
2.
3. MODEL_NAME = 'Content-Based'
4.
5. def __init__(self, items_df=None):
6. self.item_ids = item_ids
7. self.items_df = items_df
8.
9. def getting_model_name(self):
10. return self.MODEL_NAME
11.
12. def _getting_similar_items_to_the_users(self, person_id, topn=1000):
13. # The user profile and all object profiles are compared using the cosine similarity formula.
14. cosine_similarities = cosine_similarity(profiles_users[person_id], tfidf_matrix)
15. # Gets the most comparable products.
16. similar_indices = cosine_similarities.argsort().flatten()[-topn:]
17. # Sort comparable objects according to similarity.
18. similar_items = sorted([(item_ids[i], cosine_similarities[0,i]) for i in similar_indices], key=l
ambda x: -x[1])
19. return similar_items
20.
21. def recommending_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
22. similar_items = self._getting_similar_items_to_the_users(user_id)
23. # Ignores things with which the user has previously behaved
24. similar_items_filtered = list(filter(lambda x: x[0] not in items_to_ignore, similar_items))
25.
26. dataframe_recommendations = pd.DataFrame(similar_items_filtered, columns=['contentId',
'recStrength']) \
27. .head(topn)
28.
29. if verbose:
30. if self.items_df is None:
31. raise Exception('"items_df" is required in verbose mode')
32.
33. dataframe_recommendationsdataframe_recommendations = dataframe_recommendations
.merge(self.items_df, how = 'left',
34. left_on = 'contentId',
35. right_on = 'contentId')[['recStrength', 'contentId', 'title', 'url', '
lang']]
36.
37.
38. return dataframe_recommendations
39.
40. content_based_recommender_model = ContentBasedRecommender(dataframe_articles)
We have a Recall@5 of 0.162 with the customized recommendations of the content-based
filtering model, which indicates that around 16% of the test set's interacting items were listed by
this model among the top 5 things (from lists with 100 random items). Recall@10 was 0.261
(52%), as well. The fact that the Information-Based model performed less well than the
Popularity model suggests that consumers may not be as committed to reading content that is
highly similar to what they have already read.

1. print('Evaluating The Content-Based Filtering model...')


2. metrics_cb_global, dataframe_cb_result_detailed = model_evaluator.model_evaluation(content_
based_recommender_model)
3. print('\nGlobal metrics:\n%s' % metrics_cb_global)
4. dataframe_cb_result_detailed.head(10)
Output:

Collaborative Filtering model

It has main implementation methods.

o Memory-based: This method computes user similarities based on the items with which
they have engaged (user-based approach) or computes item similarities based on the users
who have interacted with the things (item-based approach).
User Neighbourhood-based CF is a common illustration of this strategy, in which the top N
similarly inclined users (typically determined using Pearson correlation) for a user are chosen
and used to suggest products that those inclined users liked but with whom the current user has
not yet interacted. Although this strategy is relatively easy to put into practice, it often does not
scale effectively for numerous people. Crab offers an excellent Python implementation of this
strategy.

o Model-based: In this method, models are created by utilizing various machine learning
algorithms to make product recommendations to customers. Numerous model-based CF
techniques exist, including probabilistic latent semantic analysis, neural networks,
bayesian networks, clustering models, and latent component models like Singular Value
Decomposition (SVD).

Matrix Factorisation
User-item matrices are condensed into a low-dimensional form using latent component models.
This method has the benefit of working with a much smaller matrix in a lower-dimensional space
rather than a high-dimensional matrix with a large number of missing values.

Both the user-based and item-based neighbourhood algorithms described in the preceding section
might be used with a reduced presentation. This paradigm has a number of benefits. Compared to
memory-based ones, it handles the sparsity of the original matrix better. Additionally, it is much
easier to compare similarities in the generated matrix, especially when working with sizable
sparse datasets.

Here, we employ Singular Value Decomposition, a well-known latent component model (SVD).
You might also use other, more CF-specific matrix factorization frameworks like surprise, mrec,
or python-recsys. We choose a SciPy implementation of SVD since Kaggle kernels support it.

The choice of how many elements to factor in the user-item matrix is crucial. The factorization
in the original matrix reconstructions is more exact the more factors there are. As a result, if the
model is permitted to retain too many specifics of the original matrix, it may struggle to
generalize to data that was not used for training. The generality of the model is increased by
reducing the number of components.

1. # Make a sparse pivot table with columns for the products and rows for the users
2. dataframe_users_items_pivot_matrix = dataframe_interaction_train.pivot(index='personId',
3. columns='contentId',
4. values='eventStrength').fillna(0)
5.
6. dataframe_users_items_pivot_matrix.head(10)
Output:
1. pivot_matrix_users_items = dataframe_users_items_pivot_matrix.to_numpy()
2. pivot_matrix_users_items[:10]
Output:

1. users_ids = list(dataframe_users_items_pivot_matrix.index)
2. users_ids[:10]
Output:

1. pivot_sparse_matrix_users_items = csr_matrix(pivot_matrix_users_items)
2. pivot_sparse_matrix_users_items
Output:
1. # The number of factors to be applied to the user-item matrix
2. Number_of_factor = 15
3. # matrix factorization of the initial user-item matrix is carried out
4. # U, sigma, Vt = svds(users_items_pivot_matrix, k = Number_of_factor)
5. U, sigma, Vt = svds(pivot_sparse_matrix_users_items, k = Number_of_factor)
6.
7. U.shape
Output:

1. Vt.shape
Output:

1. sigma = np.diag(sigma)
2. sigma.shape

We attempt to rebuild the original matrix by multiplying the elements after factorization. As a
result, the matrix is no longer sparse. We will utilize the predictions for goods with which the
user has not yet interacted to produce recommendations.

1. predicted_ratings_all_users = np.dot(np.dot(U, sigma), Vt)


2. predicted_ratings_all_users

1. predicted_ratings_norm_all_users = (predicted_ratings_all_users -
predicted_ratings_all_users.min()) / (predicted_ratings_all_users.max() -
predicted_ratings_all_users.min())
2.
3. # the process of returning the rebuilt matrix to a Pandas dataframe.
4. dataframe_cf_preds = pd.DataFrame(predicted_ratings_norm_all_users, columns = dataframe_us
ers_items_pivot_matrix.columns, index=users_ids).transpose()
5. dataframe_cf_preds.head(10)
Output:

1. len(dataframe_cf_preds.columns)
Output:

1. class CFRecommender:
2.
3. MODEL_NAME = 'Collaborative Filtering'
4.
5. def __init__(self, dataframe_cf_predictions, items_df=None):
6. self.dataframe_cf_predictions = dataframe_cf_predictions
7. self.items_df = items_df
8.
9. def getting_model_name(self):
10. return self.MODEL_NAME
11.
12. def recommending_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
13. # Obtain and arrange user predictions
14. predictions_sorted_users = self.dataframe_cf_predictions[user_id].sort_values(ascending=F
alse) \
15. .reset_index().rename(columns={user_id: 'recStrength'})
16.
17. # Send the user the movies with the highest expected rating that they haven't yet viewed.
18. dataframe_recommendations = predictions_sorted_users[~predictions_sorted_users['content
Id'].isin(items_to_ignore)] \
19. .sort_values('recStrength', ascending = False) \
20. .head(topn)
21.
22. if verbose:
23. if self.items_df is None:
24. raise Exception('"items_df" is required in verbose mode')
25.
26. dataframe_recommendationsdataframe_recommendations = dataframe_recommendations
.merge(self.items_df, how = 'left',
27. left_on = 'contentId',
28. right_on = 'contentId')[['recStrength', 'contentId', 'title', 'url', '
lang']]
29.
30.
31. return dataframe_recommendations
32.
33. cf_recommender_model = CFRecommender(dataframe_cf_preds, dataframe_articles)
Recall@5 (33%) and Recall@10 (46%) values were obtained while evaluating the Collaborative
Filtering model (SVD matrix factorization), which is much higher than the Popularity model and
Content-Based model.

1. print('Evaluating Collaborative Filtering (SVD Matrix Factorization) model...')


2. metrics_cf_global, dataframe_cf_detailed_results = model_evaluator.model_evaluation(cf_reco
mmender_model)
3. print('\nGlobal metrics:\n%s' % metrics_cf_global)
4. dataframe_cf_detailed_results.head(10)
Output:

Hybrid Recommender

It is a combination of Collaborative Filtering and Content-Based Filtering methods. In reality,


several studies have shown that hybrid methods outperform individual approaches, and both
academics and practitioners frequently adopt them.

Let's create a straightforward hybridization technique that ranks items based on the weighted
average of the normalized CF and Content-Based scores. The weights for the CF and CB models
in this instance are 100.0 and 1.0, respectively, because the CF model is significantly more
accurate than the CB model.

1. class HybridRecommender:
2.
3. MODEL_NAME = 'Hybrid'
4.
5. def __init__(self, model_cb_rec, model_cf_rec, dataframe_items, weight_cb_ensemble=1.0, w
eight_cf_ensemble=1.0):
6. self.model_cb_rec = model_cb_rec
7. self.model_cf_rec = model_cf_rec
8. self.weight_cb_ensemble = weight_cb_ensemble
9. self.weight_cf_ensemble = weight_cf_ensemble
10. self.dataframe_items = dataframe_items
11.
12. def getting_model_name(self):
13. return self.MODEL_NAME
14.
15. def recommending_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
16. # Obtaining the top 1000 suggestions for content-based filtering
17. dataframe_cb_recs = self.model_cb_rec.recommending_items(user_id, items_to_ignoreitem
s_to_ignore=items_to_ignore, verboseverbose=verbose,
18. topn=1000).rename(columns={'recStrength': 'recStrengthC
B'})
19.
20. # Obtaining the top 1000 suggestions via collaborative filtering
21. dataframe_cf_recs = self.model_cf_rec.recommending_items(user_id, items_to_ignoreitems
_to_ignore=items_to_ignore, verboseverbose=verbose,
22. topn=1000).rename(columns={'recStrength': 'recStrengthCF
'})
23.
24. # putting the outcomes together by contentId
25. dataframe_recs = dataframe_cb_recs.merge(dataframe_cf_recs,
26. how = 'outer',
27. left_on = 'contentId',
28. right_on = 'contentId').fillna(0.0)
29.
30. # Using the CF and CB scores to create a hybrid recommendation score
31. # dataframe_recs['recStrengthHybrid'] = dataframe_recs['recStrengthCB'] * dataframe_recs[
'recStrengthCF']
32. dataframe_recs['recStrengthHybrid'] = (dataframe_recs['recStrengthCB'] * self.weight_cb_e
nsemble) \
33. + (dataframe_recs['recStrengthCF'] * self.weight_cf_ensemble)
34.
35. # Sorting advice based on hybrid score
36. recommendations_df = dataframe_recs.sort_values('recStrengthHybrid', ascending=False).h
ead(topn)
37.
38. if verbose:
39. if self.dataframe_items is None:
40. raise Exception('"dataframe_items" is required in verbose mode')
41.
42. recommendations_dfrecommendations_df = recommendations_df.merge(self.dataframe_i
tems, how = 'left',
43. left_on = 'contentId',
44. right_on = 'contentId')[['recStrengthHybrid', 'contentId', 'title'
, 'url', 'lang']]
45.
46.
47. return recommendations_df
48.
49. hybrid_recommender_model = HybridRecommender(content_based_recommender_model, cf_re
commender_model, dataframe_articles,
50. weight_cb_ensemble=1.0, weight_cf_ensemble=100.0)
51.
52. print('Evaluating Hybrid model...')
53. metrics_hybrid_global, dataframe_hybrid_detailed_results = model_evaluator.model_evaluation(
hybrid_recommender_model)
54. print('\nGlobal metrics:\n%s' % metrics_hybrid_global)
55. dataframe_hybrid_detailed_results.head(10)
Output:

Comparing the Methods

Now, we will compare the methods for recall@5 and recall@10.

1. dataframe_global_metrics = pd.DataFrame([metrics_cb_global, metrics_pop_global, metrics_cf_


global, metrics_hybrid_global]) \
2. .set_index('modelName')
3. dataframe_global_metrics
Output:
A new champion has emerged!

By combining Collaborative Filtering and Content-Based Filtering, our straightforward


hybrid technique outperforms the former. Recall@5 is currently 34.2%, while Recall@10 is
47.9%.

Now for better understanding, we can also plot the graph for the comparison of the models.

1. %matplotlib inline
2. ax = dataframe_global_metrics.transpose().plot(kind='bar', figsize=(15,8))
3. for p in ax.patches:
4. ax.annotate("%.3f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='cente
r', va='center', xytext=(0, 10), textcoords='offset points')
Output:
TESTING

Now, we will test the best model, which is hybrid for other users.

1. def inspection_interactions(person_id, test_set=True):


2. if test_set:
3. dataframe_interactions = dataframe_interaction_text_indexed
4. else:
5. dataframe_interactions = dataframe_interaction_train_indexed
6. return dataframe_interactions.loc[person_id].merge(dataframe_articles, how = 'left',
7. left_on = 'contentId',
8. right_on = 'contentId') \
9. .sort_values('eventStrength', ascending = False)[['eventStrength',
10. 'contentId',
11. 'title', 'url', 'lang']]
Some of the articles We engaged with in Deskdrop from Train Set are shown below. It is clear
that machine learning, deep learning, artificial intelligence, and the Google Cloud Platform are
among key areas of interest.
1. inspection_interactions(-1479311724257856983, test_set=False).head(20)

1. hybrid_recommender_model.recommending_items(-
1479311724257856983, topn=20, verbose=True)
As we check the comparison between the recommendation from the hybrid model and the actual
interest, we find that the recommendations are pretty similar.

You might also like