Cse Big Data 702 Notes
Cse Big Data 702 Notes
BIG DATA
Structured Data
Structured data can be crudely defined as the data that resides in a fixed field within a
record.
It is type of data most familiar to our everyday lives. for ex: birthday,address
A certain schema binds it, so all the data has the same set of properties. Structured data is
also called relational data. It is split into multiple tables to enhance the integrity of the data
by creating a single record to depict an entity. Relationships are enforced by the application
of table constraints.
The business value of structured data lies within how well an organization can utilize its
existing systems and processes for analysis purposes.
A Structured Query Language (SQL) is needed to bring the data together. Structured data is
easy to enter, query, and analyze. All of the data follows the same format. However, forcing a
consistent structure also means that any alteration of data is too tough as each record has to be
updated to adhere to the new structure. Examples of structured data include numbers, dates,
strings, etc. The business data of an e-commerce website can be considered to be structured
data.
Name Class Section Roll No Grade
Geek1 11 A 1 A
Geek2 11 A 2 B
Geek3 11 A 3 A
Semi-Structured Data
Semi-structured data is not bound by any rigid schema for data storage and handling. The
data is not in the relational format and is not neatly organized into rows and columns like
that in a spreadsheet. However, there are some features like key-value pairs that help in
discerning the different entities from each other.
Since semi-structured data doesn’t need a structured query language, it is commonly
called NoSQL data.
A data serialization language is used to exchange semi-structured data across systems that
may even have varied underlying infrastructure.
Semi-structured content is often used to store metadata about a business process but it can
also include files containing machine instructions for computer programs.
This type of information typically comes from external sources such as social media
platforms or other web-based data feeds.
Data is created in plain text so that different text-editing tools can be used to draw valuable
insights. Due to a simple format, data serialization readers can be implemented on hardware
with limited processing resources and bandwidth.
Data Serialization Languages
Software developers use serialization languages to write memory-based data in files, transit,
store, and parse. The sender and the receiver don’t need to know about the other system. As
long as the same serialization language is used, the data can be understood by both systems
comfortably. There are three predominantly used Serialization languages.
1. XML– XML stands for eXtensible Markup Language. It is a text-based markup language
designed to store and transport data. XML parsers can be found in almost all popular
development platforms. It is human and machine-readable. XML has definite standards for
schema, transformation, and display. It is self-descriptive. Below is an example of a
programmer’s details in XML.
XML
<ProgrammerDetails>
<FirstName>Jane</FirstName>
<LastName>Doe</LastName>
<CodingPlatforms>
<CodingPlatform
Type="Fav">GeeksforGeeks</CodingPlatform>
<CodingPlatform
Type="2ndFav">Code4Eva!</CodingPlatform>
<CodingPlatform
Type="3rdFav">CodeisLife</CodingPlatform>
</CodingPlatforms>
</ProgrammerDetails>
XML expresses the data using tags (text within angular brackets) to shape the data (for ex:
FirstName) and attributes (For ex: Type) to feature the data. However, being a verbose and
voluminous language, other formats have gained more popularity.
2. JSON– JSON (JavaScript Object Notation) is a lightweight open-standard file format for
data interchange. JSON is easy to use and uses human/machine-readable text to store and
transmit data objects.
Javascript
"firstName": "Jane",
"lastName": "Doe",
"codingPlatforms": [
This format isn’t as formal as XML. It’s more like a key/value pair model than a formal data
depiction. Javascript has inbuilt support for JSON. Although JSON is very popular amongst
web developers, non-technical personnel find it tedious to work with JSON due to its heavy
dependence on JavaScript and structural characters (braces, commas, etc.)
3. YAML– YAML is a user-friendly data serialization language. Figuratively, it stands
for YAML Ain’t Markup Language. It is adopted by technical and non-technical handlers all
across the globe owing to its simplicity. The data structure is defined by line separation and
indentation and reduces the dependency on structural characters. YAML is extremely
comprehensive and its popularity is a result of its human-machine readability.
YAML example
A product catalog organized by tags is an example of semi-structured data.
Unstructured Data
Unstructured data is the kind of data that doesn’t adhere to any definite schema or set of
rules. Its arrangement is unplanned and haphazard.
Photos, videos, text documents, and log files can be generally considered unstructured data.
Even though the metadata accompanying an image or a video may be semi-structured, the
actual data being dealt with is unstructured.
Additionally, Unstructured data is also known as “dark data” because it cannot be analyzed
without the proper software tools.
There are five v's of Big Data that explains the characteristics.
o Volume
o Veracity
o Variety
o Value
o Velocity
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
The data is categorized as below:
1. Structured data: In Structured schema, along with all the required columns. It is in a
tabular form. Structured Data is stored in the relational database management system.
2. Semi-structured: In Semi-structured, the schema is not appropriately defined,
e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction Processing)
systems are built to work with semi-structured data. It is stored in relations, i.e., tables.
3. Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they
did not know how to derive the value of data since the data is raw.
4. Quasi-structured Data:The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server that
contains a list of activities.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
The main differences between traditional data and big data are as follows:
It is usually structured data and can be stored in It includes semi-structured, unstructured, and
spreadsheets, databases, etc. structured data.
Analysis of traditional data can be done with the use of Analysis of big data needs advanced analytics methods
primary statistical methods. such as machine learning, data mining, etc.
It generates data after the happening of an event. It generates data every second.
It is easy to secure and protect than big data because of It is harder to secure and protect than traditional data
its small size and simplicity. because of its size and complexity.
It is less efficient than big data. It is more efficient than traditional data.
If we see the last few decades, we can analyze that Big Data technology has gained so much
growth. There are a lot of milestones in the evolution of Big Data which are described below:
1. DataWarehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large volumes
of structured data.
2. Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed storage
medium and large data processing are provided by Hadoop, and it is an open-source
framework.
3. NoSQLDatabases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
4. CloudComputing:
Cloud Computing technology helps companies to store their important data in data
centers that are remote, and it saves their infrastructure cost and maintenance costs.
5. MachineLearning:
Machine Learning algorithms are those algorithms that work on large data, and analysis
is done on a huge amount of data to get meaningful insights from it. This has led to the
development of artificial intelligence (AI) applications.
6. DataStreaming:
Data Streaming technology has emerged as a solution to process large volumes of data in
real time.
7. EdgeComputing:
dge Computing is a kind of distributed computing paradigm that allows data processing
to be done at the edge or the corner of the network, closer to the source of the data.
We can categorize the leading big data technologies into the following four sections:
o Data Storage
o Data Mining
o Data Analytics
o Data Visualization
Data Storage
Let us first discuss leading Big Data Technologies that come under Data Storage:
o Hadoop: When it comes to handling big data, Hadoop is one of the leading technologies
that come into play. This technology is based entirely on map-reduce architecture and is
mainly used to process batch information. Also, it is capable enough to process tasks in
batches. The Hadoop framework was mainly introduced to store and process data in a
distributed data processing environment parallel to commodity hardware and a basic
programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from
various machines with a faster speed and low cost. That is why Hadoop is known as one
of the core components of big data technologies. The Apache Software
Foundation introduced it in Dec 2011. Hadoop is written in Java programming language.
o MongoDB: MongoDB is another important component of big data technologies in terms
of storage. No relational properties and RDBMS properties apply to MongoDb because it
is a NoSQL database. This is not the same as traditional RDBMS databases that use
structured query languages. Instead, MongoDB uses schema documents.
The structure of the data storage in MongoDB is also different from traditional RDBMS
databases. This enables MongoDB to hold massive amounts of data. It is based on a
simple cross-platform document-oriented design. The database in MongoDB uses
documents similar to JSON with the schema. This ultimately helps operational data
storage options, which can be seen in most financial organizations. As a result,
MongoDB is replacing traditional mainframes and offering the flexibility to handle a
wide range of high-volume data-types in distributed architectures.
MongoDB Inc. introduced MongoDB in Feb 2009. It is written with a combination of
C++, Python, JavaScript, and Go language.
o RainStor: RainStor is a popular database management system designed to manage and
analyze organizations' Big Data requirements. It uses deduplication strategies that help
manage storing and handling vast amounts of data for reference.
RainStor was designed in 2004 by a RainStor Software Company. It operates just like
SQL. Companies such as Barclays and Credit Suisse are using RainStor for their big data
needs.
o Hunk: Hunk is mainly helpful when data needs to be accessed in remote Hadoop clusters
using virtual indexes. This helps us to use the spunk search processing language to
analyze data. Also, Hunk allows us to report and visualize vast amounts of data from
Hadoop and NoSQL data sources.
Hunk was introduced in 2013 by Splunk Inc. It is based on the Java programming
language.
o Cassandra: Cassandra is one of the leading big data technologies among the list of top
NoSQL databases. It is open-source, distributed and has extensive column storage
options. It is freely available and provides high availability without fail. This ultimately
helps in the process of handling data efficiently on large commodity groups. Cassandra's
essential features include fault-tolerant mechanisms, scalability, MapReduce support,
distributed nature, eventual consistency, query language property, tunable consistency,
and multi-datacenter replication, etc.
Cassandra was developed in 2008 by the Apache Software Foundation for the
Facebook inbox search feature. It is based on the Java programming language.
Data Mining
Let us now discuss leading Big Data Technologies that come under Data Mining:
o Presto: Presto is an open-source and a distributed SQL query engine developed to run
interactive analytical queries against huge-sized data sources. The size of data sources
can vary from gigabytes to petabytes. Presto helps in querying the data in Cassandra,
Hive, relational databases and proprietary data storage systems.
Presto is a Java-based query engine that was developed in 2013 by the Apache Software
Foundation. Companies like Repro, Netflix, Airbnb, Facebook and Checkr are using this
big data technology and making good use of it.
o RapidMiner: RapidMiner is defined as the data science software that offers us a very
robust and powerful graphical user interface to create, deliver, manage, and maintain
predictive analytics. Using RapidMiner, we can create advanced workflows and scripting
support in a variety of programming languages.
RapidMiner is a Java-based centralized solution developed in 2001 by Ralf
Klinkenberg, Ingo Mierswa, and Simon Fischer at the Technical University of
Dortmund's AI unit. It was initially named YALE (Yet Another Learning Environment).
A few sets of companies that are making good use of the RapidMiner tool are Boston
Consulting Group, InFocus, Domino's, Slalom, and Vivint.SmartHome.
o ElasticSearch: When it comes to finding information, elasticsearch is known as an
essential tool. It typically combines the main components of the ELK stack (i.e., Logstash
and Kibana). In simple words, ElasticSearch is a search engine based on the Lucene
library and works similarly to Solr. Also, it provides a purely distributed, multi-tenant
capable search engine. This search engine is completely text-based and contains schema-
free JSON documents with an HTTP web interface.
ElasticSearch is primarily written in a Java programming language and was developed in
2010 by Shay Banon. Now, it has been handled by Elastic NV since 2012. ElasticSearch
is used by many top companies, such as LinkedIn, Netflix, Facebook, Google, Accenture,
StackOverflow, etc.
Data Analytics
Now, let us discuss leading Big Data Technologies that come under Data Analytics:
o Apache Kafka: Apache Kafka is a popular streaming platform. This streaming platform is
primarily known for its three core capabilities: publisher, subscriber and consumer. It is
referred to as a distributed streaming platform. It is also defined as a direct messaging,
asynchronous messaging broker system that can ingest and perform data processing on
real-time streaming data. This platform is almost similar to an enterprise messaging
system or messaging queue.
Besides, Kafka also provides a retention period, and data can be transmitted through a
producer-consumer mechanism. Kafka has received many enhancements to date and
includes some additional levels or properties, such as schema, Ktables, KSql, registry,
etc. It is written in Java language and was developed by the Apache software
community in 2011. Some top companies using the Apache Kafka platform include
Twitter, Spotify, Netflix, Yahoo, LinkedIn etc.
o Splunk: Splunk is known as one of the popular software platforms for capturing,
correlating, and indexing real-time streaming data in searchable repositories. Splunk can
also produce graphs, alerts, summarized reports, data visualizations, and dashboards, etc.,
using related data. It is mainly beneficial for generating business insights and web
analytics. Besides, Splunk is also used for security purposes, compliance, application
management and control.
Splunk Inc. introduced Splunk in the year 2014. It is written in combination with AJAX,
Python, C ++ and XML. Companies such as Trustwave, QRadar, and 1Labs are making
good use of Splunk for their analytical and security needs.
o KNIME: KNIME is used to draw visual data flows, execute specific steps and analyze
the obtained models, results, and interactive views. It also allows us to execute all the
analysis steps altogether. It consists of an extension mechanism that can add more
plugins, giving additional features and functionalities.
KNIME is based on Eclipse and written in a Java programming language. It was
developed in 2008 by KNIME Company. A list of companies that are making use of
KNIME includes Harnham, Tyler, and Paloalto.
o Spark: Apache Spark is one of the core technologies in the list of big data technologies. It
is one of those essential technologies which are widely used by top companies. Spark is
known for offering In-memory computing capabilities that help enhance the overall speed
of the operational process. It also provides a generalized execution model to support more
applications. Besides, it includes top-level APIs (e.g., Java, Scala, and Python) to ease the
development process.
Also, Spark allows users to process and handle real-time streaming data using batching
and windowing operations techniques. This ultimately helps to generate datasets and data
frames on top of RDDs. As a result, the integral components of Spark Core are produced.
Components like Spark MlLib, GraphX, and R help analyze and process machine
learning and data science. Spark is written using Java, Scala, Python and R language.
The Apache Software Foundation developed it in 2009. Companies like Amazon,
ORACLE, CISCO, VerizonWireless, and Hortonworks are using this big data technology
and making good use of it.
o R-Language: R is defined as the programming language, mainly used in statistical
computing and graphics. It is a free software environment used by leading data miners,
practitioners and statisticians. Language is primarily beneficial in the development of
statistical-based software and data analytics.
R-language was introduced in Feb 2000 by R-Foundation. It is written in Fortran.
Companies like Barclays, American Express, and Bank of America use R-Language for
their data analytics needs.
o Blockchain: Blockchain is a technology that can be used in several applications related to
different industries, such as finance, supply chain, manufacturing, etc. It is primarily used
in processing operations like payments and escrow. This helps in reducing the risks of
fraud. Besides, it enhances the transaction's overall processing speed, increases financial
privacy, and internationalize the markets. Additionally, it is also used to fulfill the needs
of shared ledger, smart contract, privacy, and consensus in any Business Network
Environment.
Blockchain technology was first introduced in 1991 by two researchers, Stuart
Haber and W. Scott Stornetta. However, blockchain has its first real-world application
in Jan 2009 when Bitcoin was launched. It is a specific type of database based on Python,
C++, and JavaScript. ORACLE, Facebook, and MetLife are a few of those top companies
using Blockchain technology.
Data Visualization
Let us discuss leading Big Data Technologies that come under Data Visualization:
o Tableau: Tableau is one of the fastest and most powerful data visualization tools used by
leading business intelligence industries. It helps in analyzing the data at a very faster
speed. Tableau helps in creating the visualizations and insights in the form of dashboards
and worksheets.
Tableau is developed and maintained by a company named TableAU. It was introduced
in May 2013. It is written using multiple languages, such as Python, C, C++, and Java.
Some of the list's top companies are Cognos, QlikQ, and ORACLE Hyperion, using this
tool.
o Plotly: As the name suggests, Plotly is best suited for plotting or creating graphs and
relevant components at a faster speed in an efficient way. It consists of several rich
libraries and APIs, such as MATLAB, Python, Julia, REST API, Arduino, R, Node.js,
etc. This helps interactive styling graphs with Jupyter notebook and Pycharm.
Plotly was introduced in 2012 by Plotly company. It is based on JavaScript. Paladins and
Bitbank are some of those companies that are making good use of Plotly.
Expected Properties of a Big Data System
There are various properties which mostly relies on complexity as per their scalability in the
big data. As per these properties, Big data system should perform well, efficient, and
reasonable as well. Let’s explore these properties step by step.
1. Robustness and error tolerance –
As per the obstacles in distributed system encountered, it is quite arduous to build a system
that “do the right thing”. Systems are required to behave in a right manner despite
machines going down randomly, the composite semantics of uniformity in distributed
databases, redundancy, concurrency, and many more. These obstacles make it complicated
to reason about the functioning of the system. Robustness of big data system is the solution
to overcome the obstacles associated with it.
It’s domineering for system to tolerate the human-fault. It’s an often-disregarded property
of the system which can not be overlooked. In a production system, its domineering that
the operator of the system might make mistakes, such as by providing incorrect program
that can interrupt the functioning of the database. If re-computation and immutability is
built in the core of a big data system, the system will be distinctively robust against human
fault by delivering a relevant and quite cinch mechanism for recovery.
2. Debuggability –
A system must be debug when unfair thing happens by the required information delivered
by the big data system. The key must be able to recognize, for every value in the system.
Debuggability is proficient in the Lambda Architecture via the functional behaviour of the
batch layer and with the help of re-computation algorithm when needed.
3. Scalability –
It is the tendency to handle the performance in the context of growing data and load by
adding resources to the system. The Lambda Architecture is straight scalable diagonally to
all layers of the system stack: scaling is achieved by including further number of machines.
4. Generalization –
A wide range of applications can be function in a general system. As Lambda Architecture
is based on function of all data, a number of applications can run in a generalized system.
Also, Lambda architecture can generalize social networking, applications, etc.
5. Ad hoc queries –
The ability to perform ad hoc queries on the data is significant. Every large dataset contains
unanticipated value in it. Having the ability of data mining constantly provides
opportunities for new application and business optimization.
6. Extensibility –
Extensible system enables to function to be added cost effectively. Sometimes, a new
feature or a change to an already existing system feature needs to reallocate of pre-existing
data into a new data format. Large-scale transfer of data become easy as it is the part in
building an extensible system.
7. Low latency reads and updates –
Numerous applications are needed the read with low latency, within a few milliseconds and
hundred milliseconds. In Contradict, Update latency varies within the applications. Some
of the applications needed to be broadcast with low latency, while some can function with
few hours of latency. In big data system, there is a need of applications low latency or
updates propagated shortly.
8. Minimal Maintenance –
Maintenance is like penalty for developers. It is the operations which is needed to keep the
functionality of the systems smooth. This includes forestalling when to increase number of
machines to scale, keeping processes functioning well along with their debugging.
Selecting components with probably little complexity plays a significant role in minimal
maintenance. A developer always willing to rely on components along with quite relevant
mechanism. Significantly, distributed database has more probability of complicated
internals.
UNIT-2
What is Hadoop:-
Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the
cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks and
stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts it
into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing
the processing time. It is able to process terabytes of data in minutes and Peta bytes in
hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
really cost effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
o Core Components of Hadoop Architecture
o 1. Hadoop Distributed File System (HDFS)
o One of the most critical components of Hadoop architecture is the Hadoop Distributed
File System (HDFS). HDFS is the primary storage system used by Hadoop applications.
It’s designed to scale to petabytes of data and runs on commodity hardware. What sets
HDFS apart is its ability to maintain large data sets across multiple nodes in a distributed
computing environment.
o HDFS operates on the basic principle of storing large files across multiple machines. It
achieves high throughput by dividing large data into smaller blocks, which are managed
by different nodes in the network. This nature of HDFS makes it an ideal choice for
applications with large data sets.
o 2. Yet Another Resource Negotiator (YARN)
o Yet Another Resource Negotiator (YARN) is responsible for managing resources in the
cluster and scheduling tasks for users. It is a key element in Hadoop architecture as it
allows multiple data processing engines such as interactive processing, graph processing,
and batch processing to handle data stored in HDFS.
o YARN separates the functionalities of resource management and job scheduling into
separate daemons. This design ensures a more scalable and flexible Hadoop architecture,
accommodating a broader array of processing approaches and a wider array of
applications.
o 3. MapReduce Programming Model
o MapReduce is a programming model integral to Hadoop architecture. It is designed to
process large volumes of data in parallel by dividing the work into a set of independent
tasks. The MapReduce model simplifies the processing of vast data sets, making it an
indispensable part of Hadoop.
o MapReduce is characterized by two primary tasks, Map and Reduce. The Map task takes
a set of data and converts it into another set of data, where individual elements are broken
down into tuples. On the other hand, the Reduce task takes the output from the Map as
input and combines those tuples into a smaller set of tuples.
o 4. Hadoop Common
o Hadoop Common, often referred to as the ‘glue’ that holds Hadoop architecture together,
contains libraries and utilities needed by other Hadoop modules. It provides the necessary
Java files and scripts required to start Hadoop. This component plays a crucial role in
ensuring that the hardware failures are managed by the Hadoop framework itself, offering
a high degree of resilience and reliability.
Apache Hadoop is an open source framework intended to make interaction with big
data easier, However, for those who are not acquainted with this technology, one question
arises that what is big data ? Big data is a term given to the data sets which can’t be processed
in an efficient manner with the help of traditional methodology such as RDBMS. Hadoop has
made its place in the industries and companies that need to work on large data sets which are
sensitive and needs efficient handling. Hadoop is a framework that enables processing of large
data sets which reside in the form of clusters. Being a framework, Hadoop is made up of
several modules that are supported by a large ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to
solve the big data problems. It includes Apache projects and various commercial tools and
solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and
Hadoop Common Utilities. Most of the tools or solutions are used to supplement or support
these major elements. All these tools work collectively to provide services such as absorption,
analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
Hive Architecture-
The following architecture explains the flow of submission of query into Hive.
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-
o Thrift Server - It is a cross-language service provider platform that serves the request
from all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications.
The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect to
Hive.
Hive Services
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column and its
type information, the serializers and deserializers which is used to read and write data and
the corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements
into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over
several machines and replicated to ensure their durability to failure and high availability to
parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes
and node name.
o Low Latency data access: Applications that require very less time to access the first data
should not use HDFS as it is giving importance to whole data rather than time to fetch the
first record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if the
files are small in size it takes a lot of memory for name node's memory which is not
feasible.
o Multiple Writes:It should not be used when we have to write multiple times.
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks
are 128 MB by default and this is configurable.Files n HDFS are broken into block-sized
chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS
is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file
stored in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is
large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and the
metadata of all the files in HDFS; the metadata information being file permission, names
and location of each block.The metadata are small, so it is stored in the memory of name
node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently,so all this information is handled bya single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being a commodity hardware also does the work of block creation, deletion and
replication as stated by the name node.
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described
as a “Redesigned Resource Manager” at the time of its launching, but it has now evolved to be
known as large-scale distributed operating system used for Big Data processing.
YARN architecture basically separates resource management layer from the processing layer.
In Hadoop 1.0 version, the responsibility of Job tracker is split between the resource manager
and application manager.
YARN also allows different data processing engines like graph processing, interactive
processing, stream processing as well as batch processing to run and process data stored in
HDFS (Hadoop Distributed File System) thus making the system much more efficient.
Through its various components, it can dynamically allocate various resources and schedule
the application processing. For large volume data processing, it is quite necessary to manage
the available resources properly so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-
Advantages :
Flexibility: YARN offers flexibility to run various types of distributed processing systems
such as Apache Spark, Apache Flink, Apache Storm, and others. It allows multiple
processing engines to run simultaneously on a single Hadoop cluster.
Resource Management: YARN provides an efficient way of managing resources in the
Hadoop cluster. It allows administrators to allocate and monitor the resources required by
each application in a cluster, such as CPU, memory, and disk space.
Scalability: YARN is designed to be highly scalable and can handle thousands of nodes in
a cluster. It can scale up or down based on the requirements of the applications running on
the cluster.
Improved Performance: YARN offers better performance by providing a centralized
resource management system. It ensures that the resources are optimally utilized, and
applications are efficiently scheduled on the available resources.
Security: YARN provides robust security features such as Kerberos authentication, Secure
Shell (SSH) access, and secure data transmission. It ensures that the data stored and
processed on the Hadoop cluster is secure.
Disadvantages :
MapReduce Architecture:
Components of MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of
all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to the
Hadoop MapReduce Master. Now, the MapReduce master will divide this job into further
equivalent job-parts. These job-parts are then made available for the Map and Reduce Task.
This Map and Reduce task will contain the program as per the requirement of the use-case that
the particular company is solving. The developer writes their logic to fulfill the requirement
that the industry requires. The input data which we are using is then fed to the Map Task and
the Map will generate intermediate key-value pair as its output. The output of Map i.e. these
key-value pairs are then fed to the Reducer and the final output is stored on the HDFS. There
can be n number of Map and Reduce tasks made available for processing the data as per the
requirement. The algorithm for Map and Reduce is made with a very optimized way such that
the time complexity or space complexity is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs. The
input to the map may be a key-value pair where the key can be the id of some kind of
address and value is the actual value that it keeps. The Map() function will be executed in
its memory repository on each of these input key-value pairs and generates the intermediate
key-value pair which works as input for the Reducer or Reduce() function.
2. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and
sort and send to the Reduce() function. Reducer aggregate or group the data based on its
key-value pair as per the reducer algorithm written by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all the jobs across
the cluster and also to schedule each map on the Task Tracker running on the same data
node since there can be hundreds of data nodes available in the cluster.
2. Task Tracker: The Task Tracker can be considered as the actual slaves that are working
on the instruction given by the Job Tracker. This Task Tracker is deployed on each of the
nodes available in the cluster that executes the Map and Reduce task as instructed by Job
Tracker.
There is also one important component of MapReduce Architecture known as Job History
Server. The Job History Server is a daemon process that saves and stores historical
information about the task or application, like the logs which are generated during or after the
job execution are stored on Job History Server.
UNIT-3
– IntroductiontoHadoop, ApacheHive
The major components of Hive and its interaction with the Hadoop is demonstrated in the
figure below and all the components are described further:
UserInterface(UI) –
As the name describes User interface provide an interface between user and hive. It enables
user to submit queries and other operations to the system. Hive web UI, Hive command
line, and Hive HD Insight (In windows server) are supported by the user interface.
Hive Server – It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
Driver –
Queries of the user after the interface are received by the driver within the Hive. Concept
of session handles is implemented by driver. Execution and Fetching of APIs modelled on
JDBC/ODBC interfaces is provided by the user.
Compiler –
Queries are parses, semantic analysis on the different query blocks and query expression is
done by the compiler. Execution plan with the help of the table in the database and
partition metadata observed from the metastore are generated by the compiler eventually.
Metastore –
All the structured data or information of the different tables and partition in the warehouse
containing attributes and attributes level information are stored in the metastore. Sequences
or de-sequences necessary to read and write data and the corresponding HDFS files where
the data is stored. Hive selects corresponding database servers to stock the schema or
Metadata of databases, tables, attributes in a table, data types of databases, and HDFS
mapping.
ExecutionEngine –
Execution of the execution plan made by the compiler is performed in the execution
engine. The plan is a DAG of stages. The dependencies within the various stages of the
plan is managed by execution engine as well as it executes these stages on the suitable
system components.
Diagram – Architecture of Hive that is built on the top of Hadoop
In the above diagram along with architecture, job execution flow in Hive with Hadoop is
demonstrated step by step.
Step-1:ExecuteQuery –
Interface of the Hive such as Command Line or Web user interface delivers query to the
driver to execute. In this, UI calls the execute interface to the driver such as ODBC or
JDBC.
Step-2:GetPlan –
Driver designs a session handle for the query and transfer the query to the compiler to
make execution plan. In other words, driver interacts with the compiler.
Step-3:GetMetadata –
In this, the compiler transfers the metadata request to any database and the compiler gets
the necessary metadata from the metastore.
Step-4:SendMetadata –
Metastore transfers metadata as an acknowledgment to the compiler.
Step-5:SendPlan –
Compiler communicating with driver with the execution plan made by the compiler to
execute the query.
Step-6:ExecutePlan –
Execute plan is sent to the execution engine by the driver.
o Execute Job
o Job Done
o Dfs operation (Metadata Operation)
Step-7:FetchResults –
Fetching results from the driver to the user interface (UI).
Step-8:SendResults –
Result is transferred to the execution engine from the driver. Sending results to Execution
engine. When the result is retrieved from data nodes to the execution engine, it returns the
result to the driver and to user interface (UI).
Scalability: Hive is a distributed system that can easily scale to handle large volumes of data
by adding more nodes to the cluster.
Data Accessibility: Hive allows users to access data stored in Hadoop without the need for
complex programming skills. SQL-like language is used for queries and HiveQL is based on
SQL syntax.
Data Integration: Hive integrates easily with other tools and systems in the Hadoop
ecosystem such as Pig, HBase, and MapReduce.
Flexibility: Hive can handle both structured and unstructured data, and supports various data
formats including CSV, JSON, and Parquet.
Security: Hive provides security features such as authentication, authorization, and encryption
to ensure data privacy.
High Latency: Hive’s performance is slower compared to traditional databases because of the
overhead of running queries in a distributed system.
Limited Real-time Processing: Hive is not ideal for real-time data processing as it is designed
for batch processing.
Complexity: Hive is complex to set up and requires a high level of expertise in Hadoop, SQL,
and data warehousing concepts.
Lack of Full SQL Support: HiveQL does not support all SQL operations, such as transactions
and indexes, which may limit the usefulness of the tool for certain applications.
Debugging Difficulties: Debugging Hive queries can be difficult as the queries are executed
across a distributed system, and errors may occur in different nodes.
Integer Types
-9,223,372,036,854,775,808 to
BIGINT 8-byte signed integer
9,223,372,036,854,775,807
Decimal Type
TIMESTAMP
The Date value is used to specify a particular year, month and day, in the form YYYY--MM--
DD. However, it didn't provide the time of the day. The range of Date type lies between 0000--
01--01 to 9999--12--31.
String Types
STRING
The string is a sequence of characters. It values can be enclosed within single quotes (') or double
quotes (").
Varchar
The varchar is a variable length type whose range lies between 1 and 65535, which specifies that
the maximum number of characters allowed in the character string.
CHAR
Complex Type
The Hive Query Language (HiveQL) is a query language for Hive to process and analyze
structured data in a Metastore. This chapter explains how to use the SELECT statement with
WHERE clause.
SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a
condition. It filters the data using the condition and gives you a finite result. The built-in
operators and functions generate an expression, which fulfils the condition.
Syntax
Example
Let us take an example for SELECT…WHERE clause. Assume we have the employee table as
given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details who earn a salary of more than Rs 30000.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above scenario:
On successful execution of the query, you get to see the following response:
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+------+--------------+-------------+-------------------+--------+
JDBC Program
The JDBC program to apply where clause for the given example is as follows.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "",
"");
// create statement
Statement stmt = con.createStatement();
// execute statement
Resultset res = stmt.executeQuery("SELECT * FROM employee WHERE salary>30000;");
System.out.println("Result:");
System.out.println(" ID \t Name \t Salary \t Designation \t Dept ");
while (res.next()) {
System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " " +
res.getString(4) + " " + res.getString(5));
}
con.close();
}
}
Save the program in a file named HiveQLWhere.java. Use the following commands to compile
and execute this program.
$ javac HiveQLWhere.java
$ java HiveQLWhere
Output:
Pig tutorial provides basic and advanced concepts of Pig. Our Pig tutorial is designed for
beginners and professionals.
Pig is a high-level data flow platform for executing Map Reduce programs of Hadoop. It was
developed by Yahoo. The language for Pig is pig Latin.
Our Pig tutorial includes all topics of Apache Pig with Pig usage, Pig Installation, Pig Run
Modes, Pig Latin concepts, Pig Data Types, Pig example, Pig user defined functions etc.
Apache Pig is a high-level data flow platform for executing MapReduce programs of Hadoop.
The language used for Pig is Pig Latin.
The Pig scripts get internally converted to Map Reduce jobs and get executed on data stored in
HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache Spark.
Pig can handle any type of data, i.e., structured, semi-structured or unstructured and stores the
corresponding results into Hadoop Data File System. Every task which can be achieved using
PIG can also be achieved using java used in MapReduce.
1) Ease of programming
Writing complex java programs for map reduce is quite tough for non-programmers. Pig makes
this process easy. In the Pig, the queries are converted to MapReduce internally.
2) Optimization opportunities
It is how tasks are encoded permits the system to optimize their execution automatically,
allowing the user to focus on semantics rather than efficiency.
3) Extensibility
A user-defined function is written in which the user can write their logic to execute over the data
set.
4) Flexible
5) In-built operators
It contains various type of operators such as sort, filter and joins.
It doesn't allow nested data types. It provides nested data types like tuple, bag, and map.
o Less code - The Pig consumes less line of code to perform any operation.
o Reusability - The Pig code is flexible enough to reuse again.
o Nested data types - The Pig provides a useful concept of nested data types like tuple, bag,
and map.
ETL Processing-
INTRODUCTION:
1. ETL stands for Extract, Transform, Load and it is a process used in data warehousing to
extract data from various sources, transform it into a format suitable for loading into a data
warehouse, and then load it into the warehouse. The process of ETL can be broken down
into the following three stages:
2. Extract: The first stage in the ETL process is to extract data from various sources such as
transactional systems, spreadsheets, and flat files. This step involves reading data from the
source systems and storing it in a staging area.
3. Transform: In this stage, the extracted data is transformed into a format that is suitable for
loading into the data warehouse. This may involve cleaning and validating the data,
converting data types, combining data from multiple sources, and creating new data fields.
4. Load: After the data is transformed, it is loaded into the data warehouse. This step involves
creating the physical data structures and loading the data into the warehouse.
5. The ETL process is an iterative process that is repeated as new data is added to the
warehouse. The process is important because it ensures that the data in the data warehouse
is accurate, complete, and up-to-date. It also helps to ensure that the data is in the format
required for data mining and reporting.
Additionally, there are many different ETL tools and technologies available, such as
Informatica, Talend, DataStage, and others, that can automate and simplify the ETL process.
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a
process in which an ETL tool extracts the data from various data source systems, transforms it
in the staging area, and then finally, loads it into the Data Warehouse system.
ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse builder,
CloverETL, and MarkLogic.
Data Warehouses: Most commonly used Data Warehouses are Snowflake, Redshift,
BigQuery, and Firebolt.
ADVANTAGES OR DISADVANTAGES:
1. Improved data quality: ETL process ensures that the data in the data warehouse is
accurate, complete, and up-to-date.
2. Better data integration: ETL process helps to integrate data from multiple sources and
systems, making it more accessible and usable.
3. Increased data security: ETL process can help to improve data security by controlling
access to the data warehouse and ensuring that only authorized users can access the data.
4. Improved scalability: ETL process can help to improve scalability by providing a way to
manage and analyze large amounts of data.
5. Increased automation: ETL tools and technologies can automate and simplify the ETL
process, reducing the time and effort required to load and update data in the warehouse.
1. High cost: ETL process can be expensive to implement and maintain, especially for
organizations with limited resources.
2. Complexity: ETL process can be complex and difficult to implement, especially for
organizations that lack the necessary expertise or resources.
3. Limited flexibility: ETL process can be limited in terms of flexibility, as it may not be
able to handle unstructured data or real-time data streams.
4. Limited scalability: ETL process can be limited in terms of scalability, as it may not be
able to handle very large amounts of data.
5. Data privacy concerns: ETL process can raise concerns about data privacy, as large
amounts of data are collected, stored, and analyzed.
o It executes in a single JVM and is used for development experimenting and prototyping.
o Here, files are installed and run using localhost.
o The local mode works on a local file system. The input and output data stored in the local
file system.
The command for local mode grunt shell:
1. $ pig-x local
MapReduce Mode
1. $ pig
Or,
1. $ pig -x mapreduce
These are the following ways of executing a Pig program on local and MapReduce mode: -
o Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To invoke Grunt
shell, run the pig command. Once the Grunt mode executes, we can provide Pig Latin
statements and command interactively at the command line.
o Batch Mode - In this mode, we can run a script file having a .pig extension. These files
contain Pig Latin commands.
o Embedded Mode - In this mode, we can define our own functions. These functions can
be called as UDF (User Defined Functions). Here, we use programming languages like
Java and Python.
Apache Pig Operators with Syntax and Examples
There is a huge set of Apache Pig Operators available in Apache Pig. In this article,
“Introduction to Apache Pig Operators” we will discuss all types of Apache Pig Operators in
detail.
Such as Diagnostic Operators, Grouping & Joining, Combining & Splitting and many
more. They also have their subtypes.
So, here we will discuss each Apache Pig Operators in depth along with syntax and their
examples.
What is Apache Pig Operators?
We have a huge set of Apache Pig Operators, for performing several types of Operations. Let’s
discuss types of Apache Pig Operators:
1. Diagnostic Operators
2. Grouping & Joining
3. Combining & Splitting
4. Filtering
5. Sorting
FUNCTIONS
Apache Pig provides various built-in functions namely eval, load, store, math, string,
bag and tuple functions.
Eval Functions
AVG()
1
To compute the average of the numerical values within a bag.
2 BagToString()
To concatenate the elements of a bag into a string. While concatenating, we can
place a delimiter between these values (optional).
CONCAT()
3
To concatenate two or more expressions of same type.
COUNT()
4 To get the number of elements in a bag, while counting the number of tuples in a
bag.
COUNT_STAR()
5 It is similar to the COUNT() function. It is used to get the number of elements in a
bag.
DIFF()
6
To compare two bags (fields) in a tuple.
IsEmpty()
7
To check if a bag or map is empty.
MAX()
8 To calculate the highest value for a column (numeric values or chararrays) in a
single-column bag.
MIN()
9 To get the minimum (lowest) value (numeric or chararray) for a certain column in a
single-column bag.
PluckTuple()
10 Using the Pig Latin PluckTuple() function, we can define a string Prefix and filter
the columns in a relation that begin with the given prefix.
SIZE()
11
To compute the number of elements based on any Pig data type.
SUBTRACT()
12 To subtract two bags. It takes two bags as inputs and returns a bag which contains
the tuples of the first bag that are not in the second bag.
SUM()
13
To get the total of the numeric values of a column in a single-column bag.
14 TOKENIZE()
To split a string (which contains a group of words) in a single tuple and return a bag
which contains the output of the split operation.
6. Print Page
UNIT-4
NoSQL Databases-
We know that MongoDB is a NoSQL Database, so it is very necessary to know about NoSQL
Database to understand MongoDB throughly.
NoSQL Database
It provides a mechanism for storage and retrieval of data other than tabular relations model used
in relational databases. NoSQL database doesn't use tables for storing data. It is generally used to
store big data and real-time web applications.
In the early 1970, Flat File Systems are used. Data were stored in flat files and the biggest
problems with flat files are each company implement their own flat files and there are no
standards. It is very difficult to store data in the files, retrieve data from files because there is no
standard way to store data.
Then the relational database was created by E.F. Codd and these databases answered the question
of having no standard way to store data. But later relational database also get a problem that it
could not handle big data, due to this problem there was a need of database which can handle
every types of problems then NoSQL database was developed.
Advantages of NoSQL
Architecture Pattern is a logical way of categorizing data that will be stored on the
Database. NoSQL is a type of database which helps to perform operations on big data and store
it in a valid format. It is widely used because of its flexibility and a wide variety of services.
ArchitecturePatternsofNoSQL:
The data is stored in NoSQL in any of the following four data architecture patterns.
1. Key-Value Store Database
2. Column Store Database
3. Document Database
4. Graph Database
These are explained as following below.
1.Key-ValueStoreDatabase:
This model is one of the most basic models of NoSQL databases. As the name suggests, the
data is stored in form of Key-Value Pairs. The key is usually a sequence of strings, integers or
characters but can also be a more advanced data type. The value is typically linked or co-
related to the key. The key-value pair storage databases generally store data as a hash table
where each key is unique. The value can be of any type (JSON, BLOB(Binary Large Object),
strings, etc). This type of pattern is usually used in shopping websites or e-commerce
applications.
Advantages:
Can handle large amounts of data and heavy load,
Easy retrieval of data by keys.
Limitations:
Complex queries may attempt to involve multiple key-value pairs which may delay
performance.
Data can be involving many-to-many relationships which may collide.
Examples:
DynamoDB
Berkeley DB
2.ColumnStoreDatabase:
Rather than storing data in relational tuples, the data is stored in individual cells which are
further grouped into columns. Column-oriented databases work only on columns. They store
large amounts of data into columns together. Format and titles of the columns can diverge from
one row to other. Every column is treated separately. But still, each individual column may
contain multiple other columns like traditional databases.
Basically, columns are mode of storage in this type.
Advantages:
Data is readily available
Queries like SUM, AVERAGE, COUNT can be easily performed on columns.
Examples:
HBase
Bigtable by Google
Cassandra
3.DocumentDatabase:
The document database fetches and accumulates data in form of key-value pairs but here, the
values are called as Documents. Document can be stated as a complex data structure.
Document here can be a form of text, arrays, strings, JSON, XML or any such format. The use
of nested documents is also very common. It is very effective as most of the data created is
usually in form of JSONs and is unstructured.
Advantages:
This type of format is very useful and apt for semi-structured data.
Storage retrieval and managing of documents is easy.
Limitations:
Handling multiple documents is challenging
Aggregation operations may not work accurately.
Examples:
MongoDB
CouchDB
Figure – Document Store Model in form of JSON documents
4.GraphDatabases:
Clearly, this architecture pattern deals with the storage and management of data in graphs.
Graphs are basically structures that depict connections between two or more objects in some
data. The objects or entities are called as nodes and are joined together by relationships called
Edges. Each edge has a unique identifier. Each node serves as a point of contact for the graph.
This pattern is very commonly used in social networks where there are a large number of
entities and each entity has one or many characteristics which are connected by edges. The
relational database pattern has tables that are loosely connected, whereas graphs are often very
strong and rigid in nature.
Advantages:
Fastest traversal because of connections.
Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.
Examples:
Neo4J
FlockDB( Used by Twitter)
Figure – Graph model format of NoSQL Databases
What is MongoDB?
Social Media has been around for 30 years, but the rise in the user base is recent. The data from
social media platforms can be used to boost business. This is where Social Media Mining takes
over. The amount of data these companies have to deal with is huge and scattered. Social Media
Mining helps in extracting meaning fromextract the data.
It is a system of expertise discovery inside a database. It is not wrong to say that social media is
the biggest contributor to Big Data. The records are not new; they have been around for a long
time. However, the potential to system these statistics has developed. Social media mining can
help us get insights to study customer behavior and interests,systematize and using this
information, you can serve better and compound your earnings.
Benefits of Social Media Mining
We post a lot of information on social media. In this Era where algorithms are everywhere, they
can generate information about the future trends and habits of the users, as this plays a major
role in today’s world. Social Media mining has become a must-have technique in every business.
Here are some of the benefits you can derive from using Social Media Mining :
1. Spot Trends Before They Become Trend
The data available from social media platforms can give important insights regarding society and
user behavior that were not possible earlier and were like finding a needle in a haystack. In
today’s world, the data is corroborative and evolving with time,were and there are multiple
needles to look for. Social Media Data mining is a technique that is capable of finding them all.
It is a process that starts with identifying the target audience and ends with digging into what
they are passionate about. Businesses may analyze the keywords, search results, comments,
and mentions to identify the current trend, and a deeper study of behavior change can also help
in predicting future trends. This data is very useful for businesses to make informed decisions
when the stakes are high.
2. Sentiment Analysis
Sentiment Analysis is the process of identifying positive or negative sentiments portrayed in
information posted on social media platforms. Businesses use Social Media Mining to identify
the same sentiments associated with their brand and product lines.
Sentiment Analysis has a vast application, and its use cannot be limited to self-evaluation only.
Negative sentiment about competitors can be an opportunity to win their customers. The Nestle
Maggie ban is a perfect example of this; competitors in the noodles market used strategies to
market their products as made from healthier alternatives. Patanjali saw this opportunity and
launched its noodles,; claiming to be made from atta,noodles, while the noodles market was full
of refined wheat flour noodles (maida).
When combined with social media monitoring, sentiment analysis can help you analyze
your brand image and bring negative aspects of the business to your attention. With this
information, you can address the negative sentiments and prioritize them so that they can be
addressed properly to improve the customer experience.
3. Keyword Identification
In a world where more than 90% of businesses function online, the importance of using the
right words cannot be emphasized enough. The business has to stand out to compete in a world
where your sales team cannot charm customers with their looks and cheesy talks. Keywords can
give your business an edge over itsimprove the competitors.
Keywords are those words that reveal the behavior of users and highlight the frequently used and
popular terms related to their products. Social Media Data Mining can be highly effective in
finding these keywords. The process is as basic as scanning the list of the most frequent words
or phrases used by customers to search for or define your product.
Using these keywords to define your product in digital media and implementing SEO can
yieldits pretty good results. Your product will rank higher, and by implementing frequent and
popular terms, you can make your product listings better.
4. Create a Better Product
Before the use of Big Data, businesses used to conduct individual surveys to know the public’s
opinion about their product. They faced many challenges; people didn’t entertain them, and even
if someone participated, it was very likely that their answers were not credible. With the
implementation of Social Media Data Mining, the public is responding and participating in
surveys without even realizing it, which provides companies with candid data.
Using the processed data, you can identify the things that bother customers and might give
insights about how you can improve your product to make it even better. In other words, you
are seeking advicegain and opinions from millions of users. By using so much data, you are
essentially tweaking your product in such a way that the probability of its success is very high.
By analyzing the userbase information, you can target the social media platform with the
highest number of users.
5. Competitor Analysis
You are not wrong to assume that your competitors are already using Data Mining techniques to
monitor the market and to compete with them; it becomes essential to Improving yourself by
analyzing others’ mistakes is often less painful than learning from your own.
There’s nothing wrong with following the footprints of a good competitor. You might not make a
fortune, but it will still help you survive tough times. Analyzing competitor behavior on social
media during the launch of a product will help you define a trend and use it to your
advantage.
Posts by competitor employees and management regarding hiring may give you an idea of the
expansion of business or even a subtle change in operations will help you to be proactive.
Having an idea of when to stay on your toes is advantageous in highly competitive industries.
6. Event Identification
Also known as Social Heat Mapping, this technique uses excellent. It is a part of Social Media
mining that helps researchers and agencies to be prepared for unexpected outbursts.
An excellent example of implementing heat mapping on social media was seen during
the Farmer Protests. During the protest, huge crowds were approaching the venue of the
Republic Day celebration.
7. Manage Real-time Events
This approach is mainly used for events, incidents or any issues that occur on social
media.Researchers and government department identify big issues as they use heat mapping or
any other technique to access social media sources. They detect the events and figure out
information faster than traditional sensor approaches. Many users publish the information using
their cell phones, so event identification is real-time and up to date. Organizations can respond
faster as people share information during disasters or social events.
8. Provide Useful Content (and Stop Spamming)
Social media is very close to modern life. This method is used to improve social media mining. It
uses computer algorithms to help companies for sharing information in a way that they prefer
and avoid spam. It can help organizations to identify small patterns and recognize customers who
might be interested in their products. Even social media platforms can use techniques to remove
challenging reports. As a result, social media mining provides important content to secure
all users.
9. Recognize Behavior
Social media mining analyzes our real behavior even when we are not present and helps to learn
about humans. Organizations use some techniques to understand customers. The
government provides facilities for companies to identify the right members and scientists to
explain the events. Therefore, social media mining helps to understand how events link together
that we may not noticed earlier.
A social network graph is a graph where the nodes represent people and the lines between
nodes, called edges, represent social connections between them, such as friendship or working
together on a project. These graphs can be either undirected or directed. For instance, Facebook
can be described with an undirected graph since the friendship is bidirectional, Alice and Bob
being friends is the same as Bob and Alice being friends. On the other hand, Twitter can be
described with a directed graph: Alice can follow Bob without Bob following Alice.
Social networks tend to have characteristic network properties. For instance, there tends to be a
short distance between any two nodes (as in the famous six degrees of separation study where
everyone in the world is at most six degrees away from any other), and a tendency to form
"triangles" (if Alice is friends with Bob and Carol, Bob and Carol are more likely to be friends
with each other.)
Social networks are important to social scientists interested in how people interact as well as
companies trying to target consumers for advertising. For instance if advertisers connect up three
people as friends, co-workers, or family members, and two of them buy the advertiser's product,
then they may choose to spend more in advertising to the third hold-out, on the belief that this
target has a high propensity to buy their product.
Social networks are the networks that depict the relations between people in the form of a
graph for different kinds of analysis. The graph to store the relationships of people is known as
Sociogram. All the graph points and lines are stored in the matrix data structure called
Sociomatrix. The relationships indicate of any kind like kinship, friendship, enemies,
acquaintances, colleagues, neighbors, disease transmission, etc.
Social Network Analysis (SNA) is the process of exploring or examining the social structure
by using graph theory. It is used for measuring and analyzing the structural properties of the
network. It helps to measure relationships and flows between groups, organizations, and other
connected entities. We need specialized tools to study and analyze social networks.
Basically, there are two types of social networks:
Ego network Analysis
Complete network Analysis
1. Ego Network Analysis
Ego network Analysis is the one that finds the relationship among people. The analysis is done
for a particular sample of people chosen from the whole population. This sampling is done
randomly to analyze the relationship. The attributes involved in this ego network analysis are a
person’s size, diversity, etc.
This analysis is done by traditional surveys. The surveys involve that they people are asked
with whom they interact with and their name of the relationship between them. It is not
focused to find the relationship between everyone in the sample. It is an effort to find the
density of the network in those samples. This hypothesis is tested using some statistical
hypothesis testing techniques.
The following functions are served by Ego Networks:
Propagation of information efficiently.
Sensemaking from links, For example, Social links, relationships.
Access to resources, efficient connection path generation.
Community detection, identification of the formation of groups.
Analysis of the ties among individuals for social support.
2. Complete Network Analysis
Complete network analysis is the analysis that is used in all network analyses. It analyses the
relationship among the sample of people chosen from the large population. Subgroup analysis,
centrality measure, and equivalence analysis are based on the complete network analysis. This
analysis measure helps the organization or the company to make any decision with the help of
their relationship. Testing the sample will show the relationship in the whole network since the
sample is taken from a single set of domains.
The difference between ego and complete network analysis is that the ego network focus on
collecting the relationship of people in the sample with the outside world whereas, in Complete
network, it is focused on finding the relationship among the samples.
The majority of the network analysis will be done only for a particular domain or one
organization. It is not focused on the relationships between the organization. So many of the
social network analysis measure uses only Complete network analysis.
Social Network:
When we think of a social network, we think of Facebook, Twitter, Google+, or another website
that is called a “social network,” and indeed this kind of network is representative of the broader
class of networks called “social.” The essential characteristics of a social network are:
1. There is a collection of entities that participate in the network. Typically, these entities
are people, but they could be something else entirely
2. There is at least one relationship between entities of the network. On Facebook or its ilk,
this relationship is called friends. Sometimes the relationship is all-or-nothing; two
people are either friends or they are not.
3. There is an assumption of non-randomness or locality. This condition is the hardest to
formalize, but the intuition is that relationships tend to cluster. That is, if entity A is
related to both B and C, then there is a higher probability than average that B and C are
related.
Social network as Graphs: Social networks are naturally modeled as graphs, which we
sometimes refer to as a social graph. The entities are the nodes, and an edge connects two nodes
if the nodes are related by the relationship that characterizes the network. If there is a degree
associated with the relationship, this degree is represented by labeling the edges. Often, social
graphs are undirected, as for the Facebook friends graph. But they can be directed graphs, as for
example the graphs of followers on Twitter or Google+.
Above figure is an example of a tiny social network. The entities are the nodes A through G. The
relationship, which we might think of as “friends,” is represented by the edges. For instance, B is
friends with A, C, and D.
Clustering of Social-Network Graphs:
Clustering of the graph is considered as a way to identify communities. Clustering of graphs
involves following steps:
1. Distance Measures for Social-Network Graphs
If we were to apply standard clustering techniques to a social-network graph, our first step would
be to define a distance measure. When the edges of the graph have labels, these labels might be
usable as a distance measure, depending on what they represented. But when the edges are
unlabeled, as in a “friends” graph, there is not much we can do to define a suitable distance.
Our first instinct is to assume that nodes are close if they have an edge between them and distant
if not. Thus, we could say that the distance d(x, y) is 0 if there is an edge (x, y) and 1 if there is
no such edge. We could use any other two values, such as 1 and ∞, as long as the distance is
closer when there is an edge.
2. Applying Standard Clustering Methods
There are two general approaches to clustering: hierarchical (agglomerative) and point-
assignment. Let us consider how each of these would work on a social-network graph.
Hierarchical clustering of a social-network graph starts by combining some two nodes that are
connected by an edge. Successively, edges that are not between two nodes of the same cluster
would be chosen randomly to combine the clusters to which their two nodes belong. The choices
would be random, because all distances represented by an edge are the same.
Now, consider a point-assignment approach to clustering social networks. Again, the fact that all
edges are at the same distance will introduce a number of random factors that will lead to some
nodes being assigned to the wrong cluster.
3. Betweenness:
Since there are problems with standard clustering methods, several specialized clustering
techniques have been developed to find communities in social networks. The simplest one is
based on finding the edges that are least likely to be inside the community.
Define the betweenness of an edge (a, b) to be the number of pairs of nodes x and y such that the
edge (a, b) lies on the shortest path between x and y. To be more precise, since there can be
several shortest paths between x and y, edge (a, b) is credited with the fraction of those shortest
paths that include the edge (a, b). As in golf, a high score is bad. It suggests that the edge (a, b)
runs between two different communities; that is, a and b do not belong to the same community
4. The Girvan-Newman Algorithm:
In order to exploit the betweenness of edges, we need to calculate the number of shortest paths
going through each edge. We shall describe a method called the Girvan-Newman (GN)
Algorithm, which visits each node X once and computes the number of shortest paths from X to
each of the other nodes that go through each of the edges. The algorithm begins by performing a
breadth-first search (BFS) of the graph, starting at the node X. Note that the level of each node in
the BFS presentation is the length of the shortest path from X to that node. Thus, the edges that
go between nodes at the same level can never be part of a shortest path from X.
Edges between levels are called DAG edges (“DAG” stands for directed, acyclic graph). Each
DAG edge will be part of at least one shortest path from root X. If there is a DAG edge (Y, Z),
where Y is at the level above Z (i.e., closer to the root), then we shall call Y a parent of Z and Z a
child of Y, although parents are not necessarily unique in a DAG as they would be in a tree.
5. Using betweenness to find communities:
The betweenness scores for the edges of a graph behave something like a distance measure on
the nodes of the graph. It is not exactly a distance measure, because it is not defined for pairs of
nodes that are unconnected by an edge, and might not satisfy the triangle inequality even when
defined. However, we can cluster by taking the edges in order of increasing betweenness and add
them to the graph one at a time. At each step, the connected components of the graph form some
clusters.
1. Content-based filtering: This type of system uses the characteristics of items that a user
has liked in the past to recommend similar items.
2. Collaborative filtering: This type of system uses the past behaviour of users to
recommend items that similar users have liked.
3. Hybrid: To generate suggestions, this kind of system combines content-based filtering
and collaborative filtering techniques.
4. Matrix Factorization: Using this method, the user-item matrix is divided into two
lower-dimension matrices that are then utilized to generate predictions.
5. Deep Learning: To train the user and item representations that are subsequently utilized
to generate recommendations, these models make use of neural networks.
The choice of which type of recommendation system to use depends on the specific application
and the type of data available.
It's worth noting that recommendation systems are widely used and can have a significant impact
on businesses and users. However, it's important to consider ethical considerations and biases
that may be introduced to the system.
In this article, We utilize a dataset from Kaggle Datasets: Articles Sharing and Reading from
CI&T Deskdrop in this project.
For the purpose of giving customers individualized suggestions, we will demonstrate how to
develop Collaborative Filtering, Content-Based Filtering, and Hybrid techniques in Python.
The Deskdrop dataset from CI&T's Internal Communication platform, which is an actual sample
of 12 months' worth of logs (from March 2016 to February 2017). (DeskDrop). On more than 3k
publicly published articles, there are around 73k documented user interactions. Two CSV files
make up the file:
o shared_articles.csv
o Users_interactions.csv
Importing Libraries
1. import sklearn
2. import scipy
3. import numpy as np
4. import random
5. import pandas as pd
6. from nltk.corpus import stopwords
7. from scipy.sparse import csr_matrix
8. from sklearn.model_selection import train_test_split
9. from sklearn.metrics.pairwise import cosine_similarity
10. from scipy.sparse.linalg import svds
11. from sklearn.feature_extraction.text import TfidfVectorizer
12. from sklearn.preprocessing import MinMaxScaler
13. import matplotlib.pyplot as plt
14. import math
Here, we have to load our dataset to perform the machine learning operations.
1. shared_articles.csv
It includes data on the articles posted on the platform. Each article contains a timestamp for
when it was shared, the original url, the title, plain text content, the language it was shared in
(Portuguese: pt or English: en), and information about the individual who shared it (author).
o SHARED CONTENT: Users can access the article that was shared on the platform.
o CONTENT REMOVED: The article has been taken down from the site and is no longer
accessible for recommendations.
We will just analyze the "CONTENT SHARED" event type here for the purpose of simplicity,
making the erroneous assumption that all articles were accessible for the whole one-year period.
Only publications that were available at a specific time should be recommended for a more
accurate review, but we'll do this exercise for you anyhow.
1. dataframe_articles = pd.read_csv('shared_articles.csv')
2. dataframe_articlesdataframe_articles = dataframe_articles[dataframe_articles['eventType'] == 'C
ONTENT SHARED']
3. dataframe_articles.head(5)
2. users_interactions.csv
It includes user interaction records for shared content. By using the contentId field, it may be
connected to articles shared.csv.
Data Manipulation
Here, we assign a weight or strength to each sort of interaction since there are many kinds. For
instance, we believe that a remark in an article indicates a user's interest in the item is more
significant than a like or a simple view.
1. strength_of_event_type = {
2. 'VIEW': 1.0,
3. 'LIKE': 2.0,
4. 'BOOKMARK': 2.5,
5. 'FOLLOW': 3.0,
6. 'COMMENT CREATED': 4.0,
7. }
8.
9. dataframe_interactions['eventStrength'] = dataframe_interactions['eventType'].apply(lambda x: st
rength_of_event_type[x])
Note: User cold-start is a problem with recommender systems that makes it difficult to give
consumers with little or no consumption history individualized recommendations since there isn't
enough data to model their preferences.
Due to this, we are only retaining users in the dataset who had at least five interactions.
1. def preference_of_smooth_users(x):
2. return math.log(1+x, 2)
3.
4. dataframe_interaction_full = dataframe_interaction_from_selected_users \
5. .groupby(['personId', 'contentId'])['eventStrength'].sum() \
6. .apply(preference_of_smooth_users).reset_index()
7. print('Total Number of unique user/item interactions: %d' % len(dataframe_interaction_full))
8. dataframe_interaction_full.head(10)
:
Evaluation
Evaluation is crucial for machine learning projects because it enables objective comparison of
various methods and model hyperparameter selections.
Making sure the trained model generalizes for data it was not trained on utilizing cross-validation
procedures is a crucial component of assessment. Here, we employ a straightforward cross-
validation technique known as a holdout, in which a random data sample?in this case, 20%?is
set aside throughout training and utilized just for assessment. This article's assessment metrics
were all calculated using the test set.
A more reliable assessment strategy would involve dividing the train and test sets according to a
reference date, with the train set being made up of all interactions occurring before that date and
the test set consisting of interactions occurring after that day. For the sake of simplicity, we
decided to utilize the first random strategy for this notebook, but you might want to try the
second way to more accurately replicate how the recsys would behave in production when
anticipating interactions from "future" users.
There are a number of metrics that are frequently used for assessment in recommender systems.
We decided to employ Top-N accuracy measures, which assess the precision of the top
suggestions made to a user in comparison to the test set items with which the user has actually
interacted.
NDCG@N and MAP@N are two more well-liked ranking metrics whose computation of the
score takes into consideration the position of the pertinent item in the ranked list (max. value if
the relevant item is in the first position).
9. items_all = set(dataframe_articles['contentId'])
10. items_not_interacted = items_all - items_interacted
11.
12. random.seed(seed)
13. sample_non_interacted_items = random.sample(items_not_interacted, sample_size)
14. return set(sample_non_interacted_items)
15.
16. def _to_verify_hit_top_n(self, item_id, items_recommended, topn):
17. try:
18. index = next(i for i, c in enumerate(items_recommended) if c == item_id)
19. except:
20. index = -1
21. hit = int(index in range(0, topn))
22. return hit, index
23.
24. def model_evaluation_for_users(self, model, person_id):
25. # Adding the test set's items.
26. interacted_testset_values = dataframe_interaction_text_indexed.loc[person_id]
27. if type(interacted_testset_values['contentId']) == pd.Series:
28. person_interacted_testset_items = set(interacted_testset_values['contentId'])
29. else:
30. person_interacted_testset_items = set([int(interacted_testset_values['contentId'])])
31. interated_testset_items_count = len(person_interacted_testset_items)
32.
33. # Obtaining a model's rated suggestion list for a certain user.
34. dataframe_person_recs = model.recommending_items(person_id,
35. items_to_ignore=getting_items_interacted(person_id,
36. dataframe_interaction_train_indexed
37. ),
38. topn=10000000000)
39.
40. hits_at_5_count = 0
41. hits_at_10_count = 0
42. # For each item with which the user engaged in the test set
43. for item_id in person_interacted_testset_items:
44. # Selecting 100 random things with which the user hasn't interacted
45. # (to indicate items that are deemed to be no relevant to the user) (to represent items that
are assumed to be not relevant to the user)
46. sample_non_interacted_items = self.getting_not_interacted_samples(person_id,
47. sample_size=EVAL_RANDOM_SAMPLE_NON
_INTERACTED_ITEMS,
48. seed=item_id%(2**32))
49.
50. # Combining the 100 random objects with the currently interacted item
51. items_to_filter_recs = sample_non_interacted_items.union(set([item_id]))
52.
53. # Recommendations are only filtered if they come from the interacted item or a random s
ample of 100 non-interacted items.
54. dataframe_valid_recs = dataframe_person_recs[dataframe_person_recs['contentId'].isin(it
ems_to_filter_recs)]
55. valid_recs_ = dataframe_valid_recs['contentId'].values
56. # Checking whether the currently interacted-with item is one of the Top-
N suggested things.
57. hit_at_5, index_at_5 = self._to_verify_hit_top_n(item_id, valid_recs_, 5)
58. hits_at_5_count += hit_at_5
59. hit_at_10, index_at_10 = self._to_verify_hit_top_n(item_id, valid_recs_, 10)
60. hits_at_10_count += hit_at_10
61.
62. # Recall is the percentage of things that have been engaged with and are included among the
Top-N suggested items.
63. # when combined with a group of unrelated objects
64. recall_at_5 = hits_at_5_count / float(interated_testset_items_count)
65. recall_at_10 = hits_at_10_count / float(interated_testset_items_count)
66.
67. person_metrics = {'hits@5_count':hits_at_5_count,
68. 'hits@10_count':hits_at_10_count,
69. 'interacted_count': interated_testset_items_count,
70. 'recall@5': recall_at_5,
71. 'recall@10': recall_at_10}
72. return person_metrics
73.
74. def model_evaluation(self, model):
75. #print('Running evaluation for users')
76. people_metrics = []
77. for idx, person_id in enumerate(list(dataframe_interaction_text_indexed.index.unique().valu
es)):
78. #if idx % 100 == 0 and idx > 0:
79. # print('%d users processed' % idx)
80. person_metrics = self.model_evaluation_for_users(model, person_id)
81. person_metrics['_person_id'] = person_id
82. people_metrics.append(person_metrics)
83. print('%d users processed' % idx)
84.
85. detailed_results_df = pd.DataFrame(people_metrics) \
86. .sort_values('interacted_count', ascending=False)
87.
88. global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['i
nteracted_count'].sum())
89. global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df
['interacted_count'].sum())
90.
91. global_metrics = {'modelName': model.getting_model_name(),
92. 'recall@5': global_recall_at_5,
93. 'recall@10': global_recall_at_10}
94. return global_metrics, detailed_results_df
95.
96. model_evaluator = ModelEvaluator()
Popularity Model
The Popularity model is a typical baseline strategy that is typically challenging to surpass. This
strategy merely suggests to a user the most well-liked products that the customer has not yet
eaten; it is not personally tailored. As the "wisdom of the multitude" is accounted for by
popularity, it typically offers sound advice that is generally engaging for most people.
A recommender system's main goal, which goes much beyond this straightforward method, is to
apply long-tail products to users with extremely particular interests.
1. class PopularityRecommender:
2.
3. MODEL_NAME = 'Popularity'
4.
5. def __init__(self, dataframe_popularity, dataframe_items=None):
6. self.dataframe_popularity = dataframe_popularity
7. self.dataframe_items = dataframe_items
8.
9. def getting_model_name(self):
10. return self.MODEL_NAME
11.
12. def recommending_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
13. # Suggest the most well-liked products that the consumer hasn't yet viewed.
14. dataframe_recommendation = self.dataframe_popularity[~self.dataframe_popularity['conten
tId'].isin(items_to_ignore)] \
15. .sort_values('eventStrength', ascending = False) \
16. .head(topn)
17.
18. if verbose:
19. if self.dataframe_items is None:
20. raise Exception('"dataframe_items" is required in verbose mode')
21.
22. dataframe_recommendationdataframe_recommendation = dataframe_recommendation.m
erge(self.dataframe_items, how = 'left',
23. left_on = 'contentId',
24. right_on = 'contentId')[['eventStrength', 'contentId', 'title', 'url
', 'lang']]
25.
26.
27. return dataframe_recommendation
28.
29. popularity_model = PopularityRecommender(dataframe_item_popularity, dataframe_articles)
Here, using the above-described methodology, we evaluate the Popularity model.
It had a Recall@5 of 0.2417, which suggests that the Popularity model placed around 24% of the
test set's interactive items among the top 5 items (from lists with 100 random items).
Furthermore, as predicted, Recall@10 was significantly higher (37%)
You might find it surprising that popular models can typically perform so well.
The descriptions or qualities of the objects with which the user has engaged are used in content-
based filtering techniques to suggest related items. This solution is reliable in preventing the
cold-start issue since it only relies on the user's prior decisions. It is straightforward to create
item profiles and user profiles for text-based objects like books, articles, and news stories using
the raw text.
In this case, we're employing TF-IDF, a highly well-liked information retrieval (search engine)
approach.
Using this method, unstructured text is transformed into a vector structure, where each word is
represented by a location in the vector, and the value indicates how pertinent a word is for an
article. The same Vector Space Model will be used to represent all things, making it possible to
compare articles.
1. # Avoiding stopwords (words without sense) in Portuguese and English (as we have a corpus wit
h mixed languages)
2. stopwordsstopwords_list = stopwords.words('english') + stopwords.words('portuguese')
3.
4. # trains a model with 5000 vectors that is made up of the most common bigrams and unigrams in
the corpus, excluding stopwords.
5. vectorizer = TfidfVectorizer(analyzer='word',
6. ngram_range=(1, 2),
7. min_df=0.003,
8. max_df=0.5,
9. max_features=5000,
10. stop_words=stopwords_list)
11.
12. item_ids = dataframe_articles['contentId'].tolist()
13. tfidf_matrix = vectorizer.fit_transform(dataframe_articles['title'] + "" + dataframe_articles['text'])
We average all the item profiles with which the user has engaged in order to model the user
profile. The final user profile will give more weight to the articles on which the user has
interacted the most (e.g., liked or commented), with the average being weighted by the strength
of the interactions.
1. def getting_item_profiles(item_id):
2. idx = item_ids.index(item_id)
3. profile_item = tfidf_matrix[idx:idx+1]
4. return profile_item
5.
6. def getting_item_profiless(ids):
7. list_profiles_item = [getting_item_profiles(x) for x in ids]
8. profile_items = scipy.sparse.vstack(list_profiles_item)
9. return profile_items
10.
11. def building_user_profiles(person_id, dataframe_interaction_indexed):
12. dataframe_interactions_person = dataframe_interaction_indexed.loc[person_id]
13. profiles_user_items = getting_item_profiless(dataframe_interactions_person['contentId'])
14.
15. user_item_strengths = np.array(dataframe_interactions_person['eventStrength']).reshape(-
1,1)
16. # Weighted average of the item profiles by the intensity of the interactions
17. user_item_strengths_weighted_avg = np.sum(profiles_user_items.multiply(user_item_strengt
hs), axis=0) / np.sum(user_item_strengths)
18. user_profile_norm = sklearn.preprocessing.normalize(user_item_strengths_weighted_avg)
19. return user_profile_norm
20.
21. def build_users_profiles():
22. dataframe_interaction_indexed = dataframe_interaction_train[dataframe_interaction_train['co
ntentId'] \
23. .isin(dataframe_articles['contentId'])].set_index('personId')
24. profiles_user = {}
25. for person_id in dataframe_interaction_indexed.index.unique():
26. profiles_user[person_id] = building_user_profiles(person_id, dataframe_interaction_indexe
d)
27. return profiles_user
28.
29. profiles_users = build_users_profiles()
30. len(profiles_users)
Let's look at the profile first. It is a unit vector with a length of 5000. Each position's value
indicates how vital a token (a bigram or a unigram) is to me.
According to a look at below profile, the most pertinent tokens actually do reflect interests in
machine learning, deep learning, artificial intelligence, and the Google Cloud Platform
professionally! Therefore, we may anticipate some solid advice here!
1. my_profile = profiles_users[-1479311724257856983]
2. print(my_profile.shape)
3. pd.DataFrame(sorted(zip(tfidf_feature_names,
4. profiles_users[-1479311724257856983].flatten().tolist()), key=lambda x: -
x[1])[:20],
5. columns=['token', 'relevance'])
Output:
1. class ContentBasedRecommender:
2.
3. MODEL_NAME = 'Content-Based'
4.
5. def __init__(self, items_df=None):
6. self.item_ids = item_ids
7. self.items_df = items_df
8.
9. def getting_model_name(self):
10. return self.MODEL_NAME
11.
12. def _getting_similar_items_to_the_users(self, person_id, topn=1000):
13. # The user profile and all object profiles are compared using the cosine similarity formula.
14. cosine_similarities = cosine_similarity(profiles_users[person_id], tfidf_matrix)
15. # Gets the most comparable products.
16. similar_indices = cosine_similarities.argsort().flatten()[-topn:]
17. # Sort comparable objects according to similarity.
18. similar_items = sorted([(item_ids[i], cosine_similarities[0,i]) for i in similar_indices], key=l
ambda x: -x[1])
19. return similar_items
20.
21. def recommending_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
22. similar_items = self._getting_similar_items_to_the_users(user_id)
23. # Ignores things with which the user has previously behaved
24. similar_items_filtered = list(filter(lambda x: x[0] not in items_to_ignore, similar_items))
25.
26. dataframe_recommendations = pd.DataFrame(similar_items_filtered, columns=['contentId',
'recStrength']) \
27. .head(topn)
28.
29. if verbose:
30. if self.items_df is None:
31. raise Exception('"items_df" is required in verbose mode')
32.
33. dataframe_recommendationsdataframe_recommendations = dataframe_recommendations
.merge(self.items_df, how = 'left',
34. left_on = 'contentId',
35. right_on = 'contentId')[['recStrength', 'contentId', 'title', 'url', '
lang']]
36.
37.
38. return dataframe_recommendations
39.
40. content_based_recommender_model = ContentBasedRecommender(dataframe_articles)
We have a Recall@5 of 0.162 with the customized recommendations of the content-based
filtering model, which indicates that around 16% of the test set's interacting items were listed by
this model among the top 5 things (from lists with 100 random items). Recall@10 was 0.261
(52%), as well. The fact that the Information-Based model performed less well than the
Popularity model suggests that consumers may not be as committed to reading content that is
highly similar to what they have already read.
o Memory-based: This method computes user similarities based on the items with which
they have engaged (user-based approach) or computes item similarities based on the users
who have interacted with the things (item-based approach).
User Neighbourhood-based CF is a common illustration of this strategy, in which the top N
similarly inclined users (typically determined using Pearson correlation) for a user are chosen
and used to suggest products that those inclined users liked but with whom the current user has
not yet interacted. Although this strategy is relatively easy to put into practice, it often does not
scale effectively for numerous people. Crab offers an excellent Python implementation of this
strategy.
o Model-based: In this method, models are created by utilizing various machine learning
algorithms to make product recommendations to customers. Numerous model-based CF
techniques exist, including probabilistic latent semantic analysis, neural networks,
bayesian networks, clustering models, and latent component models like Singular Value
Decomposition (SVD).
Matrix Factorisation
User-item matrices are condensed into a low-dimensional form using latent component models.
This method has the benefit of working with a much smaller matrix in a lower-dimensional space
rather than a high-dimensional matrix with a large number of missing values.
Both the user-based and item-based neighbourhood algorithms described in the preceding section
might be used with a reduced presentation. This paradigm has a number of benefits. Compared to
memory-based ones, it handles the sparsity of the original matrix better. Additionally, it is much
easier to compare similarities in the generated matrix, especially when working with sizable
sparse datasets.
Here, we employ Singular Value Decomposition, a well-known latent component model (SVD).
You might also use other, more CF-specific matrix factorization frameworks like surprise, mrec,
or python-recsys. We choose a SciPy implementation of SVD since Kaggle kernels support it.
The choice of how many elements to factor in the user-item matrix is crucial. The factorization
in the original matrix reconstructions is more exact the more factors there are. As a result, if the
model is permitted to retain too many specifics of the original matrix, it may struggle to
generalize to data that was not used for training. The generality of the model is increased by
reducing the number of components.
1. # Make a sparse pivot table with columns for the products and rows for the users
2. dataframe_users_items_pivot_matrix = dataframe_interaction_train.pivot(index='personId',
3. columns='contentId',
4. values='eventStrength').fillna(0)
5.
6. dataframe_users_items_pivot_matrix.head(10)
Output:
1. pivot_matrix_users_items = dataframe_users_items_pivot_matrix.to_numpy()
2. pivot_matrix_users_items[:10]
Output:
1. users_ids = list(dataframe_users_items_pivot_matrix.index)
2. users_ids[:10]
Output:
1. pivot_sparse_matrix_users_items = csr_matrix(pivot_matrix_users_items)
2. pivot_sparse_matrix_users_items
Output:
1. # The number of factors to be applied to the user-item matrix
2. Number_of_factor = 15
3. # matrix factorization of the initial user-item matrix is carried out
4. # U, sigma, Vt = svds(users_items_pivot_matrix, k = Number_of_factor)
5. U, sigma, Vt = svds(pivot_sparse_matrix_users_items, k = Number_of_factor)
6.
7. U.shape
Output:
1. Vt.shape
Output:
1. sigma = np.diag(sigma)
2. sigma.shape
We attempt to rebuild the original matrix by multiplying the elements after factorization. As a
result, the matrix is no longer sparse. We will utilize the predictions for goods with which the
user has not yet interacted to produce recommendations.
1. predicted_ratings_norm_all_users = (predicted_ratings_all_users -
predicted_ratings_all_users.min()) / (predicted_ratings_all_users.max() -
predicted_ratings_all_users.min())
2.
3. # the process of returning the rebuilt matrix to a Pandas dataframe.
4. dataframe_cf_preds = pd.DataFrame(predicted_ratings_norm_all_users, columns = dataframe_us
ers_items_pivot_matrix.columns, index=users_ids).transpose()
5. dataframe_cf_preds.head(10)
Output:
1. len(dataframe_cf_preds.columns)
Output:
1. class CFRecommender:
2.
3. MODEL_NAME = 'Collaborative Filtering'
4.
5. def __init__(self, dataframe_cf_predictions, items_df=None):
6. self.dataframe_cf_predictions = dataframe_cf_predictions
7. self.items_df = items_df
8.
9. def getting_model_name(self):
10. return self.MODEL_NAME
11.
12. def recommending_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
13. # Obtain and arrange user predictions
14. predictions_sorted_users = self.dataframe_cf_predictions[user_id].sort_values(ascending=F
alse) \
15. .reset_index().rename(columns={user_id: 'recStrength'})
16.
17. # Send the user the movies with the highest expected rating that they haven't yet viewed.
18. dataframe_recommendations = predictions_sorted_users[~predictions_sorted_users['content
Id'].isin(items_to_ignore)] \
19. .sort_values('recStrength', ascending = False) \
20. .head(topn)
21.
22. if verbose:
23. if self.items_df is None:
24. raise Exception('"items_df" is required in verbose mode')
25.
26. dataframe_recommendationsdataframe_recommendations = dataframe_recommendations
.merge(self.items_df, how = 'left',
27. left_on = 'contentId',
28. right_on = 'contentId')[['recStrength', 'contentId', 'title', 'url', '
lang']]
29.
30.
31. return dataframe_recommendations
32.
33. cf_recommender_model = CFRecommender(dataframe_cf_preds, dataframe_articles)
Recall@5 (33%) and Recall@10 (46%) values were obtained while evaluating the Collaborative
Filtering model (SVD matrix factorization), which is much higher than the Popularity model and
Content-Based model.
Hybrid Recommender
Let's create a straightforward hybridization technique that ranks items based on the weighted
average of the normalized CF and Content-Based scores. The weights for the CF and CB models
in this instance are 100.0 and 1.0, respectively, because the CF model is significantly more
accurate than the CB model.
1. class HybridRecommender:
2.
3. MODEL_NAME = 'Hybrid'
4.
5. def __init__(self, model_cb_rec, model_cf_rec, dataframe_items, weight_cb_ensemble=1.0, w
eight_cf_ensemble=1.0):
6. self.model_cb_rec = model_cb_rec
7. self.model_cf_rec = model_cf_rec
8. self.weight_cb_ensemble = weight_cb_ensemble
9. self.weight_cf_ensemble = weight_cf_ensemble
10. self.dataframe_items = dataframe_items
11.
12. def getting_model_name(self):
13. return self.MODEL_NAME
14.
15. def recommending_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
16. # Obtaining the top 1000 suggestions for content-based filtering
17. dataframe_cb_recs = self.model_cb_rec.recommending_items(user_id, items_to_ignoreitem
s_to_ignore=items_to_ignore, verboseverbose=verbose,
18. topn=1000).rename(columns={'recStrength': 'recStrengthC
B'})
19.
20. # Obtaining the top 1000 suggestions via collaborative filtering
21. dataframe_cf_recs = self.model_cf_rec.recommending_items(user_id, items_to_ignoreitems
_to_ignore=items_to_ignore, verboseverbose=verbose,
22. topn=1000).rename(columns={'recStrength': 'recStrengthCF
'})
23.
24. # putting the outcomes together by contentId
25. dataframe_recs = dataframe_cb_recs.merge(dataframe_cf_recs,
26. how = 'outer',
27. left_on = 'contentId',
28. right_on = 'contentId').fillna(0.0)
29.
30. # Using the CF and CB scores to create a hybrid recommendation score
31. # dataframe_recs['recStrengthHybrid'] = dataframe_recs['recStrengthCB'] * dataframe_recs[
'recStrengthCF']
32. dataframe_recs['recStrengthHybrid'] = (dataframe_recs['recStrengthCB'] * self.weight_cb_e
nsemble) \
33. + (dataframe_recs['recStrengthCF'] * self.weight_cf_ensemble)
34.
35. # Sorting advice based on hybrid score
36. recommendations_df = dataframe_recs.sort_values('recStrengthHybrid', ascending=False).h
ead(topn)
37.
38. if verbose:
39. if self.dataframe_items is None:
40. raise Exception('"dataframe_items" is required in verbose mode')
41.
42. recommendations_dfrecommendations_df = recommendations_df.merge(self.dataframe_i
tems, how = 'left',
43. left_on = 'contentId',
44. right_on = 'contentId')[['recStrengthHybrid', 'contentId', 'title'
, 'url', 'lang']]
45.
46.
47. return recommendations_df
48.
49. hybrid_recommender_model = HybridRecommender(content_based_recommender_model, cf_re
commender_model, dataframe_articles,
50. weight_cb_ensemble=1.0, weight_cf_ensemble=100.0)
51.
52. print('Evaluating Hybrid model...')
53. metrics_hybrid_global, dataframe_hybrid_detailed_results = model_evaluator.model_evaluation(
hybrid_recommender_model)
54. print('\nGlobal metrics:\n%s' % metrics_hybrid_global)
55. dataframe_hybrid_detailed_results.head(10)
Output:
Now for better understanding, we can also plot the graph for the comparison of the models.
1. %matplotlib inline
2. ax = dataframe_global_metrics.transpose().plot(kind='bar', figsize=(15,8))
3. for p in ax.patches:
4. ax.annotate("%.3f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='cente
r', va='center', xytext=(0, 10), textcoords='offset points')
Output:
TESTING
Now, we will test the best model, which is hybrid for other users.
1. hybrid_recommender_model.recommending_items(-
1479311724257856983, topn=20, verbose=True)
As we check the comparison between the recommendation from the hybrid model and the actual
interest, we find that the recommendations are pretty similar.