BIGDATA ANALYTICS USING R UNIT - I
Introduction to Big data: Data, Classification of Big Data –Structured Data, Un-structured
Data, Semi-structured data, Characteristics of Big Data, Evaluation of Big Data, Definition and
Challenges of Big Data, What is Big Data and Why to use Big Data?, Business Intelligence Vs
Big Data.
Data: Data is a group of characters or symbols on which operations are performed by a computer. It
is stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.
Big Data is a term used to describe collection of data that is huge in size and is growing
exponentially within short period of time. Big data is so large and complex that none of the
traditional data management tools are able to store it or process it efficiently. Common examples of
Big Data are as follows
1. The New York Stock Exchange which generates about one terabyte of new trade data per
day.
2. Social Media site Face book adds 500+terabytes of new data every day in terms of photo and
video uploads, message exchanges, putting comments etc.
3. Single Jet engine can generate 10+terabytes of data in 30 minutes of a flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.
Classification of Big Data: Big data could be found in three forms. They are
1. Structured data
2. Un-structured data
3. Semi-structured data
Structured data: Any data that can be stored, accessed and processed in the fixed format is termed as
a structured data. An Employee table in a database is an example of Structured Data. This type of
data is stored in the SQL database in a tabular format. Today, most of the data that is developed and
processed is structured data. This is the simplest way to manage information.
Employee_ID Employee_Name Gender Department Salary_In_lacs
101 Venkataramudu Male Finance 650000
102 Pushpavathamma Female Admin 650000
103 Balaswamy Male Admin 500000
104 Rangaswamy Male Finance 500000
Krishnaveni Degree College :: NarasaraopetPage No. : 1
BIGDATA ANALYTICS USING R UNIT - I
105 Ushenamma Female Finance 550000
Un-structured data: Any data with unknown form or structure is termed as unstructured data. An
output result of Google search is an example of Unstructured Data. Few examples of unstructured
data are Twitter Message, Facebook Post, Log Files, Email.
A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text file, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don’t know how to derive value out of it since this data is
in its raw form or unstructured format.
Semi-structured data: Semi-structured data can contain both the forms of data. Semi-structured data
is structured in form but it is actually not defined as a table definition in relational DBMS. Example
of semi-structured data is a personal data of an employee stored in XML file. Example of semi-
structured data is a data represented in an XML file.
Examples of Semi-structured Data: Personal data stored in an XML file-
<rec><name>Prashant Rao</name><sx>Male</sex><age>35</age></rec>
<rec><name>Seema R</name><sx>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sx>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sx>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J</name><sx>Male</sex><age>35></age></rec>
Characteristics of Big Data (OR) Four V’s in Big Data: Four V’s are the characteristics that
differentiate big data and data are
Volume: A particular data can actually be considered as a Big Data or not, is dependent upon volume
of data. Hence, 'Volume' is one characteristic which needs to be considered while dealing with 'Big
Data'.
Variety: A data is called a Big Data when it is collected from heterogeneous sources and the nature
of data can be both structured and unstructured. Hence, 'Variety' is another characteristic which
needs to be considered while dealing with 'Big Data'.
Velocity: A data is called a Big Data when the speed of generation of data is fast. Hence, 'Velocity'
is another characteristic which needs to be considered while dealing with 'Big Data'.
Variability: Variability refers to inconsistencies in either the usage or the flow of big data. A coffee
shop may offer six different blends of coffee, but if you get the same blend every day and it tastes
different every day, that is variability.
Evaluation of big data: 1970s and before was the era of mainframes. The data was essentially
primitive and Relational databases evolved in 1980s and 1990s. The era was of data intensive
applications. The World Wide Web(WWW) and the Internet Of Things(IOT) have created a flood of
different kinds of data such as structured, unstructured, and multimedia data.
Krishnaveni Degree College :: NarasaraopetPage No. : 2
BIGDATA ANALYTICS USING R UNIT - I
Definition of Big data: Big data refers to data that are so large and complex that traditional methods
of collection and analysis are not possible. The amount and variety of big data has increased
exponentially over the past decade. Data which are very large in size is called Big Data. Normally we
work on data of size MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes
i.e. 10^15 byte size is called Big Data. It is stated that almost 90% of today's data has been generated
in the past 3 years.
Sources of Big Data: These data come from many sources like
Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data on
a day to day basis as they have billions of users worldwide
E-commerce site: Sites like Amazon, Flipkart, and Alibaba generates huge amount of logs from
which users buying trends can be traced.
Weather Station: All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.
Telecom Company: Telecom giants like Airtel, Vodafone study the user trends and accordingly
publish their plans and for this they store the data of its million users.
Share Market: Stock exchange across the world generates huge amount of data through its daily
transaction.
Challenges of big data: The challenges in Big Data are the real implementation hurdles. Some of
the Big Data challenges are:
Sharing and Accessing Data: Perhaps the most frequent challenge in big data efforts is the
inaccessibility of data sets from external sources. Sharing data can cause substantial challenges
which include the need for inter and intra- institutional legal documents. Accessing data from public
repositories leads to multiple difficulties. It is necessary for the data to be available in an accurate,
complete and timely manner because if data in the company’s information system is to be used to
make accurate decisions in time then it becomes necessary for data to be available in this manner.
Privacy and Security: It is another most important challenge with Big Data. This challenge includes
sensitive, conceptual, technical as well as legal significance. Most of the organizations are unable to
maintain regular checks due to large amounts of data generation. However, it should be necessary to
perform security checks and observation in real time because it is most beneficial. There is some
information of a person which when combined with external large data may lead to some facts of a
person which may be secretive and he might not want the owner to know this information about that
person. Some of the organization collects information of the people in order to add value to their
business. This is done by making insights into their lives that they’re unaware of.
Analytical Challenges: There are some huge analytical challenges in big data which arise some main
challenges questions like how to deal with a problem if data volume gets too large. How to find out
the important data points. How to use data to the best advantage. These large amount of data on
Krishnaveni Degree College :: NarasaraopetPage No. : 3
BIGDATA ANALYTICS USING R UNIT - I
which these type of analysis is to be done can be structured (organized data), semi-structured (Semi-
organized data) or unstructured (unorganized data). There are two techniques through which decision
making can be done:
1. Either incorporates massive data volumes in the analysis.
2. Decide which big data is relevant at early stage.
Technical Challenges:
Quality of data:
1. When there is a collection of a large amount of data, storage of this data becomes more cost.
Big companies, business leaders and IT leaders always want large data storage.
2. For better results and conclusions, big data rather than having irrelevant data, focusses on
quality data storage.
3. This further arise a question that how it can be ensured that data is relevant, how much data
would be enough for decision making and whether the stored data is accurate or not.
Fault Tolerance:
1. Fault tolerance is the ability of a system to continue operating properly in the event of a
failure of some of its components. Fault tolerance is a technical challenge and involves
complex algorithms.
2. Nowadays, technologies like cloud computing and big data are designed to minimize damage
when failures happen, so tasks don't have to restart from the beginning.
Scalability:
1. Big data projects grow fast, so cloud computing helps to handle scalability issues.
2. It leads to various challenges like how to run and execute various jobs so that goal of each
workload can be achieved cost effectively.
3. It also requires dealing with the system failures in an efficient manner. This leads to a big
question again that what kinds of storage devices are to be used.
Why use big data: Big data analytics helps organizations to work with their data efficiently and use
that data identify new opportunities. Different techniques and algorithms can be applied to predict
from data.
Cost reduction: Big data technologies such as Hadoop and cloud-based analytics bring significant
cost advantages when it comes to storing large amounts of data.
Faster, better decision making: With the speed of Hadoop and in-memory analytics, combined with
the ability to analyze new sources of data, businesses are able to analyze information immediately
and make decisions based on what they’ve learned.
New products and services: With the ability to know customer needs and satisfaction through
analytics comes the power to give customers what they want.
Real-time Benefits of Big Data Analytics: The use of Big Data analytics is very flexible and it can be
applied to another fields as well. With the use of big data a lot there has been an enormous growth in
multiple industries. Some of them are
• Banking
• Technology
• Consumer
• Manufacturing
“Specially in Banking sector, big data tools have been associated with their system. Multiple
operations can be performed on transactional data moreover tools like Apache Hive facilitate users to
query on their data to get results in a very short period of time.
“The usability of big data is also increased in educational sector. There are new options for research
and analysis using data analytics. The insights provided by the big data analytics tools help in
knowing the needs of customers better.
Krishnaveni Degree College :: NarasaraopetPage No. : 4
BIGDATA ANALYTICS USING R UNIT - I
Business Intelligence Vs Big Data.
Comparison of Business Intelligence Big Data
objectives
Purpose The purpose of Business Intelligence is The main purpose of big data is to
to help the business people to make capture, process, and analyse the
better decisions. data, both structured and
unstructured to improve customer
outcomes.
Characteristics The features of Business Intelligence Big data can be decided by some
Properties are Executive Dashboards, “what if” characteristics such as volume ,
analysis, interactive reports, metadata variety, variability and velocity.
layer, and Ranking reports.
Tools Business Intelligence tools enable a Big Data tools are used to store a
business people to collate, analyse, and large amount of data and process
visualize data. Some of the tools are them to get insights from data to
OLAP, Data warehousing, power BI, make good decisions for business.
Google Analytics etc. Some of the tools are Hadoop,
Spark, Hive, Cloudera etc.
Benefits Benefits of Business Intelligence are Benefits of Big Data are
1. Helps in making better business 1. Better Decision Making
decisions. 2. Fraud Detection
2. Faster and more accurate 3. Storage, Mining and
reporting and analysis. analysis of data.
3. Improved data quality 4. Market Production and
4. Increase revenues forecasting.
5. Improved Operational 5. Cost savings
Efficiency
Applications Social media, Healthcare, The banking sector, Entertainment
Gaming Industry, Food Industry etc. and Social media, Health care,
Retail and wholesale etc.
Krishnaveni Degree College :: NarasaraopetPage No. : 5