0% found this document useful (0 votes)
2 views

big data analytics

Big data refers to large and complex data sets that can be structured, semi-structured, or unstructured, enabling organizations to gain insights and improve operations. It is characterized by the 5 V's: Volume, Variety, Velocity, Veracity, and Value, which highlight the challenges and opportunities in processing and analyzing such data. Big data analytics involves a series of processes to collect, store, preprocess, integrate, analyze, and visualize data to support informed decision-making and strategic business insights.

Uploaded by

parul.singh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

big data analytics

Big data refers to large and complex data sets that can be structured, semi-structured, or unstructured, enabling organizations to gain insights and improve operations. It is characterized by the 5 V's: Volume, Variety, Velocity, Veracity, and Value, which highlight the challenges and opportunities in processing and analyzing such data. Big data analytics involves a series of processes to collect, store, preprocess, integrate, analyze, and visualize data to support informed decision-making and strategic business insights.

Uploaded by

parul.singh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

What is big data?

Big data is a combination of unstructured, semi-structured or structured


data collected by organizations. These data sets can be mined to gain insights
and used in machine learning projects, predictive modeling and other
advanced analytics applications.
Big data can be used to improve operations, provide better customer service
and create personalized marketing campaigns -- all of which can increase value
for an organization. As an example, big data analytics can provide companies
with valuable insights into their customers that can then be used to refine
marketing techniques to increase customer engagement and conversion rates.
1. Structured Data:- This is the data which is in an organized form, for
example in rows and columns. No of rows called Cardinality and No of
columns called Degree of a relation Sources: Database, Spread sheets,
OLTP systems.
Working with Structured data: -
Storage: Data types – both defined and user defined help with the
storage of structured data
update, delete: Updating, deleting, etc. is easy due to structured form -
Security: can be provided easily in RDBMS.
Indexing /Searching: Data can be indexed based not only on a text string
but other attributes as well. This enables streamlined search .
Scalability (horizontal/vertical): Scalability is not generally an issue with
increase in data as resources can be increased easily. - Transaction
Processing (Atomicity, Consistency, Integrity, Durability.
2. Semi-Structured Data: This data which doesn’t conform to a data model
but has some structure. Metadata for this data is available but is not
sufficient. Sources: XML, JSON, E-mail .
Characteristics: -
inconsistent structure. - self describing (label/value pairs)
schema information is blended with data values -.
data objectives may have different attributes not known before
Challenges:
Storage cost: Storing data with their schemas increases cost
RDBMS: Semi-structured data cannot be stored in existing RDBMS as
 data cannot be mapped into tables directly Irregular and partial
structure: Some data elements may have extra
 information while others none at all Implicit structure: In many cases
the structure is implicit.
 Interpreting relationships and correlations is very difficult
 Flat files: Semi-structured is usually stored in flat files which are
difficult to index and search Heterogeneous sources: Data comes from
varied sources which is difficult to tag and search.
3. Unstructured Data: This is the data which does not conform to a data
model or is not in a form which can be used easily by a computer
program. About 80–90% data of an organization is in this format.
Sources: memos, chat-rooms, PowerPoint presentations, images, videos,
letters, researches, white papers, body of an email, etc.
Characteristics:
 Does not confirm to any data model
 Can’t be stored in the form of rows and columns
 Not in any particular format or sequence
 Not easily usable by the program
 Doesn’t follow any rule or semantics
Challenges:
 Storage space: Sheer volume of unstructured data and its
unprecedented growth makes it difficult to store. Audios, videos,
images, etc. acquire huge amount of storage space
 Scalability: Scalability becomes an issue with increase in
unstructured data
 Retrieve information: Retrieving and recovering unstructured data
are cumbersome
 Security: Ensuring security is difficult due to varied sources of data
(e.g. e-mail, web pages)
 Update/delete: Updating, deleting, etc. are not easy due to the
unstructured form
 Indexing and Searching: Indexing becomes difficult with increase
in data.
 Searching is difficult for non-text data.
 Interpretation: Unstructured data is not easily interpreted by
conventional search algorithm
 Tags: As the data grows it is not possible to put tags Manually.
 Indexing: Designing algorithms to understand the meaning of the
document and then tag or index them accordingly is difficult.
Dealing with Unstructured data:
Data Mining: Knowledge discovery in databases, popular Mining
algorithms are Association rule mining, Regression Analysis, and
Collaborative filtering.
 Natural Language Processing: It is related to HCI. It is about
enabling computers to understand human or natural language
input.
 Text Analytics: Text mining is the process of gleaning high quality and
meaningful information from text. It includes tasks such as text
categorization, text clustering, sentiment analysis and concept/entity
extraction.
 Noisy text analytics: Process of extraction structured or semistructured
from noisy unstructured data such as chats, blogs, wikis, emails,
Spelling mistakes, abbreviations, uh, hm, non standard words.
 Manual Tagging with meta data: This is about tagging manually with
adequate meta data to provide the requisite semantics to understand
unstructured data.

Parts of Speech Tagging: POST is the process of reading text and tagging each word in the sentence
belonging to particular parts of speech such as noun, verb, objective.

Unstructured Information management architecture: Open source platform from IBM used for real
time content analytics.

Define Big Data. What are the characteristics of Big Data?


Big Data is high-volume, velocity, and variety information assets that demand
cost-effective, innovative forms of information processing for enhanced
insight and decision making.
Characteristics(V’s):
1. Volume: It refers to the amount of the data. The size of the data is being
increased from Bits to Yottabytes.
Bits-> Bytes-> KBs-> MBs-> GBs-> TBs-> PBs-> Exabytes-> Zettabytes-> Yottabytes

There are different sources of data like doc, pdf, YouTube, a chat conversation on
internet messenger, a customer feedback form on an online retail website, CCTV
coverage and weather forecast.
The sources of Big data:
1. Typical internal data sources: data present within an organization’s firewall.
Data storage: File systems, SQL (RDBMSs- oracle, MS SQL server, DB2, MySQL,
PostgreSQL etc.) NoSQL, (MangoDB, Cassandra etc) and so on. Archives: Archives
of scanned documents, paper archives, customer correspondence records,
patient’s health records, student’s admission records, student’s assessment
records, and so on.
2. External data sources: data residing outside an organization’s Firewall. Public
web: Wikipedia, regulatory, compliance, weather, census etc.,
3. Both (internal + external sources) Sensor data, machine log data, social media,
business apps, media and docs.

2. Variety: Variety deals with the wide range of data types and sources of
data. Structured, semi-structured and Unstructured. Structured data: From
traditional transaction processing systems and RDBMS, etc. Semi-structured data:
For example Hypertext Markup Language (HTML), eXtensible Markup Language
(XML). Unstructured data: For example unstructured text documents, audios,
videos, emails, photos, PDFs , social media, etc.

Velocity: It refers to the speed of data processing. we have moved from the
days of batch processing to Real-time processing.

Veracity: Veracity refers to biases, noise and abnormality in data. The key
question is “Is all the data that is being stored, mined and analysed meaningful
and pertinent to the problem under consideration”.

Value: This refers to the value that big data can provide, and it relates directly
to what organizations can do with that collected data. It is often quantified as the
potential social or economic value that the data might create.
Volatility: It deals with “How long the data is valid? “
Validity: Validity refers to accuracy & correctness of data. Any data picked up
for analysis needs to be accurate.

Variability: Data flows can be highly inconsistent with periodic peaks

What are the 5 V's?


The 5 V's are defined as follows:
1. Velocity is the speed at which the data is created and how fast
it moves.
2. Volume is the amount of data qualifying as big data.
3. Value is the value the data provides.
4. Variety is the diversity that exists in the types of data.
5. Veracity is the data's quality and accuracy.

Velocity
Velocity refers to how quickly data is generated and how fast it
moves. This is an important aspect for organizations that need their
data to flow quickly, so it's available at the right times to make the
best business decisions possible.

An organization that uses big data will have a large and continuous
flow of data that's being created and sent to its end destination. Data
could flow from sources such as machines, networks, smartphones
or social media. Velocity applies to the speed at which this
information arrives -- for example, how many social media posts per
day are ingested -- as well as the speed at which it needs to be
digested and analyzed -- often quickly and sometimes in near real
time.
As an example, in healthcare, many medical devices today are
designed to monitor patients and collect data. From in-hospital
medical equipment to wearable devices, collected data needs to be
sent to its destination and analyzed quickly.

In some cases, however, it might be better to have a limited set of


collected data than to collect more data than an organization can
handle -- because this can lead to slower data velocities.

Volume
Volume refers to the amount of data that exists. Volume is like the
base of big data, as it's the initial size and amount of data that's
collected. If the volume of data is large enough, it can be considered
big data. However, what's considered to be big data is relative and
will change depending on the available computing power that's on
the market.

Value

Value refers to the benefits that big data can provide, and it relates
directly to what organizations can do with that collected data. Being
able to pull value from big data is a requirement, as the value of big
data increases significantly depending on the insights that can be
gained from it.

Variety

Variety refers to the diversity of data types. An organization might


obtain data from several data sources, which might vary in value.
Data can come from sources in and outside an enterprise as well. The
challenge in variety concerns the standardization and distribution of
all data being collected.

Unstructured data is data that's unorganized and comes in different


files or formats. Typically, unstructured data isn't a good fit for a
mainstream relational database because it doesn't fit into
conventional data models. Semi-structured data is data that hasn't
been organized into a specialized repository but has associated
information, such as metadata. This makes it easier to process than
unstructured data. Structured data, meanwhile, is data that has been
organized into a formatted repository. This means the data is made
more addressable for effective data processing and analysis.

Raw data also qualifies as a data type. While raw data can fall into
other categories -- structured, semi-structured or unstructured -- it's
considered raw if it has received no processing at all. Most often, raw
applies to data imported from other organizations or submitted or
entered by users. Social media data often falls into this category.

A more specific example could be found in a company that gathers a


variety of data about its customers. This can include structured data
culled from transactions or unstructured social media posts and call
center text. Much of this might arrive in the form of raw data,
requiring cleaning before processing.

Veracity

Veracity refers to the quality, accuracy, integrity and credibility of


data. Gathered data could have missing pieces, might be inaccurate
or might not be able to provide real, valuable insight. Veracity,
overall, refers to the level of trust there is in the collected data.

Data can sometimes become messy and difficult to use. A large


amount of data can cause more confusion than insights if it's
incomplete. For example, in the medical field, if data about what
drugs a patient is taking is incomplete, the patient's life could be
endangered.

The challenges with big data:


1. Data today is growing at an exponential rate. Most of the data
that we have today has been generated in the last two years.
The key question is : will all this data be useful for analysis how
will separate knowledge from noise.
2. How to host big data solutions outside the world.
3. The period of retention of big data.
4. Dearth of skilled professionals who possess a high level of
proficiency in data science that is vital in implementing Big data
solutions.
5. Challenges with respect to capture, curation, storage, search,
sharing, transfer, analysis, privacy violations and visualization.
6. Shortage of data visualization experts.
7. Scale : The storage of data is becoming a challenge for
everyone. 8. Security: The production of more and more data
increases security and privacy concerns.
8. Schema: there is no place for rigid schema, need of dynamic
schema.
9. Continuous availability: How to provide 24X7 support
10. Consistency: Should one opt for consistency or eventual
consistency.
11. Partition tolerant: how to build partition tolernant
systems that can take of both hardware and software failures.
12. Data quality: Inconsistent data, duplicates, logic conflicts,
and missing data all result in data quality challenges.

What is big data analytics?


Big data analytics examines and analyzes large and complex data sets known as
“big data.”
Through this analysis, you can uncover valuable insights, patterns, and trends
to make more informed decisions. It uses several techniques, tools, and
technologies to process, manage, and examine meaningful information from
massive datasets.
We typically apply big data analytics when data is too large or complicated for
traditional data processing methods to handle efficiently. The more
information there is, the greater the need for diverse analytical approaches,
quicker handling times, and a more extensive data capacity.

How does big data analytics work?


Big data analytics combines several stages and processes to extract insights.
Here’s a quick overview of what this could look like:

Data collection: Gather data


from variojus sources, such as
surveys, social media, websites,
databases, and transaction
records. This data can be
structured, unstructured, or semi-
structured.
1. Data storage: Store data in distributed systems or cloud-based solutions.
These types of storage can handle a large volume of data and provide
fault tolerance.
2. Data preprocessing: It’s best to clean and preprocess the raw data
before performing analysis. This process could involve handling missing
values, standardizing formats, addressing outliers, and structuring the
data into a more suitable format.
3. Data integration: Data usually comes from various sources in different
formats. Data integration combines the data into a unified format.
4. Data processing: Most organizations benefit from using distributed
frameworks to process big data. These break down the tasks into smaller
chunks and distribute them across multiple machines for parallel
processing.
5. Data analysis techniques: Depending on the goal of the analysis, you’ll
likely apply several data analysis techniques. These could
include descriptive, predictive, and prescriptive analytics using machine
learning, text mining, exploratory analysis, and other methods.
6. Data visualization: After analysis, communicate the results visually, like
charts, graphs, dashboards, or other visual tools. Visualization helps you
communicate complex insights in an understandable and accessible way.
7. Interpretation and decision making: Interpret the insights gained from
your analysis to draw conclusions and make data-backed decisions.
These decisions impact business strategies, processes, and operations.
8. Feedback and scale: One of the main advantages of big data analytics
frameworks is their ability to scale horizontally. This scalability enables
you to handle increasing data volumes and maintain performance, so
you have a sustainable method for analyzing large datasets.
It’s important to remember that big data analytics isn’t a linear process, but a
cycle.
You’ll continually gather new data, analyze it, and refine business strategies
based on the results. The whole process is iterative, which means adapting to
changes and making adjustments is key.
The importance of big data analytics
Big data analytics has the potential to transform the way you operate, make
decisions, and innovate. It’s an ideal solution if you’re dealing with massive
datasets and are having difficulty choosing a suitable analytical approach.
By tapping into the finer details of your information, using techniques and
specific tools, you can use your data as a strategic asset.
Big data analytics enables you to benefit from:
 Informed decision-making: You can make informed decisions based on
actual data, which reduces uncertainty and improves outcomes.
 Business insights: Analyzing large datasets uncovers hidden patterns and
trends, providing a deeper understanding of customer behavior and
market dynamics.
 Customer understanding: Get insight into customer preferences and
needs so you can personalize experiences and create more impactful
marketing strategies.
 Operational efficiency: By analyzing operational data, you can optimize
processes, identify bottlenecks, and streamline operations to reduce
costs and improve productivity.
 Innovation: Big data analytics can help you uncover new opportunities
and niches within industries. You can identify unmet needs and
emerging trends to develop more innovative products and services to
stay ahead of the competition.

Types of big data analytics


There are four main types of big data analytics—
descriptive, diagnostic, predictive, and prescriptive.
Collectively, they enable businesses to comprehensively understand their big
data and make decisions to drive improved performance.

Descriptive analytics
This type focuses on summarizing historical data to tell youwhat’s happened in
the past. It uses aggregation, data mining, and visualization techniques to
understand trends, patterns, and key performance indicators (KPIs).
Descriptive analytics helps you understand your current situation and make
informed decisions based on historical information.

Diagnostic analytics
Diagnostic analytics goes beyond describing past events and aims to
understand why they occurred. It separates data to identify the root causes of
specific outcomes or issues.
By analyzing relationships and correlations within the data, diagnostic analytics
helps you gain insights into factors influencing your results.

Predictive analytics
This type of analytics uses historical data and statistical algorithms to predict
future events. It spots patterns and trends and forecasts what might happen
next.
You can use predictive analytics to anticipate customer behavior, product
demand, market trends, and more to plan and make strategic decisions
proactively.

Prescriptive analytics
Prescriptive analytics builds on predictive analytics by recommending actions
to optimize future outcomes. It considers various possible actions and their
potential impact on the predicted event or outcome.
Prescriptive analytics help you make data-driven decisions by suggesting the
best course of action based on your desired goals and any constraints.

Q) What is NoSQL? What is the need of NoSQL? Explain


different types of NoSQL databases.
NoSQL Stands for Not Only SQL. These are non-relational, open source,
distributed databases. Features of NoSQL:
1. NoSQL databases are non-relational: They do not adhere to relational data
model. In fact either key-value pairs or document oriented or column oriented
or graph based databases.
2. Distributed: The data is distributed across several nodes in a cluster
constituted of low commodity hardware.
3. No Support for ACID properties: They do not offer support for ACID
properties of transactions. On the contrary, they adherence to CAP theorem.
4. No fixed table schema: NoSQL databases are becoming increasing popular
owing to their support for flexibility to the schema. They do not mandate for
the data to strict adhere to any schema structure at the time of storage.
Need of NoSQL:
1. It has scale out architecture instead of the monolithic architecture of
relational databases.
2. It can house large volumes of structured, semi-structured and unstructured
data.
3. Dynamic Schema: It allows insertion of data without a predefined schema.
4. Auto Sharding: It automatically spread data across an arbitrary numer of
servers or nodes in a cluster.
5. Replication: It offers good support for replication which in turn guarantees
high availability, fault tolerance and disaster recovery.

What are the advantages and disadvantage of


)

NoSQL?
Advantages:
 Big Data Capability
 No Single Point of Failure
  Easy Replication
 It provides fast performance and horizontal scalability. 
 Can handle structured, semi-structured, and unstructured data with 
equal effect NoSQL databases don't need a dedicated high-
performance server
 It can serve as the primary data source for online applications. 
 Excels at distributed database and multi-data centre operations 
 Eliminates the need for a specific caching layer to store data 
 Offers a flexible schema design which can easily be altered without
 downtime or service disruption

Disadvantages:
1. Limited query capabilities
2. RDBMS databases and tools are comparatively mature
3. It does not offer any traditional database capabilities, like
consistency
4. when multiple transactions are performed simultaneously.
When the volume of data increases it is difficult to maintain
unique
5. values as keys become difficult Doesn't work as well with
relational data
6. Open source options so not so popular for enterprises.
7. No support for join and group-by operations.

You might also like