big data analytics
big data analytics
Parts of Speech Tagging: POST is the process of reading text and tagging each word in the sentence
belonging to particular parts of speech such as noun, verb, objective.
Unstructured Information management architecture: Open source platform from IBM used for real
time content analytics.
There are different sources of data like doc, pdf, YouTube, a chat conversation on
internet messenger, a customer feedback form on an online retail website, CCTV
coverage and weather forecast.
The sources of Big data:
1. Typical internal data sources: data present within an organization’s firewall.
Data storage: File systems, SQL (RDBMSs- oracle, MS SQL server, DB2, MySQL,
PostgreSQL etc.) NoSQL, (MangoDB, Cassandra etc) and so on. Archives: Archives
of scanned documents, paper archives, customer correspondence records,
patient’s health records, student’s admission records, student’s assessment
records, and so on.
2. External data sources: data residing outside an organization’s Firewall. Public
web: Wikipedia, regulatory, compliance, weather, census etc.,
3. Both (internal + external sources) Sensor data, machine log data, social media,
business apps, media and docs.
2. Variety: Variety deals with the wide range of data types and sources of
data. Structured, semi-structured and Unstructured. Structured data: From
traditional transaction processing systems and RDBMS, etc. Semi-structured data:
For example Hypertext Markup Language (HTML), eXtensible Markup Language
(XML). Unstructured data: For example unstructured text documents, audios,
videos, emails, photos, PDFs , social media, etc.
Velocity: It refers to the speed of data processing. we have moved from the
days of batch processing to Real-time processing.
Veracity: Veracity refers to biases, noise and abnormality in data. The key
question is “Is all the data that is being stored, mined and analysed meaningful
and pertinent to the problem under consideration”.
Value: This refers to the value that big data can provide, and it relates directly
to what organizations can do with that collected data. It is often quantified as the
potential social or economic value that the data might create.
Volatility: It deals with “How long the data is valid? “
Validity: Validity refers to accuracy & correctness of data. Any data picked up
for analysis needs to be accurate.
Velocity
Velocity refers to how quickly data is generated and how fast it
moves. This is an important aspect for organizations that need their
data to flow quickly, so it's available at the right times to make the
best business decisions possible.
An organization that uses big data will have a large and continuous
flow of data that's being created and sent to its end destination. Data
could flow from sources such as machines, networks, smartphones
or social media. Velocity applies to the speed at which this
information arrives -- for example, how many social media posts per
day are ingested -- as well as the speed at which it needs to be
digested and analyzed -- often quickly and sometimes in near real
time.
As an example, in healthcare, many medical devices today are
designed to monitor patients and collect data. From in-hospital
medical equipment to wearable devices, collected data needs to be
sent to its destination and analyzed quickly.
Volume
Volume refers to the amount of data that exists. Volume is like the
base of big data, as it's the initial size and amount of data that's
collected. If the volume of data is large enough, it can be considered
big data. However, what's considered to be big data is relative and
will change depending on the available computing power that's on
the market.
Value
Value refers to the benefits that big data can provide, and it relates
directly to what organizations can do with that collected data. Being
able to pull value from big data is a requirement, as the value of big
data increases significantly depending on the insights that can be
gained from it.
Variety
Raw data also qualifies as a data type. While raw data can fall into
other categories -- structured, semi-structured or unstructured -- it's
considered raw if it has received no processing at all. Most often, raw
applies to data imported from other organizations or submitted or
entered by users. Social media data often falls into this category.
Veracity
Descriptive analytics
This type focuses on summarizing historical data to tell youwhat’s happened in
the past. It uses aggregation, data mining, and visualization techniques to
understand trends, patterns, and key performance indicators (KPIs).
Descriptive analytics helps you understand your current situation and make
informed decisions based on historical information.
Diagnostic analytics
Diagnostic analytics goes beyond describing past events and aims to
understand why they occurred. It separates data to identify the root causes of
specific outcomes or issues.
By analyzing relationships and correlations within the data, diagnostic analytics
helps you gain insights into factors influencing your results.
Predictive analytics
This type of analytics uses historical data and statistical algorithms to predict
future events. It spots patterns and trends and forecasts what might happen
next.
You can use predictive analytics to anticipate customer behavior, product
demand, market trends, and more to plan and make strategic decisions
proactively.
Prescriptive analytics
Prescriptive analytics builds on predictive analytics by recommending actions
to optimize future outcomes. It considers various possible actions and their
potential impact on the predicted event or outcome.
Prescriptive analytics help you make data-driven decisions by suggesting the
best course of action based on your desired goals and any constraints.
NoSQL?
Advantages:
Big Data Capability
No Single Point of Failure
Easy Replication
It provides fast performance and horizontal scalability.
Can handle structured, semi-structured, and unstructured data with
equal effect NoSQL databases don't need a dedicated high-
performance server
It can serve as the primary data source for online applications.
Excels at distributed database and multi-data centre operations
Eliminates the need for a specific caching layer to store data
Offers a flexible schema design which can easily be altered without
downtime or service disruption
Disadvantages:
1. Limited query capabilities
2. RDBMS databases and tools are comparatively mature
3. It does not offer any traditional database capabilities, like
consistency
4. when multiple transactions are performed simultaneously.
When the volume of data increases it is difficult to maintain
unique
5. values as keys become difficult Doesn't work as well with
relational data
6. Open source options so not so popular for enterprises.
7. No support for join and group-by operations.