What Is Data
What Is Data
Big Data is also data but with a huge size. Big Data is a term used for a collection
of data sets that are large and complex, which is difficult to store and process using
available database management tools or traditional data processing applications.
The challenge includes capturing, curating, storing, searching, sharing, transferring,
analyzing and visualization of this data.
Following are some the examples of Big Data-
New York Stock Exchange:
The New York Stock Exchange generates about one terabyte of new trade
data per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases
of social media site Facebook, every day. This
data is mainly generated in terms of photo and video uploads, message exchanges,
putting comments etc.
Jet engine
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight
time. With many thousand flights per day, generation of data reaches up too
many Petabytes.
Sources of Big Data
These data come from many sources like
o Social networking sites: Facebook, Google, LinkedIn all these sites generate
huge amount of data on a day to day basis as they have billions of users
worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge
amount of logs from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user
trends and accordingly publish their plans and for this they store the data of
its million users.
o Share Market: Stock exchange across the world generates huge amount of
data through its daily transaction.
Characteristics of Big Data
Big Data has certain characteristics and hence is defined using 3Vs namely:
(i)Volume – The name Big Data
itself is related to a size which is
enormous. Size of data plays a very
crucial role in determining value out
of data. Also, whether a particular
data can actually be considered as a
Big Data or not, is dependent upon
the volume of data.
Hence, 'Volume' is one
characteristic which needs to be
considered while dealing with Big
Data.
i) Variety – The next aspect of Big
Data is its variety.Variety refers to heterogeneous sources and the nature of data,
both structured and unstructured. During earlier days, spreadsheets and databases
were the only sources of data considered by most of the applications. Nowadays,
data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are
also being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands, determines real
potential in the data.Big Data Velocity deals with the speed at which data flows in
from sources like business processes, application logs, networks, and social media
sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.
Other Characteristics
Various individuals and organizations have suggested expanding the original three
Vs, though these proposals have tended to describe challenges rather than qualities
of big data. Some common additions are:
Veracity: The variety of sources and the complexity of the processing can
lead to challenges in evaluating the quality of the data (and consequently, the
quality of the resulting analysis)
Variability: Variation in the data leads to wide variation in quality.
Additional resources may be needed to identify, process, or filter low quality
data to make it more useful.
Value: The ultimate challenge of big data is delivering value. Sometimes, the
systems and processes in place are complex enough that using the data and
extracting actual value can become difficult.
Benefits of Big Data Processing
Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like Facebook, twitter is
enabling organizations to fine tune their business
strategies.
Improved customer service
Traditional customer feedback systems are getting replaced by new systems
designed with Big Data technologies. In these new
systems, Big Data and natural language processing technologies are being used to
read and evaluate consumer responses.
Early identification of risk to the product/services, if any.
Better operational efficiency
Big Data technologies can be used for creating a staging area or landing zone for
new data before identifying what data should be moved to the data warehouse. In
addition, such integration of Big Data technologies and data warehouse helps an
organization to offload infrequently accessed data.
1. Structured:-
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data. Over the period of time, talent in computer science has
achieved greater success in developing techniques for working with such kind of
data (where the format is well known in advance) and also deriving value out of it.
However, nowadays, we are foreseeing issues when a size of such data grows to a
huge extent, typical sizes are being in the rage of multiple
zettabytes.1021 bytes equal to 1 zettabyte or one billion terabytes forms a
zettabyte.Data stored in a relational database management system is one example of
a 'structured' data.
1. Tableau Public
Sampling Distributions
A sampling distribution is a probability distribution of a statistic obtained
through a large number of samples drawn from a specific population. The sampling
distribution of a given population is the distribution of frequencies of a range of
different outcomes that could possibly occur for a statistic of a population. A
sample is a subset of a population. The average weight computed for each sample
set is the sampling distribution of the mean. Not just the mean can be calculated
from a sample. Other statistics, such as the standard deviation, variance, proportion,
and range can be calculated from sample data. The standard deviation and variance
measure the variability of the sampling distribution. The number of observations in
a population, the number of observations in a sample and the procedure used to
draw the sample sets determine the variability of a sampling distribution. The
standard deviation of a sampling distribution is called the standard error. While the
mean of a sampling distribution is equal to the mean of the population, the standard
error depends on the standard deviation of the population, the size of the population
and the size of the sample.
In many cases we would like to learn something about a big population,
without actually inspecting every unit in that population. In that case we would like
to draw a sample that permits us to draw conclusions about a population of interest.
We may for example draw a sample from the population of Dutch men of 18 years
and older to learn something about the joint distribution of height and weight in this
population. Because we cannot draw conclusions about the population from a
sample without error, it is important to know how large these errors may be, and
how often incorrect conclusions may occur. An objective assessment of these errors
is only possible for a probability sample.
For a probability sample, the probability of inclusion in the sample is known
and positive for each unit in the population. Drawing a probability sample of size n
from a population consisting of N units, may be a quite complex random
experiment. The experiment is simplified considerably by subdividing it into n
experiments, consisting of drawing the n consecutive units. In a simple random
sample, the n consecutive units are drawn with equal probabilities from the units
concerned. In random sampling with replacement the sub experiments (drawing of
one unit) are all identical and independent: n times a unit is randomly selected from
the entire population. We will see that this property simplifies the ensuing analysis
considerably. For units in the sample we observe one or more population variables.
For probability samples, each draw is a random experiment. Every observation may
therefore be viewed as a random variable. The observation of a populationvariable
X from the unit drawn in the ith trial, yields a random variableXi. Observation of
the complete sample yields n random variables X1, ...,Xn.Likewise, if we observe
for each unit the pair of population variables (X,Y), we obtain pairs of random
variables (Xi, Yi) with outcomes (xi, yi). Consider the population of size N = 6,
displayed in table 1
Uni 1 2 3 4 5 6 Unit 1 2 3
t
X 1 1 2 2 2 3 X 1/3 1/2 1/6
A random sample of size n = 2 is drawn with replacement from this population. For
each unit drawn we observe the value of X. This yields two random variables X1
and X2, with identical probability distribution as displayed in table 2. Furthermore,
X1 and X2 are independent, so their joint distribution equals the product of their
individual distributions, i.e.
Usually we are not really interested in the individual outcomes of the sample, but
rather in some sample statistic. A statistic is a function of the sample observations
X1, .., Xn, and therefore is itself also a random variable. Some important sample
n n
1 1
statistics are the sample mean X̄ = ∑
n i=1
Xi, sample variance S2= ∑ ¿ ¿)2
n−1 i=1
n
1
and sample fraction Fr= ∑ Xi,(for 0-1 variable X).
n i=1
Frequentist Inference
According to frequentists, inference procedures should be interpreted and evaluated
in terms of their behavior in hypothetical repetitions under the same conditions. The
sampling distribution of a statistic is of crucial importance. The two basic types of
frequentist inference are estimation and testing.
where expectation is taken with respect to repeated samples from the population. If
Eθ(G) = θ, i.e. the expected value of the estimator is equal to the value of the
population parameter, then the estimator G is called unbiased.
Interval Estimation
An interval estimator for population parameter θ is an interval of type (GL, GU ).
Two important quality measures for interval estimates are:
Eθ(GU − GL),
i.e. the expected width of the interval, and
Pθ(GL < θ < GU ),
i.e. the probability that the interval contains the true parameter value. Clearly there
is a trade-off between these quality measures. If we require a high probability that
the interval contains the true parameter value, the interval itself has to become
wider. It is customary to choose a confidence level (1 − α) and use an interval
estimator such that
Pθ(GL < θ < GU ) ≥ 1 − α
for all possible values of θ. A realisation (gL, gU ) of such an interval estimator is
called a 100(1 − α)% confidence interval.
Hypothesis Testing
A test is a statistical procedure to make a choice between two hypotheses
concerning the value of a population parameter θ. One of these, called the null
hypothesis and denoted by H0, gets the “benefit of the doubt”. The two possible
conclusions are to reject or not to reject H0. H0 is only rejected if the sample data
contains strong evidence that it is not true. The null hypothesis is rejected iff
realisation g of test statistic G is in the critical region denoted by C. In doing so we
can make two kinds of errors Type I error: Reject H0 when it is true. Type II error:
Accept H0 when it is false. Type I errors are considered to be more serious than
Type II errors. Test statistic G is usually a point estimator for θ, e.g. if we test a
hypothesis concerning the value of population mean µ, then X¯ is an obvious
choice of test statistic.
Prediction Error
A prediction error is the failure of some expected event to occur. When
predictions fail, humans can use metacognitive functions, examining prior
predictions and failures and deciding, for example, whether there
are correlations and trends, such as consistently being unable to foresee outcomes
accurately in particular situations. Applying that type of knowledge can inform
decisions and improve the quality of future predictions. Predictive analytics
software processes new and historical data to forecast activity, behavior and trends.
The programs apply statistical analysis techniques, analytical queries and machine
learning algorithms to data sets to create predictive models that quantify the
likelihood of a particular event happening.
Errors are an inescapable element of predictive analytics that should also be
quantified and presented along with any model, often in the form of a confidence
interval that indicates how accurate its predictions are expected to be. Analysis of
prediction errors from similar or previous models can help determine confidence
intervals.
In artificial intelligence (AI), the analysis of prediction errors can help
guide machine learning (ML), similarly to the way it does for human learning.
In reinforcement learning, for example, an agent might use the goal of minimizing
error feedback as a way to improve. Prediction errors, in that case, might be
assigned a negative value and predicted outcomes a positive value, in which case
the AI would be programmed to attempt to maximize its score. That approach to
ML, sometimes known as error-driven learning, seeks to stimulate learning by
approximating the human drive for mastery.