DA Interview Questions
DA Interview Questions
Machine Learning,
Excel, SQL, Python, R, Statistical Modeling,
Tableau, PowerBI Docker, Software
Skills Engineering
Database Management,
Data Collection, Web
Predictive Analysis and
Scrapping, Data Cleaning,
prescriptive analysis,
Data Visualization,
Machine Learning model
Explanatory Data
building and Deployment,
Analysis, Reports
Task automation, Work for
Development and
Business Improvements
Presentations
Tasks Process.
4. What are the different tools mainly used for data analysis?
There are different tools used for data analysis. each has some strengths and
weaknesses. Some of the most commonly used tools for data analysis are as
follows:
Spreadsheet Software: Spreadsheet Software is used for a variety of
data analysis tasks, such as sorting, filtering, and summarizing data. It
also has several built-in functions for performing statistical analysis.
The top 3 mostly used Spreadsheet Software are as follows:
o Microsoft Excel
o Google Sheets
o LibreOffice Calc
Database Management Systems (DBMS): DBMSs, or database
management systems, are crucial resources for data analysis. It offers a
secure and efficient way to manage, store, and organize massive
amounts of data.
o MySQL
o PostgreSQL
o Microsoft SQL Server
o Oracle Database
Statistical Software: There are many statistical software used for Data
analysis, Each with its strengths and weaknesses. Some of the most
popular software used for data analysis are as follows:
o SAS: Widely used in various industries for statistical
analysis and data management.
o SPSS: A software suite used for statistical analysis in
social science research.
o Stata: A tool commonly used for managing, analyzing,
and graphing data in various fields.SPSS:
Programming Language: In data analysis, programming languages are
used for deep and customized analysis according to mathematical and
statistical concepts. For Data analysis, two programming languages are
highly popular:
o R: R is a free and open-source programming language
widely popular for data analysis. It has good
visualizations and environments mainly designed for
statistical analysis and data visualization. It has a wide
variety of packages for performing different data
analysis tasks.
o Python: Python is also a free and open-source
programming language used for Data analysis.
Nowadays, It is becoming widely popular among
researchers. Along with data analysis, It is used for
Machine Learning, Artificial Intelligence, and web
development.
5. What is Data Wrangling?
Data Wrangling is very much related concepts to Data Preprocessing. It’s also
known as Data munging. It involves the process of cleaning, transforming, and
organizing the raw, messy or unstructured data into a usable format. The main
goal of data wrangling is to improve the quality and structure of the dataset. So,
that it can be used for analysis, model building, and other data-driven tasks.
Data wrangling can be a complicated and time-consuming process, but it is
critical for businesses that want to make data-driven choices. Businesses can
obtain significant insights about their products, services, and bottom line by
taking the effort to wrangle their data.
Some of the most common tasks involved in data wrangling are as follows:
Data Cleaning: Identify and remove the errors, inconsistencies, and
missing values from the dataset.
Data Transformation: Transformed the structure, format, or values of
data as per the requirements of the analysis. that may include scaling &
normalization, encoding categorical values.
Data Integration: Combined two or more datasets, if that is scattered
from multiple sources, and need of consolidated analysis.
Data Restructuring: Reorganize the data to make it more suitable for
analysis. In this case, data are reshaped to different formats or new
variables are created by aggregating the features at different levels.
Data Enrichment: Data are enriched by adding additional relevant
information, this may be external data or combined aggregation of two
or more features.
Quality Assurance: In this case, we ensure that the data meets certain
quality standards and is fit for analysis.
6. What is the difference between descriptive and predictive analysis?
Descriptive and predictive analysis are the two different ways to analyze the data.
Descriptive Analysis: Descriptive analysis is used to describe
questions like “What has happened in the past?” and “What are the key
characteristics of the data?”. Its main goal is to identify the patterns,
trends, and relationships within the data. It uses statistical measures,
visualizations, and exploratory data analysis techniques to gain insight
into the dataset.
The key characteristics of descriptive analysis are as follows:
o Historical Perspective: Descriptive analysis is
concerned with understanding past data and events.
o Summary Statistics: It often involves calculating
basic statistical measures like mean, median, mode,
standard deviation, and percentiles.
o Visualizations: Graphs, charts, histograms, and other
visual representations are used to illustrate data
patterns.
o Patterns and Trends: Descriptive analysis helps
identify recurring patterns and trends within the data.
o Exploration: It’s used for initial data exploration and
hypothesis generation.
Predictive Analysis: Predictive Analysis, on the other hand, uses past
data and applies statistical and machine learning models to identify
patterns and relationships and make predictions about future events. Its
primary goal is to predict or forecast what is likely to happen in future.
The key characteristics of predictive analysis are as follows:
o Future Projection: Predictive analysis is used to
forecast and predict future events.
o Model Building: It involves developing and training
models using historical data to predict outcomes.
o Validation and Testing: Predictive models are
validated and tested using unseen data to assess their
accuracy.
o Feature Selection: Identifying relevant features
(variables) that influence the predicted outcome is
crucial.
o Decision Making: Predictive analysis supports
decision-making by providing insights into potential
outcomes.
7. What is univariate, bivariate, and multivariate analysis?
Univariate, Bivariate and multivariate are the three different levels of data
analysis that are used to understand the data.
1. Univariate analysis: Univariate analysis analyzes one variable at a
time. Its main purpose is to understand the distribution, measures of
central tendency (mean, median, and mode), measures of dispersion
(range, variance, and standard deviation), and graphical methods such
as histograms and box plots. It does not deal with the courses or
relationships from the other variables of the dataset.
Common techniques used in univariate analysis include histograms, bar
charts, pie charts, box plots, and summary statistics.
2. Bivariate analysis: Bivariate analysis involves the analysis of the
relationship between the two variables. Its primary goal is to understand
how one variable is related to the other variables. It reveals, Are there
any correlations between the two variables, if yes then how strong the
correlations is? It can also be used to predict the value of one variable
from the value of another variable based on the found relationship
between the two.
Common techniques used in bivariate analysis include scatter plots,
correlation analysis, contingency tables, and cross-tabulations.
3. Multivariate analysis: Multivariate analysis is used to analyze the
relationship between three or more variables simultaneously. Its
primary goal is to understand the relationship among the multiple
variables. It is used to identify the patterns, clusters, and dependencies
among the several variables.
Common techniques used in multivariate analysis include principal
component analysis (PCA), factor analysis, cluster analysis, and
regression analysis involving multiple predictor variables.
8. Name some of the most popular data analysis and visualization tools
used for data analysis.
Some of the most popular data analysis and visualization tools are as follows:
Tableau: Tableau is a powerful data visualization application that
enables users to generate interactive dashboards and visualizations from
a wide range of data sources. It is a popular choice for businesses of all
sizes since it is simple to use and can be adjusted to match any
organization’s demands.
Power BI: Microsoft’s Power BI is another well-known data
visualization tool. Power BI’s versatility and connectivity with other
Microsoft products make it a popular data analysis and visualization
tool in both individual and enterprise contexts.
Qlik Sense: Qlik Sense is a data visualization tool that is well-known
for its speed and performance. It enables users to generate interactive
dashboards and visualizations from several data sources, and it can be
used to examine enormous datasets.
SAS: A software suite used for advanced analytics, multivariate
analysis, and business intelligence.
IBM SPSS: A statistical software for data analysis and reporting.
Google Data Studio : Google Data Studio is a free web-based data
visualization application that allows users to create customized
dashboards and simple reports. It aggregates data from up to 12
different sources, including Google Analytics, into an easy-to-modify,
easy-to-share, and easy-to-read report.
9. What are the steps you would take to analyze a dataset?
Data analysis involves a series of steps that transform raw data into relevant
insights, conclusions, and actionable suggestions. While the specific approach
will vary based on the context and aims of the study, here is an approximate
outline of the processes commonly followed in data analysis:
Problem Definition or Objective: Make sure that the problem or
question you’re attempting to answer is stated clearly. Understand the
analysis’s aims and objectives to direct your strategy.
Data Collection: Collate relevant data from various sources. This
might include surveys, tests, databases, web scraping, and other
techniques. Make sure the data is representative and accurate.ALso
Data Preprocessing or Data Cleaning: Raw data often has errors,
missing values, and inconsistencies. In Data Preprocessing and
Cleaning, we redefine the column’s names or values, standardize the
formats, and deal with the missing values.
Exploratory Data Analysis (EDA): EDA is a crucial step in Data
analysis. In EDA, we apply various graphical and statistical approaches
to systematically analyze and summarize the main characteristics,
patterns, and relationships within a dataset. The primary objective
behind the EDA is to get a better knowledge of the data’s structure,
identify probable abnormalities or outliers, and offer initial insights that
can guide further analysis.
Data Visualizations: Data visualizations play a very important role in
data analysis. It provides visual representation of complicated
information and patterns in the data which enhances the understanding
of data and helps in identifying the trends or patterns within a data. It
enables effective communication of insights to various stakeholders.
10. What is data cleaning?
Data cleaning is the process of identifying the removing misleading or inaccurate
records from the datasets. The primary objective of Data cleaning is to improve
the quality of the data so that it can be used for analysis and predictive model-
building tasks. It is the next process after the data collection and loading.
In Data cleaning, we fix a range of issues that are as follows:
1. Inconsistencies: Sometimes data stored are inconsistent due to
variations in formats, columns_name, data types, or values naming
conventions. Which creates difficulties while aggregating and
comparing. Before going for further analysis, we correct all these
inconsistencies and formatting issues.
2. Duplicate entries: Duplicate records may biased analysis results,
resulting in exaggerated counts or incorrect statistical summaries. So,
we also remove it.
3. Missing Values: Some data points may be missing. Before going
further either we remove the entire rows or columns or we fill the
missing values with probable items.
4. Outlier: Outliers are data points that drastically differ from the average
which may result in machine error when collecting the dataset. if it is
not handled properly, it can bias results even though it can offer useful
insights. So, we first detect the outlier and then remove it.
11. What is the importance of exploratory data analysis (EDA) in data
analysis?
Exploratory data analysis (EDA) is the process of investigating and
understanding the data through graphical and statistical techniques. It is one of
the crucial parts of data analysis that helps to identify the patterns and trends in
the data as well as help in understanding the relationship between variables.
EDA is a non-parametric approach in data analysis, which means it does take any
assumptions about the dataset. EDA is important for a number of reasons that are
as follows:
1. With EDA we can get a deep understanding of patterns, distributions,
nature of data and relationship with another variable in the dataset.
2. With EDA we can analyze the quality of the dataset by making
univariate analyses like the mean, median, mode, quartile range,
distribution plot etc and identify the patterns and trends of single rows
of the dataset.
3. With EDA we can also get the relationship between the two or more
variables by making bivariate or multivariate analyses like regression,
correlations, covariance, scatter plot, line plot etc.
4. With EDA we can find out the most influential feature of the dataset
using correlations, covariance, and various bivariate or multivariate
plotting.
5. With EDA we can also identify the outliers using Box plots and remove
them further using a statistical approach.
EDA provides the groundwork for the entire data analysis process. It enables
analysts to make more informed judgments about data processing, hypothesis
testing, modelling, and interpretation, resulting in more accurate and relevant
insights.
12. What is Time Series analysis?
Time Series analysis is a statistical technique used to analyze and interpret data
points collected at specific time intervals. Time series data is the data points
recorded sequentially over time. The data points can be numerical, categorical, or
both. The objective of time series analysis is to understand the underlying
patterns, trends and behaviours in the data as well as to make forecasts about
future values.
The key components of Time Series analysis are as follows:
Trend: The data’s long-term movement or direction over time. Trends
can be upward, downward, or flat.
Seasonality: Patterns that repeat at regular intervals, such as daily,
monthly, or yearly cycles.
Cyclical Patterns: Longer-term trends that are not as regular as
seasonality, and are frequently associated with economic or business
cycles.
Irregular Fluctuations: Unpredictable and random data fluctuations
that cannot be explained by trends, seasonality, or cycles.
Auto-correlations: The link between a data point and its prior values.
It quantifies the degree of dependence between observations at different
time points.
Time series analysis approaches include a variety of techniques including
Descriptive analysis to identify trends, patterns, and irregularities, smoothing
techniques like moving averages or exponential smoothing to reduce noise and
highlight underlying trends, Decompositions to separate the time series data into
its individual components and forecasting technique like ARIMA, SARIMA,
and Regression technique to predict the future values based on the trends.
13. What is Feature Engineering?
Feature engineering is the process of selecting, transforming, and creating
features from raw data in order to build more effective and accurate machine
learning models. The primary goal of feature engineering is to identify the most
relevant features or create the relevant features by combining two or more
features using some mathematical operations from the raw data so that it can be
effectively utilized for getting predictive analysis by machine learning model.
The following are the key elements of feature engineering:
Feature Selection: In this case we identify the most relevant features
from the dataset based on the correlation with the target variables.
Create new feature: In this case, we generate the new features by
aggregating or transforming the existing features in such a way that it
can be helpful to capture the patterns or trends which is not revealed by
the original features.
Transformation: In this case, we modify or scale the features so, that it
can helpful in building the machine learning model. Some of the
common transformations method are Min-Max Scaling, Z-Score
Normalization, and log transformations etc.
Feature encoding: Generally ML algorithms only process the
numerical data, so, that we need to encode categorical features into the
numerical vector. Some of the popular encoding technique are One-
Hot-Encoding, Ordinal label encoding etc.
14. What is data normalization, and why is it important?
Data normalization is the process of transforming numerical data into
standardised range. The objective of data normalization is scale the different
features (variables) of a dataset onto a common scale, which make it easier to
compare, analyze, and model the data. This is particularly important when
features have different units, scales, or ranges because if we doesn’t normalize
then each feature has different-different impact which can affect the performance
of various machine learning algorithms and statistical analyses.
Common normalization techniques are as follows:
Min-Max Scaling: Scales the data to a range between 0 and 1 using the
formula:
(x – min) / (max – min)
Z-Score Normalization (Standardization): Scales data to have a mean
of 0 and a standard deviation of 1 using the formula:
(x – mean) / standard_deviation
Robust Scaling: Scales data by removing the median and scaling to the
interquartile range(IQR) to handle outliers using the formula:
(X – Median) / IQR
Unit Vector Scaling: Scales each data point to have a Euclidean norm
(length) (||X||) of 1 using the formula:
X / ||X||
15. What are the main libraries you would use for data analysis in
Python?
For data analysis in Python, many great libraries are used due to their versatility,
functionality, and ease of use. Some of the most common libraries are as follows:
NumPy: A core Python library for numerical computations. It supports
arrays, matrices, and a variety of mathematical functions, making it a
building block for many other data analysis libraries.
Pandas: A well-known data manipulation and analysis library. It
provides data structures (like as DataFrames) that make to easily
manipulate, filter, aggregate, and transform data. Pandas is required
when working with structured data.
SciPy: SciPy is a scientific computing library. It offers a wide range of
statistical, mathematical, and scientific computing functions.
Matplotlib: Matplotlib is a library for plotting and visualization. It
provides a wide range of plotting functions, making it easy to create
beautiful and informative visualizations.
Seaborn: Seaborn is a library for statistical data visualization. It builds
on top of Matplotlib and provides a more user-friendly interface for
creating statistical plots.
Scikit-learn: A powerful machine learning library. It includes
classification, regression, clustering, dimensionality reduction, and
model evaluation tools. Scikit-learn is well-known for its consistent
API and simplicity of use.
Statsmodels: A statistical model estimation and interpretation library.
It covers a wide range of statistical models, such as linear models and
time series analysis.
16. What’s the difference between structured and unstructured data?
Structured and unstructured data depend on the format in which the data is stored.
Structured data is information that has been structured in a certain format, such as
a table or spreadsheet. This facilitates searching, sorting, and analyzing.
Unstructured data is information that is not arranged in a certain format. This
makes searching, sorting, and analyzing more complex.
The differences between the structured and unstructured data are as follows:
Feature Structured Data Unstructured Data
Schema (structure of
No predefined
data) is often rigid and
relationships between data
organized into rows and
elements.
Structure of data columns
Customer records,
Text documents, images,
product inventories,
audio, video
Examples financial data
Similar to the single vector or column in Similar to a spreadsheet, which can have
a spreadsheet multiple vectors or columns as well as.
Best suited for working with single- The versatility and handling of the
pandas Series pandas DataFrames
Boxplot
Boxplot is used for detection the outliers in the dataset by visualizing the
distribution of data.
Statistics Interview Questions and Answers for Data
Analyst
21. What is the difference between descriptive and inferential statistics?
Descriptive statistics and inferential statistics are the two main branches of
statistics
Descriptive Statistics: Descriptive statistics is the branch of statistics,
which is used to summarize and describe the main characteristics of a
dataset. It provides a clear and concise summary of the data’s central
tendency, variability, and distribution. Descriptive statistics help to
understand the basic properties of data, identifying patterns and
structure of the dataset without making any generalizations beyond the
observed data. Descriptive statistics compute measures of central
tendency and dispersion and also create graphical representations of
data, such as histograms, bar charts, and pie charts to gain insight into a
dataset.
Descriptive statistics is used to answer the following questions:
o What is the mean salary of a data analyst?
o What is the range of income of data analysts?
o What is the distribution of monthly incomes of data
analysts?
Inferential Statistics: Inferential statistics is the branch of statistics,
that is used to conclude, make predictions, and generalize findings from
a sample to a larger population. It makes inferences and hypotheses
about the entire population based on the information gained from a
representative sample. Inferential statistics use hypothesis testing,
confidence intervals, and regression analysis to make inferences about a
population.
Inferential statistics is used to answer the following questions:
o Is there any difference in the monthly income of the
Data analyst and the Data Scientist?
o Is there any relationship between income and education
level?
o Can we predict someone’s salary based on their
experience?
22. What are measures of central tendency?
Measures of central tendency are the statistical measures that represent the centre
of the data set. It reveals where the majority of the data points generally cluster.
The three most common measures of central tendency are:
Mean: The mean, also known as the average, is calculated by adding up
all the values in a dataset and then dividing by the total number of
values. It is sensitive to outliers since a single extreme number can have
a large impact on the mean.
Mean = (Sum of all values) / (Total number of values)
Median: The median is the middle value in a data set when it is
arranged in ascending or descending order. If there is an even number
of values, the median is the average of the two middle values.
Mode: The mode is the value that appears most frequently in a dataset.
A dataset can have no mode (if all values are unique) or multiple modes
(if multiple values have the same highest frequency). The mode is
useful for categorical data and discrete distributions.
23. What are the Measures of dispersion?
Measures of dispersion , also known as measures of variability or spread, indicate
how much the values in a dataset deviate from the central tendency. They help in
quantifying how far the data points vary from the average value.
Some of the common Measures of dispersion are as follows:
Range: The range is the difference between the highest and lowest
values in a data set. It gives an idea of how much the data spreads from
the minimum to the maximum.
Variance: The variance is the average of the squared deviations of each
data point from the mean. It is a measure of how spread out the data is
around the mean.
Variance(𝜎2)=∑(𝑋−𝜇)2𝑁Variance(σ2)=N∑(X−μ)2
Standard Deviation: The standard deviation is the square root of the
variance. It is a measure of how spread out the data is around the mean,
but it is expressed in the same units as the data itself.
Mean Absolute Deviation (MAD): MAD is the average of the absolute
differences between each data point and the mean. Unlike variance, it
doesn’t involve squaring the differences, making it less sensitive to
extreme values. it is less sensitive to outliers than the variance or
standard deviation.
Percentiles: Percentiles are statistical values that measure the relative
positions of values within a dataset. Which is computed by arranging
the dataset in descending order from least to the largest and then
dividing it into 100 equal parts. In other words, a percentile tells you
what percentage of data points are below or equal to a specific value.
Percentiles are often used to understand the distribution of data and to
identify values that are above or below a certain threshold within a
dataset.
Interquartile Range (IQR): The interquartile range (IQR) is the range
of values ranging from the 25th percentile (first quartile) to the 75th
percentile (third quartile). It measures the spread of the middle 50% of
the data and is less affected by outliers.
Coefficient of Variation (CV): The coefficient of variation (CV) is a
measure of relative variability, It is the ratio of the standard deviation to
the mean, expressed as a percentage. It’s used to compare the relative
variability between datasets with different units or scales.
24. What is a probability distribution?
A probability distribution is a mathematical function that estimates the
probability of different possible outcomes or events occurring in a random
experiment or process. It is a mathematical representation of random phenomena
in terms of sample space and event probability, which helps us understand the
relative possibility of each outcome occurring.
There are two main types of probability distributions:
1. Discrete Probability Distribution: In a discrete probability
distribution, the random variable can only take on distinct, separate
values. Each value is associated with a probability. Examples of
discrete probability distributions include the binomial distribution, the
Poisson distribution, and the hypergeometric distribution.
2. Continuous Probability Distribution: In a continuous probability
distribution, the random variable can take any value within a certain
range. These distributions are described by probability density
functions (PDFs). Examples of continuous probability distributions
include the normal distribution, the exponential distribution, and the
uniform distribution.
25. What are normal distributions?
A normal distribution, also known as a Gaussian distribution, is a specific type of
probability distribution with a symmetric, bell-shaped curve. The data in a normal
distribution clustered around a central value i.e mean, and the majority of the data
falls within one standard deviation of the mean. The curve gradually tapers off
towards both tails, showing that extreme values are becoming
distribution having a mean equal to 0 and standard deviation equal to 1 is known
as standard normal distribution and Z-scores are used to measure how many
standard deviations a particular data point is from the mean in standard normal
distribution.
Normal distributions are a fundamental concept that supports many statistical
approaches and helps researchers understand the behaviour of data and variables
in a variety of scenarios.
26. What is the central limit theorem?
The Central Limit Theorem (CLT) is a fundamental concept in statistics that
states that, under certain conditions, the distribution of sample means approaches
a normal distribution as sample size rises, regardless of the the original
population distribution. In other words, even if the population distribution is not
normal, when the sample size is high enough, the distribution of sample means
will tend to be normal.
The Central Limit Theorem has three main assumptions:
1. The samples must be independent. This means that the outcome of one
sample cannot affect the outcome of another sample.
2. The samples must be random. This means that each sample must be
drawn from the population in a way that gives all members of the
population an equal chance of being selected.
3. The sample size must be large enough. The CLT typically applies when
the sample size is greater than 30.
27. What are the null hypothesis and alternative hypotheses?
In statistics, the null and alternate hypotheses are two mutually exclusive
statements regarding a population parameter. A hypothesis test analyzes sample
data to determine whether to accept or reject the null hypothesis. Both null and
alternate hypotheses represent the opposing statements or claims about a
population or a phenomenon under investigation.
Null Hypothesis (𝐻0 H0 ): The null hypothesis is a statement regarding
the status quo representing no difference or effect after the phenomena
unless there is strong evidence to the contrary.
Alternate Hypothesis (𝐻𝑎 or 𝐻1 Ha or H1 ): The alternate hypothesis is
a statement that disregards the status quo means supports the difference
or effect. The researcher tries to prove the hypothesis.
28. What is a p-value, and what does it mean?
A p-value, which stands for “probability value,” is a statistical metric used in
hypothesis testing to measure the strength of evidence against a null hypothesis.
When the null hypothesis is considered to be true, it measures the chance of
receiving observed outcomes (or more extreme results). In layman’s words, the p-
value determines whether the findings of a study or experiment are statistically
significant or if they might have happened by chance.
The p-value is a number between 0 and 1, which is frequently stated as a decimal
or percentage. If the null hypothesis is true, it indicates the probability of
observing the data (or more extreme data).
29. What is the significance level?
The significance level, often denoted as α (alpha), is a critical parameter in
hypothesis testing and statistical analysis. It defines the threshold for determining
whether the results of a statistical test are statistically significant. In other words,
it sets the standard for deciding when to reject the null hypothesis (H0) in favor
of the alternative hypothesis (Ha).
If the p-value is less than the significance level, we reject the null hypothesis and
conclude that there is a statistically significant difference between the groups.
If p-value ≤ α: Reject the null hypothesis. This indicates that the results
are statistically significant, and there is evidence to support the
alternative hypothesis.
If p-value > α: Fail to reject the null hypothesis. This means that the
results are not statistically significant, and there is insufficient evidence
to support the alternative hypothesis.
The choice of a significance level involves a trade-off between Type I and Type
II errors. A lower significance level (e.g., α = 0.01) decreases the risk of Type I
errors while increasing the chance of Type II errors (failure to identify a real
impact). A higher significance level (e.g., = 0.10), on the other hand, increases
the probability of Type I errors while decreasing the chance of Type II errors.
30. Describe Type I and Type II errors in hypothesis testing.
In hypothesis testing, When deciding between the null hypothesis (H0) and the
alternative hypothesis (Ha), two types of errors may occur. These errors are
known as Type I and Type II errors, and they are important considerations in
statistical analysis.
Type I error (False Positive, α): Type I error occurs when the null
hypothesis is rejected when it is true. This is also referred as a false
positive. The probability of committing a Type I error is denoted by α
(alpha) and is also known as the significance level. A lower
significance level (e.g., = 0.05) reduces the chance of Type I mistakes
while increasing the risk of Type II errors.
For example, a Type I error would occur if we estimated that a new
medicine was successful when it was not.
o Type I Error (False Positive, α): Rejecting a true
null hypothesis.
Type II Error (False Negative, β): Type II error occurs when a
researcher fails to reject the null hypothesis when it is actually false.
This is also referred as a false negative. The probability of committing a
Type II error is denoted by β (beta)
For example, a Type II error would occur if we estimated that a new
medicine was not effective when it is actually effective.
o Type II Error (False Negative, β): Failing to reject a
false null hypothesis.
31. What is a confidence interval, and how does it is related to point
estimates?
The confidence interval is a statistical concept used to estimates the uncertainty
associated with estimating a population parameter (such as a population mean or
proportion) from a sample. It is a range of values that is likely to contain the true
value of a population parameter along with a level of confidence in that
statement.
Point estimate: A point estimate is a single that is used to estimate the
population parameter based on a sample. For example, the sample mean
(x̄) is a point estimate of the population mean (μ). The point estimate is
typically the sample mean or the sample proportion.
Confidence interval: A confidence interval, on the other hand, is a
range of values built around a point estimate to account for the
uncertainty in the estimate. It is typically expressed as an interval with
an associated confidence level (e.g., 95% confidence interval). The
degree of confidence or confidence level shows the probability that the
interval contains the true population parameter.
The relationship between point estimates and confidence intervals can be
summarized as follows:
A point estimate provides a single value as the best guess for a
population parameter based on sample data.
A confidence interval provides a range of values around the point
estimate, indicating the range of likely values for the population
parameter.
The confidence level associated with the interval reflects the level of
confidence that the true parameter value falls within the interval.
For example, A 95% confidence interval indicates that you are 95% confident
that the real population parameter falls inside the interval. A 95% confidence
interval for the population mean (μ) can be expressed as :
(𝑥ˉ–Margin of error,𝑥ˉ+Margin of error)(xˉ–Margin of error,xˉ+Margin of error)
where x̄ is the point estimate (sample mean), and the margin of error is calculated
using the standard deviation of the sample and the confidence level.
32. What is ANOVA in Statistics?
ANOVA, or Analysis of Variance, is a statistical technique used for analyzing
and comparing the means of two or more groups or populations to determine
whether there are statistically significant differences between them or not. It is a
parametric statistical test which means that, it assumes the data is normally
distributed and the variances of the groups are identical. It helps researchers in
determining the impact of one or more categorical independent variables (factors)
on a continuous dependent variable.
ANOVA works by partitioning the total variance in the data into two
components:
Between-group variance: It analyzes the difference in means between
the different groups or treatment levels being compared.
Within-group variance: It analyzes the variance within each
individual group or treatment level.
Depending on the investigation’s design and the number of independent
variables, ANOVA has numerous varieties:
One-Way ANOVA: Compares the means of three or more independent
groups or levels of a single categorical variable. For Example: One-way
ANOVA can be used to compare the average age of employees among
the three different teams in a company.
Two-Way ANOVA: Compare the means of two or more independent
groups while taking into account the impact of a two independent
categorical variables (factors) . For example, Two-way ANOVA can be
to compare the average age of employees among the three different
teams in a company, while also taking into account the gender of the
employees.
Multivariate Analysis of Variance (MANOVA): Compare the means
of multiple dependent variables. For example, MANOVA can be used
to compare the average age, average salary, and average experience of
employees among the three different teams in a company.
33. What is a correlation?
Correlation is a statistical term that analyzes the degree of a linear relationship
between two or more variables. It estimates how effectively changes in one
variable predict or explain changes in another.Correlation is often used to access
the strength and direction of associations between variables in various fields,
including statistics, economics.
The correlation between two variables is represented by correlation coefficient,
denoted as “r”. The value of “r” can range between -1 and +1, reflecting the
strength of the relationship:
Positive correlation (r > 0): As one variable increases, the other tends
to increase. The greater the positive correlation, the closer “r” is to +1.
Negative correlation (r < 0): As one variable rises, the other tends to
fall. The closer “r” is to -1, the greater the negative correlation.
No correlation (r = 0): There is little or no linear relationship between
the variables.
34. What are the differences between Z-test, T-test and F-test?
The Z-test, t-test, and F-test are statistical hypothesis tests that are employed in a
variety of contexts and for a variety of objectives.
Z-test: The Z-test is performed when the population standard deviation
is known. It is a parametric test, which means that it makes certain
assumptions about the data, such as that the data is normally
distributed. The Z-test is most accurate when the sample size is large.
T-test: The T-test is performed when the population standard deviation
is unknown. It is also a parametric test, but unlike the Z-test, it is less
sensitive to violations of the normality assumption. The T-test is most
accurate when the sample size is large.
F-test: The F-test is performed to compare two or more groups’
variances. It assume that populations being compared follow a normal
distribution.. When the sample sizes of the groups are equal, the F-test
is most accurate.
The key differences between the Z-test, T-test, and F-test are as follows:
Z-Test T-Test F-Test
when the
populati
on
standard
independent.
deviatio
n is
unknow
n.
N<30 or population
Used to test the
N>30 standard deviation
variances
Data is unknown.
Sales(sum of sales),
Category, Region, Product Profit(sum of profit),
name, etc. Quantity(sum of quantity),
Example etc.
57. What are the dashboard, worksheet, Story, and Workbook in Tableau?
Tableau is a robust data visualization and business intelligence solution that
includes a variety of components for producing, organizing, and sharing data-
driven insights. Here’s a rundown of some of Tableau’s primary components:
Dashboard : A dashboard is a collection of views(worksheets) arranged
on a single page, designed to provide an interactive and holistic view of
data. They include charts, maps, tables and other web content.
Dashboards combine different visualizations into a single interface to
allow users to comprehensively display and understand data. They are
employed in the production of interactive reports and the provision of
quick insights.
Dashboards support the actions and interactivity, enabling the users to
filter and highlight the data dynamically. Dashboard behaviour can be
modified with parameters and quick filters.
Worksheet: A worksheet serves as the fundamental building element
for creating data visualizations. To build tables, graphs, and charts, drag
and drop fields onto the sheet or canvas. They are used to design
individual visualizations and we can create various types of charts,
apply filters, and customize formatting within a worksheet.
Worksheets offer a wide range of visualization options, including bar
charts, line charts, scatter plots, etc. It also allows you to use reference
lines, blend data and create calculated fields.
Story: A story is a sequence or narrative created by combining sheets
into a logical flow. Each story point represents a step in the narrative.
Stories are used to systematically lead viewers through a set of
visualizations or insights. They are useful for telling data-driven stories
or presenting data-driven narratives.
Stories allow you to add text descriptions, annotations, and captions to
every story point. Users can navigate through the story interactively.
Workbook: It is the highest-level container in Tableau. It is a file that
has the capacity to hold a number of worksheets, dashboards, and
stories. The whole tableau project, including data connections and
visuals, is stored in workbooks. They are the primary files used for
creating, saving and sharing tableau projects. They store all the
components required for data analysis and visualization.
Multiple worksheets, dashboards and tales can be organized in
workbooks. At the workbook level, you can set up data source
connections, define parameters and build computed fields.
58. Name the different products of Tableau with their significance.
The different products of Tableau are as follows :
Tableau Desktop: It is the primary authoring and publishing tool. It
allows data professionals to connect to various data sources, create
interactive and shareable visualizations, and develop dashboards and
reports for data analysis. Users can use the drag-and-drop interface to
generate insights and explore data.
Tableau Server: This is an enterprise-level platform tableau server that
enables safe internal collaboration and sharing of tableau information. It
manages access, centralizes data sources, and maintains data security. It
is appropriate for bigger businesses with numerous users who require
access to tableau content.
Tableau Online: It is an online version of tableau. In a scalable and
adaptable cloud environment, it enables users to publish, share, and
collaborate on tableau content. For businesses searching for cloud-
based analytics solutions without managing their infrastructure.
Tableau Public: It is a free version of tableau that enables users to
create, publish and share dashboards and visualizations publicly on the
web. The ability to share their data stories with a larger audience is
perfect for data enthusiasts and educators.
Tableau Prep: It is a tool for data preparation that makes it easier and
faster to clean, shape, and combine data from diverse sources. Data
specialists can save time and effort because it makes sure that the data
is well-structured and ready for analysis.
Tableau Mobile: A mobile application that extends tableau’s
capabilities to smartphones and tablets. By allowing users to access and
interact with tableau content while on the go, it ensures data
accessibility and decision-making flexibility.
Tableau Reader: It is a free desktop application that enables users to
view and interact with tableau workbooks and dashboards shared by the
tableau desktop users. This tool is useful for those who require access to
and exploration of tableau material without a tableau desktop license.
Tableau Prep Builder: It is an advanced data preparation tool designed
for data professionals. In order to simplify complicated data preparation
operations, it provides more comprehensive data cleaning,
transformation, and automation tools.
59. What is the difference between joining and blending in Tableau?
In Tableau, joining and blending are ways for combining data from various tables
or data sources. However, they are employed in various contexts and have several
major differences:
Basis Joining Blending
several sources to
is created by combining produce a momentary, in-
the two tables. memory blend for
visualization needs.