1
.
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
ACADEMIC YEAR 2025-2026 (ODD SEMSTER)
Question Bank
AD3301- DATA EXPLORATION AND VISUALIZATION
Regulation 2021
Prepared By, Verified By,
Ms.T.Thenmozhi Dr.R.Deepalakshmi
Assistant Professor, AD Professor&Head-AD
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester
2
UNIT I EXPLORATORY DATA ANALYSIS
EDA fundamentals – Understanding data science – Significance of EDA – Making sense of data – Comparing EDA
with classical and Bayesian analysis – Software tools for EDA - Visual Aids for EDA- Data transformation
techniques-merging database, reshaping and pivoting, Transformation techniques - Grouping Datasets - data
aggregation – Pivot tables and cross-tabulations.
1. What is data?
Data encompasses a collection of discrete objects, numbers, words, events, facts, measurements, observations, or even
descriptions of things.
2. What is a dataset? Give example.
A dataset contains many observations about a particular object.
For instance, a dataset about patients in a hospital can contain many observations.
A patient can be described by a patient identifier (ID), name, address, weight, date of birth, address, email, and
gender. Each of these features that describe a patient is a variable. Each observation can have a specific value for each
of these variables.
For example, a patient can have the following: PATIENT_ID = 1001
Name = Yoshmi Mukhiya
Address = Mannsverk 61, 5094, Bergen, Norway Date of birth = 10th July 2018
Email = [email protected] Weight = 10
Gender = Female
3. What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is a process of examining the available dataset to discover patterns, spot anomalies,
test hypotheses, and check assumptions using statistical measures.
4. What does EDA mean in data? (Nov/Dec 2023)
EDA stands for Exploratory Data Analysis. It is the process of analyzing data sets by summarizing their
key characteristics using visual tools like charts and graphs to identify patterns, trends, and anomalies.
5. State the purpose of data aggregation. (Nov/Dec 2023)
Data aggregation is used to combine data from multiple records into a summary form (like sum, average,
count) to enable easier and more efficient data analysis and reporting.
6. What is meant by EDA? (Nov/Dec 2022)
EDA (Exploratory Data Analysis) refers to the process of analyzing datasets to summarize their main
characteristics using statistics and visualizations such as histograms, boxplots, and scatter plots.
7. How do you get cross tabulation? (Nov/Dec 2022)
Cross tabulation (or contingency table) is created using two or more categorical variables to show the
frequency distribution. In Python, it can be done using pd.crosstab() from the Pandas library.
8. Mention the key responsibilities of a data analyst. (April/May 2024)
A data analyst is responsible for gathering, organizing, and interpreting data to uncover insights that support
business decisions. They clean data, perform statistical analyses, generate dashboards, and report trends to
stakeholders.
9. Name some of the best tools used for data analysis and data visualization. ( April/May 2024)
Popular tools for data analysis include Python (with Pandas, NumPy), R, and Excel. For data visualization,
commonly used tools are Tableau, Power BI, Matplotlib, and Seaborn, which help in presenting insights in a
visual format.
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester
3
10. Write short notes on the significance of EDA.
It is practically impossible to make sense of datasets containing more than a handful of data points without the help of
computer programs.
Exploratory data analysis is key, and usually the first exercise in data mining.
It allows us to visualize data to understand it as well as to create hypotheses for further analysis.
The exploratory analysis centers around creating a synopsis of data or insights for the next steps in a data mining
project.
EDA actually reveals the ground truth about the content without making any underlying assumptions.
11. List the expert tools for exploratory analysis and mention their purpose.
Python provides expert tools for exploratory analysis:
pandas for summarization scipy
for statistical analysis
matplotlib and plotly for visualizations
12. List the common tasks in the data processing
stage. The common tasks in the data processing stage include
exporting the dataset
placing them under the right tables structuring
them, and
exporting them in the correct format
13. Write short notes on nominal scales.
These are used for labeling variables without any quantitative value. They are generally
referred to as labels.
These scales are mutually exclusive and do not carry any numerical importance.
Nominal scales are considered qualitative scales and the measurements that are taken using qualitative scales are considered
qualitative data.No form of arithmetic calculation can be made on nominal measures.
Examples:
The languages that are spoken in a particular country
Biological species
14. Give an example of an ordinal scale using the Likert scale.
Consider a question: “WordPress is making content managers' lives easier. How do you feel about this statement?” The
answer to the question is scaled down to five different ordinal values, Strongly Agree, Agree, Neutral, Disagree, and
Strongly Disagree.
15. Write short notes on interval scales.
In interval scales, both the order and exact differences between the values are significant.
Interval scales are widely used in statistics, for example, in the measure of central tendencies such as mean, median,
mode, and standard deviations.
Examples include location in Cartesian coordinates and direction measured in degrees from magnetic north.
16. Give short notes on ratio scales.
Ratio scales contain order, exact values, and absolute zero. They
are used in descriptive and inferential statistics.
These scales provide numerous possibilities for statistical analysis.
Mathematical operations, the measure of central tendencies, and the measure of dispersion and coefficient of variation
can also be computed from such scales.
Examples: the measure of energy, mass, length, duration, electrical energy, plan angle, and volume.
17. Compare EDA with classical data analysis.
Classical data analysis: The problem definition and data collection step are followed by model development, which is
followed by analysis and result communication.
Exploratory data analysis approach: It follows the same approach as classical data analysis except for the model
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester
4
imposition, and the data analysis steps are swapped. The main focus is on the data, its structure, outliers, models, and
visualizations.
18.List the software tools available for EDA.
Python - widely used in data analysis, data mining, and data science
R programming language - widely utilized in statistical computation and graphical data
analysis Weka - involves several EDA tools and algorithms
KNIME - an open-source tool for data analysis and is based on Eclipse
19.Write short notes on the pivot table.
The pandas.pivot_table() function creates a spreadsheet-style pivot table as a dataframe.
The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of
the resulting dataframe.
20.Write briefly about types of joins.
The inner join takes the intersection from two or more dataframes. The outer
join takes the union from two or more dataframes.
The left join uses the keys from the left-hand dataframe only. The right join
uses the keys from the right-hand dataframe only.
PART B & C
1. Provide an explanation of the various EDA tools that are used for data analysis. (Nov/Dec 2023)
2. What is cross-tabulation and PivotTable? How to Build Pivot Table and Cross Tab Reports? (Nov/Dec
2023)
3. What is the primary purpose of EDA? What are the differences between EDA with classical and
Bayesian analysis? Discuss it in detail. (Nov/Dec 2022)
4. Explain various transformation techniques in EDA. (Nov/Dec 2022)
5. What are the tools used for EDA , Give a case study on applyting this in real business scenario (
Nov/Dec 2022) part c
6. (i) Discuss about Descriptive Statistics in exploratory analysis. (7) (April/May 2024)
(ii) Explain in detail about data transformation Techniques. (6)
7. (i) Explain in detail about Comparative Statistics in Exploratory analysis. (6) (April/May 2024)
(ii) Discuss in detail about the practical use of Pivot Table in data science with suitable example. (7)
8. Illustrate the Application of Pivot Tables
9. Compare EDA with Bayesian Analysis:
10. Discuss Data transformation in EDA
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester
5
UNIT II VISUALIZING USING MATPLOTLIB
Importing Matplotlib – Simple line plots – Simple scatter plots – visualizing errors – density and contour plots – Histograms
– legends – colors – subplots – text and annotation – customization –three dimensional plotting - Geographic Data
with Basemap - Visualization with Seaborn.
1. Write short notes on plt.show() command.
The plt.show() command starts an event loop, looks for all currently active figure objects, and opens one or more
interactive windows that display the figures.
This command should be used only once per python session, and is most often seen at the very end of the script. Multiple
show() commands can lead to unpredictable backend-dependent behavior, and should be avoided.
2. Give an example to create line plots.
import matplotlib.pyplot as plt import numpy as np
x = np.linspace(0, 10, 100) plt.plot(x,
np.sin(x),‟-‟)
plt.plot(x, np.cos(x),‟--‟) plt.show()
3. What is Matplotlib used for? (Nov/Dec 2023)
Matplotlib is a data visualization library in Python used to create various static, animated, and interactive
plots such as line graphs, bar charts, and scatter plots.
4. Differentiate plot vs subplot. (Nov/Dec 2023)
A plot displays a single graph or chart, whereas a subplot is a grid of multiple plots shown within one
figure window, allowing for comparison of multiple graphs.
5. What is the difference between MATLAB and Matplotlib? (Nov/Dec 2022)
MATLAB is a commercial programming platform used for numerical computing and simulations, while
Matplotlib is a free, open-source Python library used specifically for creating static, animated, and
interactive visualizations.
6. Is a histogram always a bar chart? Justify with your answer. (Nov/Dec 2022)
No, a histogram is not the same as a bar chart. A histogram is used for continuous data, with no gaps
between bars, while a bar chart is used for categorical data, and bars are usually separated.
7. List the software and hardware components required for data visualization. (April/May 2024)
Software: Visualization tools like Tableau, Power BI, Python (Matplotlib, Plotly), and Excel. Hardware: A
computer with a multi-core processor, minimum 8–16 GB RAM, high-resolution monitor, SSD for fast
access, and a dedicated GPU (optional) for rendering complex visuals.
8. Draw and label a rough contour plot of the joint probability density function. When P = -0.4, ρ =
-0.4. (April/May 2024)
A contour plot for negatively correlated variables (ρ = -0.4) would show elliptical contours sloping from the top
left to bottom right. The plot should be labeled with X and Y axes and curved lines representing equal
probability density levels, indicating moderate negative correlation
9. Write code to do the following:
Plot the following data on a line chart:
Runs in Overs 10 20
MI 110 224
RCB 85 210
import matplotlib.pyplot as plt overs = [10,20] mi =
[110,224]
plt.plot(overs,mi,'blue') rcb=[109,210]
plt.plot(overs,rcb,'red') plt.xlabel('Runs') plt.ylabel('Overs') plt.title('Match Summary') plt.show()
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester
6
10. Mention the use of plt.axis() method. How to set the axis limits with plt.axis() method?
The plt.axis() method is used to set the x and y limits with a single call, by passing a list that
specifies [xmin, xmax, ymin,ymax]. Example:
plt.plot(x, np.sin(x)) plt.axis([-1, 11, -1.5, 1.5]);
11. What are sharex and sharey?
By specifying sharex and sharey, we will automatically remove inner labels on the grid to make the plot cleaner. The
resulting grid of axes instances is returned within a NumPy array, allowing for convenient specification of the desired
axes using standard array indexing notation.
Example:
fig, ax = plt.subplots(2, 3, sharex='col', sharey='row')
12. What do you mean by data visualization technique?
The data visualization technique refers to the graphical or pictorial or visual representation of data. This can be
achieved by charts, graphs, diagrams, or maps.
13. How to plot two subplots using a MATLAB-style interface?
plt.figure() # create a plot figure
# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number) plt.plot(x, np.sin(x))
# create the second panel and set current axis plt.subplot(2, 1, 2)
plt.plot(x, np.cos(x));
14. What are the pre-defined transforms when considering the placement of text on a figure?
There are three pre-defined transforms: ax.transData: Transform associated with data coordinates
ax.transAxes: Transform associated with the axes (in units of axes dimensions) fig.transFigure: Transform associated
with the figure (in units of figure dimensions)
15. What are the ways of importing matplotlib?
Matplotlib can be imported in the following two ways:
Using alias name: import matplotlib.pyplot as plt
Without alias name: import matplotlib.pyplot
16. What are transData and transAxes coordinates?
The transData coordinates give the usual data coordinates associated with the x and y-axis labels.
The transAxes coordinates give the location from the bottom-left corner of the axes as a fraction of the axes‟ size.
17. How do you create a contour plot?
A contour plot can be created with the plt.contour function.
It takes three arguments:
a grid of x values
a grid of y values
a grid of z values.
The x and y values represent positions on the plot, and the z values will be represented by the contour levels.
18. Write the function for listing five plot styles randomly.
plt.style.available[:5] Output:
['fivethirtyeight', 'seaborn-pastel', 'seaborn-whitegrid', 'ggplot', 'grayscale']
19. Write a code to use a gray background and draw solid white grid lines.
(i) use a gray background
ax = plt.axes(axisbg='#E6E6E6') ax.set_axisbelow(True)
(ii) draw solid white grid lines plt.grid(color='w', linestyle='solid')
20. Write short notes on ggplot.
The ggplot package in the R language is a very popular visualization tool. Matplotlib's ggplot style
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester
7
Part B & C
1. Why matplotlib is used for data visualization? Which module of matplotlib is used for data
visualization? (Nov/Dec 2023)
2. How do you Visualize a Three-Dimensional Function in python? Illustrate with a code. (Nov/Dec
2023)
3. How to over plot a line on a scatter plot in Python? Illustrate with code. (Nov/Dec 2022)
4. Discuss with how Seaborn helps to visualize the statistical relationships. Illustrate with code and
example(Nov/Dec 2022)
5. Describe the various distributions module of Seaborn for visualization. Consider a sample application
to illustrate. (Nov/Dec 2023) part c
6. (i) Define line plot. With an example, explain how to create a line plot to visualize the trend.
(ii) The following table gives the lifetime of 400 neon lamps. Draw the histogram for the below data. (7)
(April/May 2024)
Lifetime (in hours) Number of lamps
300–400 14
400–500 56
500–600 60
600–700 86
700–800 74
800–900 62
900–1000 48
7. (i)Explain in detail about 3D Data Visualization, its components and its working flow with suitable
example. (6)
(ii) Discuss in detail about text and annotation. (7) ( April/May 2024)
8. Discuss in detail about data cleaning ( Nov/Dec 2022) part c
9. Explain the Process of Creating a Simple Line Plot in Matplotlib?
10. Illustrate the Use of Scatter Plots in Visualizing Relationships?
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester
8
UNIT III UNIVARIATE ANALYSIS
Introduction to Single variable: Distributions and Variables - Numerical Summaries of Level and Spread - Scaling and
Standardizing – Inequality - Smoothing Time Series.
1. What is sampling?
Sampling is a method that allows us to get information about the population based on the statistics from a subset of the
population (sample or case), without having to investigate every individual.
2. What are the two basic units of data analysis?
Cases and variables are the two organizing concepts that are considered the basis of data analysis. The
cases are the samples about which information is collected.
The information is collected on certain features of all the cases. These
features are the variables that vary across different cases.
Example: In a survey of individuals, their income, sex, and age are some of the variables that might be recorded.
3. List the three main types of univariate analyses. (Nov/Dec 2023)
The three main types are:
a) Central Tendency (mean, median, mode),
b) Dispersion (range, variance, standard deviation),
c) Distribution Shape (skewness, kurtosis, histograms).
4. What is the purpose of smoothing a time series data? (Nov/Dec 2023)
Smoothing helps to remove short-term fluctuations or noise from time series data, making the overall
trend more visible and easier to analyze for forecasting.
5. What is the main purpose of univariate analysis? (Nov/Dec 2022)
Univariate analysis aims to describe and summarize a single variable by understanding its distribution,
central tendency, and dispersion.
6. What is the mathematical mean of the following numbers? 10, 6, 4, 4, 6, 4 (Nov/Dec 2022)
Mean = (10 + 6 + 4 + 4 + 6 + 4) / 6 = 34 / 6 = 5.67
7. Difference between normalized scaling and standardized scaling. (April/May 2024)
Normalized Scaling converts data values into a range of [0, 1] using min-max normalization.
Standardized Scaling transforms data to have a mean of 0 and standard deviation of 1 using the z-score
method. Normalization preserves the shape, while standardization adjusts data to a common scale for
comparison.
8. Illustrate important steps to be followed in preparing a base map. (April/May 2024)
Steps include:
● Defining the area of interest
● Selecting an appropriate map projection
● Collecting geospatial data (like roads, rivers, and boundaries)
● Creating layers for different map features
● Adding legends, scale bars, orientation (north arrow), and title
● Ensuring map accuracy and readability
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester
9
9. When to prefer pie charts?
Pie charts are to be preferred when there are only a few categories and when the sizes of the categories are very
different.
10. What are the two types of distributions in histograms?
The two types of distribution in histograms are unimodal and bimodal, depending on the frequency of the occurring
values.
11. What are the four important aspects of any distribution inspected by histograms?
Level: What are typical values in the distribution?
Spread: How widely dispersed are the values? Do they differ very much from one another? Shape: Is
the distribution flat or peaked? Symmetrical or skewed?
Outliers: Are there any particularly unusual values?
12. What is SPSS?
SPSS is an acronym for Statistical Package for the Social Sciences.
SPSS is a very useful computer package that includes hundreds of different procedures for displaying and analyzing
data.
13. What is unimodal distribution?
Unimodal is a single-peaked distribution in that one value occurs with the greatest frequency than the other values. It is a
distribution with a single clearly visible peak or a single most frequent value.
The distribution‟s shape in the unimodal distribution has only one main high point.
14. What is bimodal distribution?
Bimodal distribution is a distribution where two values occur with the greatest frequency which means two frequent
values are separated by a gap in between.
This type of distribution has two fairly equal high points (or the modes).
The two modes are usually separated by a big gap in between and the distribution contains more data than others.
15. What are histograms?
Histograms are charts that are similar to bar charts that can be used to display interval-level variables grouped into
categories.
They are constructed in exactly the same way as bar charts except that the ordering of the categories is fixed.
16. What are the three main windows of SPSS?
SPSS has three main windows:
o The Data Editor window
o The Output window
o The Syntax window
17. What is data in terms of summarization?
Any data value (such as a measurement of hours worked or income earned) is composed of two components: a fitted
part and a residual part.
This can be expressed as an equation:
Data = Fit + Residual
18. What are the measures of central tendency?
Mean, median, and mode are the measures of central tendencies. Mean :
the sum of all values divided by the total number of values.
Median : the middle number in an ordered dataset.
Mode : the most frequent value.
19. What is the general principle in comparing different measures?
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester
10
The general principle in comparing different measures is: one measure is more resistant than another if it tends to be
less influenced by a change in any small part of the data.
20. How do we decide between the median and mean to summarize a typical value, or between the range,
the midspread, and the standard deviation to summarize the spread?
Locational statistics such as the range, median, and midspread generally fare better than the more abstract means and
standard deviations.
Means and standard deviations are more influenced by unusual data values than medians and midspreads.
Means and standard deviations are usually more influenced by a change in any individual data point than the medians
and midspreads.
Part B & C
1. What is scaling and standardization? When and why to standardize a variable? Illustrate with suitable
example. (Nov/Dec 2023)
2. Explain the Smoothing Techniques for time series data with suitable example. (Nov/Dec 2023)
3. Explain the 10 Essential Numerical Summaries in Statistics with example. (Nov/Dec 2022)
4. How, When, and Why Should You Normalize / Standardize / Rescale Your Data? (Nov/Dec 2022)
5. (i) Does universe frequency distribution have variable? Justify in detail. (7) (April /May 2024)
(ii) Explain in detail about scaling and standardizing. (6)
6. Imagine you are working on a time series dataset. Your manager has asked you to build a highly
accurate model. You started to build two types of models which are given below. (13) (April/May 2024)
Model 1 : Decision Tree model
Model 2 : Time series regression modelAt the end of evaluation of these two models, you found that model 2 is
better than model 1. What could be the possible reason for your inference.
7. Explain the Purpose of Time Series Smoothing Techniques?
8. Provide a step-by-step illustration of calculating the Coefficient of Variation (CV)?
9. Explain how it offers a relative measure of variability and its interpretation in different scenarios?
10. Describe the Importance of Outlier Detection in Data Analysis?
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester
11
UNIT IV BIVARIATE ANALYSIS
Relationships between Two Variables - Percentage Tables - Analyzing Contingency Tables -Handling Several
Batches - Scatterplots and Resistant Lines – Transformations.
1. Write briefly about the contingency table.
A contingency table shows the distribution of each variable conditional upon each category of the other.
The categories of one of the variables form the rows, and the categories of the other variable form the columns. Each
individual case is then tallied in the appropriate cell depending on its value on both variables.
The number of cases in each cell is called the cell frequency.
2. Name the two main types of statistical testing in bivariate analysis. (Nov/Dec 2023)
The two main types are:
a) Correlation analysis – measures strength and direction of relationship,
b) Hypothesis testing – tests significance (e.g., t-test, ANOVA, chi-square).
3. Is bivariate qualitative or quantitative? (Nov/Dec 2023)
Bivariate analysis can be qualitative, quantitative, or mixed, depending on whether the two variables
involved are categorical, numerical, or both.
4. What are the three common methods for performing bivariate analysis? (Nov/Dec 2022)
Three common methods are:
a) Scatter plot,
b) Correlation analysis,
c) Chi-square test (for categorical data).
5. Outline the difference between univariate and bivariate data. (Nov/Dec 2022)
Univariate data involves analysis of a single variable, while bivariate data examines the relationship
between two variables, often to identify correlation or dependence.
6. The diagram represents the sales of Superclene toothpaste over the last few years. Give a
reason why it is misleading. (April/May 2024)
The graph is misleading because the Y-axis does not start from zero, which artificially inflates the visual
difference in sales. This exaggerates small changes and may lead viewers to interpret a dramatic growth that
isn’t significant.
7. How do you find the correlation of a scatter plot? (April/May 2024)
To determine correlation from a scatter plot, observe the pattern of points:
● A positive linear trend indicates positive correlation.
● A downward trend indicates negative correlation.
● A random spread suggests no correlation.
Quantitatively, Pearson's correlation coefficient (r) is calculated to measure the strength and direction of the
relationship.
8. What are three different ways of representing contingency table in percentage form?
The three different ways of representing contingency table in percentage form are The
table that is constructed by dividing each cell frequency by the grand total.
Outflow table: The table that is constructed by dividing each cell frequency by its appropriate row total. Inflow
table: The table that is constructed by dividing each cell frequency by its appropriate column total.
9. What are marginals?
Each row and column in a contingency table can have a total presented at the right-hand end and at the bottom
respectively; these are called the marginals.
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester
12
The univariate distributions can be obtained from the marginal distributions.
10. Write a brief note on labeling a table.
The title of a table should be clear and concise, summarising the contents.
It should be as short as possible, while at the same time making clear when the data were collected, the geographical
unit covered, and the unit of analysis.
It helps in numbering figures and can refer to them more succinctly in the text.
Other parts of a table also need clear, informative labels.
The variables included in the rows and columns must be clearly identified.
11. What is the importance of using a layout in a table?
The effective use of space and grid lines can make the difference between a table that is easy to read and one which is
not.
Grid lines can help indicate how far a heading or subheading extends in a complex table.
12. What are the considerations to make a decision about which variable to put in the rows and which in
the columns?
Closer figures are easier to compare.
Comparisons are more easily made down a column.
A variable with more than three categories is best put in the rows so that there is plenty of room for category labels.
13. What is the difference in proportions?
The difference in proportions, d, is used to summarize the effect of being in a category of one
variable upon the chances of being in a category of another.
14. Write the properties of difference in proportions?
Symmetric measures of association have the same value regardless of which way the causal effect is assumed to run.
Asymmetric measures have varying values depending on which variable is presumed to be the cause of the other.
15. What are inferential statistics?
The analysis of data from samples of individuals to infer information about the population as a whole is called
inferential statistics.
16. Write the equation for the chi-square statistic.
The equation for chi-square is given by,
Where
O - Observed frequency E - Expected frequency
The difference between the observed and expected frequencies for each cell of the table is calculated. Then, this value
is squared before dividing it by the expected frequency for that cell.
Finally, these values are summed over all the cells of the table.
17. What is a null hypothesis?
The null hypothesis is that the two variables under analysis are not associated with the population as a whole and the
relationship that is observed between variables in the sample is small enough to have occurred due to random error.
(i.e.) the null hypothesis states that, in the population of interest, changes in the explanatory variable have no impact
on the outcome of the response variable.
18. Provide two reasons why some data points are outliers.
Outliers occur when the whole distribution is skewed.
The particular data points do not really belong substantively to the same data batch.
19. What is GNI?
GNI is the sum of values of both final goods and services and investment goods in a country. If one focuses on the
production that is undertaken by the residents of that country, the income earned by nationals from abroad has to be
added to the gross domestic product, to arrive at the gross national income.
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester
13
20. What are the two types of hypotheses used by statistical tests?
Whenever researchers use a statistical test, two hypotheses are involved:
Null hypothesis
Alternative hypothesis
Part B & C
1. How do you analyze contingency tables? Give examples. (Nov/Dec 2023)
2. Discuss the best Practices for Designing Scatter Plots. (Nov/Dec 2023)
3. What is a table of frequency values for a bivariate distribution? Explain What graph is used in the
analysis of bivariate data? (Nov/Dec 2022)
4. How do you analyze a contingency table? Discuss. (Nov/Dec 2022)
5. (i) Discuss in detail about contingency table with example. (7) (April/May 2024)
(ii) Explain in detail about percentage table with suitable example. (6)
6. Draw a scatter plot for the given data that shows the number of games played and scores obtained in
each instance. With this plot explain Scatter plot correlation and its types and justify which type of
correlation it belong to with neat illustration. (13) (April/May 2024)
No. of games : 3 5 2 6 7 1 2 7 1 7
Scores 80 90 75 80 90 50 65 85 40 100
7. Compute the average seasonal movement for the following series and justify your answer:
Year Quarterly Production (April/May 2024)
I II III I
V
2002 3 3.8 3.7 3.5
.
5
2003 3 4.0 3.7 3.5
.
6
2004 3 3.9 3.7 4.1
.
4
2005 4 4.5 3.8 4.4
.
2
2006 3 4.4 4.2 4.6
.
9
8. Describe the characteristics and applications of resistant lines in regression?
9. Explain the concept of transformations in data analysis and provide examples?
10. Illustrate the application of batch handling in experimental design?
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester
14
UNIT V MULTIVARIATE AND TIME SERIES ANALYSIS
Introducing a Third Variable - Causal Explanations - Three-Variable Contingency Tables and Beyond - Longitudinal Data –
Fundamentals of TSA – Characteristics of time series data – Data Cleaning – Time- based indexing – Visualizing –
Grouping – Resampling.
1. Define cause.
A cause is defined as an object followed by another, and where all the objects, similar to the first, are followed by objects
similar to the second. (i.e.) if the first object had not been, the second never had existed.
2. Define causality.
Causality can be defined in terms of constant conjunction or statistical association it is clearly not sufficient for one event
invariably to precede another for us to be convinced that the first
event causes the second.
3. What is multiple causality?
Multiple causality is a process where many different component causes can combine to produce a specific outcome.
4. What are direct and indirect casual effects?
Direct causal effects are effects that go directly from one variable to another. Indirect effects occur when the relationship
between two variables is mediated by one or more variables.
5. Show the characteristics of multivariate analysis. (Nov/Dec 2022)
Characteristics include:
● Involves more than two variables,
● Identifies interactions among variables,
● Common techniques include PCA, MANOVA, and multiple regression.
6. What is TSA in Statsmodel? (Nov/Dec 2022)
TSA stands for Time Series Analysis in Statsmodels, a Python module used to analyze time-based data
using models like ARIMA, SARIMA, etc.
7. List the different casual relationships between variables.
The different casual relationships between variables are prior, intervene and ensue.
8. What is Simpson’s paradox?
Simpson‟s paradox is every statistical relationship between two variables may be reversed by including additional
factors in the analysis.
9. What is multivariate analysis? (Nov/Dec 2023)
Multivariate analysis refers to the statistical analysis of more than two variables at the same time to
examine relationships and effects among them (e.g., multiple regression).
10. What are the two common techniques used to perform dimension reduction? (Nov/Dec 2023)
Two widely used dimension reduction techniques are:
a) Principal Component Analysis (PCA) – reduces dimensionality while preserving variance,
b) Linear Discriminant Analysis (LDA) – reduces dimensions while preserving class separability.
11. Define least square method in time series. (April/May 2024)
The least square method fits a trend line to time series data by minimizing the sum of the squares of the vertical
distances (errors) between actual data points and the estimated trend line. It helps in forecasting future values
based on the linear trend.
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester
15
12. List the techniques used in smoothing time series. (April/May 2024)
Common smoothing techniques include:
Moving Average (Simple or Weighted): Averages recent values to smooth out short-term
fluctuations.
Exponential Smoothing: Gives more weight to recent observations.
Loess/LOWESS: Local regression used for smoothing non-linear trends.
These techniques help reveal underlying trends by reducing random variation.
13. What is a panel study?
The participants in a research study are contacted by researchers and asked to provide information about themselves
and their circumstances on a number of different occasions. This is referred to as a panel study.
14. What are transition tables?
Transition tables have a longitudinal dimension in that the two variables that are being Cross tabulated can be
understood as a single categorical variable that has been measured at two time points.
15. Define cohort. Give example.
A cohort has been defined as an 'aggregate of individuals who experienced the same event within the same time
interval. The most obvious type of cohort used in longitudinal quantitative research is the birth cohort (i.e.) a sample
of individuals born within a relatively short time period.
16. What is cross-sectional survey?
The change over time is determined by conducting two surveys asking the same questions at different points in
historical time. This is known as repeated cross-sectional survey.
17. What is the major issue in longitudinal studies?
A major methodological issue in longitudinal studies is the problem of attrition, i.e., the dropout of participants
through successive waves of a prospective study.
18. What is univariate time series? Give example.
The series that captures a sequence of observations for the same variable over a particular duration of time is referred
to as univariate time series.
In general, the observations are taken over regular time periods, such as the change in temperature over time
throughout a day.
19. What is event history analysis?
Event history analysis focuses on the timing of events or the duration until a particular event occurs, rather than
changes in attributes over time.
20. What is repeated measures analysis?
Repeated measures analysis focuses on the changes in an individual attribute over time. For example, weight,
performance score, attitude, voting behavior, reaction time, depression, etc.
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester
16
Part B & C
1. What are the characteristics of multivariate analysis? How do you explain multivariate analysis?
(Nov/Dec 2023)
2. What is TSA analysis? Explain ARIMA, smooth-based and moving average. (Nov/Dec 2023)
3. What is meant by time series data? Describe its four components. (Nov/Dec 2022)
4. What is the best way to visualize time series data? What patterns might appear when you plot the time
series data? (Nov/Dec 2022)
5. Give a case study on univariate and multivariate analysis with example. (Nov/Dec 2023) part c
6. (i) Explain the main components of time series data. Which of these would be most prevalent in data
relating to unemployment? (6) (April/May 2024)
(ii) Suppose, you are a data scientist at Times of India and you observed the views on the articles increases
during the month of Jan–Mar. Whereas the views during Nov–Dec decreases. Does the above statement
represents seasonality? Justify your answer. (7)
7. Suppose the following data represent total revenues. (13) ( April./ May 2024)
(in millions of constant 1995 dollars) by a car rental agency over the 11 year period 1990 to 2000:
4.0, 5.0, 7.0, 6.0, 8.0, 9.0, 5.0, 2.0, 3.5, 6.5, 6.5 Compute the 5-year moving averages
for this annual time series.
8. Using the following data, calculate the following forecasts: naive, 3 period moving average, 4-period
moving average, 3-2-1 weighted moving average, 1-4-5 weighted moving average, α=1 exponential
smoothing, and α=8 exponential smoothing. Round all forecasts to whole numbers.
Period : 1 2 3 4 5 6 7 8 9 10
Actual : 974 766 727 849 693 655 854 742 717 852 (April/May2024)part c
9. Describe the Characteristics of Time Series Data?
10. Explain the Cleaning Procedures for Time Series Data?
DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE
Academic Year: 2025-2026 Odd
Semester