Data Science Viva Notes
Q: What is Exploratory Data Analysis (EDA)?
A: EDA is the process of exploring and visualizing data to understand its structure, patterns, and relationships
before applying any model.
Q: Summary statistics means?
A: Summary statistics are basic values that describe a dataset like mean, median, mode, min, max, standard
deviation.
Q: Histogram displays what?
A: A histogram shows the frequency distribution of a numeric variable.
Q: Box plot?
A: A box plot shows the spread of data using median, quartiles, and outliers.
Q: How to conclude after seeing boxplot?
A: You can see if data is symmetric, skewed, and whether there are outliers.
Q: Whiskers means?
A: Whiskers in a boxplot show the minimum and maximum values within 1.5 IQR from the quartiles.
Q: Scatter plot?
A: A scatter plot shows the relationship between two numeric variables.
Q: Data cleaning?
A: Data cleaning means fixing or removing wrong, incomplete, or inconsistent data.
Page 1
Data Science Viva Notes
Q: Handling inconsistencies mean?
A: It means correcting values that are wrongly formatted or mismatched in the dataset.
Q: How to apply imputation?
A: Imputation is filling missing values using mean, median, mode, or predictive models.
Q: How to remove duplicates?
A: Use tools or code (like `.drop_duplicates()` in Python) to delete repeated rows.
Q: Data transformation and feature engineering means?
A: Data transformation changes the data format or scale. Feature engineering creates new useful features for
the model.
Q: Normalization means?
A: Scaling all numeric data to a common range (like 0 to 1) to treat all features equally.
Q: Data transformation: converting categorical variables?
A: Convert them into numbers using encoding like One-Hot Encoding or Label Encoding.
Q: Binning?
A: Binning means converting continuous data into fixed intervals or categories.
Q: Polynomial feature creation?
A: Creating new features by raising existing numeric features to powers (like x², x³).
Q: Statistical analysis?
Page 2
Data Science Viva Notes
A: It involves using mathematical techniques to summarize, understand, and draw conclusions from data.
Q: Hypothesis testing means?
A: It tests if a statement about a population is likely true using sample data.
Q: Regression analysis?
A: A technique to study relationships between variables and predict one based on others.
Q: Linear regression model means?
A: A model that predicts an output using a straight-line relationship with input(s).
Q: T-test means?
A: A test to compare the means of two groups to see if they are significantly different.
Q: Chi-square test?
A: A test to check the association between two categorical variables.
Q: P-value means?
A: It shows the probability that the result happened by chance. A small p-value (<0.05) means the result is
statistically significant.
Q: Logistic regression?
A: A model used for classification problems (like yes/no) by predicting probability.
Q: Accuracy means?
A: Accuracy is the percentage of correct predictions made by a model.
Page 3
Data Science Viva Notes
Q: Accuracy, Precision, and Recall?
A: Accuracy: Overall correct predictions. Precision: Correct positive predictions. Recall: All actual positives
correctly predicted.
Q: ROC AUC curve?
A: A graph that shows model performance. AUC score near 1 is best.
Q: Clustering means?
A: Grouping similar data points together based on features.
Q: Segmentation?
A: Dividing data into meaningful groups (like customer segments).
Q: K-means clustering?
A: A method that divides data into 'k' clusters based on similarity.
Q: Difference between clustering and segmentation?
A: Clustering is the technique, segmentation is the result or goal.
Q: Churn prediction model?
A: A model that predicts which customers are likely to leave (churn).
Q: Time series analysis?
A: Analyzing data collected over time to find patterns and trends.
Page 4
Data Science Viva Notes
Q: Trend?
A: Long-term movement in data (upward or downward).
Q: Seasonality?
A: Repeating patterns at regular intervals (like monthly or yearly).
Q: Noise components?
A: Random or irregular variations in data that cannot be explained.
Q: Outliers mean?
A: Unusual values far from most of the data.
Q: ARIMA?
A: A forecasting model using past values and errors. It stands for AutoRegressive Integrated Moving
Average.
Q: Forecasting means?
A: Predicting future values based on past data.
Q: Exponential smoothing?
A: A method to forecast data by giving more weight to recent observations.
Q: Anomalies?
A: Unusual or unexpected data points that don?t fit the pattern.
Q: Z-score?
Page 5
Data Science Viva Notes
A: A value that shows how far a data point is from the mean, in standard deviations.
Q: Isolation Forest model?
A: A model used to detect anomalies by isolating them from the rest of the data.
Q: Profiling means?
A: Creating a summary of data to understand its structure, quality, and patterns.
Q: Correlation matrix?
A: A table showing how variables relate to each other (with values between -1 and 1).
Q: Correlation coefficient?
A: A number that shows the strength and direction of the relationship between two variables.
Q: ML, AI, and Deep Learning?
A: AI is the broad field. ML is a part of AI that learns from data. Deep Learning is a type of ML using neural
networks.
Q: Supervised and Unsupervised learning?
A: Supervised: learns with labeled data (has answers). Unsupervised: finds patterns from unlabeled data.
Page 6