1/5
AMAZON DATA
SCIENCE
INTERVIEW
𝗪𝗵𝗮𝘁 𝗶𝘀 𝘃𝗮𝗿𝗶𝗮𝗻𝗰𝗲 𝗶𝗻 𝗮 𝗺𝗼𝗱𝗲𝗹?
Variance in a model refers to how much the model's
predictions change when trained on different
subsets of the data. It captures the sensitivity of the
model to variations in the training data.
High variance means the model is very sensitive to
the specific data it was trained on. This results in
large fluctuations in predictions when exposed to
different datasets, even if they are similar. High
variance is typically associated with overfitting.
Low variance means the model's predictions are
stable, even when trained on different datasets.
@karunt
𝗜𝘀 𝗮 𝗱𝗲𝗰𝗶𝘀𝗶𝗼𝗻 𝘁𝗿𝗲𝗲 𝗺𝗼𝗱𝗲𝗹 𝗯𝗲𝘀𝘁 𝗳𝗼𝗿 𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗻𝗴 𝗶𝗳 𝗮
𝗯𝗼𝗿𝗿𝗼𝘄𝗲𝗿 𝘄𝗶𝗹𝗹 𝗽𝗮𝘆 𝗯𝗮𝗰𝗸 𝗮 𝗽𝗲𝗿𝘀𝗼𝗻𝗮𝗹 𝗹𝗼𝗮𝗻? 𝗛𝗼𝘄 𝘄𝗼𝘂𝗹𝗱
𝘆𝗼𝘂 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗲 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗼𝗳 𝘁𝗵𝗲 𝗺𝗼𝗱𝗲𝗹?
This would depend on the data and model
performance.
For instance, decision trees can be a good starting
point since it's quite interpretable, handles non-
linear relationships and requires minimal pre-
processing.
However, in financial datasets the data tends to be
imbalanced. So assessing performance on wide
variety of classification metrics like precision, recall,
f1-score would be important to assess model
performance.
From here, the interviewer might have follow up - so
make sure you understand metrics very well,
specifically - Precision, Recall, F1, AUC-ROC, AUC-PR
etc.
@karunt
𝗪𝗵𝗮𝘁 𝘄𝗼𝘂𝗹𝗱 𝘆𝗼𝘂 𝗱𝗼 𝗶𝗳 20% 𝗼𝗳 𝘁𝗵𝗲 100,000 𝘀𝗼𝗹𝗱
𝗹𝗶𝘀𝘁𝗶𝗻𝗴𝘀 𝗮𝗿𝗲 𝗺𝗶𝘀𝘀𝗶𝗻𝗴 𝘀𝗾𝘂𝗮𝗿𝗲 𝗳𝗼𝗼𝘁𝗮𝗴𝗲 𝗱𝗮𝘁𝗮. 𝗬𝗼𝘂
𝘄𝗮𝗻𝘁 𝘁𝗼 𝗽𝗿𝗲𝗱𝗶𝗰𝘁 𝗽𝗿𝗶𝗰𝗲.
This technique would depend on what cause of
missing values are - is it at random, does it mean
the listing is 'pending', or 'not ready for sale'. Once
we understand what the reason, we can solve it in
different ways like -
(a) drop the feature if its not predictive, or if
other feature are good proxies.
(a) imputation-mean/median or KNN-imputer
etc.
(b) use models that can handle missing values -
XGBoost or Random Forest
Get feedback from interviews and be ready to
dive into each apporach!
@karunt
𝗪𝗵𝗮𝘁 𝗶𝘀 𝘁𝗵𝗲 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗯𝗲𝘁𝘄𝗲𝗲𝗻 𝗫𝗚𝗯𝗼𝗼𝘀𝘁 𝗮𝗻𝗱
𝗿𝗮𝗻𝗱𝗼𝗺 𝗳𝗼𝗿𝗲𝘀𝘁?
Some differences are -
XGBoost: It’s an implementation of gradient boosting.
XGBoost builds trees 𝘀𝗲𝗾𝘂𝗲𝗻𝘁𝗶𝗮𝗹𝗹𝘆, where each new tree
tries to correct the errors made by the previous trees.
It has Lower bias due to the sequential learning
process, but potentially higher variance if overfitting
occurs. Regularization techniques are applied to
mitigate overfitting.
Random Forest: It is an example of bagging
(Bootstrap Aggregating)where multiple trees are
build 𝗶𝗻𝗱𝗲𝗽𝗲𝗻𝗱𝗲𝗻𝘁𝗹𝘆 𝗮𝗻𝗱 𝗶𝗻 𝗽𝗮𝗿𝗮𝗹𝗹𝗲𝗹. Higher bias because
trees are grown independently, but variance is
reduced because of the averaging across many trees.
It does not have explicit regularization, but naturally
prevents overfitting by averaging predictions from
multiple trees and using random feature selection.
@karunt
WAS THIS HELPFUL?
Be sure to save it so you
can come back to it later!
@karunt