Data Science Interview Question
Data Science Interview Question
1. Normalisation- used when we want to bound our values between two numbers,
typically between [0,1] or [-1,1]
2. Standardization— transforms the data to have a zero mean and variance of 1.
3. Min Max Scaler— Transform features by scaling each feature to a given range
[0,1],[0,5][-1,1]. This scaler responds well if SD is small and distribution is not
Gaussian. This is sensitive to outliers.
4. Standard Scaler— It assumes data is normally distributed within each feature
and scales them such that the distribution centered around 0 with SD 1. If data is
distributed normally then it is the best scaling.
5. Max Abs Scaler— Scale each feature by its maximum absolute value. (the point
where function obtains its greatest possible value). Max absolute value of each
feature in the training data is 1. It does not destroy sparsity( Number of zeros in
matrix/ total number of elements in the matrix.)(A very few non zero values). It
works only on positive data
6. Robust Scaler— robust to outliers. This is used when we have large outliers in
data. This scaler removes the median and scales the data accordingly to the quantile
range.
7. Quantile Transform Scaler— Transform the features to follow a uniform or a
normal distribution. This transform is non-linear.
8. Power Transform Scaler— make data more Gaussian. Used where modeling
issues related to variability that is unequal across the range. Minimizes skewness.
9. Unit vector scaling— dividing the whole data by the euclidean length of the
vector. range[0,1]
5. Histogram VS Boxplots
Histograms are bar charts that show the frequency of a numerical variable’s values
and are used to approximate the probability distribution of the given variable. It
allows you to quickly understand the shape of the distribution, the variation, and
potential outliers.
Boxplots communicate different aspects of the distribution of data. While you can’t
see the shape of the distribution through a box plot, you can gather other information
like the quartiles, the range, and outliers. Boxplots are especially useful when you
want to compare multiple charts at the same time because they take up less space
than histograms
Regularization Term
The regularization term Ω is defined as the Euclidean Norm (or L2 norm) of the
weight matrices, which is the sum of all squared weight values of a weight matrix.
The regularization term is weighted by the scalar alpha divided by two and added to
the regular loss function that is chosen for the current task. This leads to a new
expression for the loss function:
L1 Regularization:
In the case of L1 regularization (also knows as Lasso regression), we simply use
another regularization term Ω. This term is the sum of the absolute values of the
weight parameters in a weight matrix:
Alpha Value:
If your alpha value is too high, your model will be simple, but you run the risk of
underfitting your data. Your model won’t learn enough about the training data to
make useful predictions.
If your alpha value is too low, your model will be more complex, and you run the
risk of overfitting your data. Your model will learn too much about the particularities
of the training data, and won’t be able to generalize to new data.
DROPOUT:
Dropout means that during training with some probability P a neuron of the neural
network gets turned off during training.
7. Overfitting and Underfitting
Overfitting refers to a model that models the training data too well. Overfitting
happens when a model learns the detail and noise in the training data to the extent
that it negatively impacts the performance of the model on new data. This means that
the noise or random fluctuations in the training data is picked up and learned as
concepts by the model. The problem is that these concepts do not apply to new data
and negatively impact the model’s ability to generalize. Overfitting is the case where
the overall cost is really small, but the generalization of the model is unreliable. This
is due to the model learning “too much” from the training data set. These models
have low bias and high variance. These models are very complex like Decision trees
which are prone to overfitting.
Underfitting is the case where the model has “ not learned enough” from the
training data, resulting in low generalization and unreliable predictions. These
models usually have high bias and low variance. It happens when we have a very less
amount of data to build an accurate model or when we try to build a linear model
with nonlinear data. Also, these kinds of models are very simple to capture the
complex patterns in data like Linear and logistic regression.
Variance is the variability of model prediction for a given data point or a value that
tells us the spread of our data. The model with high variance pays a lot of attention to
training data and does not generalize on the data which it hasn’t seen before. As a
result, such models perform very well on training data but have high error rates on
test data.
Trade-off
If our model is too simple and has very few parameters then it may have high bias
and low variance. On the other hand, if our model has a large number of parameters
then it’s going to have high variance and low bias. So we need to find the right/good
balance without overfitting and underfitting the data. This tradeoff in complexity is
why there is a tradeoff between bias and variance. An algorithm can’t be more
complex and less complex at the same time. This trade-off is the most integral aspect
of Machine Learning model training. As we discussed, Machine Learning models
fulfill their purpose when they generalize well. Generalization is bound by the two
undesirable outcomes — high bias and high variance. Detecting whether the model
suffers from either one is the sole responsibility of the model developer.
9. Cross-Validation
Cross-validation is a validation technique for evaluating how the outcomes of
statistical analysis will generalize for an Independent dataset. This method is used in
backgrounds where the objective is forecast, and one needs to estimate how
accurately a model will accomplish. The simplest example of cross-validation is when
you split your data into two groups: training data and testing data, where you use the
training data to build the model and the testing data to test the model
10.DS vs DA vs ML vs AI
Data science is a concept used to tackle big data and includes data cleansing,
preparation, and analysis. A data scientist gathers data from multiple sources and
applies machine learning, predictive analytics, and sentiment analysis to extract
critical information from the collected data sets. They understand data from a
business point of view and can provide accurate predictions and insights that can be
used to power critical business decisions.
A data analyst is usually the person who can do basic descriptive statistics,
visualize data, and communicate data points for conclusions. They must have a basic
understanding of statistics, a perfect sense of databases, the ability to create new
views, and the perception to visualize the data. Data analytics can be referred to as
the necessary level of data science.
11. PCA
The dimensionality-reduction method is often used to reduce the dimensionality of
large data sets, by transforming a large set of variables into a smaller one that still
contains most of the information in the large set
the idea of PCA is simple — reduce the number of variables of a data set, while
preserving as much information as possible
12. Clustering
Apply k-means clustering to the dataset and use the typical within-cluster distances
from each point to their assigned cluster’s centroid to make a decision on the number
of clusters to keep.
K-means(K-means clustering is a simple unsupervised learning algorithm that is
used to solve clustering problems. It follows a simple procedure of classifying a given
data set into a number of clusters, defined by the letter "k," which is fixed
beforehand. Compare the distance from each observation to the centroid and
whichever is the least value, assign that observation to centroid)
→ This ‘Output’ is now given as a direct feedback as data to be trained for future use.
Thereby, improving the model’s efficiency with experience.
-->Regression If you want to predict continuous values, such as trying to predict
the cost of a house or the weather outside in degrees, you would use regression. This
type of problem doesn’t have a specific value constraint because the value could be
any number with no limits.
-->Classification If you want to predict discrete values, such as classifying
something into categories, you would use classification. A problem like, "Will he
make this purchase" will have an answer that falls into two specific categories: yes or
no. This is also called a binary classification problem.
Unsupervised learning: The machine learns from unlabeled data. Meaning, there
is no “right” answer given to the machine to learn, but the machine must hopefully
find patterns from the data to come up with an answer. our main goal is not to
produce the output but to discover patterns.→ In this case the machine did not
learn anything before ( i.e NO Training ).It has no knowledge about the output
class.Data is “unlabelled” or “unknown value ”. No Supervisor required and
therefore, they are “self-guided Algorithms”
-->Clustering Problem Unsupervised learning tries to solve this problem by
looking for similarities in the data. If there is a common cluster or group, the
algorithm would then categorize them in a certain form. An example of this could be
trying to group customers based on past buying behavior.
-->Association Problem Unsupervised learning tries to solve this problem by
trying to understand the rules and meaning behind different groups. Finding a
relationship between customer purchases is a common example of an association
problem. Stores may want to know what type of products were purchased together
and could possibly use this information to organize the placement of these products
for easier access. One store found out that there was a strong association between
customers buying beer and diapers. They deduced from this statement that males
who had gone out to buy diapers for their babies also tend to buy beer as well.
Reinforcement Learning:
It learns to control the behavior of system, by using gathered observations from
interaction between a Local Agent and the Environment in order to perform some
action which results in the change in state which in turn would maximize the reward
or minimize the risk.Here, the agent learns continuously from the environment in a
recursive manner. The Feedback given to the agent in the form of reward is used as a
learning experience to algorithm. This is known as reinforcement signalling.
There are three types of Naive Bayes Model, which are given below:
Regression loss
A. Mean Square Error, Quadratic loss,
L2 Loss
is the most commonly used regression
loss function. MSE is the sum of squared
distances between our target variable
and predicted values.
Magnies larger error; reduces smaller errors
Differentiable
Positive errors and negative errors are equivalent
Mean Absolute Error (MAE) is another loss function used for regression models.
MAE is the sum of absolute differences between our target and predicted variables.
So it measures the average magnitude of errors in a set of predictions, without
considering their directions. (If we consider directions also, that would be called
Mean Bias Error (MBE), which is a sum of residuals/errors). The range is also 0 to ∞.
L1 loss is more robust to outliers, but its derivatives are not continuous, making it
inefficient to find the solution. L2 loss is sensitive to outliers, but gives a more stable
and closed form solution (by setting its derivative to 0.)
Huber loss is less sensitive to outliers in data than the squared error loss. It’s also
differentiable at 0. It’s basically absolute error, which becomes quadratic when error
is small. How small that error has to be to make it quadratic depends on a
hyperparameter, 𝛿 (delta), which can be tuned. Huber loss approaches MSE when 𝛿
~ 0 and MAE when 𝛿 ~ ∞ (large numbers.)
D. Log-Cosh Loss
Log-cosh is another function used in regression tasks that’s smoother than L2.
Log-cosh is the logarithm of the hyperbolic cosine of the prediction error.
A. Hinge Loss
Hinge loss is primarily used with Support Vector Machine (SVM).Classifiers with
class labels -1 and 1.Hinge Loss not only penalizes the wrong predictions but also the
right predictions that are not confident
B. Cross-Entropy
Cross-entropy loss, or log loss, measures the performance of a classification model
whose output is a probability value between 0 and 1. Cross-entropy loss increases as
the predicted probability diverges from the actual label.
For two classes, i.e., y(x) ∈ {ω1, ω2} with classes coded as 0, 1. fˆ(x (l) ; θ) is the
output of the classier
Note that the cross entropy between two distributions p and q can be computed as,
i.e. the message-length when a wrong distribution q is assumed while the data is
actually from some other distribution p. The above equation thus follows if we also
use that P(ω1) = 1 − P(ω2)
C. Kullback-Leibler(KL) Divergence
The Kullback-Leibler Divergence score, or KL divergence score, quantifies how much
one probability distribution differs from another probability distribution.
The KL divergence between two distributions Q and P is often stated using the
following notation:
● KL(P || Q)
NumPy aims to provide an array object that is up to 50x faster than traditional
Python lists.
The array object in NumPy is called ndarray, it provides a lot of supporting functions
that make working with ndarray very easy.
NumPy arrays are stored at one continuous place in memory unlike lists, so
processes can access and manipulate them very efficiently.
This is the main reason why NumPy is faster than lists. Also it is optimized to
work with latest CPU architectures
alias: In Python alias are an alternate name for referring to the same thing.
type(): This built-in Python function tells us the type of the object passed to it.
The array object in NumPy is called ndarray.We can create a NumPy ndarray object
by using the array() function.
0-D arrays, or Scalars. An array that has 0-D arrays as its elements is called
uni-dimensional or 1-D array.
An array that has 1-D arrays as its elements is called a 2-D array.These are often used
to represent matrix or 2nd order tensors.
An array that has 2-D arrays (matrices) as its elements is called 3-D array
NumPy Arrays provides the ndim attribute that returns an integer that tells us how
many dimensions the array have.
An array can have any number of dimensions.When the array is created, you can
define the number of dimensions by using the ndmin argument.
arr = np.array([1, 2, 3, 4], ndmin=5)
Array indexing is the same as accessing an array element.You can access an array
element by referring to its index number.The indexes in NumPy arrays start with 0,
meaning that the first element has index 0, and the second has index 1
Slicing in python means taking elements from one given index to another given
index.
We pass slice instead of index like this: [start:end].
We can also define the step, like this: [start:end:step].
If we don't pass start its considered 0
If we don't pass end its considered length of array in that dimension
If we don't pass step its considered 1
The result includes the start index, but excludes the end index.Same goes with
negative indexing
● strings - used to represent text data, the text is given under quote marks. e.g.
"ABCD"
● integer - used to represent integer numbers. e.g. -1, -2, -3
● float - used to represent real numbers. e.g. 1.2, 42.42
● boolean - used to represent True or False.
● complex - used to represent complex numbers. e.g. 1.0 + 2.0j, 1.5 + 2.5j
NumPy has some extra data types, and refer to data types with one character, like i
for integers, u for unsigned integers etc.
Below is a list of all data types in NumPy and the characters used to represent them.
● i - integer
● b - boolean
● u - unsigned integer-- no signs associated +/-
● f - float
● c - complex float
● m - timedelta-- difference between 2 times
● M - datetime
● O - object
● S - string
● U - unicode string
● V - fixed chunk of memory for other type ( void )
The NumPy array object has a property called dtype that returns the data type of the
array
The best way to change the data type of an existing array, is to make a copy of the
array with the astype() method.
The astype() function creates a copy of the array, and allows you to specify the data
type as a parameter.
arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype('i')
newarr = arr.astype(int)
The main difference between a copy and a view of an array is that the copy is a new
array, and the view is just a view of the original array.
The copy owns the data and any changes made to the copy will not affect original
array, and any changes made to the original array will not affect the copy.
The view does not own the data and any changes made to the view will affect the
original array, and any changes made to the original array will affect the view.
The copy SHOULD NOT be affected by the changes made to the original array
The view SHOULD be affected by the changes made to the original array.
NumPy arrays have an attribute called shape that returns a tuple with each index
having the number of corresponding elements.
Flatten-copy, ravel-view
The function nditer() is a helping function that can be used from very basic to very
advanced iterations. It solves some basic issues which we face in iteration, lets go
through it with examples
We can use op_dtypes argument and pass it the expected datatype to change the
datatype of elements while iterating.
NumPy does not change the data type of the element in-place (where the element is
in array) so it needs some other space to perform this action, that extra space is
called buffer, and in order to enable it in nditer() we pass flags=['buffered'].
(0,) 1
(1,) 2
(2,) 3
Stacking is same as concatenation, the only difference is that stacking is done along a
new axis.
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.stack((arr1, arr2), axis=1)
1,4
2,5
3,6
We also have the method split() available but it will not adjust the
elements when elements are less in source array for splitting like in example
above, array_split() worked properly but split() would fail
You can search an array for a certain value, and return the indexes that get a
match.
To search an array, use the where() method.
NLP
Dbscan