0% found this document useful (0 votes)
14 views

Data Science Interview Question

This document discusses various techniques for data preparation, modeling, and evaluation including: 1. Techniques for data cleaning, profiling, visualization, and scaling including handling null values and standardization. 2. Reasons for scaling data and various scaling techniques like normalization, standardization, min-max scaling, and robust scaling. 3. Issues with unbalanced classification like data imbalance and techniques to address it like over/undersampling and adjusting evaluation metrics.

Uploaded by

Roshan atul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Data Science Interview Question

This document discusses various techniques for data preparation, modeling, and evaluation including: 1. Techniques for data cleaning, profiling, visualization, and scaling including handling null values and standardization. 2. Reasons for scaling data and various scaling techniques like normalization, standardization, min-max scaling, and robust scaling. 3. Issues with unbalanced classification like data imbalance and techniques to address it like over/undersampling and adjusting evaluation metrics.

Uploaded by

Roshan atul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

1.

Data wrangling and data cleaning:


Data Profiling— understanding data using share and description
Data Visualisation— using histogram, boxplots and identifying relationships
Syntax error— no white space, unique values
Standardization or Normalisation— scaling our data
Handling Null values

2. Techniques to scale our data


First, why do we need to scale our data- if there is a vast difference in range thousand
and tens, then it makes the underlying assumption that higher ranging numbers have
superiority of some sort. Also to provide exact values for different weights (price vs
kg). Another reason is like neural networks gradient descent converges much faster
with scaled data. Another reason is saturation (Reaching peak values)

1. Normalisation- used when we want to bound our values between two numbers,
typically between [0,1] or [-1,1]
2. Standardization— transforms the data to have a zero mean and variance of 1.
3. Min Max Scaler— Transform features by scaling each feature to a given range
[0,1],[0,5][-1,1]. This scaler responds well if SD is small and distribution is not
Gaussian. This is sensitive to outliers.
4. Standard Scaler— It assumes data is normally distributed within each feature
and scales them such that the distribution centered around 0 with SD 1. If data is
distributed normally then it is the best scaling.
5. Max Abs Scaler— Scale each feature by its maximum absolute value. (the point
where function obtains its greatest possible value). Max absolute value of each
feature in the training data is 1. It does not destroy sparsity( Number of zeros in
matrix/ total number of elements in the matrix.)(A very few non zero values). It
works only on positive data
6. Robust Scaler— robust to outliers. This is used when we have large outliers in
data. This scaler removes the median and scales the data accordingly to the quantile
range.
7. Quantile Transform Scaler— Transform the features to follow a uniform or a
normal distribution. This transform is non-linear.
8. Power Transform Scaler— make data more Gaussian. Used where modeling
issues related to variability that is unequal across the range. Minimizes skewness.
9. Unit vector scaling— dividing the whole data by the euclidean length of the
vector. range[0,1]

3. Unbalanced binary classification/ data imbalance


1. Reconsider the evaluation matrix
2. Increase the cost of misclassifying the minority class
3. Oversampling the minority class or under-sampling the majority class

4. Oversampling and undersampling


Under-sampling balances the dataset by reducing the size of the abundant class.
This method is used when the quantity of data is sufficient. By keeping all samples in
the rare class and randomly selecting an equal number of samples in the abundant
class, a balanced new dataset can be retrieved for further modeling.
oversampling is used when the quantity of data is insufficient. It tries to balance
the dataset by increasing the size of rare samples. Rather than getting rid of
abundant samples, new rare samples are generated by using e.g. repetition,
bootstrapping or SMOTE (Synthetic Minority Over-Sampling Technique.

5. Histogram VS Boxplots
Histograms are bar charts that show the frequency of a numerical variable’s values
and are used to approximate the probability distribution of the given variable. It
allows you to quickly understand the shape of the distribution, the variation, and
potential outliers.
Boxplots communicate different aspects of the distribution of data. While you can’t
see the shape of the distribution through a box plot, you can gather other information
like the quartiles, the range, and outliers. Boxplots are especially useful when you
want to compare multiple charts at the same time because they take up less space
than histograms

6. Regularization?(L1, L2, Dropout)


Regularization refers to a set of different techniques that lower the complexity of a
neural network model during training, and thus prevent overfitting. There are three
very popular and efficient regularization techniques called L1, L2, and dropout.
L2 Regularization:
The L2 regularization is the most common type of all regularization techniques and is
also commonly known as weight decay or Ride Regression.
During the L2 regularization, the loss function of the neural network is extended by a
so-called regularization term, which is called here Ω.

Regularization Term

The regularization term Ω is defined as the Euclidean Norm (or L2 norm) of the
weight matrices, which is the sum of all squared weight values of a weight matrix.
The regularization term is weighted by the scalar alpha divided by two and added to
the regular loss function that is chosen for the current task. This leads to a new
expression for the loss function:

EQ: Regularization loss during L2 regularization


Alpha is sometimes called the regularization rate and is an additional
hyperparameter we introduce into the neural network. Simply speaking alpha
determines how much we regularize our model. In the next step we can compute the
gradient of the new loss function and put the gradient into the update rule for the
weights:
EQ: Gradient Descent during L2 Regularization
L2 is less robust but has a stable solution and always one solution.

L1 Regularization:
In the case of L1 regularization (also knows as Lasso regression), we simply use
another regularization term Ω. This term is the sum of the absolute values of the
weight parameters in a weight matrix:

EQ: Regularization Term for L1 Regularization

EQ: Loss function during L1 Regularization

EQ: Gradient of the loss function during L1 regularization


L1 is more robust but has an unstable solution and can possibly have multiple
solutions.

Performing L2 regularization encourages the weight values towards zero


(but not exactly zero)
Performing L1 regularization encourages the weight values to be zero

Alpha Value:
If your alpha value is too high, your model will be simple, but you run the risk of
underfitting your data. Your model won’t learn enough about the training data to
make useful predictions.
If your alpha value is too low, your model will be more complex, and you run the
risk of overfitting your data. Your model will learn too much about the particularities
of the training data, and won’t be able to generalize to new data.

DROPOUT:
Dropout means that during training with some probability P a neuron of the neural
network gets turned off during training.
7. Overfitting and Underfitting
Overfitting refers to a model that models the training data too well. Overfitting
happens when a model learns the detail and noise in the training data to the extent
that it negatively impacts the performance of the model on new data. This means that
the noise or random fluctuations in the training data is picked up and learned as
concepts by the model. The problem is that these concepts do not apply to new data
and negatively impact the model’s ability to generalize. Overfitting is the case where
the overall cost is really small, but the generalization of the model is unreliable. This
is due to the model learning “too much” from the training data set. These models
have low bias and high variance. These models are very complex like Decision trees
which are prone to overfitting.

Underfitting is the case where the model has “ not learned enough” from the
training data, resulting in low generalization and unreliable predictions. These
models usually have high bias and low variance. It happens when we have a very less
amount of data to build an accurate model or when we try to build a linear model
with nonlinear data. Also, these kinds of models are very simple to capture the
complex patterns in data like Linear and logistic regression.

8. Bias Variance Trade-off


Bias is the difference between the average prediction of our model and the correct
value which we are trying to predict. The model with high bias pays very little
attention to the training data and oversimplifies the model. It always leads to high
errors on training and test data.

Variance is the variability of model prediction for a given data point or a value that
tells us the spread of our data. The model with high variance pays a lot of attention to
training data and does not generalize on the data which it hasn’t seen before. As a
result, such models perform very well on training data but have high error rates on
test data.

Trade-off
If our model is too simple and has very few parameters then it may have high bias
and low variance. On the other hand, if our model has a large number of parameters
then it’s going to have high variance and low bias. So we need to find the right/good
balance without overfitting and underfitting the data. This tradeoff in complexity is
why there is a tradeoff between bias and variance. An algorithm can’t be more
complex and less complex at the same time. This trade-off is the most integral aspect
of Machine Learning model training. As we discussed, Machine Learning models
fulfill their purpose when they generalize well. Generalization is bound by the two
undesirable outcomes — high bias and high variance. Detecting whether the model
suffers from either one is the sole responsibility of the model developer.
9. Cross-Validation
Cross-validation is a validation technique for evaluating how the outcomes of
statistical analysis will generalize for an Independent dataset. This method is used in
backgrounds where the objective is forecast, and one needs to estimate how
accurately a model will accomplish. The simplest example of cross-validation is when
you split your data into two groups: training data and testing data, where you use the
training data to build the model and the testing data to test the model

10.DS vs DA vs ML vs AI
Data science is a concept used to tackle big data and includes data cleansing,
preparation, and analysis. A data scientist gathers data from multiple sources and
applies machine learning, predictive analytics, and sentiment analysis to extract
critical information from the collected data sets. They understand data from a
business point of view and can provide accurate predictions and insights that can be
used to power critical business decisions.

A data analyst is usually the person who can do basic descriptive statistics,
visualize data, and communicate data points for conclusions. They must have a basic
understanding of statistics, a perfect sense of databases, the ability to create new
views, and the perception to visualize the data. Data analytics can be referred to as
the necessary level of data science.

Machine learning can be defined as the practice of using algorithms to extract


data, learn from it, and then forecast future trends for that topic. Traditional
machine learning software is comprised of statistical analysis and predictive analysis
that is used to spot patterns and catch hidden insights based on perceived data.

Artificial Intelligence is a field where algorithms are used to perform automatic


actions. Its models are based on the natural intelligence of humans and animals.
Similar patterns of the past are recognized, and related operations are performed
automatically when the patterns are repeated.It utilizes the principles of software
engineering and computational algorithms for the development of solutions to a
problem. Using Artificial intelligence, people can develop automatic systems that
provide cost savings and several other benefits to companies. Large organizations are
heavily dependant on Artificial Intelligence, including tech giants like Facebook,
Amazon, and Google.

11. PCA
The dimensionality-reduction method is often used to reduce the dimensionality of
large data sets, by transforming a large set of variables into a smaller one that still
contains most of the information in the large set
the idea of PCA is simple — reduce the number of variables of a data set, while
preserving as much information as possible

→ Standardization (range)(z=value-mean/standard deviation)


→ Covariance matrix computation
→ Eigenvalues(Scalar that stretches eigenvector) and Eigenvectors ( Determine
principal components)
(We can transform and change matrices into new vectors by multiplying a matrix
with a vector. The multiplication of the matrix by a vector computes a new vector.
This is the transformed vector. If the new transformed vector is just a scaled form of
the original vector then the original vector is known to be an eigenvector of the
original matrix. Vectors that have this characteristic are special and they are known
as eigenvectors.)

12. Clustering
Apply k-means clustering to the dataset and use the typical within-cluster distances
from each point to their assigned cluster’s centroid to make a decision on the number
of clusters to keep.
K-means(K-means clustering is a simple unsupervised learning algorithm that is
used to solve clustering problems. It follows a simple procedure of classifying a given
data set into a number of clusters, defined by the letter "k," which is fixed
beforehand. Compare the distance from each observation to the centroid and
whichever is the least value, assign that observation to centroid)

13. Supervised VS Unsupervised VS Reinforcement


Supervised learning: The machine learns from labeled data. Normally, the data is
labeled by humans. SL works with only labelled data (i.e It only accepts labelled data
as its input and then the machine is trained with this ‘labelled data’ and a known
output is given.)

→ This ‘Output’ is now given as a direct feedback as data to be trained for future use.
Thereby, improving the model’s efficiency with experience.
-->Regression If you want to predict continuous values, such as trying to predict
the cost of a house or the weather outside in degrees, you would use regression. This
type of problem doesn’t have a specific value constraint because the value could be
any number with no limits.
-->Classification If you want to predict discrete values, such as classifying
something into categories, you would use classification. A problem like, "Will he
make this purchase" will have an answer that falls into two specific categories: yes or
no. This is also called a binary classification problem.

Unsupervised learning: The machine learns from unlabeled data. Meaning, there
is no “right” answer given to the machine to learn, but the machine must hopefully
find patterns from the data to come up with an answer. our main goal is not to
produce the output but to discover patterns.→ In this case the machine did not
learn anything before ( i.e NO Training ).It has no knowledge about the output
class.Data is “unlabelled” or “unknown value ”. No Supervisor required and
therefore, they are “self-guided Algorithms”
-->Clustering Problem Unsupervised learning tries to solve this problem by
looking for similarities in the data. If there is a common cluster or group, the
algorithm would then categorize them in a certain form. An example of this could be
trying to group customers based on past buying behavior.
-->Association Problem Unsupervised learning tries to solve this problem by
trying to understand the rules and meaning behind different groups. Finding a
relationship between customer purchases is a common example of an association
problem. Stores may want to know what type of products were purchased together
and could possibly use this information to organize the placement of these products
for easier access. One store found out that there was a strong association between
customers buying beer and diapers. They deduced from this statement that males
who had gone out to buy diapers for their babies also tend to buy beer as well.

Reinforcement Learning:
It learns to control the behavior of system, by using gathered observations from
interaction between a Local Agent and the Environment in order to perform some
action which results in the change in state which in turn would maximize the reward
or minimize the risk.Here, the agent learns continuously from the environment in a
recursive manner. The Feedback given to the agent in the form of reward is used as a
learning experience to algorithm. This is known as reinforcement signalling.

14. Various Algorithms

1. Linear Regression - This is the go to method for regression problems. The


linear regression algorithm is used to see a relationship between the predictor
variables and explanatory variable. This relationship is either a positive,
negative, or neutral change between the variables. In its simplest form, it
attempts to fit a straight line to your training data. This line can then be used
as reference to predict future data.
Linear regression attempts to draw a line that comes closest to the data by
finding the slope and intercept that define the line and minimize regression
errors. A straight line approximates the relationship between the dependent
variable and the independent variable.
y = mx + b. Where “y” is your dependent variable (ice cream sales) and “x” is
your independent variable (temperature)
Strengths: Linear regression is very fast to implement, easy to understand,
and is less prone to overfitting. It’s a great go-to algorithm for using it as your
first model and works really on linear relationships
Weaknesses: Linear regression performs poorly when there are non-linear
relationships. It is hard to be used on complex data sets.

2. Logistic Regression - This is the go to method for classification method and


is commonly used for interpretability. It is commonly used to predict the
probability of an event occurring. Logistic regression is an algorithm borrowed
from statistics and uses a logic/sigmoid function (Purple Formula) to
transform its output, making it either 0 or 1. Logistic Regression fits an
“S-shaped logistic function” called the “Sigmoid Function”. This S-shaped
curve lie between the range of 0–1 which means it tells us the probability
depending upon dependent (Y) and independent variable (X).
Strengths: Similar to its sister, linear regression, logistic regression is easy to
interpret and is less prone to overfitting. It’s fast to implement as well and has
surprisingly great accuracy for its simple design.
Weaknesses: This algorithm performs poorly where there are multiple or
non-linear decision boundaries or capturing complex relationships.
3. K-Nearest Neighbors --K-Nearest Neighbors algorithm is one of the
simplest classification techniques. It’s an algorithm that classifies objects
based on its closest training example in its featured space. The K in K-Nearest
Neighbors refers to the number of nearest neighbors the model will be used
for its prediction.
How it Works: 1. Assign K a value (preferably a small odd number) 2. Find
closest number of K points 3. Assign the new point from the majority of
classes
Strengths: This algorithm is good for large data, it learns well on complex
patterns, it can detect linear or non-linear distributed data, and is robust to
noisy training data.
Weaknesses: It’s hard to find the right K-value, bad for higher dimensional
data, and requires a lot of computation when fitting larger amounts of
features. It’s expensive and slow.

4. Support Vector Machine-- If you want to compare extreme values in your


dataset for classification problems, SVM algorithm is the way to go. It draws a
decision boundary, also known as a hyperplane that best segregates the two
classes from one another. Or you can think of it as an algorithm that looks for
some pattern in the data points and find a best line that can separate the
pattern(s).
S — Support refers to the extreme values/points in your dataset.
V — Vector refers to the values/points in dataset / feature space.
M — Machine refers to the machine learning algorithm that focuses on the
support vectors to classify groups of data. This algorithm literally only focuses
on the extreme points and ignores the rest of the data
Strengths: This model is good for classification problems and can model
very complex dimensions. It’s also good to model non-linear relationships.
Weaknesses: It’s hard to interpret and requires a lot of memory and
processing power. It also does not provide probability estimations and is
sensitive to outliers

5. Decision Tree- A decision tree is made up of nodes. Each node represents a


question about the data. And the branches from each node represents the
possible answers. Visually, this algorithm is very easy to understand. With
every decision tree, there is always a root node and this represents the top
most question. The order of importance for each feature is represented in a
top-down approach of the nodes. The higher the node the more important its
property/feature.
Strengths: The decision tree is a very easy to understand and visualize. It’s
fast to learn, robust to outliers, and can work on non-linear relationships. This
model is commonly used to understand what features are being used, such as
medical diagnosis and credit risk analysis. It also has a built in feature
selection.
Weaknesses: The biggest drawback of a single decision tree is that it loses
its predictive power from not collecting other overlapping features. A
downside to decision trees is the possibility of building a complex tree which
do not generalize well to future data. hard to interpret, duplication is possible.
Decision trees can be unstable because it naturally has low variance of
features.
6. Random Forest-- One of the most used and powerful supervised machine
learning algorithms for prediction accuracy. Think of this algorithm as a
bunch of decision trees, instead of one single tree like the Decision Tree
algorithm. This grouping of algorithms, in this case decision trees, is called an
Ensemble Method. It’s accurate performance is generated by averaging the
many decision trees together. Random Forest is naturally hard to interpret
because of the various combinations of decision trees it uses.
Strengths: Random Forest is known for its great accuracy. It has automatic
feature selection, which identifies what features are most important. It can
handle missing data and imbalanced classes and generalizes well.
Weaknesses: A drawback to random forest is that you have very little
control on what goes inside this algorithm. It’s hard to interpret and won’t
perform well if given a set of bad features.

7. Naïve Bayes algorithm-- It is is a supervised learning algorithm, which is


based on Bayes theorem and used for solving classification problems.It is
mainly used in text classification that includes a high-dimensional training
dataset.Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine learning
models that can make quick predictions.It is a probabilistic classifier, which
means it predicts on the basis of the probability of an object.Some popular
examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which
can be described as:
● Naïve: It is called Naïve because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features. Such as if the fruit
is identified on the bases of color, shape, and taste, then red, spherical, and
sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
● Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.
Advantages of Naïve Bayes Classifier:
● Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
● It can be used for Binary as well as Multi-class Classifications.
● It performs well in Multi-class predictions as compared to the other
Algorithms.
● It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
Naive Bayes assumes that all features are independent or unrelated, so it
cannot learn the relationship between features.
Applications of Naïve Bayes Classifier:
● It is used for Credit Scoring.
● It is used in medical data classification.
● It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
● It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

● Gaussian: The Gaussian model assumes that features follow a normal


distribution. This means if predictors take continuous values instead of
discrete, then the model assumes that these values are sampled from the
Gaussian distribution.
● Multinomial: The Multinomial Naïve Bayes classifier is used when the data
is multinomial distributed. It is primarily used for document classification
problems, it means a particular document belongs to which category such as
Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
● Bernoulli: The Bernoulli classifier works similar to the Multinomial
classifier, but the predictor variables are the independent Booleans variables.
Such as if a particular word is present or not in a document. This model is also
famous for document classification tasks.

15. Loss Functions

A loss function is a measure of


how good a prediction model does in
terms of being able to predict the
expected outcome. A most commonly
used method of finding the minimum
point of function is “gradient descent”.
Loss functions can be broadly
categorized into 2 types: Classification
and Regression Loss.

Regression loss
A. Mean Square Error, Quadratic loss,
L2 Loss
is the most commonly used regression
loss function. MSE is the sum of squared
distances between our target variable
and predicted values.
Magnies larger error; reduces smaller errors
Differentiable
Positive errors and negative errors are equivalent

B. Mean Absolute Error, L1 Loss

Mean Absolute Error (MAE) is another loss function used for regression models.
MAE is the sum of absolute differences between our target and predicted variables.
So it measures the average magnitude of errors in a set of predictions, without
considering their directions. (If we consider directions also, that would be called
Mean Bias Error (MBE), which is a sum of residuals/errors). The range is also 0 to ∞.

L1 loss is more robust to outliers, but its derivatives are not continuous, making it
inefficient to find the solution. L2 loss is sensitive to outliers, but gives a more stable
and closed form solution (by setting its derivative to 0.)

C. Huber Loss, Smooth Mean Absolute Error

Huber loss is less sensitive to outliers in data than the squared error loss. It’s also
differentiable at 0. It’s basically absolute error, which becomes quadratic when error
is small. How small that error has to be to make it quadratic depends on a
hyperparameter, 𝛿 (delta), which can be tuned. Huber loss approaches MSE when 𝛿
~ 0 and MAE when 𝛿 ~ ∞ (large numbers.)

D. Log-Cosh Loss

Log-cosh is another function used in regression tasks that’s smoother than L2.
Log-cosh is the logarithm of the hyperbolic cosine of the prediction error.

Advantage: log(cosh(x)) is approximately equal to (x ** 2) / 2 for small x and to


abs(x) - log(2) for large x. This means that 'logcosh' works mostly like the mean
squared error, but will not be so strongly affected by the occasional wildly incorrect
prediction. It has all the advantages of Huber loss, and it’s twice differentiable
everywhere, unlike Huber loss.

E. Quantile loss functions


It turn out to be useful when we are interested in predicting an interval instead of
only point predictions. Prediction interval from least square regression is based on
an assumption that residuals (y — y_hat) have constant variance across values of
independent variables. We can not trust linear regression models that violate this
assumption. We can not also just throw away the idea of fitting a linear regression
model as the baseline by saying that such situations would always be better modeled
using non-linear functions or tree-based models. This is where quantile loss and
quantile regression come to the rescue as regression-based on quantile loss provides
sensible prediction intervals even for residuals with non-constant variance or
non-normal distribution.

Loss Functions Classification

A. Hinge Loss
Hinge loss is primarily used with Support Vector Machine (SVM).Classifiers with
class labels -1 and 1.Hinge Loss not only penalizes the wrong predictions but also the
right predictions that are not confident

B. Cross-Entropy
Cross-entropy loss, or log loss, measures the performance of a classification model
whose output is a probability value between 0 and 1. Cross-entropy loss increases as
the predicted probability diverges from the actual label.

For two classes, i.e., y(x) ∈ {ω1, ω2} with classes coded as 0, 1. fˆ(x (l) ; θ) is the
output of the classier

Note that the cross entropy between two distributions p and q can be computed as,

i.e. the message-length when a wrong distribution q is assumed while the data is
actually from some other distribution p. The above equation thus follows if we also
use that P(ω1) = 1 − P(ω2)

C. Kullback-Leibler(KL) Divergence
The Kullback-Leibler Divergence score, or KL divergence score, quantifies how much
one probability distribution differs from another probability distribution.

The KL divergence between two distributions Q and P is often stated using the
following notation:

● KL(P || Q)

Where the “||” operator indicates “divergence” or Ps divergence from Q.

KL divergence can be calculated as the negative sum of probability of each event in P


multiplied by the log of the probability of the event in Q over the probability of the
event in P.

● KL(P || Q) = – sum x in X P(x) * log(Q(x) / P(x))


Numpy(Numerical Python)
NumPy is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra, fourier transform, and
matrices.
Uses--
In Python we have lists that serve the purpose of arrays, but they are slow to process.

NumPy aims to provide an array object that is up to 50x faster than traditional
Python lists.

The array object in NumPy is called ndarray, it provides a lot of supporting functions
that make working with ndarray very easy.

NumPy arrays are stored at one continuous place in memory unlike lists, so
processes can access and manipulate them very efficiently.

This behavior is called locality of reference in computer science.

This is the main reason why NumPy is faster than lists. Also it is optimized to
work with latest CPU architectures

alias: In Python alias are an alternate name for referring to the same thing.

arr = np.array([1, 2, 3, 4, 5])


print(np.__version__)

type(): This built-in Python function tells us the type of the object passed to it.
The array object in NumPy is called ndarray.We can create a NumPy ndarray object
by using the array() function.

nested array: are arrays that have arrays as their elements.

0-D arrays, or Scalars. An array that has 0-D arrays as its elements is called
uni-dimensional or 1-D array.
An array that has 1-D arrays as its elements is called a 2-D array.These are often used
to represent matrix or 2nd order tensors.
An array that has 2-D arrays (matrices) as its elements is called 3-D array
NumPy Arrays provides the ndim attribute that returns an integer that tells us how
many dimensions the array have.
An array can have any number of dimensions.When the array is created, you can
define the number of dimensions by using the ndmin argument.
arr = np.array([1, 2, 3, 4], ndmin=5)

Shape gives the no of rows and columns

Array indexing is the same as accessing an array element.You can access an array
element by referring to its index number.The indexes in NumPy arrays start with 0,
meaning that the first element has index 0, and the second has index 1

For 3d array arr indexing


arr([0,1,1]) - (2d, 1d, 0d)
Use negative indexing to access an array from the end

Slicing in python means taking elements from one given index to another given
index.
We pass slice instead of index like this: [start:end].
We can also define the step, like this: [start:end:step].
If we don't pass start its considered 0
If we don't pass end its considered length of array in that dimension
If we don't pass step its considered 1
The result includes the start index, but excludes the end index.Same goes with
negative indexing

Data Types in Python

By default Python have these data types:

● strings - used to represent text data, the text is given under quote marks. e.g.
"ABCD"
● integer - used to represent integer numbers. e.g. -1, -2, -3
● float - used to represent real numbers. e.g. 1.2, 42.42
● boolean - used to represent True or False.
● complex - used to represent complex numbers. e.g. 1.0 + 2.0j, 1.5 + 2.5j

NumPy has some extra data types, and refer to data types with one character, like i
for integers, u for unsigned integers etc.

Below is a list of all data types in NumPy and the characters used to represent them.

● i - integer
● b - boolean
● u - unsigned integer-- no signs associated +/-
● f - float
● c - complex float
● m - timedelta-- difference between 2 times
● M - datetime
● O - object
● S - string
● U - unicode string
● V - fixed chunk of memory for other type ( void )

The NumPy array object has a property called dtype that returns the data type of the
array
The best way to change the data type of an existing array, is to make a copy of the
array with the astype() method.
The astype() function creates a copy of the array, and allows you to specify the data
type as a parameter.
arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype('i')
newarr = arr.astype(int)

The main difference between a copy and a view of an array is that the copy is a new
array, and the view is just a view of the original array.

The copy owns the data and any changes made to the copy will not affect original
array, and any changes made to the original array will not affect the copy.

The view does not own the data and any changes made to the view will affect the
original array, and any changes made to the original array will affect the view.

The copy SHOULD NOT be affected by the changes made to the original array
The view SHOULD be affected by the changes made to the original array.

NumPy arrays have an attribute called shape that returns a tuple with each index
having the number of corresponding elements.

By reshaping we can add or remove dimensions or change number of elements in


each dimension.

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])


newarr = arr.reshape(4, 3)

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])


newarr = arr.reshape(2, 3, 2)
Dim,rows,coln

We can reshape an 8 elements 1D array into 4 elements in 2 rows 2D array but we


cannot reshape it into a 3 elements 3 rows 2D array as that would require 3x3 = 9
elements

You are allowed to have one "unknown" dimension.


Meaning that you do not have to specify an exact number for one of the dimensions
in the reshape method.
Pass -1 as the value, and NumPy will calculate this number for you
We can not pass -1 to more than one dimension.

Flattening array means converting a multidimensional array into a 1D array.We can


use reshape(-1) to do this.
arr = np.array([[1, 2, 3], [4, 5, 6]])
newarr = arr.reshape(-1)
[1 2 3 4 5 6]

Flatten-copy, ravel-view

The function nditer() is a helping function that can be used from very basic to very
advanced iterations. It solves some basic issues which we face in iteration, lets go
through it with examples

arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])


for x in np.nditer(arr):
print(x)

Output will be 1-8

We can use op_dtypes argument and pass it the expected datatype to change the
datatype of elements while iterating.

NumPy does not change the data type of the element in-place (where the element is
in array) so it needs some other space to perform this action, that extra space is
called buffer, and in order to enable it in nditer() we pass flags=['buffered'].

arr = np.array([1, 2, 3])


for x in np.nditer(arr, flags=['buffered'], op_dtypes=['S']):
print(x)

Enumeration means mentioning sequence number of somethings one by one.

Sometimes we require corresponding index of the element while iterating, the


ndenumerate() method can be used for those usecases.

arr = np.array([1, 2, 3])


for idx, x in np.ndenumerate(arr):
print(idx, x)

(0,) 1

(1,) 2

(2,) 3

We pass a sequence of arrays that we want to join to the concatenate() function


arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

arr = np.concatenate((arr1, arr2))

Stacking is same as concatenation, the only difference is that stacking is done along a
new axis.
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.stack((arr1, arr2), axis=1)

1,4
2,5
3,6

hstack() to stack along rows.


vstack() to stack along columns.
dstack() to stack along height, which is the same as depth.

Splitting is reverse operation of Joining.


Joining merges multiple arrays into one and Splitting breaks one array into multiple.
We use array_split() for splitting arrays, we pass it the array we want to split and the
number of splits

arr = np.array([1, 2, 3, 4, 5, 6])


newarr = np.array_split(arr, 4)

[array([1, 2]), array([3, 4]), array([5]), array([6])]

We also have the method split() available but it will not adjust the
elements when elements are less in source array for splitting like in example
above, array_split() worked properly but split() would fail

You can search an array for a certain value, and return the indexes that get a
match.
To search an array, use the where() method.

There is a method called searchsorted() which performs a binary search in


the array, and returns the index where the specified value would be inserted
to maintain the search order.
The searchsorted() method is assumed to be used on sorted arrays.

Sorting means putting elements in an ordered sequence.


Ordered sequence is any sequence that has an order corresponding to
elements, like numeric or alphabetical, ascending or descending.
The NumPy ndarray object has a function called sort(), that will sort a
specified array.
Getting some elements out of an existing array and creating a new array out
of them is called filtering.

In NumPy, you filter an array using a boolean index list.


A boolean index list is a list of booleans corresponding to indexes in the
array.

If the value at an index is True that element is contained in the filtered


array, if the value at that index is False that element is excluded from the
filtered array.

NLP

1. What is Natural Language Processing (NLP)?


Natural language processing is a field at the intersection of
• computer science
• artificial intelligence
• and linguistics.(relating to language )

• Goal: for computers to process or "understand" natural language in order to


perform tasks that are useful, e.g., • Performing Tasks, like making appointments,
buying things
• Question Answering- Siri, Google Assistant, Facebook M, Cortana .. thank you,
mobile!!!
• Fully understanding and representing the meaning of language (or even defining it)
is a difficult goal.
• Perfect language understanding is Al-complete
Gradient descent

Bagging and boosting


Confusion matrix,recall precision,f1 score, p-value,hypothesis testing,adjusted R
Train test validation
Activations functions

Dbscan

Semi superized zero shot few shot


Prooning

Adam, xgboost, lbss,

You might also like