Model_Qp_Scheme-2
Model_Qp_Scheme-2
Module-1
1. Explain Data Science profile. Explain the work of Data Scientist in Academia and
Industry.
This model so far seems to suggest this will all magically happen without human intervention. By
“human” here, we mean “data scientist.” Someone has to make the decisions about what data to collect,
and why.
That person needs to be formulating questions and hypotheses and making a plan for how the
problem will be attacked. And that someone is the data scientist or our beloved data science
team.
In Academia:
The reality is that currently, no one calls themselves a data scientist in academia, except to
take on a secondary title for the sake of being a part of a “data science institute” at a
university, or for applying for a grant that supplies money for data science research.
An academic data scientist is a scientist, trained in anything from social science to biology,
who works with large amounts of data, and must grapple with computa‐ tional problems
posed by the structure, size, messiness, and the complexity and nature of the data, while
simultaneously solving a realworld problem
In Industry:
What do data scientists look like in industry? It depends on the level of seniority and
whether you’re talking about the Internet/online industry in particular. A chief data
scientist should be setting the data strategy of the company, which involves a variety of
things: setting everything up from the engineering and infrastructure for collecting data and
logging,
Fitting a model:
Fitting a model means that you estimate the parameters of the model using the observed
data. You are using your data as evidence to help approximate the real-world
mathematical process that generated the data. Fitting the model often involves
optimization methods and algorithms, such as maximum likelihood estimation, to help
get the parameters.
Fitting the model is when you start actually coding: your code will read in the data, and
you’ll specify the functional form that you wrote down on the piece of paper. Then R or
Python will use built-in optimization methods to give you the most likely values of the
parameters given the data.
Over fitting:
Over fitting is the term used to mean that you used a dataset to estimate the parameters of
your model, but your model isn’t that good at capturing reality beyond your sampled
data.
Statistical inferences:
After separating the process from the data collection, we can see clearly that there are two
sources of randomness and uncertainty. The randomness and uncertainty underlying the
process itself, and the uncertainty associated with underlying data collection methods.
This overall process of going from the world to the data, and then from the data back to
the world, is the field of statistical inference.
Statistical inference is the discipline that concerns itself with the development of
procedures, methods, and theorems that allow us to extract meaning and information
from data that has been generated by stochastic processes.
Types of data:
A strong data scientist needs to be versatile and comfortable with dealing a variety of types of data,
including:
Traditional: numerical, categorical, or binary
Text: emails, tweets, New York Times articles (see Chapter 4 or Chapter 7)
Records: user-level data, time stamped event data, json formatted log files
Geo-based location data: briefly touched on in this chapter with NYC housing data
Network Sensor data
Images
Populations and Samples :
Sampling solves some engineering challenges
In the current popular discussion of Big Data, the focus on enterprise solutions such as Hadoop to
handle engineering and computational challenges caused by too much data overlooks sampling as
a legitimate solution.
Even if we have access to all of Facebook’s or Google’s or Twitter’s data corpus, any
inferences we make from that data should not be extended to draw conclusions about
humans beyond those sets of users, or even those users for any particular day.
Probability distributions are the foundation of statistical models. Back in the day, before
computers, scientists observed real-world phenomenon, took measurements, and noticed that
certain mathematical shapes kept reappearing. The classical example is the height of humans,
following a normal distribution—a bell-shaped curve, also called a Gaussian distribution, named
after Gauss. Other common shapes have been named after their observers as well (e.g., the
Poisson distribution and the Weibull distribution), while other shapes such as Gamma
distributions or exponential distributions are named after associated mathematical objects.
Natural processes tend to generate measurements whose empirical shape could be approximated
by mathematical functions with a few parameters that could be estimated from the data.
Module -2
Disadvantages:
Choosing K:
Number of Clusters: One significant challenge is determining the optimal number of
clusters (k). The algorithm requires k to be specified in advance, and selecting the right
3. Explain Linear Regression with graphs to represent linear relationship, fitting the
Model and Cross Validation
Linear Regression
One of the most common statistical methods is linear regression. It’s used to express the
mathematical relationship between two variables or attributes.
The assumption is that there is a linear relationship between an outcome variable and a
predictor, between one variable and several other variables, in which case you’re modeling
the relationship as having a linear structure.
It makes sense those changes in one variable correlate linearly with changes in another
variable. For example, it makes sense that the more umbrellas you sell, the more money
you make. In those cases you can feel good about the linearity assumption.
y = f(x) = β0 +β1 *x
4. Discus the data science process? Write about a Data Scientist role in this process.
We want to process this to make it clean for analysis. So we build and use pipelines of data munging:
joining, scraping, wrangling, or whatever you want to call it. To do this we use tools such as Python, shell
scripts, R, or SQL, or all of the above.
Now the key here that makes data science special and distinct from statistics is that this data product then
gets incorporated back into the real world, and users interact with that product, and that generates more
data, which creates a feedback loop.
A Data Scientist’s Role in This Process-
This model so far seems to suggest this will all magically happen without human
intervention. By “human” here, we mean “data scientist.” Someone has to make the
decisions about what data to collect, and why.
That person needs to be formulating questions and hypotheses and making a plan for how
the problem will be attacked. And that someone is the data scientist or our beloved data
science team.
Disadvantages:
Choosing K:
Number of Clusters: One significant challenge is determining the optimal number of clusters
(k). The algorithm requires k to be specified in advance, and selecting the right
1. Discuss the difference between filter and wrapper methods in feature selection.
2. Discuss feature selection and feature extraction in the context of machine learning
The idea of feature selection is identifying the subset of data or transformed data that
you want to put into your model.
It’s an important part of building statistical models and algorithms in general. Just
because you have data doesn’t mean it all has to go into the model.
For example, it’s possible you have many redundancies or correlated variables in your
raw data, and so you don’t want to include all those variables in your model.
Similarly, you might want to construct new variables by transforming the variables
with a logarithm, say, or turning a continuous variable into a binary variable, before
feeding them into the model.
If the number of features is larger than the number of observations, or if we have a sparsity
problem, then large isn’t necessarily good. And if the huge data just makes it hard to
manipulate because of computational reasons (e.g., it can’t all fit on one computer, so the data
needs to be shared across multiple machines) without improving our signal, then that’s a net
negative.
Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are both
techniques used for dimensionality reduction in data analysis, but they differ in their
approaches and applications.
• where U is m×k, S is k×k, and V is k×n, the columns of U and V are pair wise
orthogonal, and S is diagonal. Note the standard statement of SVD is slightly more
involved and has U and V both square unitary matrices, and has the middle
“diagonal” matrix a rectangular. We’ll be using this form, because we’re going to be
taking approximations to X of increasingly smaller rank.
• Each row of U corresponds to a user, whereas V has a row for each item. The values
along the diagonal of the square matrix S are called the “singular values.”
Dimensionality reduction-
• Let’s think about how we reduce dimensions and create “latent features” internally
every day.
• For example, people invent concepts like “coolness,” but we can’t directly measure
how cool someone is. Other people exhibit different patterns of behavior, which we
internally map or reduce to our one dimension of “coolness.”
• Two things are happening here: the dimensionality is reduced into a single feature
and the latent aspect of that feature.
• “Important” in this context means they explain the variance in the answers to the
various questions—in other words, they model the answers efficiently.
• Our goal is to build a model that has a representation in a low dimensional subspace
that gathers “taste information” to generate recommendations. So we’re saying here
that taste is latent but can be approximated by putting together all the observed
information we do have about the user.
Also consider that most of the time, the rating questions are binary (yes/no). To deal with
this, Hunch created a separate variable for every question. They also found that comparison
questions may be better at revealing preferences.
Module-4
Data representation refers to the form in which you can store, process, and transmit data.
Representations are a useful apparatus to derive insights from the data. Thus, representations
transform data into useful information.Data visualization reveals patterns and correlations, such as the
positive relationship between body mass and maximum longevity in animals, not easily discernible in raw
data.
2. Explain Data Wrangling. Draw the data wrangling process to measure employee
engagement
Data wrangling is the process of transforming raw data into a suitable representation for various
tasks. It is the discipline of augmenting, cleaning, filtering, standardizing, and enriching data in a
way that allows it to be used in a downstream task, which in our case is data visualization.
3. Discuss Radar Chart with faceting for multiple variables. What are the design practices to
be followed for Radar Charts?
Radar charts (also known as spider or web charts) visualize multiple variables with each variable
plotted on its own axis, resulting in a polygon. All axes are arranged radially, starting at the
center with equal distances between one another, and have the same scale.
Design Practices
• Try to display 10 factors or fewer on a single radar chart to make it easier to read.
Use faceting (displaying each variable in a separate plot) for multiple variables/groups, as shown in
the preceding diagram, in order to maintain clarity
4. Explain the differences between a pie chart and Donut charts with respect to data
Visualization
Pie Chart
• Pie charts illustrate numerical proportions by dividing a circle into slices.
• Each arc length represents a proportion of a category.
• The full circle equates to 100%.
Design Practices
• Arrange the slices according to their size in increasing/decreasing order, either in a
clockwise or counterclockwise manner.
• Make sure that every slice has a different color
Donut charts
Donut charts are also more space-efficient because the center is cut out, so it can be used to
display information or further divide groups into subgroups
Design Practice
• Use the same color that’s used for the category for the subcategories. Use varying
brightness levels for the different subcategories.
Scatter plots show data points for two numerical variables, displaying a variable on both axes.
Design Practices
• Start both axes at zero to represent data accurately.
• Use contrasting colors for data points and avoid using symbols for scatter plots with
multiple groups or categories
Uses
• You can detect whether a correlation (relationship) exists between two variables.
• They allow you to plot the relationship between multiple groups or categories using
different colors.
Bubble plot
• A bubble plot extends a scatter plot by introducing a third numerical variable.
• The value of the variable is represented by the size of the dots.
• The area of the dots is proportional to the value.
• A legend is used to link the size of the dot to an actual numerical value.
Design Practices
• The design practices for the scatter plot are also applicable to the bubble plot.
Don’t use bubble plots for very large amounts of data, since too many bubbles make the chart
difficult to read
Uses
Bubble plots help to show a correlation between three variables
Module -5
1. Explain Box plot and Scatter plot in Matplotlib using plt.boxplot() and plt.scatter()
commands
Box Plot-
o The box plot shows multiple statistical measurements. The box extends from the lower
to the upper quartile values of the data, thereby allowing us to visualize the interquartile
range.
o The plt.boxplot(x) function creates a box plot.
o Important parameters:
• x: Specifies the input data. It specifies either a 1D array for a single box, or a sequence
of arrays for multiple boxes.
• notch: (optional) If true, notches will be added to the plot to indicate the confidence
interval around the median.
• labels: (optional) Specifies the labels as a sequence.
• showfliers: (optional) By default, it is true, and outliers are plotted beyond the caps.
• showmeans: (optional) If true, arithmetic means are shown
Scatter Plot-
o Scatter plots show data points for two numerical variables, displaying a variable on both
axes.
o plt.scatter(x, y) creates a scatter plot of y versus x, with optionally varying marker size
and/or color.
Important parameters:
• x, y: Specifies the data positions.
• s: (optional) Specifies the marker size in points squared.
• c: (optional) Specifies the marker color. If a sequence of numbers is specified, the
numbers will be mapped to the colors of the color map.
Ticks-
• Tick locations and labels can be set manually if Matplotlib's default isn't sufficient.
• Considering the previous plot, it might be preferable to only have ticks at multiples of
ones at the x-axis. One way to accomplish this is to use plt.xticks() and plt.yticks() to
either get or set the ticks manually.
• plt.xticks(ticks, [labels], [**kwargs]) sets the current tick locations and labels of the x-
axis.
Parameters:
• ticks: List of tick locations; if an empty list is passed, ticks will be disabled.
• labels (optional): You can optionally pass a list of labels for the specified locations.
• **kwargs (optional): matplotlib.text.Text() properties can be used to customize the
appearance of the tick labels. A quite useful property is rotation; this allows you to rotate
the tick labels to use space more efficiently.
plt.close()
Figures that are no longer used should be closed by explicitly calling plt.close(), which also
cleans up memory efficiently.
• If nothing is specified, the plt.close() command will close the current Figure.
• To close a specific Figure, you can either provide a reference to a Figure instance or
provide the Figure number. To find the number of a Figure object, we can make use of
the number attribute, as follows:
plt.gcf().number
• The plt.close('all') command is used to close all active Figures. The following example
shows how a Figure can be created and closed:
Labels
• Matplotlib provides a few label functions that we can use for setting labels to the x-
and y-axes.
• The plt.xlabel() and plt.ylabel() functions are used to set the label for the current
axes. The set_xlabel() and set_ylabel() functions are used to set the label for specified
axes.
ii) Titles-
• A title describes a particular chart/graph.
• The titles are placed above the axes in the center, left edge, or right edge. There are
two options for titles – you can either set the Figure title or the title of an Axes.
• The suptitle() function sets the title for the current and specified Figure. The title()
function helps in setting the title for the current and specified axes.
iii)Annotations-
• Compared to text that is placed at an arbitrary position on the Axes, annotations are used
to annotate some features of the plot.
• In annotations, there are two locations to consider: the annotated location, xy, and the
location of the annotation, text xytext. It is useful to specify the parameter arrowprops,
which results in an arrow pointing to the annotated location.
iv)Legends-
• Legend describes the content of the plot. To add a legend to your Axes, we have to
specify the label parameter at the time of plot creation.
• Calling plt.legend() for the current Axes or Axes.legend() for a specific Axes will add the
legend. The loc parameter specifies the location of the legend.
5. Discuss the basic image operations that can be performed in Matplotlib with example
Basic Image Operations- The following are the basic operations for designing an image.
Loading Images-
In Matplotlib, loading images is part of the image submodule. We use the alias mpimg for the
submodule, as follows:
import matplotlib.image as mpimg
The mpimg.imread(fname) reads an image and returns it as a numpy.array object. For
grayscale images, the returned array has a shape (height, width), for RGB images (height, width,
3), and for RGBA images (height, width, 4).
The array values range from 0 to 255.
We can also load the image in the following manner: