0% found this document useful (0 votes)
12 views19 pages

Model_Qp_Scheme-2

Uploaded by

mtbnogroj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views19 pages

Model_Qp_Scheme-2

Uploaded by

mtbnogroj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

CHILDREN’S EDUCATION SOCIETY (REGD)

THE OXFORD COLLEGE OF ENGINEERING


Bommanahalli, Hosur Road, Bangalore – 68
080-30219601, Fax: 080-25730551, 30219629,
Website: http://www.theoxford.edu/engineering/
(Approved by AICTE, NBA, New Delhi& Affiliated to VTU, Belgaum)
DATA SCIENCE AND VISUALIZATION (21CS644)

MODEL QUESTION PAPER SOLUTION -2

Module-1

1. Explain Data Science profile. Explain the work of Data Scientist in Academia and
Industry.
This model so far seems to suggest this will all magically happen without human intervention. By
“human” here, we mean “data scientist.” Someone has to make the decisions about what data to collect,
and why.
That person needs to be formulating questions and hypotheses and making a plan for how the
problem will be attacked. And that someone is the data scientist or our beloved data science
team.
In Academia:

 The reality is that currently, no one calls themselves a data scientist in academia, except to
take on a secondary title for the sake of being a part of a “data science institute” at a
university, or for applying for a grant that supplies money for data science research.
 An academic data scientist is a scientist, trained in anything from social science to biology,
who works with large amounts of data, and must grapple with computa‐ tional problems
posed by the structure, size, messiness, and the complexity and nature of the data, while
simultaneously solving a realworld problem

In Industry:

 What do data scientists look like in industry? It depends on the level of seniority and
whether you’re talking about the Internet/online industry in particular. A chief data
scientist should be setting the data strategy of the company, which involves a variety of
things: setting everything up from the engineering and infrastructure for collecting data and
logging,

2. Explain Fitting and Over fitting the Model

Fitting a model:
 Fitting a model means that you estimate the parameters of the model using the observed
data. You are using your data as evidence to help approximate the real-world
mathematical process that generated the data. Fitting the model often involves
optimization methods and algorithms, such as maximum likelihood estimation, to help
get the parameters.
 Fitting the model is when you start actually coding: your code will read in the data, and
you’ll specify the functional form that you wrote down on the piece of paper. Then R or
Python will use built-in optimization methods to give you the most likely values of the
parameters given the data.
Over fitting:
 Over fitting is the term used to mean that you used a dataset to estimate the parameters of
your model, but your model isn’t that good at capturing reality beyond your sampled
data.

Explain the following


i) Types of Data
ii) Populations and sample
Types of data:
A strong data scientist needs to be versatile and comfortable with dealing a variety of types of data,
including:
 Traditional: numerical, categorical, or binary
 Text: emails, tweets, New York Times articles (see Chapter 4 or Chapter 7)
 Records: user-level data, time stamped event data, json formatted log files
 Geo-based location data: briefly touched on in this chapter with NYC housing data
 Network Sensor data
 Images
Populations and Samples :
 Sampling solves some engineering challenges
In the current popular discussion of Big Data, the focus on enterprise solutions such as Hadoop to
handle engineering and computational challenges caused by too much data overlooks sampling as
a legitimate solution.
 Even if we have access to all of Facebook’s or Google’s or Twitter’s data corpus, any
inferences we make from that data should not be extended to draw conclusions about
humans beyond those sets of users, or even those users for any particular day.

3. What is Statistical Thinking? Discuss in detail Statistical inferences.


 The world we live in is complex, random, and uncertain. At the same time, it’s one big
data generating machine, as we commute to work on subways and in cars, as our blood
moves through our bodies, as we’re shopping, emailing, procrastinating at work by
browsing the Internet and watching the stock market, as we’re building things, eating
things, talking to our friends and family about things, while factories are producing
products, this all at least potentially produces data.
 Data represents the traces of the real-world processes, and exactly which traces we gather
are decided by our data collection or sampling method. You, the data scientist, the
observer, are turning the world into data, and this is an utterly subjective, not objective,
process.

Statistical inferences:
 After separating the process from the data collection, we can see clearly that there are two
sources of randomness and uncertainty. The randomness and uncertainty underlying the
process itself, and the uncertainty associated with underlying data collection methods.
 This overall process of going from the world to the data, and then from the data back to
the world, is the field of statistical inference.
 Statistical inference is the discipline that concerns itself with the development of
procedures, methods, and theorems that allow us to extract meaning and information
from data that has been generated by stochastic processes.

4. Explain the following


i) Types of Data
ii) Populations and sample

Types of data:
A strong data scientist needs to be versatile and comfortable with dealing a variety of types of data,
including:
 Traditional: numerical, categorical, or binary
 Text: emails, tweets, New York Times articles (see Chapter 4 or Chapter 7)
 Records: user-level data, time stamped event data, json formatted log files
 Geo-based location data: briefly touched on in this chapter with NYC housing data
 Network Sensor data
 Images
Populations and Samples :
 Sampling solves some engineering challenges
In the current popular discussion of Big Data, the focus on enterprise solutions such as Hadoop to
handle engineering and computational challenges caused by too much data overlooks sampling as
a legitimate solution.
 Even if we have access to all of Facebook’s or Google’s or Twitter’s data corpus, any
inferences we make from that data should not be extended to draw conclusions about
humans beyond those sets of users, or even those users for any particular day.

5. Explain the probability Distribution with different probabilistic graphs

Probability distributions are the foundation of statistical models. Back in the day, before
computers, scientists observed real-world phenomenon, took measurements, and noticed that
certain mathematical shapes kept reappearing. The classical example is the height of humans,
following a normal distribution—a bell-shaped curve, also called a Gaussian distribution, named
after Gauss. Other common shapes have been named after their observers as well (e.g., the
Poisson distribution and the Weibull distribution), while other shapes such as Gamma
distributions or exponential distributions are named after associated mathematical objects.
Natural processes tend to generate measurements whose empirical shape could be approximated
by mathematical functions with a few parameters that could be estimated from the data.
Module -2

1. Explain the Overview of KNN process with Example.


Advantages:

 Simplicity and Speed:


K-means is straightforward and easy to understand, making it accessible for beginners. It is
relatively fast compared to other clustering algorithms, especially with small to medium-
sized datasets.
 Scalability:
K-means can handle large datasets efficiently. Its complexity is linear with respect to the
number of data points, making it scalable to large data Convergence:
 Versatility:
K-means is versatile and can be applied to various types of data and problems, including
image segmentation, document clustering, and customer segmentation

Disadvantages:

 Choosing K:
Number of Clusters: One significant challenge is determining the optimal number of
clusters (k). The algorithm requires k to be specified in advance, and selecting the right

 Outliers and Noise:


Susceptibility to Outliers: K-means is sensitive to outliers and noisy data. Outliers can
disproportionately affect the position of centroids and, consequently, the resulting clusters.
2. Explain how Real Direct Online real estate firm was able to address the issues related
to current data and Brokers

Case Study: RealDirect


 Doug Perlson, the CEO of RealDirect, has a background in real estate law, startups,
and online advertising.His goal with RealDirect is to use all the data he can access
about real estate to improve the way people sell and buy houses.
 RealDirect is addressing this problem by hiring a team of licensed realestate agents
who work together and pool their knowledge.To accomplish this, it built an
interface for sellers, giving them useful datadriven tips on how to sell their house.
It also uses interaction data to give real-time recommendations on what to do next.
 The team of brokers also becomes data experts, learning to use information-
collecting tools to keep tabs on new and relevant data or to access publicly
available information. One problem with publicly available data is that it’s old
news—there’s a three-month lag between a sale and when the data about that sale
is available.
 RealDirect is working on real-time feeds on things like when people start searching
for a home, what the initial offer is, the time between offer and close, and how
people search for a home online.
 Ultimately, good information helps both the buyer and the seller. At least if they’re
honest.How Does RealDirect Make Money?
 First, it offers a subscription to sellers—about $395 a month—to access the selling
tools.
 Second, it allows sellers to use Real Direct’s agents at a reduced commission,
typically 2% of the sale instead of the usual 2.5% or 3%.
 This is where the magic of data pooling comes in: it allows RealDirect to take a
smaller commission because it’s more optimized, and therefore gets more volume.
 The site itself is best thought of as a platform for buyers and sellers to manage their
sale or purchase process.
 There are statuses for each person on site: active, offer made, offer rejected,
showing, in contract, etc. Based on your status, different actions are suggested by
the software.

3. Explain Linear Regression with graphs to represent linear relationship, fitting the
Model and Cross Validation

Linear Regression
 One of the most common statistical methods is linear regression. It’s used to express the
mathematical relationship between two variables or attributes.
 The assumption is that there is a linear relationship between an outcome variable and a
predictor, between one variable and several other variables, in which case you’re modeling
the relationship as having a linear structure.
 It makes sense those changes in one variable correlate linearly with changes in another
variable. For example, it makes sense that the more umbrellas you sell, the more money
you make. In those cases you can feel good about the linearity assumption.
y = f(x) = β0 +β1 *x

Fitting the model


The intuition behind linear regression is that you want to find the line that minimizes the distance
between all the points and the line.
Many lines look approximately correct, but your goal is to find the optimal one. Optimal could mean
different things, but let’s start with optimal to mean the line that, on average, is closest to all the points.

4. Discus the data science process? Write about a Data Scientist role in this process.

We want to process this to make it clean for analysis. So we build and use pipelines of data munging:
joining, scraping, wrangling, or whatever you want to call it. To do this we use tools such as Python, shell
scripts, R, or SQL, or all of the above.
Now the key here that makes data science special and distinct from statistics is that this data product then
gets incorporated back into the real world, and users interact with that product, and that generates more
data, which creates a feedback loop.
 A Data Scientist’s Role in This Process-
 This model so far seems to suggest this will all magically happen without human
intervention. By “human” here, we mean “data scientist.” Someone has to make the
decisions about what data to collect, and why.
 That person needs to be formulating questions and hypotheses and making a plan for how
the problem will be attacked. And that someone is the data scientist or our beloved data
science team.

5. Explain the Advantages and Disadvantages of K-means algorithm


Advantages:

 Simplicity and Speed:


K-means is straightforward and easy to understand, making it accessible for beginners. It is
relatively fast compared to other clustering algorithms, especially with small to medium-sized
datasets.
 Scalability:
K-means can handle large datasets efficiently. Its complexity is linear with respect to the
number of data points, making it scalable to large data Convergence:
 Versatility:
K-means is versatile and can be applied to various types of data and problems, including
image segmentation, document clustering, and customer segmentation

Disadvantages:

 Choosing K:
Number of Clusters: One significant challenge is determining the optimal number of clusters
(k). The algorithm requires k to be specified in advance, and selecting the right

 Outliers and Noise:


Susceptibility to Outliers: K-means is sensitive to outliers and noisy data. Outliers can
disproportionately affect the position of centroids and, consequently, the resulting clusters.
Module-3

1. Discuss the difference between filter and wrapper methods in feature selection.

Filter Methods Wrapper methods


 Filter methods evaluate the relevance  Wrapper methods evaluate the
of features by looking at the intrinsic usefulness of feature subsets by
properties of the data, independent of training and evaluating a specific
any machine learning algorithm learning algorithm on different
 They typically use statistical subsets of features.
techniques to rank and select features  The selection process is guided by the
based on their individual performance. model’s performance, typically using
 Common techniques include metrics such as accuracy, precision, or
correlation coefficients, chi-square recall.
tests, mutual information, and  Common techniques include forward
variance thresholding. selection, backward elimination, and
 These methods can handle large recursive feature elimination (RFE).
datasets effectively because they do  These methods optimize feature
not rely on iterative training processes selection for the specific model and
dataset, potentially leading to higher
model performance compared to filter
methods.

2. Discuss feature selection and feature extraction in the context of machine learning

 The idea of feature selection is identifying the subset of data or transformed data that
you want to put into your model.
 It’s an important part of building statistical models and algorithms in general. Just
because you have data doesn’t mean it all has to go into the model.
 For example, it’s possible you have many redundancies or correlated variables in your
raw data, and so you don’t want to include all those variables in your model.
 Similarly, you might want to construct new variables by transforming the variables
with a logarithm, say, or turning a continuous variable into a binary variable, before
feeding them into the model.
If the number of features is larger than the number of observations, or if we have a sparsity
problem, then large isn’t necessarily good. And if the huge data just makes it hard to
manipulate because of computational reasons (e.g., it can’t all fit on one computer, so the data
needs to be shared across multiple machines) without improving our signal, then that’s a net
negative.

3. Compare principal component analysis (PCA) with singular value decomposition


(SVD) in terms of their application to dimensionality reduction

Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are both
techniques used for dimensionality reduction in data analysis, but they differ in their
approaches and applications.

Principal Component Analysis singular value decomposition


(PCA) (SVD)
 PCA aims to transform the original  SVD decomposes a matrix into three
data into a new coordinate system other matrices: UUU, Σ\SigmaΣ, and
such that the greatest variance by any VTV^TVT. In the context of
projection of the data comes to lie on dimensionality reduction, SVD is
the first coordinate (principal used to approximate the original data
component), the second greatest matrix with lower-rank matrices by
variance on the second coordinate, keeping only the largest singular
and so on. values and corresponding singular
 Uses the covariance matrix of the vectors.
data. The principal components are  Directly factors the data matrix AAA
the eigenvectors of the covariance into A=UΣVTA = U \Sigma
matrix, and the corresponding V^TA=UΣVT. The columns of UUU
eigenvalues indicate the amount of are the left singular vectors, the
variance captured by each principal columns of VVV are the right
component. singular vectors, and Σ\SigmaΣ is a
 PCA: Involves calculating the diagonal matrix of singular values.
covariance matrix, then performing  Directly applied to the data matrix,
eigenvalue decomposition on this making it more general. SVD can be
matrix to obtain principal applied to any matrix without
components. It can also be performed requiring it to be square or centered.
using SVD on the centered data  Truncates the diagonal matrix Σ\
matrix. SigmaΣ to keep the top kkk singular
 PCA: Selects the top kkk principal values, reconstructing the data with a
components to form a new subspace lower-rank approximation.
for the data, effectively reducing the
number of dimensions while
preserving as much variance as
possible.

4. Discuss the Concept of Singular value decomposition method in Data Science.

Singular Value Decomposition (SVD)-


• Given an m×n matrix X of rank k, it is a theorem from linear algebra that we can
always compose it into the product of three matrices as follows:

• where U is m×k, S is k×k, and V is k×n, the columns of U and V are pair wise
orthogonal, and S is diagonal. Note the standard statement of SVD is slightly more
involved and has U and V both square unitary matrices, and has the middle
“diagonal” matrix a rectangular. We’ll be using this form, because we’re going to be
taking approximations to X of increasingly smaller rank.
• Each row of U corresponds to a user, whereas V has a row for each item. The values
along the diagonal of the square matrix S are called the “singular values.”
Dimensionality reduction-
• Let’s think about how we reduce dimensions and create “latent features” internally
every day.
• For example, people invent concepts like “coolness,” but we can’t directly measure
how cool someone is. Other people exhibit different patterns of behavior, which we
internally map or reduce to our one dimension of “coolness.”
• Two things are happening here: the dimensionality is reduced into a single feature
and the latent aspect of that feature.
• “Important” in this context means they explain the variance in the answers to the
various questions—in other words, they model the answers efficiently.
• Our goal is to build a model that has a representation in a low dimensional subspace
that gathers “taste information” to generate recommendations. So we’re saying here
that taste is latent but can be approximated by putting together all the observed
information we do have about the user.
Also consider that most of the time, the rating questions are binary (yes/no). To deal with
this, Hunch created a separate variable for every question. They also found that comparison
questions may be better at revealing preferences.

5. Explain Decision tree and Entropy with example


Decision trees-
• Decision trees have an intuitive appeal because outside the context of data science in our
everyday lives, we can think of breaking big decisions down into a series of questions.

The Decision Tree Algorithm-


• Build your decision tree iteratively, starting at the root. You need an algorithm to decide
which attribute to split on; e.g., which node should be the next one to identify.
• Choose that attribute in order to maximize information gain, because you’re getting the
most bang for your buck that way.
• Keep going until all the points at the end are in the same class or you end up with no
features left. In this case, you take the majority vote.
• Often people “prune the tree” afterwards to avoid overfitting. This just means cutting it
off below a certain depth.

Module-4

1. What is Data Visualization? Explain the importance of Data Visualization

Data representation refers to the form in which you can store, process, and transmit data.
Representations are a useful apparatus to derive insights from the data. Thus, representations
transform data into useful information.Data visualization reveals patterns and correlations, such as the
positive relationship between body mass and maximum longevity in animals, not easily discernible in raw
data.

Importance of Data Visualization


 Visualizing data has many advantages, such as the following:
 Complex data can be easily understood.
 A simple visual representation of outliers, target audiences, and futures markets can be
created.
 Storytelling can be done using dashboards and animations.
Data can be explored through interactive visualizations

2. Explain Data Wrangling. Draw the data wrangling process to measure employee
engagement

Data wrangling is the process of transforming raw data into a suitable representation for various
tasks. It is the discipline of augmenting, cleaning, filtering, standardizing, and enriching data in a
way that allows it to be used in a downstream task, which in our case is data visualization.

3. Discuss Radar Chart with faceting for multiple variables. What are the design practices to
be followed for Radar Charts?

Radar charts (also known as spider or web charts) visualize multiple variables with each variable
plotted on its own axis, resulting in a polygon. All axes are arranged radially, starting at the
center with equal distances between one another, and have the same scale.
Design Practices
• Try to display 10 factors or fewer on a single radar chart to make it easier to read.
Use faceting (displaying each variable in a separate plot) for multiple variables/groups, as shown in
the preceding diagram, in order to maintain clarity

4. Explain the differences between a pie chart and Donut charts with respect to data
Visualization

Pie Chart
• Pie charts illustrate numerical proportions by dividing a circle into slices.
• Each arc length represents a proportion of a category.
• The full circle equates to 100%.

Design Practices
• Arrange the slices according to their size in increasing/decreasing order, either in a
clockwise or counterclockwise manner.
• Make sure that every slice has a different color
Donut charts

Donut charts are also more space-efficient because the center is cut out, so it can be used to
display information or further divide groups into subgroups

Design Practice
• Use the same color that’s used for the category for the subcategories. Use varying
brightness levels for the different subcategories.

5. Compare and Contrast Scatter Plot and Bubble plot

Scatter plots show data points for two numerical variables, displaying a variable on both axes.
Design Practices
• Start both axes at zero to represent data accurately.
• Use contrasting colors for data points and avoid using symbols for scatter plots with
multiple groups or categories

Uses
• You can detect whether a correlation (relationship) exists between two variables.
• They allow you to plot the relationship between multiple groups or categories using
different colors.

Bubble plot
• A bubble plot extends a scatter plot by introducing a third numerical variable.
• The value of the variable is represented by the size of the dots.
• The area of the dots is proportional to the value.
• A legend is used to link the size of the dot to an actual numerical value.

Design Practices
• The design practices for the scatter plot are also applicable to the bubble plot.
Don’t use bubble plots for very large amounts of data, since too many bubbles make the chart
difficult to read

Uses
Bubble plots help to show a correlation between three variables

Module -5

1. Explain Box plot and Scatter plot in Matplotlib using plt.boxplot() and plt.scatter()
commands

Box Plot-
o The box plot shows multiple statistical measurements. The box extends from the lower
to the upper quartile values of the data, thereby allowing us to visualize the interquartile
range.
o The plt.boxplot(x) function creates a box plot.
o Important parameters:
• x: Specifies the input data. It specifies either a 1D array for a single box, or a sequence
of arrays for multiple boxes.
• notch: (optional) If true, notches will be added to the plot to indicate the confidence
interval around the median.
• labels: (optional) Specifies the labels as a sequence.
• showfliers: (optional) By default, it is true, and outliers are plotted beyond the caps.
• showmeans: (optional) If true, arithmetic means are shown

Scatter Plot-
o Scatter plots show data points for two numerical variables, displaying a variable on both
axes.
o plt.scatter(x, y) creates a scatter plot of y versus x, with optionally varying marker size
and/or color.
Important parameters:
• x, y: Specifies the data positions.
• s: (optional) Specifies the marker size in points squared.
• c: (optional) Specifies the marker color. If a sequence of numbers is specified, the
numbers will be mapped to the colors of the color map.

2. Explain Plotting using Pandas data frames briefly in Matplotlib

• It is pretty straightforward to use pandas.


• DataFrame as a data source. Instead of providing x and y values, you can provide the pandas.
• DataFrame in the data parameter and give keys for x and y, as follows:
plt.plot('x_key', 'y_key', data=df)

• If your data is already a pandas DataFrame, this is the preferred way

Ticks-
• Tick locations and labels can be set manually if Matplotlib's default isn't sufficient.
• Considering the previous plot, it might be preferable to only have ticks at multiples of
ones at the x-axis. One way to accomplish this is to use plt.xticks() and plt.yticks() to
either get or set the ticks manually.
• plt.xticks(ticks, [labels], [**kwargs]) sets the current tick locations and labels of the x-
axis.
Parameters:
• ticks: List of tick locations; if an empty list is passed, ticks will be disabled.
• labels (optional): You can optionally pass a list of labels for the specified locations.
• **kwargs (optional): matplotlib.text.Text() properties can be used to customize the
appearance of the tick labels. A quite useful property is rotation; this allows you to rotate
the tick labels to use space more efficiently.

3. Explain plt.figure() and plt.close()commands in Matplotlib

plt.figure() to create a new Figure.


• This function returns a Figure instance, but it is also passed to the backend.
• Every Figure-related command that follows is applied to the current Figure and does not
need to know the Figure instance.
• By default, the Figure has a width of 6.4 inches and a height of 4.8 inches with a dpi (dots
per inch) of 100. To change the default values of the Figure, we can use the parameters
figsize and dpi.
 #To change the width and the height
 plt.figure(figsize=(10, 5))
 #To change the dpi
 plt.figure(dpi=300)

plt.close()
Figures that are no longer used should be closed by explicitly calling plt.close(), which also
cleans up memory efficiently.
• If nothing is specified, the plt.close() command will close the current Figure.
• To close a specific Figure, you can either provide a reference to a Figure instance or
provide the Figure number. To find the number of a Figure object, we can make use of
the number attribute, as follows:
 plt.gcf().number
• The plt.close('all') command is used to close all active Figures. The following example
shows how a Figure can be created and closed:

4. Briefly explain these terms


i) Labels ii)Titles iii)Annotations iv)Legends

Labels
• Matplotlib provides a few label functions that we can use for setting labels to the x-
and y-axes.
• The plt.xlabel() and plt.ylabel() functions are used to set the label for the current
axes. The set_xlabel() and set_ylabel() functions are used to set the label for specified
axes.

ii) Titles-
• A title describes a particular chart/graph.
• The titles are placed above the axes in the center, left edge, or right edge. There are
two options for titles – you can either set the Figure title or the title of an Axes.
• The suptitle() function sets the title for the current and specified Figure. The title()
function helps in setting the title for the current and specified axes.

iii)Annotations-
• Compared to text that is placed at an arbitrary position on the Axes, annotations are used
to annotate some features of the plot.
• In annotations, there are two locations to consider: the annotated location, xy, and the
location of the annotation, text xytext. It is useful to specify the parameter arrowprops,
which results in an arrow pointing to the annotated location.

iv)Legends-
• Legend describes the content of the plot. To add a legend to your Axes, we have to
specify the label parameter at the time of plot creation.
• Calling plt.legend() for the current Axes or Axes.legend() for a specific Axes will add the
legend. The loc parameter specifies the location of the legend.

5. Discuss the basic image operations that can be performed in Matplotlib with example

Basic Image Operations- The following are the basic operations for designing an image.
 Loading Images-
In Matplotlib, loading images is part of the image submodule. We use the alias mpimg for the
submodule, as follows:
import matplotlib.image as mpimg
The mpimg.imread(fname) reads an image and returns it as a numpy.array object. For
grayscale images, the returned array has a shape (height, width), for RGB images (height, width,
3), and for RGBA images (height, width, 4).
The array values range from 0 to 255.
We can also load the image in the following manner:

You might also like