DP 100 Sample

The document outlines a series of questions and suggested answers related to configuring various modules in Azure Machine Learning for data science tasks, including feature selection, model training, and data visualization. Each question specifies the required configurations or methods to achieve specific objectives, such as identifying outliers, replacing missing data, and evaluating model performance. The suggested answers provide insights into the appropriate techniques and settings to use for effective data analysis and model development.

Uploaded by

Prashanth Venkategowda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views13 pages

DP 100 Sample

Uploaded by

Prashanth Venkategowda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

DP-100 = Azure Data Scientist Associate = SAMPLE

QUESTION 1
You need to configure the Feature Based Feature Selection module based on the experiment requirements
and datasets.
How should you configure the module properties? To answer, select the appropriate options in the dialog
box in the answer area.
NOTE: Each correct selection is worth one point.

Hot Area:
Suggested Answer:

Box 1: Mutual Information.

The mutual information score is particularly useful in feature selection because it maximizes the
mutual information between the joint distribution and target variables in datasets with many
dimensions.
Box 2: MedianValue -
MedianValue is the feature column, , it is the predictor of the dataset.
Scenario: The MedianValue and AvgRoomsinHouse columns both hold data in numeric format.
You need to select a feature selection algorithm to analyze the relationship between the two
columns in more detail.

QUESTION 2
You need to set up the Permutation Feature Importance module according to the model training
requirements.
Which properties should you select? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Suggested Answer:
Box 1: Accuracy -
Scenario: You want to configure hyperparameters in the model learning process to speed the
learning phase by using hyperparameters. In addition, this configuration should cancel the lowest
performing runs at each evaluation interval, thereby directing effort and resources towards models
that are more likely to be successful.
Box 2: R-Squared

QUESTION 3
You need to select a feature extraction method.
Which method should you use?
A. Mutual information
B. Pearson's correlation
C. Spearman correlation
D. Fisher Linear Discriminant Analysis

Box 1: Floating point -

Need floating point for Median values.
Scenario: An initial investigation shows that the datasets are identical in structure apart from the
MedianValue column. The smaller Paris dataset contains the
MedianValue in text format, whereas the larger London dataset contains the MedianValue in
numerical format.
Box 2: Unchanged -
Note: Select the Categorical option to specify that the values in the selected columns should be
treated as categories.
For example, you might have a column that contains the numbers 0,1 and 2, but know that the
numbers actually mean "Smoker", "Non smoker" and "Unknown". In that case, by flagging the
column as categorical you can ensure that the values are not used in numeric calculations, only to
group data.

QUESTION 6
You need to identify the methods for dividing the data according to the testing requirements.
Which properties should you select? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Suggested Answer:
Scenario: Testing -
You must produce multiple partitions of a dataset based on sampling using the Partition and
Sample module in Azure Machine Learning Studio.
Box 1: Assign to folds -
Use Assign to folds option when you want to divide the dataset into subsets of the data. This option
is also useful when you want to create a custom number of folds for cross-validation, or to split
rows into several groups.
Not Head: Use Head mode to get only the first n rows. This option is useful if you want to test a
pipeline on a small number of rows, and don't need the data to be balanced or sampled in any way.
Not Sampling: The Sampling option supports simple random sampling or stratified random
sampling. This is useful if you want to create a smaller representative sample dataset for testing.
Box 2: Partition evenly -
Specify the partitioner method: Indicate how you want data to be apportioned to each partition,
using these options:
✑ Partition evenly: Use this option to place an equal number of rows in each partition. To specify
the number of output partitions, type a whole number in the
Specify number of folds to split evenly into text box.

QUESTION 7
You need to visually identify whether outliers exist in the Age column and quantify the outliers before the
outliers are removed.

Which three Azure Machine Learning Studio modules should you use? Each correct answer presents part
of the solution.

NOTE: Each correct selection is worth one point.

A. Create Scatterplot
B. Summarize Data
C. Clip Values
D. Replace Discrete Values
E. Build Counting Transform

Step 1: Sweep Clustering -

Start by using the "Tune Model Hyperparameters" module to select the best sets of parameters for
each of the models we're considering.
One of the interesting things about the "Tune Model Hyperparameters" module is that it not only
outputs the results from the Tuning, it also outputs the Trained
Model.
Step 2: Train Model -
Step 3: Evaluate Model -
Scenario: You need to provide the test results to the Fabrikam Residences team. You create data
visualizations to aid in presenting the results.
You must produce a Receiver Operating Characteristic (ROC) curve to conduct a diagnostic test
evaluation of the model. You need to select appropriate methods for producing the ROC curve in
Azure Machine Learning Studio to compare the Two-Class Decision Forest and the Two-Class
Decision Jungle modules with one another.

QUESTION 9
You need to replace the missing data in the AccessibilityToHighway columns.

How should you configure the Clean Missing Data module? To answer, select the appropriate options in
the answer area.
NOTE: Each correct selection is worth one point.

Hot Area:

Suggested Answer:
Box 1: Replace using MICE -
Replace using MICE: For each missing value, this option assigns a new value, which is calculated
by using a method described in the statistical literature as
"Multivariate Imputation using Chained Equations" or "Multiple Imputation by Chained
Equations". With a multiple imputation method, each variable with missing data is modeled
conditionally using the other variables in the data before filling in the missing values.
Scenario: The AccessibilityToHighway column in both datasets contains missing values. The
missing data must be replaced with new data so that it is modeled conditionally using the other
variables in the data before filling in the missing values.
Box 2: Propagate -
Cols with all missing values indicate if columns of all missing values should be preserved in the
output.

QUESTION 10
You plan to implement an Azure Machine Learning solution.
You have the following requirements: