0% found this document useful (0 votes)
36 views13 pages

DP 100 Sample

The document outlines a series of questions and suggested answers related to configuring various modules in Azure Machine Learning for data science tasks, including feature selection, model training, and data visualization. Each question specifies the required configurations or methods to achieve specific objectives, such as identifying outliers, replacing missing data, and evaluating model performance. The suggested answers provide insights into the appropriate techniques and settings to use for effective data analysis and model development.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views13 pages

DP 100 Sample

The document outlines a series of questions and suggested answers related to configuring various modules in Azure Machine Learning for data science tasks, including feature selection, model training, and data visualization. Each question specifies the required configurations or methods to achieve specific objectives, such as identifying outliers, replacing missing data, and evaluating model performance. The suggested answers provide insights into the appropriate techniques and settings to use for effective data analysis and model development.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

DP-100 = Azure Data Scientist Associate = SAMPLE

QUESTION 1
You need to configure the Feature Based Feature Selection module based on the experiment requirements
and datasets.
How should you configure the module properties? To answer, select the appropriate options in the dialog
box in the answer area.
NOTE: Each correct selection is worth one point.

Hot Area:
Suggested Answer:

Box 1: Mutual Information.


The mutual information score is particularly useful in feature selection because it maximizes the
mutual information between the joint distribution and target variables in datasets with many
dimensions.
Box 2: MedianValue -
MedianValue is the feature column, , it is the predictor of the dataset.
Scenario: The MedianValue and AvgRoomsinHouse columns both hold data in numeric format.
You need to select a feature selection algorithm to analyze the relationship between the two
columns in more detail.

QUESTION 2
You need to set up the Permutation Feature Importance module according to the model training
requirements.
Which properties should you select? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Suggested Answer:
Box 1: Accuracy -
Scenario: You want to configure hyperparameters in the model learning process to speed the
learning phase by using hyperparameters. In addition, this configuration should cancel the lowest
performing runs at each evaluation interval, thereby directing effort and resources towards models
that are more likely to be successful.
Box 2: R-Squared

QUESTION 3
You need to select a feature extraction method.
Which method should you use?
A. Mutual information
B. Pearson's correlation
C. Spearman correlation
D. Fisher Linear Discriminant Analysis

Suggested Answer: C

Spearman's rank correlation coefficient assesses how well the relationship between two variables
can be described using a monotonic function.
Note: Both Spearman's and Kendall's can be formulated as special cases of a more general
correlation coefficient, and they are both appropriate in this scenario.
Scenario: The MedianValue and AvgRoomsInHouse columns both hold data in numeric format.
You need to select a feature selection algorithm to analyze the relationship between the two
columns in more detail.
Incorrect Answers:
B: The Spearman correlation between two variables is equal to the Pearson correlation between the
rank values of those two variables; while Pearson's correlation assesses linear relationships,
Spearman's correlation assesses monotonic relationships (whether linear or not).

QUESTION 4
You need to configure the Permutation Feature Importance module for the model training requirements.
What should you do? To answer, select the appropriate options in the dialog box in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Suggested Answer:

Box 1: 500 -
For Random seed, type a value to use as seed for randomization. If you specify 0 (the default), a
number is generated based on the system clock.
A seed value is optional, but you should provide a value if you want reproducibility across runs of
the same experiment.
Here we must replicate the findings.
Box 2: Mean Absolute Error -
Scenario: Given a trained model and a test dataset, you must compute the Permutation Feature
Importance scores of feature variables. You need to set up the
Permutation Feature Importance module to select the correct metric to investigate the model's
accuracy and replicate the findings.
Regression. Choose one of the following: Precision, Recall, Mean Absolute Error, Root Mean
Squared Error, Relative Absolute Error, Relative Squared Error,
Coefficient of Determination -

QUESTION 5
You need to configure the Edit Metadata module so that the structure of the datasets match.
Which configuration options should you select? To answer, select the appropriate options in the answer
area.
NOTE: Each correct selection is worth one point.
Hot Area:
Suggested Answer:

Box 1: Floating point -


Need floating point for Median values.
Scenario: An initial investigation shows that the datasets are identical in structure apart from the
MedianValue column. The smaller Paris dataset contains the
MedianValue in text format, whereas the larger London dataset contains the MedianValue in
numerical format.
Box 2: Unchanged -
Note: Select the Categorical option to specify that the values in the selected columns should be
treated as categories.
For example, you might have a column that contains the numbers 0,1 and 2, but know that the
numbers actually mean "Smoker", "Non smoker" and "Unknown". In that case, by flagging the
column as categorical you can ensure that the values are not used in numeric calculations, only to
group data.

QUESTION 6
You need to identify the methods for dividing the data according to the testing requirements.
Which properties should you select? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Suggested Answer:
Scenario: Testing -
You must produce multiple partitions of a dataset based on sampling using the Partition and
Sample module in Azure Machine Learning Studio.
Box 1: Assign to folds -
Use Assign to folds option when you want to divide the dataset into subsets of the data. This option
is also useful when you want to create a custom number of folds for cross-validation, or to split
rows into several groups.
Not Head: Use Head mode to get only the first n rows. This option is useful if you want to test a
pipeline on a small number of rows, and don't need the data to be balanced or sampled in any way.
Not Sampling: The Sampling option supports simple random sampling or stratified random
sampling. This is useful if you want to create a smaller representative sample dataset for testing.
Box 2: Partition evenly -
Specify the partitioner method: Indicate how you want data to be apportioned to each partition,
using these options:
✑ Partition evenly: Use this option to place an equal number of rows in each partition. To specify
the number of output partitions, type a whole number in the
Specify number of folds to split evenly into text box.

QUESTION 7
You need to visually identify whether outliers exist in the Age column and quantify the outliers before the
outliers are removed.

Which three Azure Machine Learning Studio modules should you use? Each correct answer presents part
of the solution.

NOTE: Each correct selection is worth one point.

A. Create Scatterplot
B. Summarize Data
C. Clip Values
D. Replace Discrete Values
E. Build Counting Transform

Suggested Answer: ABC


B: To have a global view, the summarize data module can be used. Add the module and connect it
to the data set that needs to be visualized.
A: One way to quickly identify Outliers visually is to create scatter plots.
C: The easiest way to treat the outliers in Azure ML is to use the Clip Values module. It can
identify and optionally replace data values that are above or below a specified threshold.
You can use the Clip Values module in Azure Machine Learning Studio, to identify and optionally
replace data values that are above or below a specified threshold. This is useful when you want to
remove outliers or replace them with a mean, a constant, or other substitute value.

QUESTION 8
You need to produce a visualization for the diagnostic test evaluation according to the data visualization
requirements.
Which three modules should you recommend be used in sequence? To answer, move the appropriate
modules from the list of modules to the answer area and arrange them in the correct order.
Select and Place:
Suggested Answer:

Step 1: Sweep Clustering -


Start by using the "Tune Model Hyperparameters" module to select the best sets of parameters for
each of the models we're considering.
One of the interesting things about the "Tune Model Hyperparameters" module is that it not only
outputs the results from the Tuning, it also outputs the Trained
Model.
Step 2: Train Model -
Step 3: Evaluate Model -
Scenario: You need to provide the test results to the Fabrikam Residences team. You create data
visualizations to aid in presenting the results.
You must produce a Receiver Operating Characteristic (ROC) curve to conduct a diagnostic test
evaluation of the model. You need to select appropriate methods for producing the ROC curve in
Azure Machine Learning Studio to compare the Two-Class Decision Forest and the Two-Class
Decision Jungle modules with one another.

QUESTION 9
You need to replace the missing data in the AccessibilityToHighway columns.

How should you configure the Clean Missing Data module? To answer, select the appropriate options in
the answer area.
NOTE: Each correct selection is worth one point.

Hot Area:

Suggested Answer:
Box 1: Replace using MICE -
Replace using MICE: For each missing value, this option assigns a new value, which is calculated
by using a method described in the statistical literature as
"Multivariate Imputation using Chained Equations" or "Multiple Imputation by Chained
Equations". With a multiple imputation method, each variable with missing data is modeled
conditionally using the other variables in the data before filling in the missing values.
Scenario: The AccessibilityToHighway column in both datasets contains missing values. The
missing data must be replaced with new data so that it is modeled conditionally using the other
variables in the data before filling in the missing values.
Box 2: Propagate -
Cols with all missing values indicate if columns of all missing values should be preserved in the
output.

QUESTION 10
You plan to implement an Azure Machine Learning solution.
You have the following requirements:

• Run a Jupyter notebook to interactively train a machine learning model.

• Deploy assets and workflows for machine learning proof of concept by using scripting rather than
custom programming.

You need to select a development technique for each requirement.

Which development technique should you use? To answer, select the appropriate options in the answer
area.

NOTE: Each correct selection is worth one point.


Suggested Answer:

You might also like