Databricks-Machine-Learning-Associate問題集、Databricks実際の試験問題

質問 1

A machine learning engineer is using the following code block to scale the inference of a single-node model on a Spark DataFrame with one million records:

Assuming the default Spark configuration is in place, which of the following is a benefit of using an Iterator?

A. The data will be distributed across multiple executors during the inference process

B. The data will be limited to a single executor preventing the model from being loaded multiple times

C. The model only needs to be loaded once per executor rather than once per batch during the inference process

D. The model will be limited to a single executor preventing the data from being distributed

正解: C

解説: (PassTest メンバーにのみ表示されます)

質問 2

A data scientist learned during their training to always use 5-fold cross-validation in their model development workflow. A colleague suggests that there are cases where a train-validation split could be preferred over k-fold cross-validation when k > 2.
Which of the following describes a potential benefit of using a train-validation split over k-fold cross-validation in this scenario?

A. Fewer hyperparameter values need to be tested when using a train-validation split

B. Reproducibility is achievable when using a train-validation split

C. Bias is avoidable when using a train-validation split

D. A holdout set is not necessary when using a train-validation split

E. Fewer models need to be trained when using a train-validation split

正解: E

質問 3

A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model. They elect to use the Hyperopt library's fmin operation to facilitate this process. Unfortunately, the final model is not very accurate. The data scientist suspects that there is an issue with the objective_function being passed as an argument to fmin.
They use the following code block to create the objective_function:

Which of the following changes does the data scientist need to make to their objective_function in order to produce a more accurate model?

A. Remove the mean operation that is wrapping the cross_val_score operation

B. Replace the fmin operation with the fmax operation

C. Add a random_state argument to the RandomForestRegressor operation

D. Add test set validation process

E. Replace the r2 return value with -r2

正解: E

解説: (PassTest メンバーにのみ表示されます)

質問 4

In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?

A. When the features are of the categorical type

B. When the features are of the boolean type

C. When the features contain no missing no values

D. When the features contain a lot of extreme outliers

E. When the features contain no outliers

正解: D

解説: (PassTest メンバーにのみ表示されます)

質問 5

A machine learning engineer wants to parallelize the training of group-specific models using the Pandas Function API. They have developed the train_model function, and they want to apply it to each group of DataFrame df.
They have written the following incomplete code block:

Which of the following pieces of code can be used to fill in the above blank to complete the task?

A. applyInPandas

B. train_model

C. predict

D. mapInPandas

E. groupedApplyIn

正解: D

解説: (PassTest メンバーにのみ表示されます)

質問 6

A machine learning engineer has grown tired of needing to install the MLflow Python library on each of their clusters. They ask a senior machine learning engineer how their notebooks can load the MLflow library without installing it each time. The senior machine learning engineer suggests that they use Databricks Runtime for Machine Learning.
Which of the following approaches describes how the machine learning engineer can begin using Databricks Runtime for Machine Learning?

A. They can check the Databricks Runtime ML box when creating their clusters.

B. They can select a Databricks Runtime ML version from the Databricks Runtime Version dropdown when creating their clusters.

C. They can set the runtime-version variable in their Spark session to "ml".

D. They can add a line enabling Databricks Runtime ML in their init script when creating their clusters.

正解: B

解説: (PassTest メンバーにのみ表示されます)

質問 7

A machine learning engineer is converting a decision tree from sklearn to Spark ML. They notice that they are receiving different results despite all of their data and manually specified hyperparameter values being identical.
Which of the following describes a reason that the single-node sklearn decision tree and the Spark ML decision tree can differ?

A. Spark ML decision trees test a random sample of feature variables in the splitting algorithm

B. Spark ML decision trees automatically prune overfit trees

C. Spark ML decision trees test binned features values as representative split candidates

D. Spark ML decision trees test more split candidates in the splitting algorithm

E. Spark ML decision trees test every feature variable in the splitting algorithm

正解: C

解説: (PassTest メンバーにのみ表示されます)

質問 8

A data scientist wants to parallelize the training of trees in a gradient boosted tree to speed up the training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult.
Which of the following describes why?

A. Gradient boosting requires access to all data at once which cannot happen during parallelization.

B. Gradient boosting calculates gradients in evaluation metrics using all cores which prevents parallelization.

C. Gradient boosting is an iterative algorithm that requires information from the previous iteration to perform the next step.

D. Gradient boosting is not a linear algebra-based algorithm which is required for parallelization

正解: C

解説: (PassTest メンバーにのみ表示されます)

Databricks Certified Machine Learning Associate - Databricks-Machine-Learning-Associate 模擬練習