更新された2025年03月30日検証済み!合格できるDatabricks-Machine-Learning-Associate試験一発合格保証付き [Q39-Q63]

Share

更新された2025年03月30日検証済み!合格できるDatabricks-Machine-Learning-Associate試験一発合格保証付き

無料で使えるDatabricks-Machine-Learning-Associateサンプルには問題100%カバー率でリアル試験問題(更新された76問あります)

質問 # 39
A data scientist is developing a machine learning pipeline using AutoML on Databricks Machine Learning.
Which of the following steps will the data scientist need to perform outside of their AutoML experiment?

  • A. Model evaluation
  • B. Model tuning
  • C. Model deployment
  • D. Exploratory data analysis

正解:D

解説:
AutoML platforms, such as the one available in Databricks Machine Learning, streamline various stages of the machine learning pipeline including feature engineering, model selection, hyperparameter tuning, and model evaluation. However, exploratory data analysis (EDA) is typically performed outside the AutoML process. EDA involves understanding the dataset, visualizing distributions, identifying anomalies, and gaining insights into data before feeding it into a machine learning pipeline. This step is crucial for ensuring that the data is clean and suitable for model training but is generally done manually by the data scientist.
Reference
Databricks documentation on AutoML: https://docs.databricks.com/applications/machine-learning/automl.html


質問 # 40
A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model. They elect to use the Hyperopt library's fmin operation to facilitate this process. Unfortunately, the final model is not very accurate. The data scientist suspects that there is an issue with the objective_function being passed as an argument to fmin.
They use the following code block to create the objective_function:

Which of the following changes does the data scientist need to make to their objective_function in order to produce a more accurate model?

  • A. Add a random_state argument to the RandomForestRegressor operation
  • B. Add test set validation process
  • C. Remove the mean operation that is wrapping the cross_val_score operation
  • D. Replace the fmin operation with the fmax operation
  • E. Replace the r2 return value with -r2

正解:E

解説:
When using the Hyperopt library with fmin, the goal is to find the minimum of the objective function. Since you are using cross_val_score to calculate the R2 score which is a measure of the proportion of the variance for a dependent variable that's explained by an independent variable(s) in a regression model, higher values are better. However, fmin seeks to minimize the objective function, so to align with fmin's goal, you should return the negative of the R2 score (-r2). This way, by minimizing the negative R2, fmin is effectively maximizing the R2 score, which can lead to a more accurate model.
Reference
Hyperopt Documentation: http://hyperopt.github.io/hyperopt/
Scikit-Learn documentation on model evaluation: https://scikit-learn.org/stable/modules/model_evaluation.html


質問 # 41
The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.
Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

  • A. Iterative optimization
  • B. Logistic regression
  • C. Singular value decomposition
  • D. Least-squares method

正解:A

解説:
For large datasets, Spark ML uses iterative optimization methods to distribute the training of a linear regression model. Specifically, Spark MLlib employs techniques like Stochastic Gradient Descent (SGD) and Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) optimization to iteratively update the model parameters. These methods are well-suited for distributed computing environments because they can handle large-scale data efficiently by processing mini-batches of data and updating the model incrementally.
Reference:
Databricks documentation on linear regression: Linear Regression in Spark ML


質問 # 42
Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

  • A. pandas API on Spark DataFrames are unrelated to Spark DataFrames
  • B. pandas API on Spark DataFrames are more performant than Spark DataFrames
  • C. pandas API on Spark DataFrames are less mutable versions of Spark DataFrames
  • D. pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata
  • E. pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata

正解:D

解説:
Pandas API on Spark (previously known as Koalas) provides a pandas-like API on top of Apache Spark. It allows users to perform pandas operations on large datasets using Spark's distributed compute capabilities. Internally, it uses Spark DataFrames and adds metadata that facilitates handling operations in a pandas-like manner, ensuring compatibility and leveraging Spark's performance and scalability.
Reference
pandas API on Spark documentation: https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html


質問 # 43
An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.
Which of the following explanations justifies this suggestion?

  • A. One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.
  • B. One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.
  • C. One-hot encoding is not supported by most machine learning libraries.
  • D. One-hot encoding is dependent on the target variable's values which differ for each application.
  • E. One-hot encoding is not a common strategy for representing categorical feature variables numerically.

正解:A

解説:
One-hot encoding transforms categorical variables into a format that can be provided to machine learning algorithms to better predict the output. However, when done prematurely or universally within a feature repository, it can be problematic:
Dimensionality Increase: One-hot encoding significantly increases the feature space, especially with high cardinality features, which can lead to high memory consumption and slower computation.
Model Specificity: Some models handle categorical variables natively (like decision trees and boosting algorithms), and premature one-hot encoding can lead to inefficiency and loss of information (e.g., ordinal relationships).
Sparse Matrix Issue: It often results in a sparse matrix where most values are zero, which can be inefficient in both storage and computation for some algorithms.
Generalization vs. Specificity: Encoding should ideally be tailored to specific models and use cases rather than applied generally in a feature repository.
Reference
"Feature Engineering and Selection: A Practical Approach for Predictive Models" by Max Kuhn and Kjell Johnson (CRC Press, 2019).


質問 # 44
A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model by comparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model.
Which of the following possible explanations for this difference is invalid?

  • A. The second model is much more accurate than the first model
  • B. The first model is much more accurate than the second model
  • C. The RMSE is an invalid evaluation metric for regression problems
  • D. The data scientist failed to take the log of the predictions in the first model prior to computing the RMSE
  • E. The data scientist failed to exponentiate the predictions in the second model prior to computing the RMSE

正解:C

解説:
The Root Mean Squared Error (RMSE) is a standard and widely used metric for evaluating the accuracy of regression models. The statement that it is invalid is incorrect. Here's a breakdown of why the other statements are or are not valid:
Transformations and RMSE Calculation: If the model predictions were transformed (e.g., using log), they should be converted back to their original scale before calculating RMSE to ensure accuracy in the evaluation. Missteps in this conversion process can lead to misleading RMSE values.
Accuracy of Models: Without additional information, we can't definitively say which model is more accurate without considering their RMSE values properly scaled back to the original price scale.
Appropriateness of RMSE: RMSE is entirely valid for regression problems as it provides a measure of how accurately a model predicts the outcome, expressed in the same units as the dependent variable.
Reference
"Applied Predictive Modeling" by Max Kuhn and Kjell Johnson (Springer, 2013), particularly the chapters discussing model evaluation metrics.


質問 # 45
A data scientist has produced two models for a single machine learning problem. One of the models performs well when one of the features has a value of less than 5, and the other model performs well when the value of that feature is greater than or equal to 5. The data scientist decides to combine the two models into a single machine learning solution.
Which of the following terms is used to describe this combination of models?

  • A. Ensemble learning
  • B. Bootstrap aggregation
  • C. Support vector machines
  • D. Stacking
  • E. Bucketing

正解:A

解説:
Ensemble learning is a machine learning technique that involves combining several models to solve a particular problem. The scenario described fits the concept of ensemble learning, where two models, each performing well under different conditions, are combined to create a more robust model. This approach often leads to better performance as it combines the strengths of multiple models.
Reference
Introduction to Ensemble Learning: https://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/


質問 # 46
The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.
Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

  • A. Iterative optimization
  • B. Logistic regression
  • C. Singular value decomposition
  • D. Spark ML cannot distribute linear regression training
  • E. Least-squares method

正解:A

解説:
For large datasets with many variables, Spark ML distributes the training of a linear regression model using iterative optimization methods. Specifically, Spark ML employs algorithms such as Gradient Descent or L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) to iteratively minimize the loss function. These iterative methods are suitable for distributed computing environments and can handle large-scale data efficiently by partitioning the data across nodes in a cluster and performing parallel updates.
Reference:
Spark MLlib Documentation (Linear Regression with Iterative Optimization).


質問 # 47
A machine learning engineer is trying to scale a machine learning pipeline pipeline that contains multiple feature engineering stages and a modeling stage. As part of the cross-validation process, they are using the following code block:

A colleague suggests that the code block can be changed to speed up the tuning process by passing the model object to the estimator parameter and then placing the updated cv object as the final stage of the pipeline in place of the original model.
Which of the following is a negative consequence of the approach suggested by the colleague?

  • A. The model will be refit one more per cross-validation fold
  • B. The cross-validation process will no longer be
  • C. The cross-validation process will no longer be reproducible
  • D. The feature engineering stages will be computed using validation data
  • E. The model will take longer to train for each unique combination of hvperparameter values

正解:D

解説:
If the model object is passed to the estimator parameter of CrossValidator and the cross-validation object itself is placed as a stage in the pipeline, the feature engineering stages within the pipeline would be applied separately to each training and validation fold during cross-validation. This leads to a significant issue: the feature engineering stages would be computed using validation data, thereby leaking information from the validation set into the training process. This would potentially invalidate the cross-validation results by giving an overly optimistic performance estimate.
Reference:
Cross-validation and Pipeline Integration in MLlib (Avoiding Data Leakage in Pipelines).


質問 # 48
Which of the Spark operations can be used to randomly split a Spark DataFrame into a training DataFrame and a test DataFrame for downstream use?

  • A. TrainValidationSplit
  • B. DataFrame.randomSplit
  • C. DataFrame.where
  • D. CrossValidator
  • E. TrainValidationSplitModel

正解:B

解説:
The correct method to randomly split a Spark DataFrame into training and test sets is by using the randomSplit method. This method allows you to specify the proportions for the split as a list of weights and returns multiple DataFrames according to those weights. This is directly intended for splitting DataFrames randomly and is the appropriate choice for preparing data for training and testing in machine learning workflows.
Reference:
Apache Spark DataFrame API documentation (DataFrame Operations: randomSplit).


質問 # 49
A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem:
* Hyperparameter 1: [2, 5, 10]
* Hyperparameter 2: [50, 100]
Which of the following represents the number of machine learning models that can be trained in parallel during this process?

  • A. 0
  • B. 1
  • C. 2
  • D. 3

正解:A

解説:
To determine the number of machine learning models that can be trained in parallel, we need to calculate the total number of combinations of hyperparameters. The given hyperparameter grid includes:
Hyperparameter 1: [2, 5, 10] (3 values)
Hyperparameter 2: [50, 100] (2 values)
The total number of combinations is the product of the number of values for each hyperparameter: 3 (values of Hyperparameter 1)×2 (values of Hyperparameter 2)=63 (values of Hyperparameter 1)×2 (values of Hyperparameter 2)=6 With 3-fold cross-validation, each combination of hyperparameters will be evaluated 3 times. Thus, the total number of models trained will be: 6 (combinations)×3 (folds)=186 (combinations)×3 (folds)=18 However, the number of models that can be trained in parallel is equal to the number of hyperparameter combinations, not the total number of models considering cross-validation. Therefore, 6 models can be trained in parallel.
Reference:
Databricks documentation on hyperparameter tuning: Hyperparameter Tuning


質問 # 50
A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model in parallel. They elect to use the Hyperopt library to facilitate this process.
Which of the following Hyperopt tools provides the ability to optimize hyperparameters in parallel?

  • A. quniform
  • B. fmin
  • C. objective_function
  • D. search_space
  • E. SparkTrials

正解:E

解説:
The SparkTrials class in the Hyperopt library allows for parallel hyperparameter optimization on a Spark cluster. This enables efficient tuning of hyperparameters by distributing the optimization process across multiple nodes in a cluster.
from hyperopt import fmin, tpe, hp, SparkTrials search_space = { 'x': hp.uniform('x', 0, 1), 'y': hp.uniform('y', 0, 1) } def objective(params): return params['x'] ** 2 + params['y'] ** 2 spark_trials = SparkTrials(parallelism=4) best = fmin(fn=objective, space=search_space, algo=tpe.suggest, max_evals=100, trials=spark_trials) Reference:
Hyperopt Documentation


質問 # 51
A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.
Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

  • A. import pyspark.pandas as ps
    df = ps.DataFrame(spark_df)
  • B. import pandas as pd
    df = pd.DataFrame(spark_df)
  • C. spark_df.to_sql()
  • D. spark_df.to_pandas()
  • E. import pyspark.pandas as ps
    df = ps.to_pandas(spark_df)

正解:A

解説:
To use the pandas API on Spark, which is designed to bridge the gap between the simplicity of pandas and the scalability of Spark, the correct approach involves importing the pyspark.pandas (recently renamed to pandas_api_on_spark) module and converting a Spark DataFrame to a pandas-on-Spark DataFrame using this API. The provided syntax correctly initializes a pandas-on-Spark DataFrame, allowing the data scientist to work with the familiar pandas-like API on large datasets managed by Spark.
Reference
Pandas API on Spark Documentation: https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html


質問 # 52
The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.
Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

  • A. Iterative optimization
  • B. Logistic regression
  • C. Singular value decomposition
  • D. Spark ML cannot distribute linear regression training
  • E. Least-squares method

正解:A


質問 # 53
A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.
Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

  • A. import pyspark.pandas as ps
    df = ps.DataFrame(spark_df)
  • B. import pandas as pd
    df = pd.DataFrame(spark_df)
  • C. spark_df.to_pandas()
  • D. import pyspark.pandas as ps
    df = ps.to_pandas(spark_df)

正解:A

解説:
To use the pandas API on Spark, the data scientist can run the following code block:
import pyspark.pandas as ps df = ps.DataFrame(spark_df)
This code imports the pandas API on Spark and converts the Spark DataFrame spark_df into a pandas-on-Spark DataFrame, allowing the data scientist to use familiar pandas functions for further feature engineering.
Reference:
Databricks documentation on pandas API on Spark: pandas API on Spark


質問 # 54
An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.
Which of the following explanations justifies this suggestion?

  • A. One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.
  • B. One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.
  • C. One-hot encoding is dependent on the target variable's values which differ for each apaplication.
  • D. One-hot encoding is not a common strategy for representing categorical feature variables numerically.

正解:A

解説:
The suggestion not to one-hot encode categorical feature variables within the feature repository is justified because one-hot encoding can be problematic for some machine learning algorithms. Specifically, one-hot encoding increases the dimensionality of the data, which can be computationally expensive and may lead to issues such as multicollinearity and overfitting. Additionally, some algorithms, such as tree-based methods, can handle categorical variables directly without requiring one-hot encoding.
Reference:
Databricks documentation on feature engineering: Feature Engineering


質問 # 55
A machine learning engineer is trying to scale a machine learning pipeline by distributing its single-node model tuning process. After broadcasting the entire training data onto each core, each core in the cluster can train one model at a time. Because the tuning process is still running slowly, the engineer wants to increase the level of parallelism from 4 cores to 8 cores to speed up the tuning process. Unfortunately, the total memory in the cluster cannot be increased.
In which of the following scenarios will increasing the level of parallelism from 4 to 8 speed up the tuning process?

  • A. When the data is particularly wide in shape
  • B. When the model is unable to be parallelized
  • C. When the entire data can fit on each core
  • D. When the data is particularly long in shape
  • E. When the tuning process in randomized

正解:C

解説:
Increasing the level of parallelism from 4 to 8 cores can speed up the tuning process if each core can handle the entire dataset. This ensures that each core can independently work on training a model without running into memory constraints. If the entire dataset fits into the memory of each core, adding more cores will allow more models to be trained in parallel, thus speeding up the process.
Reference:
Parallel Computing Concepts


質問 # 56
A data scientist has created a linear regression model that uses log(price) as a label variable. Using this model, they have performed inference and the predictions and actual label values are in Spark DataFrame preds_df.
They are using the following code block to evaluate the model:
regression_evaluator.setMetricName("rmse").evaluate(preds_df)
Which of the following changes should the data scientist make to evaluate the RMSE in a way that is comparable with price?

  • A. They should take the log of the predictions before computing the RMSE
  • B. They should exponentiate the predictions before computing the RMSE
  • C. They should exponentiate the computed RMSE value
  • D. They should evaluate the MSE of the log predictions to compute the RMSE

正解:B

解説:
When evaluating the RMSE for a model that predicts log-transformed prices, the predictions need to be transformed back to the original scale to obtain an RMSE that is comparable with the actual price values. This is done by exponentiating the predictions before computing the RMSE. The RMSE should be computed on the same scale as the original data to provide a meaningful measure of error.
Reference:
Databricks documentation on regression evaluation: Regression Evaluation


質問 # 57
A data scientist is using Spark ML to engineer features for an exploratory machine learning project.
They decide they want to standardize their features using the following code block:

Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set.
Which of the following changes can the data scientist make to address the concern?

  • A. Utilize the Pipeline API to standardize the test data according to the training data's summary statistics
  • B. Utilize the MinMaxScaler object to standardize the test data according to global minimum and maximum values
  • C. Utilize the MinMaxScaler object to standardize the training data according to global minimum and maximum values
  • D. Utilize the Pipeline API to standardize the training data according to the test data's summary statistics
  • E. Utilize a cross-validation process rather than a train-test split process to remove the need for standardizing data

正解:A

解説:
To address the concern about standardizing features prior to splitting the data, the correct approach is to use the Pipeline API to ensure that only the training data's summary statistics are used to standardize the test data. This is achieved by fitting the StandardScaler (or any scaler) on the training data and then transforming both the training and test data using the fitted scaler. This approach prevents information leakage from the test data into the model training process and ensures that the model is evaluated fairly.
Reference:
Best Practices in Preprocessing in Spark ML (Handling Data Splits and Feature Standardization).


質問 # 58
A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective function objective_function and they have defined the search space search_space.
As a result, they have the following code block:

Which of the following changes do they need to make to the above code block in order to accomplish the task?

  • A. Change fmin() to fmax()
  • B. Remove the algo=tpe.suggest argument
  • C. Change SparkTrials() to Trials()
  • D. Reduce num_evals to be less than 10
  • E. Remove the trials=trials argument

正解:C

解説:
The SparkTrials() is used to distribute trials of hyperparameter tuning across a Spark cluster. If the environment does not support Spark or if the user prefers not to use distributed computing for this purpose, switching to Trials() would be appropriate. Trials() is the standard class for managing search trials in Hyperopt but does not distribute the computation. If the user is encountering issues with SparkTrials() possibly due to an unsupported configuration or an error in the cluster setup, using Trials() can be a suitable change for running the optimization locally or in a non-distributed manner.
Reference
Hyperopt documentation: http://hyperopt.github.io/hyperopt/


質問 # 59
A data scientist has written a feature engineering notebook that utilizes the pandas library. As the size of the data processed by the notebook increases, the notebook's runtime is drastically increasing, but it is processing slowly as the size of the data included in the process increases.
Which of the following tools can the data scientist use to spend the least amount of time refactoring their notebook to scale with big data?

  • A. Spark SQL
  • B. PySpark DataFrame API
  • C. Feature Store
  • D. pandas API on Spark

正解:D

解説:
The pandas API on Spark provides a way to scale pandas operations to big data while minimizing the need for refactoring existing pandas code. It allows users to run pandas operations on Spark DataFrames, leveraging Spark's distributed computing capabilities to handle large datasets more efficiently. This approach requires minimal changes to the existing code, making it a convenient option for scaling pandas-based feature engineering notebooks.
Reference:
Databricks documentation on pandas API on Spark: pandas API on Spark


質問 # 60
A data scientist has developed a random forest regressor rfr and included it as the final stage in a Spark MLPipeline pipeline. They then set up a cross-validation process with pipeline as the estimator in the following code block:

Which of the following is a negative consequence of including pipeline as the estimator in the cross-validation process rather than rfr as the estimator?

  • A. The process will leak data from the training set to the test set during the evaluation phase
  • B. The process will be unable to parallelize tuning due to the distributed nature of pipeline
  • C. The process will leak data prep information from the validation sets to the training sets for each model
  • D. The process will have a longer runtime because all stages of pipeline need to be refit or retransformed with each mode

正解:D

解説:
Including the entire pipeline as the estimator in the cross-validation process means that all stages of the pipeline, including data preprocessing steps like string indexing and vector assembling, will be refit or retransformed for each fold of the cross-validation. This results in a longer runtime because each fold requires re-execution of these preprocessing steps, which can be computationally expensive.
If only the random forest regressor (rfr) were included as the estimator, the preprocessing steps would be performed once, and only the model fitting would be repeated for each fold, significantly reducing the computational overhead.
Reference:
Databricks documentation on cross-validation: Cross Validation


質問 # 61
A machine learning engineer wants to parallelize the training of group-specific models using the Pandas Function API. They have developed the train_model function, and they want to apply it to each group of DataFrame df.
They have written the following incomplete code block:

Which of the following pieces of code can be used to fill in the above blank to complete the task?

  • A. predict
  • B. train_model
  • C. mapInPandas
  • D. applyInPandas
  • E. groupedApplyIn

正解:C

解説:
The function mapInPandas in the PySpark DataFrame API allows for applying a function to each partition of the DataFrame. When working with grouped data, groupby followed by applyInPandas is the correct approach to apply a function to each group as a separate Pandas DataFrame. However, if the function should apply across each partition of the grouped data rather than on each individual group, mapInPandas would be utilized. Since the code snippet indicates the use of groupby, the intent seems to be to apply train_model on each group specifically, which aligns with applyInPandas. Thus, applyInPandas is a better fit to ensure that each group generated by groupby is processed through the train_model function, preserving the partitioning and grouping integrity.
Reference
PySpark Documentation on applying functions to grouped data: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.applyInPandas.html


質問 # 62
A data scientist learned during their training to always use 5-fold cross-validation in their model development workflow. A colleague suggests that there are cases where a train-validation split could be preferred over k-fold cross-validation when k > 2.
Which of the following describes a potential benefit of using a train-validation split over k-fold cross-validation in this scenario?

  • A. Reproducibility is achievable when using a train-validation split
  • B. Bias is avoidable when using a train-validation split
  • C. Fewer models need to be trained when using a train-validation split
  • D. A holdout set is not necessary when using a train-validation split
  • E. Fewer hyperparameter values need to be tested when using a train-validation split

正解:C

解説:
A train-validation split is often preferred over k-fold cross-validation (with k > 2) when computational efficiency is a concern. With a train-validation split, only two models (one on the training set and one on the validation set) are trained, whereas k-fold cross-validation requires training k models (one for each fold).
This reduction in the number of models trained can save significant computational resources and time, especially when dealing with large datasets or complex models.
Reference:
Model Evaluation with Train-Test Split


質問 # 63
......

今すぐダウンロード!リアルDatabricks Databricks-Machine-Learning-Associate試験問題集テストエンジン試験問題:https://www.passtest.jp/Databricks/Databricks-Machine-Learning-Associate-shiken.html

検証済み!Databricks-Machine-Learning-Associate問題集と解答でDatabricks-Machine-Learning-Associateテストエンジン正確解答付き:https://drive.google.com/open?id=1hIkco9D1hSrhrqEV0cufHSBzJNHGfzMS