Databricks Certified Machine Learning Associate — Question 32

A data scientist has developed a machine learning pipeline with a static input data set using Spark ML, but the pipeline is taking too long to process. They increase the number of workers in the cluster to get the pipeline to run more efficiently. They notice that the number of rows in the training set after reconfiguring the cluster is different from the number of rows in the training set prior to reconfiguring the cluster.
Which of the following approaches will guarantee a reproducible training and test set for each model?

Answer options

Correct answer: B

Explanation

The correct answer is B because writing out the split datasets to persistent storage ensures that the training and test sets are stored in a consistent manner, allowing for reproducibility. The other options do not guarantee consistent datasets; manual configuration, partitioning, or setting a speed may lead to varied results based on the cluster's state or processing conditions.