Google Cloud Professional Machine Learning Engineer — Question 114

You are experimenting with a built-in distributed XGBoost model in Vertex AI Workbench user-managed notebooks. You use BigQuery to split your data into training and validation sets using the following queries:

CREATE OR REPLACE TABLE ‘myproject.mydataset.training‘ AS
(SELECT * FROM ‘myproject.mydataset.mytable‘ WHERE RAND() <= 0.8);

CREATE OR REPLACE TABLE ‘myproject.mydataset.validation‘ AS
(SELECT * FROM ‘myproject.mydataset.mytable‘ WHERE RAND() <= 0.2);

After training the model, you achieve an area under the receiver operating characteristic curve (AUC ROC) value of 0.8, but after deploying the model to production, you notice that your model performance has dropped to an AUC ROC value of 0.65. What problem is most likely occurring?

Answer options

Correct answer: C

Explanation

The correct answer is C because if the training and validation tables share records, it can lead to inflated performance metrics during training that do not generalize to unseen data. Options A and D describe issues related to data distribution or function behavior, but they do not address the overlap of records. Option B suggests insufficient training data, which does not explain the specific drop in performance observed after deployment.