Google Cloud Professional Machine Learning Engineer — Question 77
You are profiling the performance of your TensorFlow model training time and notice a performance issue caused by inefficiencies in the input data pipeline for a single 5 terabyte CSV file dataset on Cloud Storage. You need to optimize the input pipeline performance. Which action should you try first to increase the efficiency of your pipeline?
Answer options
- A. Preprocess the input CSV file into a TFRecord file.
- B. Randomly select a 10 gigabyte subset of the data to train your model.
- C. Split into multiple CSV files and use a parallel interleave transformation.
- D. Set the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method.
Correct answer: C
Explanation
The best first action to improve the input pipeline's efficiency is to split the CSV into multiple files and use a parallel interleave transformation (Option C). This allows for better parallel processing and reduces bottlenecks in data loading. The other options may improve performance but not as effectively as splitting the data for parallel processing.