Data Engineering on Microsoft Azure — Question 26
You plan to implement an Azure Data Lake Storage Gen2 container that will contain CSV files. The size of the files will vary based on the number of events that occur per hour.
File sizes range from 4 KB to 5 GB.
You need to ensure that the files stored in the container are optimized for batch processing.
What should you do?
Answer options
- A. Convert the files to JSON
- B. Convert the files to Avro
- C. Compress the files
- D. Merge the files
Correct answer: D
Explanation
Merging the files is the best option for optimizing them for batch processing, as larger files reduce the overhead associated with many small files. Converting to JSON or Avro may not necessarily improve batch processing performance, and compressing files could complicate access and processing without addressing file count issues.