Data Engineering on Microsoft Azure — Question 28
You are implementing a batch dataset in the Parquet format.
Data files will be produced be using Azure Data Factory and stored in Azure Data Lake Storage Gen2. The files will be consumed by an Azure Synapse Analytics serverless SQL pool.
You need to minimize storage costs for the solution.
What should you do?
Answer options
- A. Use Snappy compression for the files.
- B. Use OPENROWSET to query the Parquet files.
- C. Create an external table that contains a subset of columns from the Parquet files.
- D. Store all data as string in the Parquet files.
Correct answer: A
Explanation
Using Snappy compression for the files effectively reduces storage costs while maintaining a good balance between compression speed and decompression efficiency. The other options do not directly address storage optimization; OPENROWSET and creating an external table focus on data access rather than storage, and storing all data as strings can actually increase the size of the files, leading to higher costs.