Data Engineering on Microsoft Azure — Question 23
You plan to ingest streaming social media data by using Azure Stream Analytics. The data will be stored in files in Azure Data Lake Storage, and then consumed by using Azure Databricks and PolyBase in Azure Synapse Analytics.
You need to recommend a Stream Analytics data output format to ensure that the queries from Databricks and PolyBase against the files encounter the fewest possible errors. The solution must ensure that the files can be queried quickly and that the data type information is retained.
What should you recommend?
Answer options
- A. JSON
- B. Parquet
- C. CSV
- D. Avro
Correct answer: B
Explanation
The correct answer is B, Parquet, because it is a columnar storage file format that supports efficient data compression and encoding schemes, making it ideal for analytical queries. It retains schema and data type information, which is crucial for ensuring that Azure Databricks and PolyBase can execute queries without issues. Other formats like JSON, CSV, and Avro may not provide the same level of performance or ease of querying in this context.