Databricks Certified Associate Developer for Apache Spark — Question 200
A data analyst at an e-commerce company needs to process daily sales data. The data consists of approximately 50,000 records stored in a single CSV file, totaling about 20 MB. The analyst needs to perform aggregations and generate a summary report.
Which approach could the data analyst use in this situation?
Answer options
- A. Deploy a real-time streaming solution using Spark Streaming to process incoming data.
- B. Use a local Python script with the pandas library to read and analyze the CSV file.
- C. Implement Apache Spark with a distributed cluster to process the data in parallel.
- D. Set up a Hadoop ecosystem with HDFS and MapReduce for distributed processing.
Correct answer: B
Explanation
The correct answer is B because using a local Python script with the pandas library is efficient for analyzing a manageable CSV file size of 20 MB. Options A, C, and D suggest more complex solutions that are unnecessary for the volume of data, as they are better suited for larger datasets or real-time processing needs.