A data scientist wants to ingest a directory full of plain text files so that each record…

Question

A data scientist wants to ingest a directory full of plain text files so that each record in the output DataFrame contains the entire contents of a single file and the full path of the file the text was read from. The first attempt at the code does read the text files but each record contains a single line. This code is shown below: raw_txt_path = '/datasets/raw_txt/*' corpus = spark.read.text(raw_txt_path)\ .select ("*', '_metadata.file path") Which code change can be implemented in a DataFrame that meets the data scientist’s requirements?

Accepted Answer

Correct answer: A. A. Add the option wholetext=True to the text() function — The correct answer is A because adding the option wholetext=True to the text() function allows Spark to read the entire contents of each file as a single record, which meets the data scientist's requirement. The other options do not achieve this; options B and D specify line separators, and option C incorrectly sets wholetext to false, which would lead to reading only single lines.

Databricks Certified Associate Developer for Apache Spark — Question 186

Answer options

Correct answer: A

Explanation