Databricks Certified Associate Developer for Apache Spark — Question 186
A data scientist wants to ingest a directory full of plain text files so that each record in the output DataFrame contains the entire contents of a single file and the full path of the file the text was read from.
The first attempt at the code does read the text files but each record contains a single line. This code is shown below:
raw_txt_path = '/datasets/raw_txt/*'
corpus = spark.read.text(raw_txt_path)\
.select ("*', '_metadata.file path")
Which code change can be implemented in a DataFrame that meets the data scientist’s requirements?
Answer options
- A. Add the option wholetext=True to the text() function
- B. Add the option linesep="\n" to the text() function
- C. Add the option wholetext=False to the text() function
- D. Add the option lineSep=”,” to the text() function
Correct answer: A
Explanation
The correct answer is A because adding the option wholetext=True to the text() function allows Spark to read the entire contents of each file as a single record, which meets the data scientist's requirement. The other options do not achieve this; options B and D specify line separators, and option C incorrectly sets wholetext to false, which would lead to reading only single lines.