Databricks Certified Associate Developer for Apache Spark — Question 186

A data scientist wants to ingest a directory full of plain text files so that each record in the output DataFrame contains the entire contents of a single file and the full path of the file the text was read from.

The first attempt at the code does read the text files but each record contains a single line. This code is shown below:

raw_txt_path = '/datasets/raw_txt/*'

corpus = spark.read.text(raw_txt_path)\
.select ("*', '_metadata.file path")

Which code change can be implemented in a DataFrame that meets the data scientist’s requirements?

Answer options

Correct answer: A

Explanation

The correct answer is A because adding the option wholetext=True to the text() function allows Spark to read the entire contents of each file as a single record, which meets the data scientist's requirement. The other options do not achieve this; options B and D specify line separators, and option C incorrectly sets wholetext to false, which would lead to reading only single lines.