Spark 3.0 Read Binary File into DataFrame

Since Spark 3.0, Spark supports a data source format binaryFile to read binary file (image, pdf, zip, gzip, tar e.t.c) into Spark DataFrame/Dataset. When used binaryFile format, the DataFrameReader converts the entire contents of each binary file into a single DataFrame, the resultant DataFrame contains the raw content and metadata of the file.

1. Read a Single Binary File

The below example read the spark.png image binary file into DataFrame. The RAW data of the file will be loaded into content column.


// Read a Single Binary File
  val df = spark.read.format("binaryFile").load("/tmp/binary/spark.png")
  df.printSchema()
  df.show()

This returns the below schema and DataFrame.


// Output:
root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- content: binary (nullable = true)

+--------------------+--------------------+------+--------------------+
|                path|    modificationTime|length|             content|
+--------------------+--------------------+------+--------------------+
|file:/C:/tmp/bina...|2020-07-25 10:11:...| 74675|[89 50 4E 47 0D 0...|
+--------------------+--------------------+------+--------------------+

The data in content column shows binary data.

In order to read binary files from Amazon S3 using the below prefix to the path along with third-party dependencies and credentials.

s3:\\ = > First gen
s3n:\\ => second Gen
s3a:\\ => Third gen

2. Read Multiple Binary Files

The below example reads all PNG image files from a path into Spark DataFrame.


// Read Multiple Binary Files
  val df3 = spark.read.format("binaryFile").load("/tmp/binary/*.png")
  df3.printSchema()
  df3.show(false)

It reads all png files and converts each file into a single record in DataFrame.

3. Read all Binary Files in a Folder

In order to read all binary files from a folder, just pass the folder path to the load() method.


// Read all Binary Files in a Folder
  val df2 = spark.read.format("binaryFile").load("/tmp/binary/")
  df2.printSchema()
  df2.show(false)

4. Reading Binary File Options

pathGlobFilter: To load files with paths matching a given glob pattern while keeping the behavior of partition discovery.

For example, the following code reads all PNG files from the path with any partitioned directories.


// Reading Binary File Options
  val df = spark.read.format("binaryFile")
              .option("pathGlobFilter", "*.png")
              .load("/tmp/binary/")

recursiveFileLookup: Ignores the partition discovery and recursively search files under the input directory path.


  val df = spark.read.format("binaryFile")
            .option("pathGlobFilter", "*.png")
            .option("recursiveFileLookup", "true")
            .load("/tmp/binary/")

5. Few things to note

While using binaryFile data source, if you pass text file to the load() method, it reads the contents of the text file as a binary into DataFrame.

binary() method on DataFrameReader still not available hence, you can’t use spark.read.binary("path") yet. I will update this article when it’s available.

Currently, the binary file data source does not support writing a DataFrame back to the binary file format.

Conclusion

In summary, Spark 3.0 provides a binaryFile data source to read the binary file into DataFrame but it does not support writing the data frame back into a binary file. It also has option pathGlobFilter to load files by preserving the partition and recursiveFileLookup option to recursively load the files from the subdirectories by ignoring partition.

Reference

https://spark.apache.org/docs/3.0.0-preview/sql-data-sources-binaryFile.html

Happy Learning !!