Spark Convert Parquet file to JSON

  • Post author:
  • Post category:Apache Spark

In this Spark article, you will learn how to convert Parquet file to JSON file format with Scala example, In order to convert first, we will read a Parquet file into DataFrame and write it in a JSON file.

What is Apache Parquet

Apache Parquet is a columnar file format that provides optimizations to speed up queries and is a far more efficient file format than CSV or JSON, supported by many data processing systems.

It is compatible with most of the data processing frameworks in the Hadoop echo systems. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Spark SQL provides support for both reading and writing Parquet files that automatically capture the schema of the original data, It also reduces data storage by 75% on average. Below are some advantages of storing data in a parquet format. Spark by default supports Parquet in its library hence we don’t need to add any dependency libraries.

Apache Parquet Advantages:

Below are some of the advantages of using Apache Parquet. combining these benefits with Spark improves performance and gives the ability to work with structure files.

  • Reduces IO operations.
  • Fetches specific columns that you need to access.
  • It consumes less space.
  • Support type-specific encoding.

Reading Parquet file into DataFrame

Spark DataFrameReader provides parquet() function (spark.read.parquet) to read the parquet files and creates a Spark DataFrame. In this example, we are reading data from an apache parquet.


val df = spark.read.parquet("src/main/resources/zipcodes.parquet")

Alternatively, you can also write the above statement as


  //read parquet file
  val df = spark.read.format("parquet")
    .load("src/main/resources/zipcodes.parquet")
  df.show()
  df.printSchema()

If you want to read more on Parquet, I would recommend checking how to Read and Write Parquet file with a specific schema along with the dependencies and how to use partitions.

Spark Convert Parquet to JSON file

In the previous section, we have read the Parquet file into DataFrame now let’s convert it to Avro by saving it to JSON file format.


//convert to json
df.write.mode(SaveMode.Overwrite)
  .json("/tmp/json/zipcodes.json")

Alternatively, you can also write.


df.write
.json("/tmp/json/zipcodes.json")

If you want to read more on JSON, I would recommend checking how to Read and Write JSON file with a specific schema.

Complete Example to convert Parquet file to JSON file format


package com.sparkbyexamples.spark.dataframe

import org.apache.spark.sql.{SaveMode, SparkSession}
package com.sparkbyexamples.spark.dataframe

import org.apache.spark.sql.{SaveMode, SparkSession}

object ParquetToJson extends App {

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  spark.sparkContext.setLogLevel("ERROR")

  //read parquet file
  val df = spark.read.format("parquet")
    .load("src/main/resources/zipcodes.parquet")
  df.show()
  df.printSchema()

  //convert to json
  df.write.mode(SaveMode.Overwrite)
    .json("/tmp/json/zipcodes.json")
}

Conclusion

In this Spark article, you have learned how to convert a Parquet file to a JSON file format with Scala examples. Though we literally don’t convert from Parquet format to JSON straight, first we convert it to DataFrame and then DataFrame can be saved to any format Spark supports.

Happy Learning !!

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply