You are currently viewing Spark Convert JSON to Avro, CSV & Parquet

In this Spark article, you will learn how to read a JSON file into DataFrame and convert or save DataFrame to CSV, Avro and Parquet file formats using Scala examples.

Though the below examples explain with the JSON in context, once we have data in DataFrame, we can convert it to any format Spark supports regardless of how and from where you have read it.

1. Read JSON into DataFrame

Using spark.read.json("path") or spark.read.format("json").load("path") you can read a JSON file into a Spark DataFrame, these methods take a file path as an argument, These methods also support reading multi-line JSON file and with custom schema.


// Read json file into dataframe
val df = spark.read.json("src/main/resources/zipcodes.json")
df.printSchema()
df.show(false)

This snippet prints the schema and sample data to console.

2. Spark Convert JSON to Avro file

Once we convert the JSON into Spark DataFrame, we can write the DataFrame to AVRO file format. first, let’s see what is Avro file format and then will see some examples in Scala.

Apache Avro is an open-source, row-based, data serialization and data exchange framework for Hadoop projects, originally developed by databricks as an open-source library that supports reading and writing data in Avro file format. it is mostly used in Apache Spark especially for Kafka-based data pipelines. When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program.

Since Avro library is external to Spark, it doesn’t provide avro() function on DataFrameWriter , hence we should use DataSource “avro” or “org.apache.spark.sql.avro” to write Spark DataFrame to Avro file.


// Spark Convert JSON to Avro file
df.write.format("avro").save("/tmp/zipcodes.avro")

Spark DataFrameWriter provides partitionBy() function to partition the Avro at the time of writing. Partition improves performance on reading by reducing Disk I/O.


df.write.partitionBy("State","Zipcode")
        .format("avro").save("/tmp/zipcodes_partition.avro")

If you want to read more on Avro, I would recommend checking how to Read and Write Avro file with a specific schema along with the dependencies it needed.

3. Spark Convert JSON to Parquet file

Let’s see how to convert the Spark DataFrame that created from JSON to the Parquet file, first let’s see what is Parquet file format and then will see some examples in Scala.

Apache Parquet is a columnar file format that provides optimizations to speed up queries and is a far more efficient file format than CSV or JSON, supported by many data processing systems.

Using spark.write.parquet() function we can write Spark DataFrame to Parquet file, and parquet() function is provided in DataFrameWriter class. As mentioned earlier Spark doesn’t need any additional packages or libraries to use Parquet as it by default provides with Spark. easy isn’t it? as we don’t have to worry about version and compatibility issues. In this example, we are writing DataFrame to “people.parquet” file


df.write.parquet("/tmp/zipcodes.parquet")

If you want to read more on Parquet, I would recommend checking how to Read and Write Parquet file with a specific schema along with the dependencies it needed.

4. Spark Convert JSON to CSV file

Similar to Avro and Parquet, once we have a DataFrame created from JSON file, we can easily convert or save it to CSV file using dataframe.write.csv("path")


// Spark Convert JSON to CSV file
df.write
.option("header","true")
.csv("/tmp/zipcodes.csv")

In this example, we have used the head option to write the CSV file with the header, Spark also supports multiple options to read and write CSV files

5. Complete Example of converting JSON to Avro, Parquet and CSV file


package com.sparkbyexamples.spark.dataframe

import org.apache.spark.sql.SparkSession

object JsonToAvroCsvParquet extends App {

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExample")
    .getOrCreate()

  spark.sparkContext.setLogLevel("ERROR")

  // Read json file into dataframe
  val df = spark.read.json("src/main/resources/zipcodes.json")
  df.printSchema()
  df.show(false)

  // Convert to avro
  df.write.format("avro").save("/tmp/avro/zipcodes.avro")

  // Convert to avro by partition
  df.write.partitionBy("State","Zipcode")
    .format("avro").save("/tmp/avro/zipcodes_partition.avro")

  // Convert to parquet
  df.write.parquet("/tmp/parquet/zipcodes.parquet")

  // Convert to csv
  df.write.option("header","true").csv("/tmp/csv/zipcodes.csv")
}

Conclusion

In this Spark article, you have learned how to convert a JSON file to Avro, Parquet and CSV file with Scala examples. Though we literally don’t convert from one format to another straight, first we convert it to DataFrame and then DataFrame can be converted to any format Spark supports.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium