Spark Convert Avro file to Parquet

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:Apache Spark
Post last modified:March 27, 2024
Reading time:8 mins read

You are currently viewing Spark Convert Avro file to Parquet

Photo by Markus Spiske on Unsplash

In this Spark article, you will learn how to convert Avro file to Parquet file format with Scala example, In order to convert first, we will read an Avro file into DataFrame and write it in a Parquet file.

1. Read Avro into DataFrame

2. What is Apache Avro

Apache Avro is an open-source, row-based, data serialization and data exchange framework for Hadoop projects, originally developed by databricks as an open-source library that supports reading and writing data in Avro file format. it is mostly used in Apache Spark especially for Kafka-based data pipelines. When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program.

It has build to serialize and exchange big data between different Hadoop based projects. It serializes data in a compact binary format and schema is in JSON format that defines the field names and data types.

3. Avro Advantages

Supports complex data structures like Arrays, Map, Array of map and map of array elements.
A compact, binary serialization format which provides fast while transferring data.
row-based data serialization system.
Support multi-languages, meaning data written by one language can be read by different languages.
Code generation is not required to read or write data files.
Simple integration with dynamic languages.

avro() function is not provided in Spark DataFrameReader hence, we should use DataSource format as “avro” or “org.apache.spark.sql.avro” and load() is used to read the Avro file.


  // Read avro file
  val df = spark.read.format("avro")
    .load("src/main/resources/zipcodes.avro")
  df.show()
  df.printSchema()

In case, if you have Avro data partitioned, use where() function to load a specific partition, below snippet loads an Avro file with Zipcode 19802


spark.read
      .format("avro")
      .load("zipcodes_partition.avro")
      .where(col("Zipcode") === 19802)
      .show()

If you want to read more on Avro, I would recommend checking how to Read and Write Avro file with a specific schema along with the dependencies it needed.

4. Spark Convert Avro to Parquet file

In the previous section, we have read the Avro file into DataFrame now let’s convert it to Parquet by saving it to Parquet file format. before we start, first let’s learn what is parquet and it’s advantages.

4.1 What is Apache Parquet

Apache Parquet is a columnar file format that provides optimizations to speed up queries and is a far more efficient file format than CSV or JSON, supported by many data processing systems.

It is compatible with most of the data processing frameworks in the Hadoop echo systems. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Spark SQL provides support for both reading and writing Parquet files that automatically capture the schema of the original data, It also reduces data storage by 75% on average. Below are some advantages of storing data in a parquet format. Spark by default supports Parquet in its library hence we don’t need to add any dependency libraries.

4.2 Apache Parquet Advantages:

Below are some of the advantages of using Apache Parquet. combining these benefits with Spark improves performance and gives the ability to work with structure files.

Reduces IO operations.
Fetches specific columns that you need to access.
It consumes less space.
Support type-specific encoding.

Using spark.write.parquet() function we can write Spark DataFrame to Parquet file, and parquet() function is provided in DataFrameWriter class. As mentioned earlier Spark doesn’t need any additional packages or libraries to use Parquet as it by default provides with Spark. easy isn’t it? as we don’t have to worry about version and compatibility issues. In this example, we are writing DataFrame to “people.parquet” file


  // Convert to parquet
  df.write.mode(SaveMode.Overwrite)
    .parquet("/tmp/parquet/zipcodes.parquet")

If you want to read more on Parquet, I would recommend checking how to Read and Write Parquet file with a specific schema along with the dependencies and how to use partitions.

5. Complete Example of converting Avro file to Parquet file format


package com.sparkbyexamples.spark.dataframe

import org.apache.spark.sql.{SaveMode, SparkSession}

object AvroToParquet extends App {

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExample")
    .getOrCreate()

  spark.sparkContext.setLogLevel("ERROR")

  //read avro file
  val df = spark.read.format("avro")
    .load("src/main/resources/zipcodes.avro")
  df.show()
  df.printSchema()

  //convert to parquet
  df.write.mode(SaveMode.Overwrite)
    .parquet("/tmp/parquet/zipcodes.parquet")
}

Conclusion

In this Spark article, you have learned how to convert an Avro file to a Parquet file format with Scala examples. Though we literally don’t convert from Avro format to Parquet straight, first we convert it to DataFrame and then DataFrame can be saved to any format Spark supports.

Happy Learning !!

Tags: Avro, parquet

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium