You are currently viewing Spark Unstructured vs semi-structured vs Structured data

Spark RDD natively supports reading text files and later with DataFrame, Spark added different data sources like CSV, JSON, Avro, Parquet and many more. Based on the data source you choose, you may need a third party dependency and Spark can read and write all these files from/to windows(using Uinutils), Linux, HDFS, S3, Azure, GCP, and many more cloud platforms.

Unstructured data

Text file formats are considered unstructured data. In order to process text files use <a href="https://sparkbyexamples.com/spark/spark-read-text-file-rdd-dataframe/">spark.read.text()</a> and <a href="https://sparkbyexamples.com/spark/spark-read-text-file-rdd-dataframe/">spark.read.textFile()</a>

Semi-Structured data

CSV and TSV is considered as Semi-structured data and to process CSV file, we should use <a href="https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/">spark.read.csv()</a>

XML and JSON file format is considered semi-structured data as the data in the file can represent as a string, integer, arrays e.t.c but without explicitly mentioning the data types.

Processing JSON file in spark can be done using  <a href="https://sparkbyexamples.com/spark/spark-read-and-write-json-file/">spark.read.json("path")</a> or <a href="https://sparkbyexamples.com/spark/spark-read-and-write-json-file/">spark.read.format("json").load("path")</a>

Note that Parsing unstructured and semi-structured data to DataFrame and Dataset is very slow.

Structured data

Avro and Parquet file formats are considered structured data as these can maintain the structure/schema of the data along with its data types.

avro() function is not provided in Spark DataFrameReader  hence, we should use DataSource format as “avro” or org.apache.spark.sql.avro and load() is used to read the Avro file. pass HDFS path as an argument to the load function.

DataFrameReader provides parquet() function (spark.read.parquet) to read the parquet files and creates a Spark DataFrame.

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium