Spark saveAsTable() is a method from DataFrameWriter that is used to save the content of the DataFrame
as the specified table. This method takes the argument as a string to specify the name of the table to write.
Table of contents
1. What is Spark saveAsTable()
saveAsTable
() is a method from Spark DataFrameWriter class that allows you to save the content of a DataFrame or a Dataset as a table in a database. The table can be stored in various databases like Apache Hive, Apache HBase, or any other JDBC-compliant databases.
When you use the saveAsTable()
method, Spark creates a new table in the specified database if it doesn’t exist, and if it already exists, it will overwrite the table’s data. The method takes a string argument that specifies the name of the table to be created or overwritten.
Here is an example of how to use saveAsTable
to save a DataFrame to a Hive table:
import org.apache.spark.sql.SparkSession
//Create a Spark Session.
val spark = SparkSession.builder()
.appName("SaveAsTableExample")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
//Create a Dataframe.
val df = Seq((1, "John"), (2, "Jane"), (3, "Bob")).toDF("id", "name")
//Save DataFrame into a table
df.write.saveAsTable("my_database.my_table")
In this example, we first create a SparkSession and enable Hive support. Then, we create a DataFrame and use the saveAsTable
method to save it as a table in Hive Metastore. The table will be created in the my_database
database with the name my_table
.
2. saveAsTable() Examples
2.1 Saving a DataFrame as a table in a default database:
//Create a Dataframe.
val df = Seq((1, "John"), (2, "Jane"), (3, "Bob")).toDF("id", "name")
//Save DataFrame into a table in a default database:
df.write.saveAsTable("my_table")
This will save the contents of df
as a table called my_table
in the default database.
2.2 Saving a DataFrame as a table in a specific database:
//Create a Dataframe.
val df = Seq((1, "John"), (2, "Jane"), (3, "Bob")).toDF("id", "name")
//Save DataFrame into a table in a specific database:
df.write.option("database", "my_database").saveAsTable("my_table")
//OR
df.write.saveAsTable("my_database.my_table")
This will save the contents of df
as a table called my_table
in the my_database
database.
2.3 Saving a DataFrame as a table with a specific format:
//Create a Dataframe.
val df = Seq((1, "John"), (2, "Jane"), (3, "Bob")).toDF("id", "name")
//Save DataFrame into a table with specific data format in a default database:
df.write.format("parquet").saveAsTable("my_table")
This will save the contents of df
as a table called my_table
using the Parquet file format.
2.4 Saving a DataFrame as a table with specific partition columns
//Create a Dataframe.
val df = Seq((1, "John"), (2, "Jane"), (3, "Bob")).toDF("id", "name")
//Save DataFrame into a table with partition in a default database:
df.write.partitionBy("year", "month").saveAsTable("my_table")
This will save the contents of df
as a table called my_table
and partition it by the year
and month
columns.
2.5 Saving a DataFrame as an external table
//Create a Dataframe.
val df = Seq((1, "John"), (2, "Jane"), (3, "Bob")).toDF("id", "name")
//Save DataFrame into a External table:
df.write.option("path", "/path/to/table").saveAsTable("my_table")
This will save the contents of df
as an external table called my_table
located at the specified file path.
Note: The saveAsTable()
method writes data in a table format that is specific to the database you are saving to. Therefore, it is recommended that you use this method only when writing to a database and not when writing to a file system.
3. Uses of Spark saveAsTable()
The saveAsTable()
method in Apache Spark is used to save the content of a DataFrame or a Dataset as a table in a database. Here are some common use cases for the saveAsTable
method:
- Persisting data:
saveAsTable
allows you to persist the data of a DataFrame or a Dataset as a table in a database. This is useful when you want to reuse the data later without re-reading it from the original data source. You can use this method to save the data to a database, and then read it back into a DataFrame or Dataset when needed. - Sharing data:
saveAsTable
can be used to share data with other users or applications. When you save a DataFrame or a Dataset as a table, other users can access it using SQL queries, and you can also use the table in other Spark applications. This is particularly useful when you have a large amount of data that needs to be shared across different teams or applications. - Migrating data:
saveAsTable
can also be used to migrate data from one database to another. You can read data from a source database, create a DataFrame or a Dataset, and then save it as a table in a destination database. This can be useful when you need to move data between different database systems. - Querying data:
saveAsTable
can be used to create a table that can be queried using SQL. You can use the table in Spark SQL or any other SQL client that supports the database you are using. This is useful when you need to perform complex queries on the data, or when you need to join the data with other tables.
4. Conclusion
Overall, Spark saveAsTable()
provides a convenient way to persist data in a database, share data with other users or applications, and perform complex queries on the data. It is a powerful tool for data management and analytics in Apache Spark.