You are currently viewing How to Load Parquet file into Snowflake table

Here, you will learn Parquet introduction, It’s advantages and steps involved to load Parquet data file into Snowflake data warehouse table using PUT SQL and then load Parquet file from interstage to Snowflake database table using COPY INTO.

Related: Unload Snowflake table to Parquet file

Apache Parquet Introduction

Apache Parquet is a columnar file format that provides optimizations to speed up queries and is a far more efficient file format than CSV or JSON, supported by many data processing systems.

It is compatible with most of the data processing frameworks in the Hadoop echo systems. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Spark SQL provides support for both reading and writing Parquet files that automatically capture the schema of the original data, It also reduces data storage by 75% on average. Below are some advantages of storing data in a parquet format. Spark by default supports Parquet in its library hence we don’t need to add any dependency libraries.

Apache Parquet Advantages:

Below are some of the advantages of using Apache Parquet. combining these benefits with Spark improves performance and gives the ability to work with structure files.

  • Reduces IO operations.
  • Fetches specific columns that you need to access.
  • It consumes less space.
  • Support type-specific encoding.

Load Parquet file to Snowflake table

Loading a Parquet data file to the Snowflake Database table is a two-step process.

  • First, using PUT command upload the data file to Snowflake Internal stage.
  • Second, using COPY INTO, load the file from the internal stage to the Snowflake table.
Load Parquet file into Snowflake table

PUT – Upload the file to Snowflake internal stage

Using PUT command, you can upload the Parquet file to Snowflake Internal stage, either to name, table and user stage or to the external stage.

@~<username> – Is used to upload to Snowflake user stage

@%<tablename> – Is used to upload to Snowflake table stage

@<name> – Is used to upload to name stage

Below example uploads the emp.csv file to internal table EMP stage.


PUT file:///tmp/data1_0_0_0.snappy.parquet @%EMP;
data1_0_0_0.snappy.parquet(0.00MB): [##########] 100.00% Done (0.241s, 0.01MB/s).
+----------------------------+----------------------------+-------------+-------------+--------------------+--------------------+----------+---------+
| source                     | target                     | source_size | target_size | source_compression | target_compression | status   | message |
|----------------------------+----------------------------+-------------+-------------+--------------------+--------------------+----------+---------|
| data1_0_0_0.snappy.parquet | data1_0_0_0.snappy.parquet |        1445 |        1445 | PARQUET            | PARQUET            | UPLOADED |         |
+----------------------------+----------------------------+-------------+-------------+--------------------+--------------------+----------+---------+
1 Row(s) produced. Time Elapsed: 2.068s

Let’s see an example with the name internal stage.


PUT file:///apps/sparkbyexamples/data1_0_0_0.snappy.parquet @~;

Another example with the name internal stage along with the path.


PUT file:///apps/sparkbyexamples/data1_0_0_0.snappy.parquet @~/tmp;

LIST – To list the files from internal stages

To verify if the file uploaded successfully, use the LIST command with the stage as a parameter.


LIST @%EMP
                                     ;
+----------------------------+------+----------------------------------+------------------------------+
| name                       | size | md5                              | last_modified                |
|----------------------------+------+----------------------------------+------------------------------|
| data1_0_0_0.snappy.parquet | 1456 | 41f9cd57a7639a3c937e097804e6aebf | Fri, 6 Mar 2020 06:09:12 GMT |
+----------------------------+------+----------------------------------+------------------------------+
1 Row(s) produced. Time Elapsed: 0.589s

COPY INTO – Load the Parquet file to Snowflake table

Once you upload the Parquet file to the internal stage, now use the COPY INTO tablename command to load the Parquet file to the Snowflake database table.

First, create a table EMP with one column of type Variant.


CREATE OR REPLACE TABLE EMP(PARQUET_RAW VARIANT)

COPY INTO EMP from (select $1 from @%EMP/data1_0_0_0.snappy.parquet)
 file_format = (type=PARQUET COMPRESSION=SNAPPY);
+----------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file                       | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|----------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| data1_0_0_0.snappy.parquet | LOADED |           5 |           5 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
+----------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
1 Row(s) produced. Time Elapsed: 1.472s

Alternatively, you can also write the same statement as


COPY INTO EMP from @%EMP/data1_0_0_0.snappy.parquet
  file_format = (type=PARQUET COMPRESSION=SNAPPY);

Finally, let’s run the select and see if the data loaded successfully.


SELECT * FROM EMP;
+------------------------------------+
| PARQUET_RAW                        |
|------------------------------------|
| {                                  |
|   "_COL_0": "James",               |
|   "_COL_1": "Smith",               |
|   "_COL_2": 1.000000000000000e+04, |
|   "_COL_3": 10,                    |
|   "_COL_4": "M"                    |
| }                                  |
| {                                  |
|   "_COL_0": "Michael",             |
|   "_COL_1": "Jones",               |
|   "_COL_2": 2.000000000000000e+04, |
|   "_COL_3": 10,                    |
|   "_COL_4": "M"                    |
| }                                  |
| {                                  |
|   "_COL_0": "Robert",              |
|   "_COL_1": "Williams",            |
|   "_COL_2": 3.000400000000000e+03, |
|   "_COL_3": 10,                    |
|   "_COL_4": "M"                    |
| }                                  |
| {                                  |
|   "_COL_0": "Maria",               |
|   "_COL_1": "Jones",               |
|   "_COL_2": 2.347650000000000e+04, |
|   "_COL_3": 20,                    |
|   "_COL_4": "F"                    |
| }                                  |
| {                                  |
|   "_COL_0": "Jen",                 |
|   "_COL_1": "Mary",                |
|   "_COL_2": 1.234450000000000e+05, |
|   "_COL_3": 20,                    |
|   "_COL_4": "F"                    |
| }                                  |
+------------------------------------+
5 Row(s) produced. Time Elapsed: 0.667s

Note that the above example loads the parquet file into a single column (each record in a parquet file loads into a single column of a row.

Conclusion

To load a Parquet file into the Snowflake table, you need to upload the data file to Snowflake internal stage and then load the file from the internal stage to the table. You can also change the compression and specify the data, time formats the loading file has and many more loading options.

References

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium