You are currently viewing Load file from Amazon S3 into Snowflake table

In this Snowflake article, you will learn how to load the CSV/Parquet/Avro data file from the Amazon S3 bucket External stage into the Snowflake table.

Advertisements

Related: Unload Snowflake table to Amazon S3 bucket.

I assume you already have a CSV/Parquet/Avro file in the Amazon S3 bucket you are trying to load to the Snowflake table. This article doesn’t cover how to upload a file to an S3 bucket.

COPY INTO – Loads the data file from Amazon S3 to Snowflake table

Using COPY INTO <table name> SQL command, we can load the data files from the Internal table stage, Internal user stage, Internal named stages, external named stage, and external location.

In this article, you will learn loading from an Amazon S3 bucket external location into the Snowflake table. Snowflake also supports loading fro Azure, GCP, and many more clouds which will cover in other articles.

Once you upload the file to the Amazon S3 bucket, Either from SnowSQL or Snowflake web console use COPY INTO <tablename> command to load the compressed CSV file to the Snowflake database table.


COPY INTO EMP from 's3://mybucket/load/emp.csv'
credentials = (aws_key_id='xxxx' aws_secret_key='xxxxx' aws_token='xxxxxx');

+------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file       | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| emp.csv    | LOADED |           5 |           5 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
+------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
1 Row(s) produced. Time Elapsed: 2.096s

By default, it also supports loading from GZ compressed file. Now, let’s run the select and see if the data loaded successfully.


select * from EMP;
+---------+----------+----------+------------+--------+
| FNAME   | LNAME    |   SALARY | DEPARTMENT | GENDER |
|---------+----------+----------+------------+--------|
| James   | Smith    |  10000   |         10 | M      |
| Michael | Jones    |  20000   |         10 | M      |
| Robert  | Williams |   3000.4 |         10 | M      |
| Maria   | Jones    |  23476.5 |         20 | F      |
| Jen     | Mary     | 123445   |         20 | F      |
+---------+----------+----------+------------+--------+
5 Row(s) produced. Time Elapsed: 0.646s

Options with COPY INTO

COMPRESSION – Load file with specified compression

Use COMPRESSION to specify the compressed file you wanted to load, By default, it loads a file with GZIP format, however, you can change it to use the following compressions AUTO | GZIP | BZ2 | BROTLI | ZSTD | DEFLATE | RAW_DEFLATE | NONE


COPY INTO EMP from 's3://mybucket/load/emp.csv.bz2'
  file_format = (type=CSV COMPRESSION = BZ2)
  credentials = (aws_key_id='xxxx' aws_secret_key='xxxxx' aws_token='xxxxxx');
+--------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file               | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|--------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| emp.csv.bz2  | LOADED |           5 |           5 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
+--------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
1 Row(s) produced. Time Elapsed: 1.902s

RECORD_DELIMITER – Changes delimiter for record

In case if you have a file with record separator other than ‘\n’, use RECORD_DELIMITER. for example, below query uses with ‘\n\r’


COPY INTO EMP from 's3://mybucket/load/emp.csv.gz'  
  file_format = (type=CSV RECORD_DELIMITER='\n\r')
  credentials = (aws_key_id='xxxx' aws_secret_key='xxxxx' aws_token='xxxxxx');
1 Row(s) produced. Time Elapsed: 1.300s

FIELD_DELIMITER – Specify field delimiter of loading file

Use FIELD_DELIMITER to change the default delimiter, by default it used ‘,’ character, but you can change to use any character, for example, let’s see how to use pipe delimiter


COPY INTO EMP from 's3://mybucket/load/emp.csv.gz'
  file_format = (type=CSV FIELD_DELIMITER='|')
  credentials = (aws_key_id='xxxx' aws_secret_key='xxxxx' aws_token='xxxxxx');
  
1 Row(s) produced. Time Elapsed: 1.300s

SKIP_HEADER – Ignore column names on header

Use HEADER optional parameter to specify whether to include the table column headings in the output files, by default it is set to TRUE, you can change it to FALSE if you do not want column names of a header on the output file. for example.

DATE_FORMAT – Specify input date format

By default date columns expects in ‘YYYY-MM-DD‘, If your loading file has a different format, specify the input date format with DATE_FORMAT option.


COPY INTO EMP from 's3://mybucket/load/emp.csv.gz'
  file_format = (type=CSV DATE_FORMAT='MM-DD-YYYY')
  credentials = (aws_key_id='xxxx' aws_secret_key='xxxxx' aws_token='xxxxxx');
1 Row(s) produced. Time Elapsed: 1.300s

TIME_FORMAT – Specify Input time format

Time columns expect in ‘HH24:MI:SS ‘, If your loading file has a different format then use TIME_FORMAT option specify the input format.


COPY INTO EMP from 's3://mybucket/load/emp.csv.gz'
  file_format = (type=CSV TIME_FORMAT='HH24:MI:SS')
  credentials = (aws_key_id='xxxx' aws_secret_key='xxxxx' aws_token='xxxxxx');
  
1 Row(s) produced. Time Elapsed: 1.300s

TIMESTAMP_FORMAT – Specify input timestamp format

Timestamp columns expect in ‘YYYY-MM-DD HH24:MI:SS.FF3 TZHTZM‘, If your loading file has a different format, use TIMESTAMP_FORMAT option to specify the input format.


COPY INTO EMP from 's3://mybucket/load/emp.csv.gz'
  file_format = (type=CSV TIMESTAMP_FORMAT='MM-DD-YYYY HH24:MI:SS.FF3 TZHTZM')
  credentials = (aws_key_id='xxxx' aws_secret_key='xxxxx' aws_token='xxxxxx');
  
1 Row(s) produced. Time Elapsed: 1.300s

Load Parquet file from Amazon S3

First, you need to upload the file to Amazon S3 using AWS utilities, Once you have uploaded the Parquet file to the internal stage, now use the COPY INTO tablename command to load the Parquet file to the Snowflake database table.

First, create a table EMP with one column of type Variant. Parquet raw data can be loaded into only one column.


create or replace table EMP(PARQUET_RAW VARIANT)

Run the below command to load to table.


COPY INTO EMP from (select $1 from @%EMP/data1_0_0_0.snappy.parquet)
 file_format = (type=PARQUET COMPRESSION=SNAPPY);
+----------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file                       | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|----------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| emp.snappy.parquet         | LOADED |           5 |           5 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
+----------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
1 Row(s) produced. Time Elapsed: 1.472s

Every record from emp.snappy.parquet file will be loaded as JSON into | PARQUET_RAW column of EMP table


SELECT * FROM EMP;
+------------------------------------+
| PARQUET_RAW                        |
|------------------------------------|
| {                                  |
|   "_COL_0": "James",               |
|   "_COL_1": "Smith",               |
|   "_COL_2": 1.000000000000000e+04, |
|   "_COL_3": 10,                    |
|   "_COL_4": "M"                    |
| }                                  |
| {                                  |
|   "_COL_0": "Michael",             |
|   "_COL_1": "Jones",               |
|   "_COL_2": 2.000000000000000e+04, |
|   "_COL_3": 10,                    |
|   "_COL_4": "M"                    |
| }                                  |
| {                                  |
|   "_COL_0": "Robert",              |
|   "_COL_1": "Williams",            |
|   "_COL_2": 3.000400000000000e+03, |
|   "_COL_3": 10,                    |
|   "_COL_4": "M"                    |
| }                                  |
| {                                  |
|   "_COL_0": "Maria",               |
|   "_COL_1": "Jones",               |
|   "_COL_2": 2.347650000000000e+04, |
|   "_COL_3": 20,                    |
|   "_COL_4": "F"                    |
| }                                  |
| {                                  |
|   "_COL_0": "Jen",                 |
|   "_COL_1": "Mary",                |
|   "_COL_2": 1.234450000000000e+05, |
|   "_COL_3": 20,                    |
|   "_COL_4": "F"                    |
| }                                  |
+------------------------------------+
5 Row(s) produced. Time Elapsed: 0.667s

Similarly, you can also load JSON/AVRO/CSV files from Amazon S3 into Snowflake table.

Conclusion

To load a CSV/Avro/Parquet file from Amazon S3 bucket into the Snowflake table, you need to use the COPY INTO <tablename> SQL. You can execute this SQL either from SnowSQL or from Snowflake web console. You can also change the compression and specify the data, time formats and many more options with COPY INTO.

References

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium