AWS Glue PySpark Extensions Reference

AWS Glue offers several PySpark extensions that help simplify the ETL process. These extensions facilitate converting, handling, and modifying data during ETL jobs. This tutorial will provide an overview of these extensions and demonstrate how to use them in your AWS Glue ETL scripts.

Prerequisites:

An AWS account: You will need an AWS account to create and configure your AWS Glue resources.
AWS CLI: The AWS Command Line Interface is a unified tool to manage your AWS services.
Basic knowledge of AWS Glue, PySpark, and SQL.

1. AWS Glue PySpark Extensions:

1.1 Accessing parameters using getResolvedOptions:

The getResolvedOptions method allows you to access the parameters passed when you start a job. This feature makes it easy to pass dynamic values to your job script.


from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ['JOB_NAME', 'MY_PARAMETER'])
my_parameter = args['MY_PARAMETER']

2. PySpark Extension Types:

AWS Glue provides a few extension types to handle specific cases. These types are DynamicFrame, DynamicRecord, TransformationContext, and DataSink.

3. DynamicFrame class:

DynamicFrame is another AWS Glue extension to PySpark, which is similar to a DataFrame. It provides additional operations for semi-structured data and missing schema information. You can convert between DataFrame and DynamicFrame as needed.


# Create a DynamicFrame from a catalog table
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytable")

# Convert a DynamicFrame to DataFrame
data_frame = dynamic_frame.toDF()

# Convert a DataFrame to DynamicFrame
dynamic_frame = DynamicFrame.fromDF(data_frame, glueContext, "dynamic_frame")

4. DynamicFrameCollection class:

The DynamicFrameCollection class is a dictionary of DynamicFrame objects. It’s useful when your ETL process involves branching or when you need to apply transformations on multiple DynamicFrames.


# Creating a DynamicFrameCollection
dynamic_frame_collection = { "frame1": dynamic_frame1, "frame2": dynamic_frame2 }

5. DynamicFrameWriter class:

DynamicFrameWriter class allows you to write out DynamicFrame objects to a variety of data sources.


# Writing a DynamicFrame to an S3 bucket in CSV format
glueContext.write_dynamic_frame.from_options(frame = dynamic_frame, connection_type = "s3", connection_options = {"path": "s3://mybucket/output"}, format = "csv")

6. DynamicFrameReader class:

DynamicFrameReader class provides the ability to create DynamicFrame objects from various data sources.


# Reading a DynamicFrame from an S3 bucket
dynamic_frame = glueContext.create_dynamic_frame.from_options(connection_type = "s3", connection_options = {"paths": ["s3://mybucket/input"]}, format = "json")

7. GlueContext class:

The GlueContext class is an extension of PySpark’s SparkContext. It’s the entry point of any Glue job and simplifies reading, transforming, and writing data. It also manages a Hive-compliant catalog.


# Creating a GlueContext
from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext()
glueContext = GlueContext(sc)

8. AWS Glue Catalog:

You can use the AWS Glue Catalog with PySpark using GlueContext. Here is an example of how to do it:


# Reading a table from Glue Catalog
glue_catalog_table = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytable")

9. Glue Transformations:

AWS Glue provides several transformations that can be used with DynamicFrames:

ApplyMapping: Transforms the columns in the DynamicFrame to match the specified mapping.


mapped_frame = ApplyMapping.apply(frame = dynamic_frame, mappings = [("source_column", "source_type", "target_column", "target_type")])

Filter: Filters the DynamicFrame to include only the rows that satisfy the specified predicate.


filtered_frame = Filter.apply(frame = dynamic_frame, f = lambda x: x["column_name"] == "value")

SelectFields: Transforms the DynamicFrame to include only the specified fields.


selected_frame = SelectFields.apply(frame = dynamic_frame, paths = ["column1", "column2"])

10. Glue DataSink:

AWS Glue provides the DataSink class, which allows you to write DynamicFrames to a variety of data sources:


glueContext.write_dynamic_frame.from_options(frame = dynamic_frame, connection_type = "s3", connection_options = {"path": "s3://mybucket/output"}, format = "json")

These are just a handful of the AWS Glue PySpark extensions available. They’re designed to simplify ETL scripting and aid in effectively transforming and enriching data in AWS Glue. For the most up-to-date information and additional functionality, always refer to the AWS Glue PySpark documentation.