Spark with AWS Glue - Getting Started with Data Processing and Analytics

Apache Spark and AWS Glue are powerful tools for data processing and analytics. This tutorial aims to provide a comprehensive guide for newcomers to AWS on how to use Spark with AWS Glue. We will cover the end-to-end configuration process, including setting up AWS services, creating a Glue job, and running Spark code using Python/PySpark. We will also work with sample dummy data to demonstrate the data processing capabilities of Spark and AWS Glue.

Prerequisites:

To follow this tutorial, you will need the following:

An AWS account: Sign up for an AWS account at https://aws.amazon.com/.
AWS Glue permissions: Ensure that your AWS account has the necessary permissions to create and run AWS Glue jobs. If you are using an existing account, make sure you have appropriate permissions or consult your AWS account administrator.
Sample dataset: You can use any dataset of your choice, but for this tutorial, we will use this SalesData.csv (here is how the data looks like):


Date,Salesperson,Lead Name,Segment,Region,Target Close,Forecasted Monthly Revenue,Opportunity Stage,Weighted Revenue,Closed Opportunity,Active Opportunity,Latest Status Entry
1/2/2011,Gerri Hinds,US_SMB_1317,SMB,US,2/2/2011,103699,Lead,10370,FALSE,FALSE,FALSE
1/3/2011,David King,EMEA_Enterprise_405,Enterprise,EMEA,4/9/2011,393841,Lead,39384,FALSE,FALSE,FALSE………
1/6/2011,James Swanger,US_Enterprise_1466,Enterprise,US,5/4/2011,326384,Lead,32638,FALSE,FALSE,FALSE
1/11/2011,Gerri Hinds,US_SMB_2291,SMB,US,2/14/2011,76316,Lead,7632,FALSE,FALSE,FALSE

Spark with AWS Glue step-by-step Tutorial

Step 1: Set up an S3 Bucket:

Create an S3 bucket to store your sample data and Glue job artifacts. Navigate to the S3 service in the AWS Management Console and click on “Create bucket.” Provide a unique name and select your desired region.
Once the bucket is created, create two sub-folders named:
1. cleaned_data
2. raw_data

Step 2: Prepare the Sample Data:

Upload the SalesData.csv file to your S3 bucket by clicking on your bucket name under the raw_data sub-folder, then “Upload.”

Step 3: Create a Glue Job:

In the AWS Glue console, select “ETL Jobs” in the left-hand menu, then select “Spark script editor” and click on “Create”

Provide a name for the job on the top.

Write the Spark (PySpark) code for your data processing tasks. In this Spark application we are performing the following operation:

Create a Spark session
Read the data from the SalesData.csv file which is stored in our S3 bucket located under the sub-folder raw_data and perform the following transformations:
1. Conversion of Date: The “Date” column is converted from string type to date type using the to_date function. This enables easier date-based analysis and operations.
2. Total Sales by Salesperson: The code calculates the total sales by salesperson by grouping the DataFrame by “Salesperson” and calculating the sum of the “Forecasted_Monthly_Revenue” column.
3. Average Revenue by Opportunity Stage: The code calculates the average revenue per opportunity stage by grouping the DataFrame by “Opportunity_Stage” and calculating the average of the “Weighted_Revenue” column.
4. Filtering Closed Opportunities: The code filters the DataFrame to include only closed opportunities by selecting rows where the “Closed_Opportunity” column is set to True.
5. Selection of Specific Columns: The code selects specific columns from the DataFrame to include in the cleaned dataset. Adjust the column names as needed based on your requirements.
Once everything is done, we are saving all the results in our S3 bucket located under the sub-folder cleaned_data

Step 4: Example of Spark with AWS Glue

Remember to replace <YOUR_BUCKET_LOCATION_OF_RAW_DATA> and <YOUR_BUCKET_LOCATION_OF_CLEANED_DATA> with your S3 bucket locations. Feel free to modify the transformations or add additional ones based on your specific needs.


# Examples of Spark with AWS Glue

# Imports
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import DateType

S3_INPUT_DATA = 's3://mybuckect-spark-by-example/raw_data/'
S3_OUTPUT_DATA = 's3://mybuckect-spark-by-example/cleaned_data/'

def main():

# Creating the SparkSession
spark = SparkSession.builder.appName("My Demo ETL App").getOrCreate()
spark.sparkContext.setLogLevel('ERROR')

# Spark DataFrame (Raw) - Transformation
df = spark.read.option("Header", True).option("InferSchema", True).csv(S3_INPUT_DATA)

# Define a dictionary of replacements to replace spaces in column names with underscores
replacements = {c:c.replace(' ','_') for c in df.columns if ' ' in c}

# Select columns from the dataframe using the replacements dictionary to rename columns with spaces
df = df.select([F.col(c).alias(replacements.get(c, c)) for c in df.columns])

# Convert the "Date" column from string type to date type
df = df.withColumn("Date", F.to_date(F.col("Date"), "M/d/yyyy"))

# Calculate the total sales by salesperson
sales_by_salesperson = df.groupBy("Salesperson").agg(F.sum("Forecasted_Monthly_Revenue").alias("Total_Sales"))

# Calculate the average revenue per opportunity stage
avg_revenue_by_stage = df.groupBy("Opportunity_Stage").agg(F.avg("Weighted_Revenue").alias("Avg_Revenue"))

# Filter the dataset to include only closed opportunities
closed_opportunities = df.filter(F.col("Closed_Opportunity") == True)

# Select specific columns for the cleaned dataset
cleaned_df = df.select("Date", "Salesperson", "Segment", "Region", "Opportunity_Stage", "Weighted_Revenue")

# Print the total number of records in the cleaned dataset
print(f"Total no. of records in the cleaned dataset is: {cleaned_df.count()}")

try:
# Save the DataFrames under different folders within the S3_OUTPUT_DATA bucket
sales_by_salesperson.write.mode('overwrite').parquet(S3_OUTPUT_DATA + "/sales_by_salesperson")
avg_revenue_by_stage.write.mode('overwrite').parquet(S3_OUTPUT_DATA + "/avg_revenue_by_stage")
closed_opportunities.write.mode('overwrite').parquet(S3_OUTPUT_DATA + "/closed_opportunities")
cleaned_df.write.mode('overwrite').parquet(S3_OUTPUT_DATA + "/cleaned_df")
print('The cleaned data is uploaded')
except:
print('Something went wrong, please check the logs :P')

if __name__ == '__main__':
main()

Save the Spark code and click “Run”

Monitor the job run in the AWS Glue console. You can check the job logs and progress to ensure the data processing is successful.

Step 5: Verify the Results:

Navigate to the S3 bucket where you specified the job output path.

Congratulations! You have successfully processed data using Spark with AWS Glue.

Conclusion

This tutorial provided a comprehensive walkthrough for getting started with Spark and AWS Glue. You learned how to get started with AWS Glue, load data, define a Glue job, perform transformations, and finally write the processed data to S3. We built this application using PySpark and executed the job for data processing. By following these steps, you can leverage the power of Spark and AWS Glue to efficiently process and analyze large datasets. Feel free to explore more advanced features and capabilities of Spark and AWS Glue to enhance your data engineering and analytics workflows.