Spark Merge Two DataFrames with Different Columns or Schema

In Spark or PySpark let's see how to merge/unione two DataFrames with a different number of columns (different schema). In Spark 3.1, you can easily achieve this using unionByName() transformation by passing allowMissingColumns with the value true. In order version, this property is not available //Scala merged_df = df1.unionByName(df2, true)…

Continue Reading Spark Merge Two DataFrames with Different Columns or Schema

Spark Context ‘sc’ Not Defined?

Problem: When I tried to use SparkContext object 'sc' in PySpark program I am getting Spark Context 'sc' Not Defined, But the sc is working in Spark/PySpark shell. Solution: Spark Context 'sc' Not Defined? In Spark/PySpark 'sc' is a SparkContext object that's created upfront by default on spark-shell/pyspark shell, this…

Continue Reading Spark Context ‘sc’ Not Defined?

NameError: Name ‘Spark’ is not Defined

Problem: When I am using spark.createDataFrame() I am getting NameError: Name 'Spark' is not Defined, if I use the same in Spark or PySpark shell it works without issue. Solution: NameError: Name 'Spark' is not Defined in PySpark Since Spark 2.0 'spark' is a SparkSession object that is by default…

Continue Reading NameError: Name ‘Spark’ is not Defined

Pyspark: Exception: Java gateway process exited before sending the driver its port number

Problem: While running PySpark application through spark-submit, Spyder or even from PySpark shell I am getting Pyspark: Exception: Java gateway process exited before sending the driver its port number. Solution: Pyspark: Exception: Java gateway process exited before sending the driver its port number In order to run PySpark (Spark with…

Continue Reading Pyspark: Exception: Java gateway process exited before sending the driver its port number

PySpark “ImportError: No module named py4j.java_gateway” Error

Problem: When I was running PySpark commands after successful installation of PySpark on Linux, I got an error "ImportError: No module named py4j.java_gateway", I had spent some time and understand what is Py4J module and resolve the issue. I would like to share it here. ImportError: No module named py4j.java_gateway…

Continue Reading PySpark “ImportError: No module named py4j.java_gateway” Error

How to Import PySpark in Python Script

Let's see how to import the PySpark library in Python Script or how to use it in shell, sometimes even after successfully installing Spark on Linux/windows/mac, you may have issues like "No module named pyspark" while importing PySpark libraries in Python, below I have explained some possible ways to resolve…

Continue Reading How to Import PySpark in Python Script

PySpark Replace Column Values in DataFrame

You can replace column values of PySpark DataFrame by using SQL string functions regexp_replace(), translate(), and overlay() with Python examples. In this article, I will cover examples of how to replace part of a string with another string, replace all columns, change values conditionally, replace values from a python dictionary, replace…

Continue Reading PySpark Replace Column Values in DataFrame

PySpark Retrieve DataType & Column Names of DataFrame

You can find all column names & data types (DataType) of PySpark DataFrame by using df.dtypes and df.schema and you can also retrieve the data type of a specific column name using df.schema["name"].dataType, let's see all these with PySpark(Python) examples. 1. PySpark Retrieve All Column DataType and Names By using…

Continue Reading PySpark Retrieve DataType & Column Names of DataFrame

Spark Get DataType & Column Names of DataFrame

In Spark you can get all DataFrame column names and types (DataType) by using df.dttypes and df.schema where df is an object of DataFrame. Let's see some examples of how to get data type and column name of all columns and data type of selected column by name using Scala…

Continue Reading Spark Get DataType & Column Names of DataFrame

Spark Get the Current SparkContext Settings

In Spark/PySpark you can get the current active SparkContext and its configuration settings by accessing spark.sparkContext.getConf.getAll(), here spark is an object of SparkSession and getAll() returns Array[(String, String)], let's see with examples using Spark with Scala & PySpark (Spark with Python). Spark Get SparkContext Configurations In the below Spark example,…

Continue Reading Spark Get the Current SparkContext Settings