Apache Spark :

Spark Tutorials with Scala Examples

Spark regexp_replace() – Replace String Value

Spark org.apache.spark.sql.functions.regexp_replace is a string function that is used to replace part of a string (substring) value with another string on DataFrame column by using gular expression (regex). This function returns a org.apache.spark.sql.Column type after replacing a string value. In this article, I will explain the syntax, usage of regexp_replace()…

Continue Reading Spark regexp_replace() – Replace String Value

Spark SQL like() Using Wildcard Example

In Spark & PySpark like() function is similar to SQL LIKE operator that is used to match based on wildcard characters (percentage, underscore) to filter the rows. You can use this function to filter the DataFrame rows by single or multiple conditions, to derive a new column, use it on…

Continue Reading Spark SQL like() Using Wildcard Example

Spark isin() & IS NOT IN Operator Example

Question: In Spark & PySpark how to use isin() & IS NOT IN operators that are similar to IN & NOT IN functions available in SQL that check DataFrame column value exists/contains in a list of string values, when I tried to use isin(list_param) from the Column class, I am…

Continue Reading Spark isin() & IS NOT IN Operator Example

Spark – Get Size/Length of Array & Map Column

Question: In Spark & PySpark, how to get the size/length of ArrayType (array) column and also how to find the size of MapType (map/Dic) type in DataFrame, could you also please explain with an example how to filter by array/map size? PySpark Example: How to Get Size of ArrayType, MapType…

Continue Reading Spark – Get Size/Length of Array & Map Column

Spark Using Length/Size Of a DataFrame Column

Question: In Spark & PySpark is there a function to filter the DataFrame rows by length or size of a String Column (including trailing spaces) and also show how to create a DataFrame column with the length of another column. Solution: Filter DataFrame By Length of a Column Spark SQL…

Continue Reading Spark Using Length/Size Of a DataFrame Column

Spark rlike() Working with Regex Matching Examples

Similar to SQL regexp_like() function Spark & PySpark also supports Regex (Regular expression matching) by using rlike() function, This function is available in org.apache.spark.sql.Column class. Use regex expression with rlike() to filter rows by checking case insensitive (ignore case) and to filter rows that have only numeric/digits and more examples.…

Continue Reading Spark rlike() Working with Regex Matching Examples

Spark Check String Column Has Numeric Values

Problem: In Spark, I have a string column on DataFrame and wanted to check if this string column has all or any numeric values, wondering if there is any function similar to the isNumeric function in other tools/languages. Solution: Check String Column Has all Numeric Values Unfortunately, Spark doesn't have…

Continue Reading Spark Check String Column Has Numeric Values

Spark Check Column Data Type is Integer or String

When you are dealing with large datasets with different data types (DataType) in Spark we often need to check the data type of a DataFrame column and even sometimes you need to get all integer, string type columns to perform certain operations. Related: Different Ways to Get All Column Names…

Continue Reading Spark Check Column Data Type is Integer or String

Spark Trim String Column on DataFrame

Problem: In Spark or PySpark how to remove white spaces (blanks) in DataFrame string column similar to trim() in SQL that removes left and right white spaces. PySpark - How to Trim String Column on DataFrame Solution: Spark Trim String Column on DataFrame (Left & Right) In Spark & PySpark…

Continue Reading Spark Trim String Column on DataFrame

Spark Merge Two DataFrames with Different Columns or Schema

In Spark or PySpark let's see how to merge/unione two DataFrames with a different number of columns (different schema). In Spark 3.1, you can easily achieve this using unionByName() transformation by passing allowMissingColumns with the value true. In order version, this property is not available //Scala merged_df = df1.unionByName(df2, true)…

Continue Reading Spark Merge Two DataFrames with Different Columns or Schema