PySpark SQL Functions - Spark By {Examples}

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:PySpark
Post last modified:May 16, 2024
Reading time:42 mins read

You are currently viewing PySpark SQL Functions

PySpark SQL provides several built-in standard functions pyspark.sql.functions to work with DataFrame and SQL queries. All these PySpark Functions return pyspark.sql.Column type.

Advertisements

PySpark SQL Function Introduction

PySpark SQL Functions provide powerful functions for efficiently performing various transformations and computations on DataFrame columns within the PySpark environment. Leveraging these built-in functions offers several advantages. First, they are optimized for distributed processing, enabling seamless execution across large-scale datasets distributed across a cluster. Additionally, PySpark SQL Functions are well-documented, extensively tested, and maintained, ensuring reliability and consistency in data operations.

Sometimes, you might get tempted to create user-defined functions (UDFs) to perform certain operation, but that’s not good. The reason is that utilizing PySpark SQL Functions over user-defined functions (UDFs) is advantageous due to their native integration with PySpark’s underlying execution engine.

Unlike UDFs, which involve serialization and deserialization overheads, PySpark SQL Functions are optimized for distributed computation and can be pushed down to the underlying data sources, resulting in improved performance and scalability. Moreover, PySpark SQL Functions adhere to Spark’s Catalyst optimizer rules, enabling query optimization and efficient execution plans, further enhancing performance and resource utilization.

Importing SQL Functions in PySpark

To use PySpark SQL Functions, simply import them from the pyspark.sql.functions module and apply them directly to DataFrame columns within transformation operations. These functions offer a wide range of functionalities such as mathematical operations, string manipulations, date/time conversions, and aggregation functions.


// Import a specific function
// This imports when and col functions
from pyspark.sql.functions import when, col

You can import SQL functions with aliases to simplify their usage in your code. Here’s how you can do it:


# Import functions as alias
from pyspark.sql import functions as F

# Now you can use functions with 'F' alias
dataframe.select(F.col("columnName"))  # Example of using col function with alias 'F'

Here, F is the alias for pyspark.sql.functions. You can then use F followed by the function name to call SQL functions in your PySpark code, which can make your code more concise and readable.

PySpark grouped SQL functions into the categories below. Click on a category for a list of functions, syntax, and descriptions.

PySpark SQL Functions

String SQL Functions
Date & Time SQL Functions
Collection SQL Functions
Math SQL Functions
Aggregate SQL Functions
Window SQL Functions
Sort SQL Functions
Partition Transformation Functions
UDF Functions and
Other Misc Functions

I will explain a few functions from each category in the section below. For a complete list, refer to the official Spark documentation.

PySpark SQL String Functions

PySpark SQL String Functions provide a comprehensive set of functions for manipulating and transforming string data within PySpark DataFrames. These functions offer various functionalities for common string operations, such as substring extraction, string concatenation, case conversion, trimming, padding, and pattern matching.

Some commonly used PySpark SQL String Functions include:

String Functions	Description
`concat_ws`(sep, *cols)	Concat multiple strings into a single string with a specified separator
`format_number`(col, d)	Formats the number to ‘#,–#,–#.–’, rounded to d and returns the value in String
`format_string`(format, *cols)	Formats the input string to printf-style.
`length`(col)	Returns the length of the input string column.
`lower`(col) `upper`(col)	lower() – Converts all characters in a string to lowercase upper() – Convers all characters to uppercase
`lpad`(col, len, pad) `rpad`(col,len,pad)	lpad() – Add a specified character as padding on the left side. rpad() – Add a specified character as padding on the right side.
`ltrim`(col) `rtrim`()	ltrim() – Removes the space on the left side. rtrim() – Removes the spaces on the right side.
`repeat`(col, n)	Returns a new string after repeating a column n times
`split`(str, pattern[, limit])	Splits string by specified patterns
`substring`(str, pos, len)	Returns the substring from te string column.

Datetime Functions

Datetime Functions	Description
`add_months`(start, months)	Returns the date that is months months after start
`current_date`()	Returns the current date at the start of query evaluation as a `DateType` column.
`current_timestamp`()	Returns the current timestamp at the start of query evaluation as a `TimestampType` column.
`date_add`(start, days)	Returns the date that is days after start
`date_format`(date, format)	Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
`date_sub`(start, days)	Returns the date that is days days before start
`date_trunc`(format, timestamp)	Returns a timestamp truncated to the specified unit as defined by the format.
`datediff`(end, start)	Returns the number of days from start to end.
`dayofmonth`(col)	Extract the day of the month of a given date as integer.
`dayofweek`(col)	Extract the day of the week of a given date as integer.
`dayofyear`(col)	Extract the day of the year of a given date as integer.
`second`(col)	Extract the seconds of a given date as integer.
`weekofyear`(col)	Extract the week number of a given date as integer.
`year`(col)	Extract the year of a given date as integer.
`quarter`(col)	Extract the quarter of a given date as integer.
`month`(col)	Extract the month of a given date as integer.
`last_day`(date)	Returns the last day of the month to which the provided date corresponds.
`minute`(col)	Extract the minutes of a given date as integer.
`months_between`(date1, date2[, roundOff])	Returns number of months between dates date1 and date2.
`next_day`(date, dayOfWeek)	Returns the first date which is later than the value of the date column.
`hour`(col)	Extract the hours of a given date as integer.
`make_date`(year, month, day)	Returns a column with a date built from the year, month and day columns.
`from_unixtime`(timestamp[, format])	Converts the number of seconds since the Unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp in the current system time zone, formatted as specified.
`unix_timestamp`([timestamp, format])	Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.
`to_timestamp`(col[, format])	Converts a `Column` into `pyspark.sql.types.TimestampType` using the optionally specified format.
`to_date`(col[, format])	Converts a `Column` into `pyspark.sql.types.DateType` using the optionally specified format.
`trunc`(date, format)	Returns date truncated to the unit specified by the format.
`from_utc_timestamp`(timestamp, tz)	This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE.
`to_utc_timestamp`(timestamp, tz)	This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE.
`window`(timeColumn, windowDuration[, …])	Bucketize rows into one or more time windows given a timestamp specifying column.
`session_window`(timeColumn, gapDuration)	Generates session window given a timestamp specifying column.
`timestamp_seconds`(col)	New in version 3.1.0.

Collection Functions

Collection Functions	Description
`array_contains`(col, value)	Collection function: returns null if the array is null, true if the array contains the given value, and false otherwise.
`arrays_overlap`(a1, a2)	Collection function: returns true if the arrays contain any common non-null element; if not, returns null if both the arrays are non-empty and any of them contains a null element; returns false otherwise.
`slice`(x, start, length)	Collection function: returns an array containing all the elements in x from index start (array indices start at 1, or from the end if start is negative) with the specified length.
`array_join`(col, delimiter[, null_replacement])	Concatenates the elements of column using the delimiter.
`concat`(*cols)	Concatenates multiple input columns together into a single column.
`array_position`(col, value)	Finds the position of the first occurrence of the specified value within the given array.
`element_at`(col, extraction)	Collection function: Returns element of array at given index in extraction if col is array.
`array_sort`(col)	Collection function: sorts the input array in ascending order.
`array_remove`(col, element)	Collection function: Remove all elements that equal to element from the given array.
`array_distinct`(col)	Collection function: removes duplicate values from the array.
`array_intersect`(col1, col2)	Collection function: returns an array of the elements in the intersection of col1 and col2, without duplicates.
`array_union`(col1, col2)	Collection function: returns an array of the elements in the union of col1 and col2, without duplicates.
`array_except`(col1, col2)	Collection function: returns an array of the elements in col1 but not in col2, without duplicates.
`transform`(col, f)	Returns an array of elements after applying a transformation to each element in the input array.
`exists`(col, f)	Returns whether a predicate holds for one or more elements in the array.
`forall`(col, f)	Returns whether a predicate holds for every element in the array.
`filter`(col, f)	Returns an array of elements for which a predicate holds in a given array.
`aggregate`(col, initialValue, merge[, finish])	Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.
`zip_with`(left, right, f)	Merge two given arrays, element-wise, into a single array using a function.
`transform_keys`(col, f)	Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new keys for the pairs.
`transform_values`(col, f)	Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new values for the pairs.
`map_filter`(col, f)	Returns a map whose key-value pairs satisfy a predicate.
`map_zip_with`(col1, col2, f)	Merge two given maps, key-wise into a single map using a function.
`explode`(col)	Returns a new row for each element in the given array or map.
`explode_outer`(col)	Returns a new row for each element in the given array or map.
`posexplode`(col)	Returns a new row for each element with position in the given array or map.
`posexplode_outer`(col)	Returns a new row for each element with position in the given array or map.
`get_json_object`(col, path)	Extracts json object from a json string based on json path specified, and returns json string of the extracted json object.
`json_tuple`(col, *fields)	Creates a new row for a json column according to the given field names.
`from_json`(col, schema[, options])	Parses a column containing a JSON string into a `MapType` with `StringType` as keys type, `StructType` or `ArrayType` with the specified schema.
`schema_of_json`(json[, options])	Parses a JSON string and infers its schema in DDL format.
`to_json`(col[, options])	Converts a column containing a `StructType`, `ArrayType` or a `MapType` into a JSON string.
`size`(col)	Collection function: returns the length of the array or map stored in the column.
`sort_array`(col[, asc])	Collection function: sorts the input array in ascending or descending order according to the natural ordering of the array elements.
`array_max`(col)	Collection function: returns the maximum value of the array.
`array_min`(col)	Collection function: returns the minimum value of the array.
`shuffle`(col)	Collection function: Generates a random permutation of the given array.
`reverse`(col)	Collection function: returns a reversed string or an array with reverse order of elements.
`flatten`(col)	Collection function: creates a single array from an array of arrays.
`sequence`(start, stop[, step])	Generate a sequence of integers from start to stop, incrementing by step.
`array_repeat`(col, count)	Collection function: creates an array containing a column repeated count times.
`map_keys`(col)	Collection function: Returns an unordered array containing the keys of the map.
`map_values`(col)	Collection function: Returns an unordered array containing the values of the map.
`map_entries`(col)	Collection function: Returns an unordered array of all entries in the given map.
`map_from_entries`(col)	Collection function: Returns a map created from the given array of entries.
`arrays_zip`(*cols)	Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
`map_concat`(*cols)	Returns the union of all the given maps.
`from_csv`(col, schema[, options])	Parses a column containing a CSV string to a row with the specified schema.
`schema_of_csv`(csv[, options])	Parses a CSV string and infers its schema in DDL format.
`to_csv`(col[, options])	Converts a column containing a `StructType` into a CSV string.

Math Functions


`sqrt`(col)	Computes the square root of the specified float value.
`abs`(col)	Computes the absolute value.
`acos`(col)	Computes inverse cosine of the input column.
`acosh`(col)	Computes inverse hyperbolic cosine of the input column.
`asin`(col)	Computes inverse sine of the input column.
`asinh`(col)	Computes inverse hyperbolic sine of the input column.
`atan`(col)	Compute inverse tangent of the input column.
`atanh`(col)	Computes inverse hyperbolic tangent of the input column.
`atan2`(col1, col2)	New in version 1.4.0.
`bin`(col)	Returns the string representation of the binary value of the given column.
`cbrt`(col)	Computes the cube-root of the given value.
`ceil`(col)	Computes the ceiling of the given value.
`conv`(col, fromBase, toBase)	Convert a number in a string column from one base to another.
`cos`(col)	Computes cosine of the input column.
`cosh`(col)	Computes hyperbolic cosine of the input column.
`cot`(col)	Computes cotangent of the input column.
`csc`(col)	Computes cosecant of the input column.
`exp`(col)	Computes the exponential of the given value.
`expm1`(col)	Computes the exponential of the given value minus one.
`factorial`(col)	Computes the factorial of the given value.
`floor`(col)	Computes the floor of the given value.
`hex`(col)	Computes hex value of the given column, which could be `pyspark.sql.types.StringType`, `pyspark.sql.types.BinaryType`, `pyspark.sql.types.IntegerType` or `pyspark.sql.types.LongType`.
`unhex`(col)	Inverse of hex.
`hypot`(col1, col2)	Computes `sqrt(a^2 + b^2)` without intermediate overflow or underflow.
`log`(arg1[, arg2])	Returns the logarithm of the second argument using the first argument as the base.
`log10`(col)	Calculates the base 10 logarithm of the specified value.
`log1p`(col)	Calculates the natural logarithm of the given value plus one.
`log2`(col)	Returns the base-2 logarithm of the argument.
`pow`(col1, col2)	Returns the value of the first argument raised to the power of the second argument.
`rint`(col)	Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
`round`(col[, scale])	Round the given value to scale decimal places using HALF_UP rounding mode if scale >= 0 or at integral part when scale < 0.
`bround`(col[, scale])	Round the given value to scale decimal places using HALF_EVEN rounding mode if scale >= 0 or at integral part when scale < 0.
`sec`(col)	Computes secant of the input column.
`shiftleft`(col, numBits)	Shift the given value numBits left.
`shiftright`(col, numBits)	(Signed) shift the given value numBits right.
`shiftrightunsigned`(col, numBits)	Unsigned shift the given value numBits right.
`signum`(col)	Computes the signum of the given value.
`sin`(col)	Computes sine of the input column.
`sinh`(col)	Computes hyperbolic sine of the input column.
`tan`(col)	Computes tangent of the input column.
`tanh`(col)	Computes hyperbolic tangent of the input column.
`toDegrees`(col)	Deprecated since version 2.1.0.
`degrees`(col)	Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
`toRadians`(col)	Deprecated since version 2.1.0.
`radians`(col)	Converts an angle measured in degrees to an approximately equivalent angle measured in radians.

Aggregate Functions

Window Functions

Sort Functions

UDF Functions


`pandas_udf`([f, returnType, functionType])	Creates a pandas user defined function (a.k.a.
`udf`([f, returnType])	Creates a user defined function (UDF).

Conclusion:

In this post, I have listed several commonly used built-in standard library PySpark SQL functions where you could read usage, syntax, and examples.

Happy Learning !!