The PySpark between()
function is used to get the rows between two values. The Column.between() returns either True or False (boolean expression), it is evaluated to true if the value of this expression is between the given column values or internal values. By combining this function with where() you can get the rows where the expression is in between two values.
Alternatively, you can also use the between() function with SQL statement where the syntax and behavior are exactly the same as RDBMS SQL.
Let’s create the PySpark DataFrame.
# Imports
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
# prepare Data
simpleData = (("James", "Sales", 4000), \
("Michael", "Sales", 4600), \
("Robert", "Sales", 4100), \
("Maria", "Finance", 3000), \
("James", "Sales", 3000), \
("Scott", "Finance", 3300), \
("Jen", "Finance", 3900), \
("Jeff", "Marketing", 3000), \
("Kumar", "Marketing", 2000),\
("Saif", "Sales", 4100) \
)
columns= ["employee_name", "department", "salary"]
# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
Yields below output.
1. PySpark Column between()
The pyspark.sql.Column.between() returns the boolean expression TRUE when the values are in between two columns or literal values. If it is not, it returns False. The PySpark Column class has several functions which result in a boolean expression.
Note that The between() range is inclusive: lower-bound and upper-bound values are included.
# Syntax of between()
Column.between(lowerBound, upperBound)
In the below example df.Salary
is of Column type, so we can use df.salary.between()
# Using between
df.select(df.employee_name, df.salary.between(3000, 4000)) \
.show()
Yields below output.
2. Using where() with between()
You can combine between()
with where() to filter the DataFrame by selecting values within a given range. The following example returns all rows with a salary between 3000 and 4000 (both these values are also included in the result)
# Using where with between
df.select("employee_name","salary") \
.where(df.salary.between(3000, 4000)) \
.show()
Yields below output.
3. Using col().between()
Alternatively, you can also use col() SQL function.
# using col().between()
from pyspark.sql.functions import col
# Using where with between
df.select("employee_name","salary") \
.where(col("salary").between(3000, 4000)) \
.show()
Yields the same output as above.
4. PySpark SQL between
PySpark also provides to execute the native SQL statement, so you can use the BETWEEN
operator which is a logical operator that allows you to check the range of values.
You can also use the greater than or equal to (>=) and less than or equal to (<=) to substitute the BETWEEN
operator in SQL statement however, the condition that uses the BETWEEN
operator is much more readable, and it is your choice which one to use.
# Using SQL
df.createOrReplaceTempView("employee")
df2 = spark.sql("select employee_name, salary from employee" \
" where salary between 3000 and 4000")
df2.show()
df2 = spark.sql("select employee_name, salary from employee" \
" where salary >= 3000 and salary <= 4000")
df2.show()
These two SQL statements yield the same output as above.
5. Complete Example
Following is a complete example of between() predicate.
# Imports
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
# Prepare Data
simpleData = (("James", "Sales", 4000), \
("Michael", "Sales", 4600), \
("Robert", "Sales", 4100), \
("Maria", "Finance", 3000), \
("James", "Sales", 3000), \
("Scott", "Finance", 3300), \
("Jen", "Finance", 3900), \
("Jeff", "Marketing", 3000), \
("Kumar", "Marketing", 2000),\
("Saif", "Sales", 4100) \
)
columns= ["employee_name", "department", "salary"]
# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
# Using between
df.select(df.employee_name, df.salary.between(3000, 4000)) \
.show()
# Using where with between
df.select("employee_name","salary") \
.where(df.salary.between(3000, 4000)) \
.show()
# using col().between()
from pyspark.sql.functions import col
# Using where with between
df.select("employee_name","salary") \
.where(col("salary").between(3000, 4000)) \
.show()
# Using SQL
df.createOrReplaceTempView("employee")
df2 = spark.sql("select employee_name, salary from employee" \
" where salary between 3000 and 4000")
df2.show()
df2 = spark.sql("select employee_name, salary from employee" \
" where salary >= 3000 and salary <= 4000")
df2.show()
Conclusion
In this article, you have learned how to use PySpark between() with several examples. This returns a boolean expression that is evaluated to true if the value of this expression is between the given columns, if not return false.
Related Articles
- PySpark unionByName() Example
- PySpark Broadcast Variable
- PySpark Broadcast Join
- PySpark persist() Example
- PySpark lag() Function
- PySpark Random Sample with Example
- PySpark reduceByKey usage with example
- Pyspark – Get substring() from a column
- Show First Top N Rows in Spark | PySpark
- PySpark Create DataFrame from List
- PySpark Concatenate Columns
- PySpark Refer Column Name With Dot (.)