• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:9 mins read
You are currently viewing PySpark between() Example

The PySpark between(lowerBound,upperBound) is used to get the rows between two values. The Columns.between() returns either True or False (boolean expression), it is evaluated to true if the value of this expression is between the given column values or internal values. By combining this function with where() you can get the rows where the expression is in between two values.

Advertisements

Alternatively, you can also use between() function with SQL statement where the syntax and behavior are exactly the same as RDBMS SQL.

Let’s create the PySpark DataFrame.


# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
            .appName('SparkByExamples.com') \
            .getOrCreate()

# prepare Data
simpleData = (("James", "Sales", 4000), \
    ("Michael", "Sales", 4600),  \
    ("Robert", "Sales", 4100),   \
    ("Maria", "Finance", 3000),  \
    ("James", "Sales", 3000),    \
    ("Scott", "Finance", 3300),  \
    ("Jen", "Finance", 3900),    \
    ("Jeff", "Marketing", 3000), \
    ("Kumar", "Marketing", 2000),\
    ("Saif", "Sales", 4100) \
  )
columns= ["employee_name", "department", "salary"]

# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)

Yields below output.

pyspark between

1. PySpark Column between()

The pyspark.sql.Column.between() returns the boolean expression TRUE when the values are in between two columns or literal values. If it is not, it returns False. The PySpark Column class has several functions which result in a boolean expression.

Note that The between() range is inclusive: lower-bound and upper-bound values are included. 


# Syntax of between()
Column.between(lowerBound, upperBound)

In the below example df.Salary is of Column type, so we can use df.salary.between()


# Using between
df.select(df.employee_name, df.salary.between(3000, 4000)) \
  .show()

Yields below output.

pyspark between predicate

2. Using where() with between()

You can combine between() with where() to filter the DataFrame by selecting values within a given range. The following example returns all rows with a salary between 3000 and 4000 (both these values are also included in the result)


# Using where with between
df.select("employee_name","salary") \
  .where(df.salary.between(3000, 4000)) \
  .show()

Yields below output.

pyspark sql between

3. Using col().between()

Alternatively, you can also use col() SQL function.


# using col().between()
from pyspark.sql.functions import col
# Using where with between
df.select("employee_name","salary") \
  .where(col("salary").between(3000, 4000)) \
  .show()

Yields the same output as above.

4. PySpark SQL between

PySpark also provides a way to run the operations in the native SQL statement, so you can use the BETWEEN operator which is a logical operator that allows you to check the range of values.

You can also use the greater than or equal to (>=) and less than or equal to (<=) to substitute the BETWEEN operator in SQL statement however, the condition that uses the BETWEEN operator is much more readable and it is your choice which one to use.

To run the SQL query use spark.sql() function and the table created with createOrReplaceTempView() would be available to use until you end your current SparkSession.

spark.sql() returns a DataFrame and here, I have used show() to display the contents to console.


# Using SQL
df.createOrReplaceTempView("employee")

df2 = spark.sql("select employee_name, salary from employee"  \
                " where salary between 3000 and 4000")
df2.show()

df2 = spark.sql("select employee_name, salary from employee"  \
                " where salary >= 3000 and salary <= 4000")
df2.show()

These two SQL statements yield the same output as above.

5. Complete Example

Following is a complete example of between() predicate.


# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
            .appName('SparkByExamples.com') \
            .getOrCreate()

# Prepare Data
simpleData = (("James", "Sales", 4000), \
    ("Michael", "Sales", 4600),  \
    ("Robert", "Sales", 4100),   \
    ("Maria", "Finance", 3000),  \
    ("James", "Sales", 3000),    \
    ("Scott", "Finance", 3300),  \
    ("Jen", "Finance", 3900),    \
    ("Jeff", "Marketing", 3000), \
    ("Kumar", "Marketing", 2000),\
    ("Saif", "Sales", 4100) \
  )
columns= ["employee_name", "department", "salary"]

# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)

# Using between
df.select(df.employee_name, df.salary.between(3000, 4000)) \
  .show()
    
# Using where with between
df.select("employee_name","salary") \
  .where(df.salary.between(3000, 4000)) \
  .show()

# using col().between()
from pyspark.sql.functions import col
# Using where with between
df.select("employee_name","salary") \
  .where(col("salary").between(3000, 4000)) \
  .show()

# Using SQL
df.createOrReplaceTempView("employee")

df2 = spark.sql("select employee_name, salary from employee"  \
                " where salary between 3000 and 4000")
df2.show()

df2 = spark.sql("select employee_name, salary from employee"  \
                " where salary >= 3000 and salary <= 4000")
df2.show()

Conclusion

In this article, you have learned how to use PySpark between() with several examples. This returns a boolean expression that is evaluated to true if the value of this expression is between the given columns, if not return false.

Related Articles

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium