Site icon Spark By {Examples}

PySpark between() Example

pyspark between

The PySpark between(lowerBound,upperBound) is used to get the rows between two values. The Columns.between() returns either True or False (boolean expression), it is evaluated to true if the value of this expression is between the given column values or internal values. By combining this function with where() you can get the rows where the expression is in between two values.

Alternatively, you can also use between() function with SQL statement where the syntax and behavior are exactly the same as RDBMS SQL.

Let’s create the PySpark DataFrame.


# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
            .appName('SparkByExamples.com') \
            .getOrCreate()

# prepare Data
simpleData = (("James", "Sales", 4000), \
    ("Michael", "Sales", 4600),  \
    ("Robert", "Sales", 4100),   \
    ("Maria", "Finance", 3000),  \
    ("James", "Sales", 3000),    \
    ("Scott", "Finance", 3300),  \
    ("Jen", "Finance", 3900),    \
    ("Jeff", "Marketing", 3000), \
    ("Kumar", "Marketing", 2000),\
    ("Saif", "Sales", 4100) \
  )
columns= ["employee_name", "department", "salary"]

# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)

Yields below output.

pyspark between

1. PySpark Column between()

The pyspark.sql.Column.between() returns the boolean expression TRUE when the values are in between two columns or literal values. If it is not, it returns False. The PySpark Column class has several functions which result in a boolean expression.

Note that The between() range is inclusive: lower-bound and upper-bound values are included. 


# Syntax of between()
Column.between(lowerBound, upperBound)

In the below example df.Salary is of Column type, so we can use df.salary.between()


# Using between
df.select(df.employee_name, df.salary.between(3000, 4000)) \
  .show()

Yields below output.

pyspark between predicate

2. Using where() with between()

You can combine between() with where() to filter the DataFrame by selecting values within a given range. The following example returns all rows with a salary between 3000 and 4000 (both these values are also included in the result)


# Using where with between
df.select("employee_name","salary") \
  .where(df.salary.between(3000, 4000)) \
  .show()

Yields below output.

pyspark sql between

3. Using col().between()

Alternatively, you can also use col() SQL function.


# using col().between()
from pyspark.sql.functions import col
# Using where with between
df.select("employee_name","salary") \
  .where(col("salary").between(3000, 4000)) \
  .show()

Yields the same output as above.

4. PySpark SQL between

PySpark also provides a way to run the operations in the native SQL statement, so you can use the BETWEEN operator which is a logical operator that allows you to check the range of values.

You can also use the greater than or equal to (>=) and less than or equal to (<=) to substitute the BETWEEN operator in SQL statement however, the condition that uses the BETWEEN operator is much more readable and it is your choice which one to use.

To run the SQL query use spark.sql() function and the table created with createOrReplaceTempView() would be available to use until you end your current SparkSession.

spark.sql() returns a DataFrame and here, I have used show() to display the contents to console.


# Using SQL
df.createOrReplaceTempView("employee")

df2 = spark.sql("select employee_name, salary from employee"  \
                " where salary between 3000 and 4000")
df2.show()

df2 = spark.sql("select employee_name, salary from employee"  \
                " where salary >= 3000 and salary <= 4000")
df2.show()

These two SQL statements yield the same output as above.

5. Complete Example

Following is a complete example of between() predicate.


# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
            .appName('SparkByExamples.com') \
            .getOrCreate()

# Prepare Data
simpleData = (("James", "Sales", 4000), \
    ("Michael", "Sales", 4600),  \
    ("Robert", "Sales", 4100),   \
    ("Maria", "Finance", 3000),  \
    ("James", "Sales", 3000),    \
    ("Scott", "Finance", 3300),  \
    ("Jen", "Finance", 3900),    \
    ("Jeff", "Marketing", 3000), \
    ("Kumar", "Marketing", 2000),\
    ("Saif", "Sales", 4100) \
  )
columns= ["employee_name", "department", "salary"]

# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)

# Using between
df.select(df.employee_name, df.salary.between(3000, 4000)) \
  .show()
    
# Using where with between
df.select("employee_name","salary") \
  .where(df.salary.between(3000, 4000)) \
  .show()

# using col().between()
from pyspark.sql.functions import col
# Using where with between
df.select("employee_name","salary") \
  .where(col("salary").between(3000, 4000)) \
  .show()

# Using SQL
df.createOrReplaceTempView("employee")

df2 = spark.sql("select employee_name, salary from employee"  \
                " where salary between 3000 and 4000")
df2.show()

df2 = spark.sql("select employee_name, salary from employee"  \
                " where salary >= 3000 and salary <= 4000")
df2.show()

Conclusion

In this article, you have learned how to use PySpark between() with several examples. This returns a boolean expression that is evaluated to true if the value of this expression is between the given columns, if not return false.

Related Articles

Exit mobile version