You are currently viewing Sparklyr Sort DataFrame with Examples

How to sort DataFrame in Sparklyr (R Package)? R Sparklyr package provides a high-level interface for working with Apache Spark. One of the most common operations in data analysis is ordering and sorting data. In this article, we will discuss how you can use sparklyr to sort the data using arrange(), and sdf_sql() functions. The arrange function works the same way as order_by and sort in python.

Before we start, first let’s create a DataFrame to work with the data in R. Here we are using R’s data.frame() function to create the data frame.


# create the dataframe
simpleData <- data.frame(
   employee_name = c("James","Michael","Robert","Maria","Raman","Scott","Jen","Jeff","Kumar"),
   department = c("Sales","Sales","Sales","Finance","Finance","Finance","Finance","Marketing","Marketing"),
   state = c("NY","NY","CA","CA","CA","NY","NY","CA","NY"),
   salary = c(90000,86000,81000,90000,99000,83000,79000,80000,91000),
   age = c(34,56,30,24,40,36,53,25,50),
   bonus = c(10000,20000,23000,23000,24000,19000,15000,18000,21000)
 )

# initiate spark connection
sc <- spark_connect(master = 'local', 
                    spark_home = Sys.getenv("SPARK_HOME"), 
                    app_name = "SparkByExamples.com", 
                    method = 'shell',
                    version = '3.0.0')

# Load data into a Spark dataframe
df <- copy_to(sc, simpleData, 'df')

df % head(5)

The above code yields the below output:

sparklyr arrange sort

1. Sparklyr DataFrame Sort using arrange() Function

We can sort the data in the Sparklyr DataFrame based on the specified column using arrange function. By default, this function arranges the data in ascending order.

Let’s sort the above created Sparklyr DataFrame by column state in ascending order.


# Sort dataframe using arrange() function
df %>% arrange(state)

The above code gives the following output:

sparklyr arrange sort

2. DataFrame Sort in Descending Order

If you wanted to specify the ascending order/sort explicitly on DataFrame, you can use desc() method and pass it inside the arrange function.


# Sort dataframe using arrange() function in ascending order
df %>% arrange(desc(salary))

The above code yields the output as below:

3. Sparklyr Sort using sdf_sql()

Sparklyr provides a way to run SQL syntax queries on the DataFrame. Below is an example of how to arrange the data frame using SQL syntax. You can use sdf_sql() function to run custom spark SQL.


# Sort dataframe using sdf_sql() function in ascending order
df <- sdf_sql('select * from df ORDER BY department asc')

# Sort dataframe using sdf_sql() function in descending order
df <- sdf_sql('select * from df ORDER BY department desc')

The above code yields the below output respectively:

Sort by department column in ascending order.

Sort by department column in descending order.

sparklyr dataframe sort

4. Dataframe Sorting Complete Example


# create the dataframe
simpleData <- data.frame(employee_name = c("James","Michael","Robert","Maria","Raman","Scott","Jen","Jeff","Kumar"),
                        department = c("Sales","Sales","Sales","Finance","Finance","Finance","Finance","Marketing","Marketing"),
                        state = c("NY","NY","CA","CA","CA","NY","NY","CA","NY"),
                        salary = c(90000,86000,81000,90000,99000,83000,79000,80000,91000),
                        age = c(34,56,30,24,40,36,53,25,50),
                        bonus = c(10000,20000,23000,23000,24000,19000,15000,18000,21000))

# initiate spark connection
sc <- spark_connect(master = 'local', 
                    spark_home = Sys.getenv("SPARK_HOME"), 
                    app_name = "SparkByExamples.com", 
                    method = 'shell',
                    version = '3.0.0')

# Load data into a Spark dataframe
df % head(5)

# Sort dataframe using arrange() function
df %>% arrange(state)

# Sort dataframe using arrange() function in ascending order
df %>% arrange(state)

# Sort dataframe using sdf_sql() function in ascending order
df <- sdf_sql('select * from df ORDER BY department asc')

# Sort dataframe using sdf_sql() function in descending order
df % arrange(desc(salary))

5. Conclusion

In this tutorial, you were introduced to sorting Sparklyr DataFrame columns using the arrange() and spark SQL sdf_sql() functions. The use of these functions in conjunction with Spark SQL and the sorting orders of ascending and descending were demonstrated.

Feel free to comment and ask questions. I will be more than happy to answer.

Happy learning!

Reference

https://spark.rstudio.com/