Spark SQL Sampling with Examples

Spark sampling is a mechanism to get random sample records from the dataset, this is helpful when you have a larger dataset and wanted to analyze/test a subset of the data for example 10% of the original file.

Spark provides sampling methods on RDD, DataFrame, and Dataset API to get sample data, In this article, I will explain how to get random sample records and how to get the same random sample every time you run and many more with scala examples.

Data sampling is most used by data analysts and data scientists to get statistical data on a subset of the dataset before applying it to large datasets.

Depends on Spark API you choose, you can use DataFrame.sample(), RDD.sample(), RDD.takeSample(), DataFrameStatFunctions.sampleBy() functions to get sample data.

1. Spark DataFrame Sampling

Spark DataFrame sample() has several overloaded functions, every signature takes fraction as a mandatory argument with a double value between 0 to 1 and returns a new Dataset with selected random sample records.

1.1 DataFrame sample() Syntax:


sample(fraction : scala.Double, seed : scala.Long) : Dataset[T]
sample(fraction : scala.Double) : Dataset[T]
sample(withReplacement : scala.Boolean, fraction : scala.Double, seed : scala.Long) : Dataset[T]
sample(withReplacement : scala.Boolean, fraction : scala.Double) : Dataset[T]

Parameters

fraction – Fraction of rows to generate, range [0.0, 1.0]. Note that it doesn’t guarantee to provide the exact number of the fraction of records.

seed – Seed for sampling (default a random seed). Used to reproduce same random sampling

withReplacement – Sample with replacement or not (default False).

Note:

1.2 DataFrame sample() Examples

Note: If you run the same examples on your system, you may see different results for Example 1 and 3.

Example 1 Using fraction to get a random sample in Spark – By using fraction between 0 to 1, it returns the approximate number of the fraction of the dataset. For example, 0.1 returns 10% of the rows. However, this does not guarantee it returns the exact 10% of the records.


val spark = SparkSession.builder() .master("local[1]").appName("SparkByExample")
    .getOrCreate();
val df=spark.range(100)
println(df.sample(0.1).collect().mkString(","))
//Output: 7,8,27,36,40,48,51,52,63,76,85,88

My DataFrame has 100 records and I wanted to get 10% sample records which are 10 but the sample() function returned 12 records. This proves the sample function doesn’t return the exact fraction specified.

Example 2: Using seed to reproduce same Samples in Spark – Every time you run a sample() function it returns a different set of sampling records, however sometimes during the development and testing phase you may need to regenerate the same sample every time as you need to compare the results from your previous run. To get consistent same random sampling use the same slice value for every run. Change slice value to get different results.


println(df.sample(0.1,123).collect().mkString(","))
//Output: 36,37,41,43,56,66,69,75,83
println(df.sample(0.1,123).collect().mkString(","))
//Output: 36,37,41,43,56,66,69,75,83
println(df.sample(0.1,456).collect().mkString(","))
//Output: 19,21,42,48,49,50,75,80

On above examples, first 2 I have used slice 123 hence the sampling results are same and for last I have used 456 as slice hence it has returned different sampling records

Example 3: Sample withReplacement (May contain duplicates) – some times you may need to get the random sample with repeated values. By using the value true, results in repeated values.


println(df.sample(true,0.3,123).collect().mkString(",")) //with Duplicates
//Output: 0,5,9,11,14,14,16,17,21,29,33,41,42,52,52,54,58,65,65,71,76,79,85,96
println(df.sample(0.3,123).collect().mkString(",")) // No duplicates
//Output: 0,4,17,19,24,25,26,36,37,41,43,44,53,56,66,68,69,70,71,75,76,78,83,84,88,94,96,97,98

On first example, values 14, 52 and 65 are repeated values.

2. Spark Stratified Sampling

Use sampleBy() from DataFrameStatFunctions class to get Stratified sampling in Spark

2.1 sampleBy() Syntax


sampleBy[T](col : _root_.scala.Predef.String, fractions : _root_.scala.Predef.Map[T, scala.Double], seed : scala.Long) : DataFrame
sampleBy[T](col : _root_.scala.Predef.String, fractions : java.util.Map[T, java.lang.Double], seed : scala.Long) : DataFrame
sampleBy[T](col : org.apache.spark.sql.Column, fractions : _root_.scala.Predef.Map[T, scala.Double], seed : scala.Long) : DataFrame
sampleBy[T](col : org.apache.spark.sql.Column, fractions : java.util.Map[T, java.lang.Double], seed : scala.Long) : DataFrame

2.2 sampleBy() Example


println(df.stat.sampleBy("id", Map(""->0.1),123).collect().mkString(","))
//Output: 6,13,17,19,78

3. Spark RDD Sampling

Spark RDD also provides sample() function to get a random sampling, it also has another signature takeSample() that returns an Array[T].

RDD takeSample() is an action hence you need to careful when you use this function as returning too much data results in an out-of-memory error similar to collect().

3.1 RDD sample() Syntax

Spark RDD sample() function returns the random sampling similar to DataFrame and takes similar type of parameters but in different order. Since I’ve already covered the explanation of these parameters on DataFrame, I will not be repeating the explanation on RDD, If not already read I recommend to read DataFrame section above.

sample() of RDD returns a new RDD by selecting random sampling. Below is a syntax.


sample(withReplacement : scala.Boolean, fraction : scala.Double, seed : scala.Long) : RDD[T]

3.2 RDD sample() Example


val rdd = spark.sparkContext.range(0,100)
println(rdd.sample(false,0.1,0).collect().mkString(","))
//Output: 1,20,29,42,53,62,63,70,82,87
println(rdd.sample(true,0.3,123).collect().mkString(","))
//Output: 1,4,21,30,32,32,32,33,42,45,46,52,53,54,55,58,58,66,66,68,70,70,74,86,92,96,98,99

3.3 RDD takeSample() Syntax

Below is a syntax of RDD takeSample() signature.


takeSample(withReplacement : scala.Boolean, num : scala.Int, seed : scala.Long) : scala.Array[T] 

3.4 RDD takeSample() Example


println(rdd.takeSample(false,10,0).mkString(","))
//Output: 62,30,27,29,21,16,86,7,20,91
println(rdd.takeSample(true,30,123).mkString(","))
//Output: 85,49,61,16,90,5,33,98,89,38,89,29,5,48,24,60,41,33,13,40,14,33,56,95,40,48,61,36,82,9

Conclusion

In summary, Spark sampling can be done on RDD and DataFrame. In order to do sampling, you need to know how much data you wanted to retrieve by specifying fractions.

Use seed to regenerate the same sampling multiple times. and

Use withReplacement if you are okay to repeat the random records.

Thanks for reading. If you recognize my effort or like articles here please do comment or provide any suggestions for improvements in the comments sections!

Reference

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply

This Post Has One Comment

  1. Anonymous

    nice article… all the concepts have covered in this blog are really helpful especially for new comers to bigdata(i’m professional with over 12+ years industry exp in other technologies… so i can understand the value of this blog) …. because just syntax focused examples can be find easily on the web but ***sparkbyexamples*** explains every concepts precisely and where a particular concept is applicable. really appreciate your efforts…