• Post author:
  • Post category:Pandas
  • Post last modified:April 9, 2024
  • Reading time:20 mins read
You are currently viewing Pandas Select Rows Based on Column Values

In pandas, you can select rows based on column values using boolean indexing or using methods like DataFrame.loc[] attribute, DataFrame.query(), or DataFrame.apply() method with lambda function.

Advertisements

In this article, I will explain how to select rows based on single or multiple-column values (values from the list) and also how to select rows that have None or Nan values with several examples.

Key Points –

  • Pandas allow the selection of rows based on column values using boolean arrays.
  • Boolean expressions can involve logical operators like & (and), | (or), and ~ (not) for complex conditions.
  • Methods like loc[], apply[], query(), and isin() offer versatile ways to select rows based on column values.
  • Optimizing boolean expressions for selecting rows can significantly improve performance, especially when dealing with large datasets, by leveraging Pandas’ internal optimizations.
  • Avoid using chained indexing (df[][]) as it can lead to unpredictable behavior and should be replaced with more explicit methods.

Create DataFrame

Now, Let’s create Pandas DataFrame using data from a Python dictionary, where the columns are CoursesFeeDuration and Discount.


# Create pandas DataFrame 
import pandas as pd
import numpy as np

technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','40days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print("Original DataFrame:\n",df)

Yields below output.

pandas select rows values

Select Rows Based on Column Values

You can use boolean indexing to select rows based on column values in Pandas DataFrame. Selecting rows based on column values in a Pandas dataframe means extracting specific rows from the dataframe that satisfy certain conditions defined on one or more columns.

The following program creates a DataFrame named df and then selects rows where the Courses column has the value Spark. The resulting DataFrame, df2, contains only the rows related to the Spark course.


# Select rows based on column values
df2=df[df["Courses"] == 'Spark'] 
print("Selected rows where 'Courses' is 'Spark':\n", df2)

In the above examples, this will print out the rows where the value in the Courses column is equal to Spark.

pandas select rows values

You can also write the above statement with a variable. Select rows from the DataFrame df where the value in the Courses column is equal to the value stored in the variable value, which is Spark in this case. The resulting DataFrame is assigned to df2. This is a concise and effective way to select rows based on a specific column value.


value="Spark"
df2=df[df["Courses"] == value]
print(df2)

# Output:
#   Courses    Fee Duration  Discount
#0   Spark  22000   30days      1000

Use the != operator to select rows where the column value does not equal a certain value. for instance, df[df["Courses"] != 'Spark'].


df2=df[df["Courses"] != 'Spark']
print(df2)

# Output:
#    Courses    Fee Duration  Discount
#1  PySpark  25000   50days      2300
#2   Hadoop  23000   40days      1000
#3   Python  24000     None      1200
#4   Pandas  26000      NaN      2500

Using DataFrame.query()

Alternatively, you can use of DataFrame.query() function. It selects rows from the DataFrame where the value in the Courses column is equal to Spark.

The DataFrame.query() method allows you to write conditions as strings, making it more concise. In this case, the condition is specified as a string: "Courses == 'Spark'". It selects rows where the “Courses” column is equal to ‘Spark’. Adjust the column name and the value in the query string based on your requirements.


# Select rows based on column values 
# Using DataFrame.query()
df2 = df.query("Courses == 'Spark'")
print("Selected rows where 'Courses' is 'Spark':\n", df2)

# Output:
# Selected rows where 'Courses' is 'Spark':
#   Courses    Fee Duration  Discount
#0   Spark  22000   30days      1000

Using DataFrame.loc to Select Based on Column Values

Select rows from the DataFrame df where the value in the "Courses" column is equal to 'PySpark' using DataFrame.loc[].


# Select rows based on column values using DataFrame.loc
df2 = df.loc[df['Courses'] == 'Spark']
print("\nSelected rows where 'Courses' is 'Spark':\n", df2)

In the above example, df['Courses'] == 'PySpark' creates a boolean mask, and df.loc[] is then used to select rows where this condition is True. Adjust the column name and the value in the condition according to your specific requirements.

Select Rows Based on List of Column Values

You can use the isin() method to select rows where the Courses column is in a list of values. For instance, df['Courses'].isin(values) creates a boolean mask where rows are selected if the ‘Courses’ column value is in the list values. Adjust the column name ('Courses') and the list of values (values) according to your specific DataFrame and requirements.


# Select rows where 'Courses' column is in the list of values
# # Using DataFrame.isin()
values = ['Spark', 'PySpark', 'Hadoop']
df2 = df[df['Courses'].isin(values)]
print("Selected rows where 'Courses' is in the list:\n", df2)

# Using df.loc
values = ['Spark', 'PySpark', 'Hadoop']
df2 = df.loc[df['Courses'].isin(values)]
print("Selected rows where 'Courses' is in the list:\n", df2)

# Output:
#Selected rows where 'Courses' is in the list:
#    Courses    Fee Duration  Discount
#0    Spark  22000   30days      1000
#1  PySpark  25000   50days      2300
#2   Hadoop  23000   40days      1000

Similarly, you can use the ~ operator to select rows where a column value is not in a list of values.


# Select rows where 'Courses' column is NOT in the list of excluded values
values = ['Spark', 'PySpark', 'Hadoop']
df2=df.loc[~df['Courses'].isin(values)]
print("Selected rows where 'Courses' is NOT in the list:\n", df2)

# Output:
# Selected rows where 'Courses' is NOT in the list:
#   Courses    Fee Duration  Discount
#3  Python  24000     None      1200
#4  Pandas  26000      NaN      2500

In the above example, the ~ operator negates the boolean condition created by isin(), effectively selecting rows where the 'Courses' column is not in the list of excluded values (values). Adjust the column name ('Courses') and the list of excluded values (values) according to your specific DataFrame and requirements.

Using Multiple Column Conditions

To select rows based on multiple column conditions in Pandas, you can use logical operators like & (AND) and | (OR) to combine the conditions. Selecting rows based on multiple conditions on the Discount column.


# Select rows based on multiple conditions
print(df.loc[(df['Discount'] >= 1500) & (df['Discount'] <= 2300)])
print(df[(df['Discount'] >= 1500) & (df['Discount'] <= 2300)])
print(df.query("Discount >= 1500 & Discount <= 2300"))

# Output:
#    Courses    Fee Duration  Discount
#1  PySpark  25000   50days      2300

In Pandas, you can use multiple columns to select DataFrame rows by specifying conditions for each column.


# Using multiple column conditions 
df2=df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]
print(df2)

# Output:
#    Courses    Fee Duration  Discount
#1  PySpark  25000   50days      2300
#3   Python  24000     None      1200
#4   Pandas  26000      NaN      2500

Here,

  • (df['Discount'] >= 1200) creates a boolean mask for rows where the ‘Discount’ column value is greater than or equal to 1200.
  • (df['Fee'] >= 23000) creates another boolean mask for rows where the ‘Fee’ column value is greater than or equal to 23000.
  • The conditions for both columns are combined using the & (AND) operator to ensure that both conditions are satisfied simultaneously.
  • The resulting boolean mask is used to select rows from the DataFrame df.

Select Rows Using DataFrame.apply()

DataFrame.apply() with a lambda function can indeed be used to apply a function row-by-row, and you can return the rows that match specific conditions.

This program uses the loc[] accessor along with a lambda function to select rows where the ‘Courses’ column is either ‘Spark’ or ‘PySpark’. Adjust the conditions in the lambda function according to your specific requirements.


# By using lambda function
df2 = df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
print(df2)

# Selecting rows where 'Courses' is 'Spark' or 'PySpark' 
# Using apply() with a lambda function
df2 = df[df.apply(lambda row: row['Courses'] in ['Spark', 'PySpark'], axis=1)]
print(df2)

# Output:
#    Courses    Fee Duration  Discount
#0    Spark  22000   30days      1000
#1  PySpark  25000   50days      2300

Using With None & nan Data

The dropna() method is used to drop rows with None or NaN values in column values. For instance, this will drop any rows from the DataFrame df where at least one column value is None or NaN, and the resulting DataFrame df2 will contain only rows with complete data.


# Select rows by ignoreing columns 
# That have None & Nan values
df2=df.dropna()
print(df2)

# Output:
#    Courses    Fee Duration  Discount
#0    Spark  22000   30days      1000
#1  PySpark  25000   50days      2300
#2   Hadoop  23000   40days      1000

If you want to drop columns where column values are None or NaN, you can use the DataFrame.dropna() method with axis='columns' parameter.


# Drop columns where column values are None or NaN
df2 = df.dropna(axis='columns')
print(df2)

# Output:
#   Courses    Fee  Discount
#0    Spark  22000      1000
#1  PySpark  25000      2300
#2   Hadoop  23000      1000
#3   Python  24000      1200
#4   Pandas  26000      2500

In the above example, this will drop any columns from the DataFrame df where at least one value is None or NaN, and the resulting DataFrame df2 will contain only columns with complete data.

Other Examples of Select Rows Based On Column Values


# Select based on value contains
print(df[df['Courses'].str.contains("Spark")])

# Select after converting values
print(df[df['Courses'].str.lower().str.contains("spark")])

#Select startswith
print(df[df['Courses'].str.startswith("P")])

Frequently Asked Questions on Select Rows Based on Column Values

How can I select rows where a specific column has a certain value?

You can use boolean indexing to select rows where a specific column has a certain value. For instance, df['Column_Name'] == 3 creates a boolean mask, and then you use this mask to select the rows where the condition is True. Adjust the column name (‘Column_Name’) and the value (3) according to your specific case.

Can I select rows based on multiple conditions?

You can select rows based on multiple conditions using boolean indexing in Pandas. You can combine conditions using logical operators such as & (AND) and | (OR).

How can I select rows where a column value is in a list?

To select rows where a column value is in a list, you can use the isin() function in Pandas. For example, df['Column_Name'].isin([2, 4]) creates a boolean mask for rows where the ‘Column_Name’ values are either 2 or 4. Adjust the column name and the list of values according to your specific case.

Can I select rows based on partial-string matching?

You can select rows based on partial string matching in Pandas using the str.contains() method. This method allows you to check whether a string column contains a specific substring.

How do I select rows where a column value is not equal to a specific value?

To select rows where a column value is not equal to a specific value, you can use the != operator in conjunction with boolean indexing.

Conclusion

In this article, I have explained selecting rows based on column values in Pandas. This is a fundamental operation in data analysis and manipulation. Throughout this process, several methods are used to efficiently filter and extract the desired rows from a DataFrame.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply