In pandas, you can select rows based on column values using boolean indexing or using methods like DataFrame.loc[]
attribute, DataFrame.query()
, or DataFrame.apply()
method with lambda
function.
In this article, I will explain how to select rows based on single or multiple-column values (values from the list) and also how to select rows that have None
or Nan
values with several examples.
Key Points –
- Pandas allow the selection of rows based on column values using boolean arrays.
- Boolean expressions can involve logical operators like
&
(and),|
(or), and~
(not) for complex conditions. - Methods like
loc[]
,apply[]
,query()
, andisin()
offer versatile ways to select rows based on column values. - Optimizing boolean expressions for selecting rows can significantly improve performance, especially when dealing with large datasets, by leveraging Pandas’ internal optimizations.
- Avoid using chained indexing (
df[][]
) as it can lead to unpredictable behavior and should be replaced with more explicit methods.
Create DataFrame
Now, Let’s create Pandas DataFrame using data from a Python dictionary, where the columns are Courses
, Fee
, Duration
and Discount
.
# Create pandas DataFrame
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','40days', None,np.nan],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
print("Original DataFrame:\n",df)
Yields below output.
Select Rows Based on Column Values
You can use boolean indexing to select rows based on column values in Pandas DataFrame. Selecting rows based on column values in a Pandas dataframe means extracting specific rows from the dataframe that satisfy certain conditions defined on one or more columns.
The following program creates a DataFrame named df
and then selects rows where the Courses
column has the value Spark
. The resulting DataFrame, df2
, contains only the rows related to the Spark
course.
# Select rows based on column values
df2=df[df["Courses"] == 'Spark']
print("Selected rows where 'Courses' is 'Spark':\n", df2)
In the above examples, this will print out the rows where the value in the Courses
column is equal to Spark
.
You can also write the above statement with a variable. Select rows from the DataFrame df
where the value in the Courses
column is equal to the value stored in the variable value
, which is Spark
in this case. The resulting DataFrame is assigned to df2
. This is a concise and effective way to select rows based on a specific column value.
value="Spark"
df2=df[df["Courses"] == value]
print(df2)
# Output:
# Courses Fee Duration Discount
#0 Spark 22000 30days 1000
Use the !=
operator to select rows where the column value does not equal a certain value. for instance, df[df["Courses"] != 'Spark']
.
df2=df[df["Courses"] != 'Spark']
print(df2)
# Output:
# Courses Fee Duration Discount
#1 PySpark 25000 50days 2300
#2 Hadoop 23000 40days 1000
#3 Python 24000 None 1200
#4 Pandas 26000 NaN 2500
Using DataFrame.query()
Alternatively, you can use of DataFrame.query()
function. It selects rows from the DataFrame where the value in the Courses
column is equal to Spark
.
The DataFrame.query()
method allows you to write conditions as strings, making it more concise. In this case, the condition is specified as a string: "Courses == 'Spark'"
. It selects rows where the “Courses” column is equal to ‘Spark’. Adjust the column name and the value in the query string based on your requirements.
# Select rows based on column values
# Using DataFrame.query()
df2 = df.query("Courses == 'Spark'")
print("Selected rows where 'Courses' is 'Spark':\n", df2)
# Output:
# Selected rows where 'Courses' is 'Spark':
# Courses Fee Duration Discount
#0 Spark 22000 30days 1000
Using DataFrame.loc to Select Based on Column Values
Select rows from the DataFrame df
where the value in the "Courses"
column is equal to 'PySpark'
using DataFrame.loc[]
.
# Select rows based on column values using DataFrame.loc
df2 = df.loc[df['Courses'] == 'Spark']
print("\nSelected rows where 'Courses' is 'Spark':\n", df2)
In the above example, df['Courses'] == 'PySpark'
creates a boolean mask, and df.loc[]
is then used to select rows where this condition is True. Adjust the column name and the value in the condition according to your specific requirements.
Select Rows Based on List of Column Values
You can use the isin()
method to select rows where the Courses
column is in a list of values. For instance, df['Courses'].isin(values)
creates a boolean mask where rows are selected if the ‘Courses’ column value is in the list values
. Adjust the column name ('Courses'
) and the list of values (values
) according to your specific DataFrame and requirements.
# Select rows where 'Courses' column is in the list of values
# # Using DataFrame.isin()
values = ['Spark', 'PySpark', 'Hadoop']
df2 = df[df['Courses'].isin(values)]
print("Selected rows where 'Courses' is in the list:\n", df2)
# Using df.loc
values = ['Spark', 'PySpark', 'Hadoop']
df2 = df.loc[df['Courses'].isin(values)]
print("Selected rows where 'Courses' is in the list:\n", df2)
# Output:
#Selected rows where 'Courses' is in the list:
# Courses Fee Duration Discount
#0 Spark 22000 30days 1000
#1 PySpark 25000 50days 2300
#2 Hadoop 23000 40days 1000
Similarly, you can use the ~
operator to select rows where a column value is not in a list of values.
# Select rows where 'Courses' column is NOT in the list of excluded values
values = ['Spark', 'PySpark', 'Hadoop']
df2=df.loc[~df['Courses'].isin(values)]
print("Selected rows where 'Courses' is NOT in the list:\n", df2)
# Output:
# Selected rows where 'Courses' is NOT in the list:
# Courses Fee Duration Discount
#3 Python 24000 None 1200
#4 Pandas 26000 NaN 2500
In the above example, the ~
operator negates the boolean condition created by isin()
, effectively selecting rows where the 'Courses'
column is not in the list of excluded values (values). Adjust the column name ('Courses'
) and the list of excluded values (values
) according to your specific DataFrame and requirements.
Using Multiple Column Conditions
To select rows based on multiple column conditions in Pandas, you can use logical operators like &
(AND) and |
(OR) to combine the conditions. Selecting rows based on multiple conditions on the Discount column.
# Select rows based on multiple conditions
print(df.loc[(df['Discount'] >= 1500) & (df['Discount'] <= 2300)])
print(df[(df['Discount'] >= 1500) & (df['Discount'] <= 2300)])
print(df.query("Discount >= 1500 & Discount <= 2300"))
# Output:
# Courses Fee Duration Discount
#1 PySpark 25000 50days 2300
In Pandas, you can use multiple columns to select DataFrame rows by specifying conditions for each column.
# Using multiple column conditions
df2=df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]
print(df2)
# Output:
# Courses Fee Duration Discount
#1 PySpark 25000 50days 2300
#3 Python 24000 None 1200
#4 Pandas 26000 NaN 2500
Here,
(df['Discount'] >= 1200)
creates a boolean mask for rows where the ‘Discount’ column value is greater than or equal to 1200.(df['Fee'] >= 23000)
creates another boolean mask for rows where the ‘Fee’ column value is greater than or equal to 23000.- The conditions for both columns are combined using the
&
(AND) operator to ensure that both conditions are satisfied simultaneously. - The resulting boolean mask is used to select rows from the DataFrame
df
.
Select Rows Using DataFrame.apply()
DataFrame.apply() with a lambda function can indeed be used to apply a function row-by-row, and you can return the rows that match specific conditions.
This program uses the loc[]
accessor along with a lambda function to select rows where the ‘Courses’ column is either ‘Spark’ or ‘PySpark’. Adjust the conditions in the lambda function according to your specific requirements.
# By using lambda function
df2 = df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
print(df2)
# Selecting rows where 'Courses' is 'Spark' or 'PySpark'
# Using apply() with a lambda function
df2 = df[df.apply(lambda row: row['Courses'] in ['Spark', 'PySpark'], axis=1)]
print(df2)
# Output:
# Courses Fee Duration Discount
#0 Spark 22000 30days 1000
#1 PySpark 25000 50days 2300
Using With None & nan Data
The dropna()
method is used to drop rows with None
or NaN
values in column values. For instance, this will drop any rows from the DataFrame df
where at least one column value is None
or NaN
, and the resulting DataFrame df2
will contain only rows with complete data.
# Select rows by ignoreing columns
# That have None & Nan values
df2=df.dropna()
print(df2)
# Output:
# Courses Fee Duration Discount
#0 Spark 22000 30days 1000
#1 PySpark 25000 50days 2300
#2 Hadoop 23000 40days 1000
If you want to drop columns where column values are None
or NaN
, you can use the DataFrame.dropna()
method with axis='columns'
parameter.
# Drop columns where column values are None or NaN
df2 = df.dropna(axis='columns')
print(df2)
# Output:
# Courses Fee Discount
#0 Spark 22000 1000
#1 PySpark 25000 2300
#2 Hadoop 23000 1000
#3 Python 24000 1200
#4 Pandas 26000 2500
In the above example, this will drop any columns from the DataFrame df
where at least one value is None
or NaN
, and the resulting DataFrame df2
will contain only columns with complete data.
Other Examples of Select Rows Based On Column Values
# Select based on value contains
print(df[df['Courses'].str.contains("Spark")])
# Select after converting values
print(df[df['Courses'].str.lower().str.contains("spark")])
#Select startswith
print(df[df['Courses'].str.startswith("P")])
Frequently Asked Questions on Select Rows Based on Column Values
You can use boolean indexing to select rows where a specific column has a certain value. For instance, df['Column_Name'] == 3
creates a boolean mask, and then you use this mask to select the rows where the condition is True. Adjust the column name (‘Column_Name’) and the value (3) according to your specific case.
You can select rows based on multiple conditions using boolean indexing in Pandas. You can combine conditions using logical operators such as &
(AND) and |
(OR).
To select rows where a column value is in a list, you can use the isin()
function in Pandas. For example, df['Column_Name'].isin([2, 4])
creates a boolean mask for rows where the ‘Column_Name’ values are either 2 or 4. Adjust the column name and the list of values according to your specific case.
You can select rows based on partial string matching in Pandas using the str.contains()
method. This method allows you to check whether a string column contains a specific substring.
To select rows where a column value is not equal to a specific value, you can use the !=
operator in conjunction with boolean indexing.
Conclusion
In this article, I have explained selecting rows based on column values in Pandas. This is a fundamental operation in data analysis and manipulation. Throughout this process, several methods are used to efficiently filter and extract the desired rows from a DataFrame.
Happy Learning !!
Related Articles
- Pandas – Get Column Index For Column Name
- Pandas Select Rows by Index (Position/Label)
- Pandas Select Columns by Name or Index
- Pandas Drop Rows Based on Column Value
- How to Drop Column From Pandas DataFrame
- Pandas Drop Last Column From DataFrame
- Pandas – Get All Column Names as List from DataFrame
- Pandas – Select All Columns Except One Column
- Select Rows From List of Values in Pandas DataFrame
- Different Ways to Rename Pandas DataFrame Column
- Pandas Find Row Values for Column Maximal
- Pandas Create New DataFrame By Selecting Specific Columns
References
- https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/03_subset_data.html
- https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html