• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:23 mins read
You are currently viewing Pandas Create Conditional Column in DataFrame

You can create a conditional column in pandas DataFrame by using np.where(), np.select(), DataFrame.map(), DataFrame.assign(), DataFrame.apply(), DataFrame.loc[]. Additionally, you can also use mask() method transform() and lambda functions to create single and multiple functions. In this article, I will explain several ways of how to create a conditional DataFrame column (new) with examples.

Advertisements

Adding a new column by conditionally checking values on existing columns is required when you would need to curate the DataFrame or derive a new column from the existing columns.

Key Points –

  • Use boolean conditions to create a new column based on specific criteria within a Pandas DataFrame.
  • Utilize the DataFrame syntax, such as df['new_column'], to assign values to the newly created column.
  • Use NumPy’s np.where function as a concise way to apply conditional logic and assign values to the new column.
  • Use chained assignments with .loc to set values for a new column based on specific conditions, ensuring modifications are made to the original DataFrame.
  • Leverage the apply function along with a lambda function to apply conditional logic to each row of the DataFrame.
  • Handle multiple conditions by combining them with logical operators like & (AND), | (OR), and ~ (NOT) to create complex conditional expressions for the new column.

1. Quick Examples of Pandas Create Conditional DataFrame Column

If you are in a hurry, below are some quick examples of creating conditional pandas DataFrame columns.


# Below are some quick examples.

# Create conditional DataFrame column by np.where() function.
df['Discount'] = np.where(df['Courses']=='Spark', 1000, 2000)

# Another way to create column conditionally.
df['Discount'] = [1000 if x == 'Spark' else 2000 for x in df['Courses']]

# Create conditional DataFrame column by map() and lambda.
df['Discount'] = df.Courses.map( lambda x: 1000 if x == 'Spark' else 2000)

# Create conditional DataFrame column by np.select() function.
conditions = [
    (df['Courses'] == 'Spark') & (df['Duration'] == '30days'),
    (df['Courses'] == 'Spark') & (df['Duration'] == '35days'),
    (df['Duration'] == '50days')]
choices = [1000, 1050,200]
df['Discount'] = np.select(conditions,choices, default=0)

# Using Dictionary to map new values.
Discount_dictionary ={'Spark' : 1500, 'PySpark' : 800, 'Spark' : 1200}
df['Discount'] = df['Courses'].map(Discount_dictionary)

# Pandas create conditional DataFrame column by dictionary
df['Discount'] = [Discount_dictionary.get(v, None) for v in df['Courses']]

# Using DataFrame.assign() method.
def Courses_Discount(row):
    if row["Courses"] == "Spark":
        return 1000
    else:
        return 2000
df = df.assign(Discount=df.apply(Courses_Discount, axis=1))

# Conditions with multiple rand multiple columns.
def Courses_Discount(row):
    if row["Courses"] == "Spark":
        return 1000
    elif row["Fee"] == 25000:
        return 2000
    else:
        return 0
df = df.assign(Discount=df.apply(Courses_Discount, axis=1))

# Using .loc[] property for single condition.
df.loc[(df['Courses']=="Spark"), 'Discount'] = 1000

# Using loc[] method for Multiple conditions.
df.loc[(df['Courses']=="Spark")&(df['Fee']==23000)|(df['Fee']==25000), 'Discount'] = 1000

# Using DataFrame.apply() method with lambda function.
df['Discount'] = df['Courses'].apply(lambda x: '1000' if x=='Spark' else 1000)

# Pandas create conditional column using mask() method.
# Replace values where the condition is True
df['Discount'] = df['Discount'].mask(df['Courses']=='Spark', other=1000)

# Using where()
df['Discount'] = df['Discount'].where(df['Courses']=='Spark', other=1000)

# Using transform() with a lambda function.
df['Discount'] = df['Courses'].transform(lambda x: 1000 if x == 'Spark' else 2000)

Let’s create a pandas DataFrame with a few rows and columns and execute these examples and validate results. Our DataFrame contains column names CoursesFee and Duration.


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Spark","Python","PySpark"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan]
          }
df = pd.DataFrame(technologies)

Yields below output.


# Output:
   Courses         Fee   Duration
0    Spark        22000   30days
1  PySpark        25000   50days
2    Spark        23000   30days
3   Python        24000     None
4  PySpark        26000      NaN

2. Create Conditional DataFrame Column by np.where() function

To create a conditional DataFrame column in pandas use np.where() function. Although pandas also have the similar where() function, Numpy.where() is very different. The difference being, where() function of Numpy provides greater flexibility and it treats the given condition differently from Pandas.

pandas where() function only allows for updating the values that do not meet the given condition. However, the where function of Numpy allows for updating values that meet and do not meet the given condition.


import numpy as np
# Create conditional DataFrame column by np.where() function.
df['Discount'] = np.where(df['Courses']=='Spark', 1000, 2000)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2000
2    Spark  23000   35days      1000
3   Python  24000     None      2000
4    Spark  26000      NaN      1000

Another way to create a column conditionally.


# Another way to create another column conditionally.
df['Discount'] = [1000 if x == 'Spark' else 2000 for x in df['Courses']]
print(df)

Yields the same output as above.

Similarly, you can also create by using Series.map() and lambda. The lambda functions are defined using the keyword lambda. They can have any number of arguments but only one expression. These are very helpful when we have to perform small tasks with less code.


# Create conditional DataFrame column by lambda.
df['Discount'] = df.Courses.map( lambda x: 1000 if x == 'Spark' else 2000)
print(df)

Yields the same output as above.

3. Create Conditional DataFrame Column by numpy.select() function

You can create a conditional DataFrame column by checking multiple columns using numpy.select() function. The select() function is more capable than the previous methods. We can use it to give a set of conditions and a set of values. Thus, we are able to assign a specific value for each condition.

When no condition matches, it assigns the default value to the new column.


# Create conditional DataFrame column by np.select function.
conditions = [
    (df['Courses'] == 'Spark') & (df['Duration'] == '30days'),
    (df['Courses'] == 'Spark') & (df['Duration'] == '35days'),
    (df['Duration'] == '50days')]
choices = [1000, 1050,200]
df['Discount'] = np.select(conditions,choices, default=0)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days       200
2    Spark  23000   35days      1050
3   Python  24000     None         0
4    Spark  26000      NaN         0

4. Using DataFrame.map() Function

We can use DataFrame.map() function to achieve the same goal. It is a straight forward method where we use a dictionary to simply map values to the newly added column based on the key. Map values of Series according to input correspondence. It is used for substituting each value in a Series with another value.


# Using Dictionary to map new values.
Discount_dictionary ={'Spark' : 1500, 'PySpark' : 800, 'Spark' : 1200}
df['Discount'] = df['Courses'].map(Discount_dictionary)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days    1200
1  PySpark  25000   50days     800
2    Spark  23000   35days    1200
3   Python  24000     None     NaN
4    Spark  26000      NaN    1200

5. Using Dict to Create Conditional DataFrame Column

Another method to create pandas conditional DataFrame column is by creating a Dict with key-value pair. dict.get. The get() method returns the value of the item with the specified key. However, if the key is not found when you use dict[key] it assigns NaN.


# Pandas create conditional DataFrame column by dict.get
df['Discount'] = [Discount_dictionary.get(v, None) for v in df['Courses']]
print(df)

Yields the same output as above.

6. Using Series.apply() Function

We can use Series.apply() function, when we have more than two values, in that case, we can use a dictionary to map new values onto the keys. It provides a lot of flexibility when we are having a larger number of categories for which we want to assign different values to the newly added column.


# Using DataFrame.apply() function.
def map_values(row, discount_dict):
    return discount_dict[row]

discount_dict = {'Spark': 1000, 'PySpark': 2000, 'Python': 0}
df['Discount'] = df['Courses'].apply(map_values, args = (discount_dict,))
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2000
2    Spark  23000   30days      1000
3   Python  24000     None         0
4  PySpark  26000      NaN      2000

7. Using DataFrame.assign() Method

The DataFrame.assign() function is used to assign new columns to a DataFrame. Returns a new object with all original columns in addition to new ones. Note that all the above examples create a new column on the existing DataFrame, this example creates a new DataFrame with the new column.


# Using DataFrame.assign() method.
def Courses_Discount(row):
    if row["Courses"] == "Spark":
        return 1000
    else:
        return 2000
df2 = df.assign(Discount=df.apply(Courses_Discount, axis=1))
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2000
2    Spark  23000   35days      1000
3   Python  24000     None      2000
4    Spark  26000      NaN      1000

8. Using Multiple Columns by Using DataFrame.assign() Method

If you need to check multiple columns to create a new column use DataFrame.assign() function, you can see below example-


# Multiple discounts and multiple columns.
def Courses_Discount(row):
    if row["Courses"] == "Spark":
        return 1000
    elif row["Fee"] == 25000:
        return 2000
    else:
        return 0
df = df.assign(Discount=df.apply(Courses_Discount, axis=1))
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2000
2    Spark  23000   35days      1000
3   Python  24000     None         0
4    Spark  26000      NaN      1000

9. Using DataFrame.loc[] Method

The loc[] property is used to access a group of rows and columns by label(s) or a boolean array. The loc[] is primarily label-based, but may also be used with a boolean array. You can apply a loc[] property for a single condition-


# Using .loc() property for single condition.
df.loc[(df['Courses']=="Spark"), 'Discount'] = 1000
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days    1000.0
1  PySpark  25000   50days       NaN
2    Spark  23000   35days    1000.0
3   Python  24000     None       NaN
4    Spark  26000      NaN    1000.0

NOTE: Alternatively, to apply loc() property for multiple conditions and create a new column in pandas DataFrame. For example


# Using loc() method for Multiple conditions.
df.loc[(df['Courses']=="Spark")&(df['Fee']==23000)|(df['Fee']==25000), 'Discount'] = 1000
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days    1000.0
1  PySpark  25000   50days    1000.0
2    Spark  23000   35days    1000.0
3   Python  24000     None       NaN
4    Spark  26000      NaN    1000.0

10. Using DataFrame.apply() method with lambda Function

You can also create conditional DataFrame column by using DataFrame.apply() method with lambda function. The apply function along an axis of the DataFrame. The lambda functions are defined using the keyword lambda.


# Using DataFrame.apply method with lambda function.
df['Discount'] = df['Courses'].apply(lambda x: '1000' if x=='Spark' else 1000)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration Discount
0    Spark  22000   30days     1000
1  PySpark  25000   50days     1000
2    Spark  23000   30days     1000
3   Python  24000     None     1000
4  PySpark  26000      NaN     1000

11. Pandas Create Conditional Column Using Mask() Methods

Let’s see by using mask() method. The mask() method is used to replace values where the condition is True.


# Pandas create conditional column using mask() method.
# Replace values where the condition is True
df['Discount'] = df['Discount'].mask(df['Courses']=='Spark', other=1000)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration Discount
0    Spark  22000   30days     1000
1  PySpark  25000   50days     1000
2    Spark  23000   30days     1000
3   Python  24000     None     1000
4  PySpark  26000      NaN     1000

NOTE: You can replace values where the condition is false by Series.where() method. The where() method is used to check a DataFrame for one or more conditions and return the result accordingly.


# Replace values where the condition is False.
df['Discount'] = df['Discount'].where(df['Courses']=='Spark', other=1000)
print(df)

Yields the same output as above.

12. Using transform() with lambda function

Finally, you can use the method transform() with a lambda function. The transform() function returns a self-produced DataFrame with transformed values after applying the function specified in its parameter.


# Using transform with a lambda function.
df['Discount'] = df['Courses'].transform(lambda x: 1000 if x == 'Spark' else 2000)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2000
2    Spark  23000   35days      1000
3   Python  24000     None      2000
4    Spark  26000      NaN      1000

Frequently Asked Questions on Pandas Create Conditional Column

How do I create a conditional column in a Pandas DataFrame?

To create a conditional column in Pandas, you can use boolean conditions, apply functions, or np.where to define the conditions and assign values to the new column.

Can I use multiple conditions for creating a conditional column?

You can use logical operators like & (AND), | (OR), and ~ (NOT) to combine multiple conditions and create complex expressions for the conditional column.

What is boolean indexing in the context of creating conditional columns?

Boolean indexing involves using boolean conditions to filter rows in a DataFrame. By applying boolean indexing, you can selectively assign values to a new column based on specific conditions.

Is it possible to modify the original DataFrame when creating a conditional column?

Using chained assignments with .loc allows you to modify the original DataFrame when creating a conditional column. This ensures that changes are made directly to the DataFrame rather than a copy.

Are there alternative methods to np.where for creating conditional columns?

Besides np.where, you can use other methods such as boolean indexing, apply functions with lambda expressions, or chained assignments with .loc to achieve the same goal of creating conditional columns. Choose the method that best fits your specific use case.

Conclusion

In this article, you have learned how Pandas create DataFrame conditional column by using np.where(), np.select(), DataFrame.apply(), DataFrame.assign(), DataFrame.map(), loc[], mask() method, transform() and lambda functions to create single and multiple functions.

References

Leave a Reply