Pandas Create Conditional Column in DataFrame

  • Post author:
  • Post category:Pandas
  • Post last modified:October 5, 2023

You can create a conditional column in pandas DataFrame by using np.where(), np.select(), DataFrame.map(), DataFrame.assign(), DataFrame.apply(), DataFrame.loc[]. Additionally, you can also use mask() method transform() and lambda functions to create single and multiple functions. In this article, I will explain several ways of how to create a conditional DataFrame column (new) with examples.

Adding a new column by conditionally checking values on existing columns is required when you would need to curate the DataFrame or derive a new column from the existing columns.

1. Quick Examples of Pandas Create Conditional DataFrame Column

If you are in a hurry, below are some quick examples.


# Below are some quick examples.

# Create conditional DataFrame column by np.where() function.
df['Discount'] = np.where(df['Courses']=='Spark', 1000, 2000)

# Another way to create column conditionally.
df['Discount'] = [1000 if x == 'Spark' else 2000 for x in df['Courses']]

# Create conditional DataFrame column by map() and lambda.
df['Discount'] = df.Courses.map( lambda x: 1000 if x == 'Spark' else 2000)

# Create conditional DataFrame column by np.select() function.
conditions = [
    (df['Courses'] == 'Spark') & (df['Duration'] == '30days'),
    (df['Courses'] == 'Spark') & (df['Duration'] == '35days'),
    (df['Duration'] == '50days')]
choices = [1000, 1050,200]
df['Discount'] = np.select(conditions,choices, default=0)

# Using Dictionary to map new values.
Discount_dictionary ={'Spark' : 1500, 'PySpark' : 800, 'Spark' : 1200}
df['Discount'] = df['Courses'].map(Discount_dictionary)

# Pandas create conditional DataFrame column by dictionary
df['Discount'] = [Discount_dictionary.get(v, None) for v in df['Courses']]

# Using DataFrame.assign() method.
def Courses_Discount(row):
    if row["Courses"] == "Spark":
        return 1000
    else:
        return 2000
df = df.assign(Discount=df.apply(Courses_Discount, axis=1))

# Conditions with multiple rand multiple columns.
def Courses_Discount(row):
    if row["Courses"] == "Spark":
        return 1000
    elif row["Fee"] == 25000:
        return 2000
    else:
        return 0
df = df.assign(Discount=df.apply(Courses_Discount, axis=1))

# Using .loc[] property for single condition.
df.loc[(df['Courses']=="Spark"), 'Discount'] = 1000

# Using loc[] method for Multiple conditions.
df.loc[(df['Courses']=="Spark")&(df['Fee']==23000)|(df['Fee']==25000), 'Discount'] = 1000

# Using DataFrame.apply() method with lambda function.
df['Discount'] = df['Courses'].apply(lambda x: '1000' if x=='Spark' else 1000)

# Pandas create conditional column using mask() method.
# Replace values where the condition is True
df['Discount'] = df['Discount'].mask(df['Courses']=='Spark', other=1000)

# Using where()
df['Discount'] = df['Discount'].where(df['Courses']=='Spark', other=1000)

# Using transform() with a lambda function.
df['Discount'] = df['Courses'].transform(lambda x: 1000 if x == 'Spark' else 2000)

Let’s create a pandas DataFrame with a few rows and columns and execute these examples and validate results. Our DataFrame contains column names Courses, Fee and Duration.


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Spark","Python","PySpark"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan]
          }
df = pd.DataFrame(technologies)

Yields below output.


# Output:
   Courses         Fee   Duration
0    Spark        22000   30days
1  PySpark        25000   50days
2    Spark        23000   30days
3   Python        24000     None
4  PySpark        26000      NaN

2. Create Conditional DataFrame Column by np.where() function

To create a conditional DataFrame column in pandas use np.where() function. Although pandas also have the similar where() function, Numpy.where() is very different. The difference being, where() function of Numpy provides greater flexibility and it treats the given condition differently from Pandas.

pandas where() function only allows for updating the values that do not meet the given condition. However, the where function of Numpy allows for updating values that meet and do not meet the given condition.


import numpy as np
# Create conditional DataFrame column by np.where() function.
df['Discount'] = np.where(df['Courses']=='Spark', 1000, 2000)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2000
2    Spark  23000   35days      1000
3   Python  24000     None      2000
4    Spark  26000      NaN      1000

Another way to create a column conditionally.


# Another way to create another column conditionally.
df['Discount'] = [1000 if x == 'Spark' else 2000 for x in df['Courses']]
print(df)

Yields same output as above.

Similarly, you can also create by using Series.map() and lambda. The lambda functions are defined using the keyword lambda. They can have any number of arguments but only one expression. These are very helpful when we have to perform small tasks with less code.


# Create conditional DataFrame column by lambda.
df['Discount'] = df.Courses.map( lambda x: 1000 if x == 'Spark' else 2000)
print(df)

Yields same output as above.

3. Create Conditional DataFrame Column by numpy.select() function

You can create a conditional DataFrame column by checking multiple columns using numpy.select() function. The select() function is more capable than the previous methods. We can use it to give a set of conditions and a set of values. Thus, we are able to assign a specific value for each condition.

When no condition matches, it assigns the default value to the new column.


# Create conditional DataFrame column by np.select function.
conditions = [
    (df['Courses'] == 'Spark') & (df['Duration'] == '30days'),
    (df['Courses'] == 'Spark') & (df['Duration'] == '35days'),
    (df['Duration'] == '50days')]
choices = [1000, 1050,200]
df['Discount'] = np.select(conditions,choices, default=0)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days       200
2    Spark  23000   35days      1050
3   Python  24000     None         0
4    Spark  26000      NaN         0

4. Using DataFrame.map() Function

We can use DataFrame.map() function to achieve the same goal. It is a straight forward method where we use a dictionary to simply map values to the newly added column based on the key. Map values of Series according to input correspondence. It is used for substituting each value in a Series with another value.


# Using Dictionary to map new values.
Discount_dictionary ={'Spark' : 1500, 'PySpark' : 800, 'Spark' : 1200}
df['Discount'] = df['Courses'].map(Discount_dictionary)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days    1200
1  PySpark  25000   50days     800
2    Spark  23000   35days    1200
3   Python  24000     None     NaN
4    Spark  26000      NaN    1200

5. Using Dict to Create Conditional DataFrame Column

Another method to create pandas conditional DataFrame column is by creating a Dict with key-value pair. dict.get. The get() method returns the value of the item with the specified key. However, if the key is not found when you use dict[key] it assigns NaN.


# Pandas create conditional DataFrame column by dict.get
df['Discount'] = [Discount_dictionary.get(v, None) for v in df['Courses']]
print(df)

Yields same output as above.

6. Using Series.apply() Function

We can use Series.apply() function, when we have more than two values, in that case, we can use a dictionary to map new values onto the keys. It provides a lot of flexibility when we are having a larger number of categories for which we want to assign different values to the newly added column.


# Using DataFrame.apply() function.
def map_values(row, discount_dict):
    return discount_dict[row]

discount_dict = {'Spark': 1000, 'PySpark': 2000, 'Python': 0}
df['Discount'] = df['Courses'].apply(map_values, args = (discount_dict,))
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2000
2    Spark  23000   30days      1000
3   Python  24000     None         0
4  PySpark  26000      NaN      2000

7. Using DataFrame.assign() Method

The DataFrame.assign() function is used to assign new columns to a DataFrame. Returns a new object with all original columns in addition to new ones. Note that all the above examples create a new column on the existing DataFrame, this example creates a new DataFrame with the new column.


# Using DataFrame.assign() method.
def Courses_Discount(row):
    if row["Courses"] == "Spark":
        return 1000
    else:
        return 2000
df2 = df.assign(Discount=df.apply(Courses_Discount, axis=1))
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2000
2    Spark  23000   35days      1000
3   Python  24000     None      2000
4    Spark  26000      NaN      1000

8. Using Multiple Columns by Using DataFrame.assign() Method

If you need to check multiple columns to create a new column use DataFrame.assign() function, you can see below example-


# Multiple discounts and multiple columns.
def Courses_Discount(row):
    if row["Courses"] == "Spark":
        return 1000
    elif row["Fee"] == 25000:
        return 2000
    else:
        return 0
df = df.assign(Discount=df.apply(Courses_Discount, axis=1))
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2000
2    Spark  23000   35days      1000
3   Python  24000     None         0
4    Spark  26000      NaN      1000

9. Using DataFrame.loc[] Method

The loc[] property is used to access a group of rows and columns by label(s) or a boolean array. The loc[] is primarily label based, but may also be used with a boolean array. You can apply a loc[] property for single condition-


# Using .loc() property for single condition.
df.loc[(df['Courses']=="Spark"), 'Discount'] = 1000
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days    1000.0
1  PySpark  25000   50days       NaN
2    Spark  23000   35days    1000.0
3   Python  24000     None       NaN
4    Spark  26000      NaN    1000.0

NOTE: Alternatively, to apply loc() property for multiple conditions and create a new column in pandas DataFrame. For example


# Using loc() method for Multiple conditions.
df.loc[(df['Courses']=="Spark")&(df['Fee']==23000)|(df['Fee']==25000), 'Discount'] = 1000
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days    1000.0
1  PySpark  25000   50days    1000.0
2    Spark  23000   35days    1000.0
3   Python  24000     None       NaN
4    Spark  26000      NaN    1000.0

10. Using DataFrame.apply() method with lambda Function

You can also create conditional DataFrame column by using DataFrame.apply() method with lambda function. The apply function along an axis of the DataFrame. The lambda functions are defined using the keyword lambda.


# Using DataFrame.apply method with lambda function.
df['Discount'] = df['Courses'].apply(lambda x: '1000' if x=='Spark' else 1000)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration Discount
0    Spark  22000   30days     1000
1  PySpark  25000   50days     1000
2    Spark  23000   30days     1000
3   Python  24000     None     1000
4  PySpark  26000      NaN     1000

11. Pandas Create Conditional Column Using Mask() Methods

Let’s see by using mask() method. The mask() method is used to replace values where the condition is True.


# Pandas create conditional column using mask() method.
# Replace values where the condition is True
df['Discount'] = df['Discount'].mask(df['Courses']=='Spark', other=1000)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration Discount
0    Spark  22000   30days     1000
1  PySpark  25000   50days     1000
2    Spark  23000   30days     1000
3   Python  24000     None     1000
4  PySpark  26000      NaN     1000

NOTE: You can replace values where the condition is false by Series.where() method. The where() method is used to check a DataFrame for one or more conditions and return the result accordingly.


# Replace values where the condition is False.
df['Discount'] = df['Discount'].where(df['Courses']=='Spark', other=1000)
print(df)

Yields same output as above.

12. Using transform() with lambda function

Finally, you can use the method transform() with a lambda function. The transform() function returns a self-produced DataFrame with transformed values after applying the function specified in its parameter.


# Using transform with a lambda function.
df['Discount'] = df['Courses'].transform(lambda x: 1000 if x == 'Spark' else 2000)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2000
2    Spark  23000   35days      1000
3   Python  24000     None      2000
4    Spark  26000      NaN      1000

13.  Complete Example For Pandas Create Conditional Column


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Spark","Python","PySpark"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan]

          }
df = pd.DataFrame(technologies)

# Create conditional DataFrame column by np.where function.
df['Discount'] = np.where(df['Courses']=='Spark', 1000, 2000)
print(df)

# Another way to create another column conditionally.
df['Discount'] = [1000 if x == 'Spark' else 2000 for x in df['Courses']]
print(df)

# Create conditional DataFrame column by lambda.
df['Discount'] = df.Courses.map( lambda x: 1000 if x == 'Spark' else 2000)
print(df)

# Create conditional DataFrame column by np.select function.
conditions = [
    (df['Courses'] == 'Spark') & (df['Duration'] == '30days'),
    (df['Courses'] == 'Spark') & (df['Duration'] == '35days'),
    (df['Duration'] == '50days')]
choices = [1000, 1050,200]
df['Discount'] = np.select(conditions,choices, default=0)
print(df)

# Using Dictionary to map new values.
Discount_dictionary ={'Spark' : 1500, 'PySpark' : 800, 'Spark' : 1200}
df['Discount'] = df['Courses'].map(Discount_dictionary)
print(df)

# Pandas create conditional DataFrame column by dict.get
df['Discount'] = [Discount_dictionary.get(v, None) for v in df['Courses']]
print(df)

# Using DataFrame.assign() method.
def Courses_Discount(row):
    if row["Courses"] == "Spark":
        return 1000
    else:
        return 2000
df = df.assign(Discount=df.apply(Courses_Discount, axis=1))
print(df)

# Multiple discounts and multiple columns.
def Courses_Discount(row):
    if row["Courses"] == "Spark":
        return 1000
    elif row["Fee"] == 25000:
        return 2000
    else:
        return 0
df = df.assign(Discount=df.apply(Courses_Discount, axis=1))
print(df)

# Using .loc() property for single condition.
df.loc[(df['Courses']=="Spark"), 'Discount'] = 1000
print(df)

# Using loc() method for Multiple conditions.
df.loc[(df['Courses']=="Spark")&(df['Fee']==23000)|(df['Fee']==25000), 'Discount'] = 1000
print(df)

# Using DataFrame.apply method with lambda function.
df['Discount'] = df['Courses'].apply(lambda x: '1000' if x=='Spark' else 1000)
print(df)

# Pandas create conditional column using mask() method.
# Replace values where the condition is True
df['Discount'] = df['Discount'].mask(df['Courses']=='Spark', other=1000)
print(df)

# Replace values where the condition is False.
df['Discount'] = df['Discount'].where(df['Courses']=='Spark', other=1000)
print(df)

# Using transform with a lambda function.
df['Discount'] = df['Courses'].transform(lambda x: 1000 if x == 'Spark' else 2000)
print(df)

Conclusion

In this article, you have learned how Pandas create DataFrame conditional column by using np.where(), np.select(), DataFrame.apply(), DataFrame.assign(), DataFrame.map(), loc[], mask() method, transform() and lambda functions to create single and multiple functions.

References

Leave a Reply

You are currently viewing Pandas Create Conditional Column in DataFrame