Pandas DataFrame is a Two-Dimensional data structure, Portenstitially heterogeneous tabular data structure with labeled axes rows, and columns. pandas Dataframe is consists of three components principal, data, rows, and columns. In this article, we’ll explain how to create Pandas data structure DataFrame Dictionaries and indexes, how to access fillna() & dropna() method, Iterating over rows and columns, and finally using some functions with examples.
1. Dataframe Features
- Potentially columns are of different types.
- Pandas DataFrame size is mutable.
- DataFrame labeled axes (rows and columns).
- can perform arithmetic operations on rows and columns on DataFrame.
2. DataFrame Methods
Following are the most used Pandas DataFrame methods.
FUNCTIONS | DESCRIPTION |
---|---|
index() | Returns the index (row labels) of the DataFrame. |
axes() | Returns a list representing the axes of the DataFrame. |
insert() | Inserts a column into a DataFrame. |
add() | Returns the addition of DataFrame and other, element-wise (binary operator add) |
sub() | Returns subtraction of DataFrame and other, element-wise (binary operator sub) |
mul() | Returns multiplication of DataFrame and other, element-wise (binary operator mul) |
div() | Returns floating division of DataFrame and other, element-wise (binary operator truediv) |
dtypes() | Returns a Series with the data type of each column. |
unique() | Extracts the unique values in the DataFrame. |
loc[] | Retrieves rows based on an index label. |
drop() | Delete rows or columns from a DataFrame. |
pop() | Delete rows or columns from a DataFrame. |
columns() | Alternative attribute to change the column name. |
dropna() | Allows the user to analyze and drop Rows/Columns with Null values in different ways. |
fillna() | Manages and lets the user replace NaN values with some value of their own. |
3. Create a DataFrame Using Dictionary Ndarray/Lists
To create Pandas DataFrame from the dictionary of ndarray/list, all the ndarray must be of the same length. If the Data index is passed then the length index should be equal to the length of the array. If no index is passed, by default index will be range(n) where n is the array length.
# Create DataFrame
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
'Fee' :[20000,25000,26000,22000,24000,21000,22000],
'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
})
df = pd.DataFrame(technologies)
print(df)
Yields below output
# Output:
Courses Fee Duration
0 Spark 20000 30day
1 PySpark 25000 40days
2 Hadoop 26000 35days
3 Python 22000 40days
4 pandas 24000 60days
5 Oracle 21000 50days
6 Java 22000 55days
4. Using List of Dictionaries
You can create Pandas DataFrame in different ways by using loading the datasets from existing storage, storage can be Excel file, CSV file, and SQL Database. Pandas DataFrame can be created from the lists, dictionary, and from a list of the dictionary, etc.
# Using List of Dictionaries
import pandas as pd
data = [{'a=5', 'b=9', 'c=8'},{'a=6', 'b=8', 'c=4'}]
df =pd.DataFrame(data)
print(df)
Yields below output
# Output:
0 1 2
0 a=5 c=8 b=9
1 a=6 c=4 b=8
5. Using 2D List
You can create DataFrame using 2-dimensional list.
# Using 2D List
import pandas as pd
data = [['William', 28], ['mia', 25], ['juli', 21], ['messi',30]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)
Yields below output
# Output:
Name Age
0 William 28
1 mia 25
2 juli 21
3 messi 30
6. Using Indexes
You can create data in your own index argument.
# Using Indexes
import pandas as pd
data = {'Name':['William', 'Mia', 'messi', 'juli'], 'marks':[98, 96, 94, 90]}
df = pd.DataFrame(data, index =['rank1', 'rank2', 'rank3', 'rank4'])
print(df)
Yields below output
# Output:
Name marks
rank1 William 98
rank2 Mia 96
rank3 messi 94
rank4 juli 90
7. Locate Named Indexes
Locate Named Index in the
df.loc[] attribute to return the specified rows. For instance, df.loc['rank3']
it retrieves the row Name: rank3, dtype: object
.
# Locate Named Indexes
import pandas as pd
data = {'Name':['William', 'Mia', 'messi', 'juli'], 'marks':[98, 96, 94, 90]}
df = pd.DataFrame(data, index =['rank1', 'rank2', 'rank3', 'rank4'])
print(df.loc['rank3'])
Yields below output
# Output:
Name messi
marks 94
Name: rank3, dtype: object
8. Using zip() Function
# Using zip() Function
import pandas as pd
name = ['William', 'Mia', 'messi', 'juli']
age = [30, 20, 28, 32]
list_of_tuples = list(zip(name, age))
df = pd.DataFrame(list_of_tuples, columns = ['Name', 'Age'])
print(df)
Yields below output
# Output:
Name Age
0 William 30
1 Mia 20
2 messi 28
3 juli 32
9. fillna() Method
In order to fill null values in a dataset. The fillna() function is used Manages and lets the user replace file NA/NaN values using the specified method.
# fillna() Method
import pandas as pd
import numpy as np
dataset = {
"Name" : ["Messi", "Ronaldo", "Alisson", "Mohamed", np.nan],
"Age" : [33, 32, np.nan, 30, np.nan],
"Height": [5.6, 5.8, 6.2, np.nan, np.nan],
"Job": [True, np.nan, np.nan, np.nan, np.nan]
}
data = pd.DataFrame(dataset)
print(data.fillna("None"))
Yields below output
# Output:
Name Age Height Job
0 Messi 33.0 5.6 True
1 Ronaldo 32.0 5.8 None
2 Alisson None 6.2 None
3 Mohamed 30.0 None None
4 None None None None
10. dropna() Method
In order to drop null values from a DataFrame, we used dropna() function this function drops Rows and Columns of datasets with Null values in different ways.
# dropna() Method
import pandas as pd
import numpy as np
dataset = {
"Name" : ["Messi", "Ronaldo", "Alisson", "Mohamed", np.nan],
"Age" : [33, 32, np.nan, 30, np.nan],
"Height": [5.9, 5.8, 6.2, np.nan, np.nan],
"Designation": ['football player', np.nan, 'fp', np.nan,'true']
}
data = pd.DataFrame(dataset)
print(data)
Yields below output.
# Output:
Name Age Height Designation
0 Messi 33.0 5.9 football player
1 Ronaldo 32.0 5.8 NaN
2 Alisson NaN 6.2 fp
3 Mohamed 30.0 NaN NaN
4 NaN NaN NaN true
Now we drop rows at least one Null value.
import pandas as pd
import numpy as np
dataset = {
"Name" : ["Messi", "Ronaldo", "Alisson", "Mohamed", np.nan],
"Age" : [33, 32, np.nan, 30, np.nan],
"Height": [5.9, 5.8, 6.2, np.nan, np.nan],
"Designation": ['football player', np.nan, np.nan, np.nan,'true']
}
data = pd.DataFrame(dataset)
print(data.dropna())
Yields below output.
# Output:
Name Age Height Designation
0 Messi 33.0 5.9 football player
11. Pandas DataFrame Iterating over rows and columns
Sometimes you need to process all the data values of a DataFrame, in such a case writing separate statements for assigning accessing individual data values makes the process cumbersome. Pandas DataFrame supports Iterating over rows and columns, let’s see these with some examples.
To iterate over rows : a.iterrows()
To iterate over columns : a.iteritems()
# Pandas DataFrame Iterating over rows and columns
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
'Fee' :[20000,25000,26000,22000,24000,21000,22000],
'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
})
df = pd.DataFrame(technologies)
print(df)
Yields below output.
# Output:
Courses Fee Duration
0 Spark 20000 30day
1 PySpark 25000 40days
2 Hadoop 26000 35days
3 Python 22000 40days
4 pandas 24000 60days
5 Oracle 21000 50days
6 Java 22000 55days
11.1 Iterating over rows
In order to iterate over rows, we can use three functions iteritems()
, iterrows()
, itertuples()
. We can apply iterrows() function in order to get each element of rows.
# Iterating over rows
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
'Fee' :[20000,25000,26000,22000,24000,21000,22000],
'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
})
df = pd.DataFrame(technologies)
for x, y in df.iterrows():
print(x, y)
print()
Yields below output.
# Output:
0 Courses Spark
Fee 20000
Duration 30day
Name: 0, dtype: object
1 Courses PySpark
Fee 25000
Duration 40days
Name: 1, dtype: object
2 Courses Hadoop
Fee 26000
Duration 35days
Name: 2, dtype: object
3 Courses Python
Fee 22000
Duration 40days
Name: 3, dtype: object
4 Courses pandas
Fee 24000
Duration 60days
Name: 4, dtype: object
5 Courses Oracle
Fee 21000
Duration 50days
Name: 5, dtype: object
6 Courses Java
Fee 22000
Duration 55days
Name: 6, dtype: object
11.2 Iterating over columns
In order to iterate over columns, we need to create a list of DataFrame columns and iterating through that list to pull out the DataFrame columns. retrieve the fifth element of the column.
# Iterating over columns
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
'Fee' :[20000,25000,26000,22000,24000,21000,22000],
'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
})
df = pd.DataFrame(technologies)
columns = list(df)
for i in columns:
print (df[i][4])
Yields be low output.
# Output:
pandas
24000
60days
Related Articles
- Pandas Drop Index Column Explained
- Pandas DatetimeIndex Usage Explained
- Pandas apply map (applymap()) Explained
- Pandas Groupby Aggregate Explained
- Pandas Window Functions Explained
- pandas isin() Explained with Examples
- Pandas – Check If a Column Exists in DataFrame
- Create Pandas Plot Bar Explained with Examples
- Pandas Pivot Table Explained with Examples
- Pandas API on Spark | Explained With Examples
- Pandas – Check Any Value is NaN in DataFrame
- Pandas Create Test and Train Samples from DataFrame
- Pandas Set Value to Particular Cell in DataFrame Using Index
Reference
https://www.w3schools.com/python/pandas/pandas_dataframes.asp