Pandas Read TSV with Examples

How to read TSV file in pandas? TSV stands for Tab Separated File use pandas which is a text file where each field is separated by tab (\t). In Pandas, you can read the TSV file into DataFrame by using the read_table() function.

In this article, I will explain the read_table() function and using its syntax, parameters, and usage how to read a TSV (Tab-Separated Values) file in Pandas involves understanding various options like handling headers, skipping rows or columns, and setting columns as indices.

Key Points –

Pandas provides the read_csv() function which can be utilized to read TSV files by specifying the sep='\t' parameter, allowing for efficient data loading and manipulation.
When reading TSV files, it’s important to consider whether the file contains a header row. Pandas can infer the header row automatically (header='infer') or you can specify header=None if the file doesn’t have a header.
Pandas offers flexibility in handling TSV files, allowing you to skip specific rows or columns using parameters like skiprows and usecols, respectively.

Syntax of read_table()

Following is the syntax of the read_table() function


# Syntax of read_table()
pandas.read_table(filepath_or_buffer, sep=NoDefault.no_default, delimiter=None, header='infer', names=NoDefault.no_default, index_col=None, usecols=None, squeeze=None, prefix=NoDefault.no_default, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, error_bad_lines=None, warn_bad_lines=None, on_bad_lines=None, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None)

Pandas Read TSV into DataFrame

To read a TSV file with tab (\t) delimiter use pandas read_table() function. This also supports optionally iterating or breaking the file into chunks. In the above syntax, you can see that there are several optional parameters available for reading TSV files, each serving a specific purpose. These parameters become particularly valuable when handling large files, as they can significantly enhance the efficiency of loading TSV data.

I’ll utilize the provided data to demonstrate reading a TSV file. You can access the data file on GitHub.


# Import pandas
import pandas as pd

# Read TSV file into DataFrame
df = pd.read_table('courses.tsv')
print(df)

# Output:
#  Courses    Fee Duration  Discount
# 0   Spark  25000  50 Days      2000
# 1  Pandas  20000  35 Days      1000
# 2    Java  15000      NaN       800
# 3  Python  15000  30 Days       500
# 4     PHP  18000  30 Days       800

When using Pandas’ read_csv() function to read a TSV file, by default, it assumes the first row contains column names (header) and creates an incremental numerical index starting from zero if no index column is specified.

Alternatively, you can also read_csv() but you need to use explicitly param sep or delimiter with '\t'

Using read_table() to Set Column as Index

To set a column as the index while reading a TSV file in Pandas, you can use the index_col parameter. Here, pd.read_csv() reads the TSV file named ‘courses.tsv’, sep='\t' specifies that the file is tab-separated, and index_col='Courses' sets the Courses column as the index of the DataFrame.


# Set column as Index
df = pd.read_tsv('courses.tsv', index_col='Courses')
print(df)

# Output:
#           Fee Duration  Discount
# Courses                          
# Spark    25000  50 Days      2000
# Pandas   20000  35 Days      1000
# Java     15000      NaN       800
# Python   15000  30 Days       500
# PHP      18000  30 Days       800

The above program will read the TSV file named courses.tsv, set the Courses column as the index, and print the DataFrame. Make sure the file path is correct and the file courses.tsv exists in the specified location.

Skip Rows Using read_table()

The skiprows and skipfooter parameters in pandas read_table function is useful for skipping rows at the beginning and end of a file, respectively.


# Skip first few rows
df = pd.read_table('courses.tsv', header=None, skiprows=2)
print(df)

# Output:
#        0      1        2     3
# 0  Pandas  20000  35 Days  1000
# 1    Java  15000      NaN   800
# 2  Python  15000  30 Days   500
# 3     PHP  18000  30 Days   800

This program reads the TSV file courses.tsv, skips the first two rows, and loads the remaining rows into a DataFrame df. By setting header=None.

Read CSV by Ignoring Column Names

To read a CSV file while ignoring the column names and assigning custom names, you can use header=None to indicate that there is no header row, and then use the names parameter to specify the custom column names.


# Ignore header and assign new columns
columns = ['courses','course_fee','course_duration','course_discount']
df = pd.read_table('/courses.tsv', header=None,names=columns,skiprows=1)
print(df)

# Output:
#  courses  course_fee course_duration  course_discount
# 0   Spark       25000         50 Days             2000
# 1  Pandas       20000         35 Days             1000
# 2    Java       15000             NaN              800
# 3  Python       15000         30 Days              500
# 4     PHP       18000         30 Days              800

The above code will read the courses.tsv file, ignore the header row, and assign the new column names specified in the columns list. If courses.tsv is in your current working directory, this code should work perfectly. Otherwise, make sure to provide the correct file path.

Load only Selected Columns Using read_table()

You can use read_table() to load only selected columns by specifying the usecols parameter. For instance, you define the columns list containing the names of the columns you want to load. Then, you use read_table() to read the TSV file, and the usecols parameter is set to columns, which specifies the columns to load.


# Load only selected columns
columns = ['courses','course_fee','course_duration','course_discount']
df = pd.read_table('courses.tsv', usecols =['Courses','Fee','Discount'])
print(df)

# Output:
#  Courses    Fee  Discount
# 0   Spark  25000      2000
# 1  Pandas  20000      1000
# 2    Java  15000       800
# 3  Python  15000       500
# 4     PHP  18000       800

The above example columns contains the names of the columns you want to load from the CSV file. The pd.read_csv() function reads the courses.csv file and loads only the specified columns using the usecols parameter.

Set DataTypes to Columns Using read_table()

By default, pd.read_table() will infer the data types for each column based on the data present in the file. It will automatically choose the data type that best fits the data in each column.


# Set column data types
df = pd.read_table('courses.tsv', dtype={'Courses':'string','Fee':'float'})
print(df.dtypes)

# Output:
# Courses      string
# Fee         float64
# Duration     object
# Discount      int64
# dtype: object

Parameters of pandas read_table()

nrows – Specifies how many rows to read from the file.
true_value – Values to consider as True.
false_values – Values to consider as False.
mangle_dupe_cols – If True, duplicate columns will be specified as ‘X’, ‘X.1’, ‘X.2’, etc., rather than ‘X’, ‘X’, ‘X’, etc.
converters – Dictionary containing column names and function pairs to convert values.
skipinitialspace – Skips spaces after delimiter.
na_values – Values to consider as NaN/NA.
keep_default_na – Specify whether to keep the default NaN values.
na_filter – Detect missing values. Set to False to improve performance.
skip_blank_lines – Skips empty lines without data.
parse_dates – Columns to parse as dates. Can be a list of column names or indices.
thousands– Separator for thousands.
decimal – Character for decimal point.
lineterminator – Line separator.
quotechar – Use a quote character when the delimiter is within a value.

FAQ on Pandas Read TSV

How do I read a TSV file into a pandas DataFrame?

To read a TSV (Tab-Separated Values) file into a pandas DataFrame, you can use the pd.read_csv() function with the sep parameter set to ‘\t’ to specify that the file is tab-separated.

Can I specify column names when reading a TSV file?

You can specify column names when reading a TSV (Tab-Separated Values) file into a pandas DataFrame. If the TSV file doesn’t have a header row with column names, you can provide a list of column names using the names parameter in the pd.read_csv() function.

How do I skip rows when reading a TSV file?

You can skip specific rows when reading a TSV (Tab-Separated Values) file into a pandas DataFrame by using the skiprows parameter of the pd.read_csv() function. The skiprows parameter allows you to specify a list of row indices or a range of rows that should be skipped during the reading process.

Can I specify a different encoding when reading a TSV file?

You can specify a different encoding when reading a TSV (Tab-Separated Values) file into a pandas DataFrame using the encoding parameter of the pd.read_csv() function. The encoding parameter allows you to specify the character encoding used in the file.

How can I read a TSV file with a specific index column?

To read a TSV (Tab-Separated Values) file into a pandas DataFrame with a specific column as the index, you can use the index_col parameter of the pd.read_csv() function. The index_col parameter allows you to specify the column that should be used as the index.

Conclusion

In this article, I have explained the read_table() function and using its syntax, parameters, and usage how to read a TSV file, and how to load it into Pandas DataFrame. Also learned without headers, skip rows or columns, set columns as index, and many more with examples.

Happy Learning!!