How to read TSV file in pandas? TSV stands for Tab Separated File use pandas which is a text file where each field is separated by tab (\t). In Pandas, you can read the TSV file into DataFrame by using the read_table()
function.
In this article, I will explain the read_table()
function and using its syntax, parameters, and usage how to read a TSV (Tab-Separated Values) file in Pandas involves understanding various options like handling headers, skipping rows or columns, and setting columns as indices.
Key Points –
- Pandas provides the
read_csv()
function which can be utilized to read TSV files by specifying thesep='\t'
parameter, allowing for efficient data loading and manipulation. - When reading TSV files, it’s important to consider whether the file contains a header row. Pandas can infer the header row automatically (
header='infer'
) or you can specifyheader=None
if the file doesn’t have a header. - Pandas offers flexibility in handling TSV files, allowing you to skip specific rows or columns using parameters like
skiprows
andusecols
, respectively.
Syntax of read_table()
Following is the syntax of the read_table() function
# Syntax of read_table()
pandas.read_table(filepath_or_buffer, sep=NoDefault.no_default, delimiter=None, header='infer', names=NoDefault.no_default, index_col=None, usecols=None, squeeze=None, prefix=NoDefault.no_default, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, error_bad_lines=None, warn_bad_lines=None, on_bad_lines=None, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None)
Pandas Read TSV into DataFrame
To read a TSV file with tab (\t) delimiter use pandas read_table()
function. This also supports optionally iterating or breaking the file into chunks. In the above syntax, you can see that there are several optional parameters available for reading TSV files, each serving a specific purpose. These parameters become particularly valuable when handling large files, as they can significantly enhance the efficiency of loading TSV data.
I’ll utilize the provided data to demonstrate reading a TSV file. You can access the data file on GitHub.
# Import pandas
import pandas as pd
# Read TSV file into DataFrame
df = pd.read_table('courses.tsv')
print(df)
# Output:
# Courses Fee Duration Discount
# 0 Spark 25000 50 Days 2000
# 1 Pandas 20000 35 Days 1000
# 2 Java 15000 NaN 800
# 3 Python 15000 30 Days 500
# 4 PHP 18000 30 Days 800
When using Pandas’ read_csv()
function to read a TSV file, by default, it assumes the first row contains column names (header) and creates an incremental numerical index starting from zero if no index column is specified.
Alternatively, you can also read_csv() but you need to use explicitly param sep
or delimiter
with '\t'
Using read_table() to Set Column as Index
To set a column as the index while reading a TSV file in Pandas, you can use the index_col
parameter. Here, pd.read_csv()
reads the TSV file named ‘courses.tsv’, sep='\t'
specifies that the file is tab-separated, and index_col='Courses'
sets the Courses
column as the index of the DataFrame.
# Set column as Index
df = pd.read_tsv('courses.tsv', index_col='Courses')
print(df)
# Output:
# Fee Duration Discount
# Courses
# Spark 25000 50 Days 2000
# Pandas 20000 35 Days 1000
# Java 15000 NaN 800
# Python 15000 30 Days 500
# PHP 18000 30 Days 800
The above program will read the TSV file named courses.tsv
, set the Courses
column as the index, and print the DataFrame. Make sure the file path is correct and the file courses.tsv
exists in the specified location.
Skip Rows Using read_table()
The skiprows
and skipfooter
parameters in pandas read_table
function is useful for skipping rows at the beginning and end of a file, respectively.
# Skip first few rows
df = pd.read_table('courses.tsv', header=None, skiprows=2)
print(df)
# Output:
# 0 1 2 3
# 0 Pandas 20000 35 Days 1000
# 1 Java 15000 NaN 800
# 2 Python 15000 30 Days 500
# 3 PHP 18000 30 Days 800
This program reads the TSV file courses.tsv
, skips the first two rows, and loads the remaining rows into a DataFrame df
. By setting header=None
.
Read CSV by Ignoring Column Names
To read a CSV file while ignoring the column names and assigning custom names, you can use header=None
to indicate that there is no header row, and then use the names
parameter to specify the custom column names.
# Ignore header and assign new columns
columns = ['courses','course_fee','course_duration','course_discount']
df = pd.read_table('/courses.tsv', header=None,names=columns,skiprows=1)
print(df)
# Output:
# courses course_fee course_duration course_discount
# 0 Spark 25000 50 Days 2000
# 1 Pandas 20000 35 Days 1000
# 2 Java 15000 NaN 800
# 3 Python 15000 30 Days 500
# 4 PHP 18000 30 Days 800
The above code will read the courses.tsv
file, ignore the header row, and assign the new column names specified in the columns
list. If courses.tsv
is in your current working directory, this code should work perfectly. Otherwise, make sure to provide the correct file path.
Load only Selected Columns Using read_table()
You can use read_table()
to load only selected columns by specifying the usecols
parameter. For instance, you define the columns
list containing the names of the columns you want to load. Then, you use read_table()
to read the TSV file, and the usecols
parameter is set to columns
, which specifies the columns to load.
# Load only selected columns
columns = ['courses','course_fee','course_duration','course_discount']
df = pd.read_table('courses.tsv', usecols =['Courses','Fee','Discount'])
print(df)
# Output:
# Courses Fee Discount
# 0 Spark 25000 2000
# 1 Pandas 20000 1000
# 2 Java 15000 800
# 3 Python 15000 500
# 4 PHP 18000 800
The above example columns
contains the names of the columns you want to load from the CSV file. The pd.read_csv()
function reads the courses.csv
file and loads only the specified columns using the usecols
parameter.
Set DataTypes to Columns Using read_table()
By default, pd.read_table()
will infer the data types for each column based on the data present in the file. It will automatically choose the data type that best fits the data in each column.
# Set column data types
df = pd.read_table('courses.tsv', dtype={'Courses':'string','Fee':'float'})
print(df.dtypes)
# Output:
# Courses string
# Fee float64
# Duration object
# Discount int64
# dtype: object
Parameters of pandas read_table()
nrows
– Specifies how many rows to read from the file.true_value
– Values to consider as True.false_values
– Values to consider as False.mangle_dupe_cols
– If True, duplicate columns will be specified as ‘X’, ‘X.1’, ‘X.2’, etc., rather than ‘X’, ‘X’, ‘X’, etc.converters
– Dictionary containing column names and function pairs to convert values.skipinitialspace
– Skips spaces after delimiter.na_values
– Values to consider as NaN/NA.keep_default_na
– Specify whether to keep the default NaN values.na_filter
– Detect missing values. Set to False to improve performance.skip_blank_lines
– Skips empty lines without data.parse_dates
– Columns to parse as dates. Can be a list of column names or indices.thousands
– Separator for thousands.decimal
– Character for decimal point.lineterminator
– Line separator.quotechar
– Use a quote character when the delimiter is within a value.
FAQ on Pandas Read TSV
To read a TSV (Tab-Separated Values) file into a pandas DataFrame, you can use the pd.read_csv()
function with the sep
parameter set to ‘\t’ to specify that the file is tab-separated.
You can specify column names when reading a TSV (Tab-Separated Values) file into a pandas DataFrame. If the TSV file doesn’t have a header row with column names, you can provide a list of column names using the names
parameter in the pd.read_csv()
function.
You can skip specific rows when reading a TSV (Tab-Separated Values) file into a pandas DataFrame by using the skiprows
parameter of the pd.read_csv()
function. The skiprows
parameter allows you to specify a list of row indices or a range of rows that should be skipped during the reading process.
You can specify a different encoding when reading a TSV (Tab-Separated Values) file into a pandas DataFrame using the encoding
parameter of the pd.read_csv()
function. The encoding
parameter allows you to specify the character encoding used in the file.
To read a TSV (Tab-Separated Values) file into a pandas DataFrame with a specific column as the index, you can use the index_col
parameter of the pd.read_csv()
function. The index_col
parameter allows you to specify the column that should be used as the index.
Conclusion
In this article, I have explained the read_table()
function and using its syntax, parameters, and usage how to read a TSV file, and how to load it into Pandas DataFrame. Also learned without headers, skip rows or columns, set columns as index, and many more with examples.
Happy Learning!!
Related Articles
- pandas ExcelWriter Usage with Examples
- pandas write CSV file
- Read Excel file into pandas DataFrame
- Pandas Read Text with Examples
- Pandas Read JSON File with Examples
- Pandas read_csv() with Examples
- Pandas Read Excel with Examples
- Pandas Select Rows Based on List Index
- Pandas Read SQL Query or Table with Examples
- Pandas Read Multiple CSV Files into DataFrame
References
- https://pandas.pydata.org/docs/reference/api/pandas.read_table.html
- https://fileinfo.com/extension/tsv