How to read CSV in Pandas with Column Names or Regular Expressions

A practical example of how to load columns based on columns names or using regex

George Pipis


A common scenario is to work with massive datasets consisting of many columns which are not needed for our analysis. Usually, we tend to load the whole dataset and then keep the columns of our interest. Let’s see a more efficient way to load only the columns of our interest.

For this tutorial, we will work with the following dataset:

import pandas as pd
import re


With the following column names:

['campaign_id', 'user_id', 'cdate', 'att_gander', 'att_agegroup','variant', 'response']

How to read a CSV by keeping only specific column names

Assuming that we want to load only three columns, such as the user_id, the variant and the response. In the read_csv function, there is a parameter called usecols which will be used in all the examples below.

usecols : list-like or callable, optional
Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or strings
that correspond to column names provided either by the user in `names` or
inferred from the document header row(s). If ``names`` are given, the document
header row(s) are not taken into account. For example, a valid list-like
`usecols` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``.
Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
To instantiate a DataFrame from ``data`` with element order preserved use
``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns
in ``['foo', 'bar']`` order or
``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]``
for ``['bar', 'foo']`` order.

Let’s load only the columns of our interest:

pd.read_csv('example.csv', usecols = lambda x: x in ['user_id', 'variant'…



George Pipis

Sr. Director, Data Scientist @ Persado | Co-founder of the Data Science blog: