Contents

Python to Read Large Excel/CSV File Faster

featured-image.jpg
Img Source: https://unsplash.com/photos/Wpnoqo2plFA

Read a CSV with PyArrow

In Pandas 1.4, released in January 2022, there is a new backend for CSV reading, relying on the Arrow library’s CSV parser. It’s still marked as experimental, and it doesn’t support all the features of the default parser—but it is faster.1

CSV parserElapsed timeCPU time (user+sys)
Default C13.2 seconds13.2 seconds
PyArrow2.7 seconds6.5 seconds
1
2
3
import pandas as pd

pd.read_csv('Example.csv', engine='pyarrow')

Notice, it’s only feasible by pd.read_csv() not pd.read_excel().

In pd.read_excel():

engine: str, default None

If io is not a buffer or path, this must be set to identify io. Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”.2

Upgrade Pandas

pip install --upgrade pandas --user

Noted –user is needed for windows user to handle:

1
2
ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: ...
Consider using the `--user` option or check the permissions.

Read Large Excel File Faster

Parallel

Let’s imagine that you received excel files and that you have no other choice but to load them as is. You can also use joblib to parallelize this3. Compared to our pickle code from above, we only need to update the loop function.4

Just One File

Standard usecols, nrows, skiprows experiment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def load_in_positions():
    '''
    @return: Positions DataFrame
    '''
    print(f'Retrieving Positions.xlsx with Pandas--{pd.__version__}...Please hold on...')

    start = time.time()
    xlsx = pd.read_excel(CFG.Positions_xlsx_path, sheet_name="Position")
    print(f'time used:\n{time.time()-start}\nxlsx:\n{xlsx}')

    start = time.time()
    xlsx = pd.read_excel(CFG.Positions_xlsx_path, sheet_name="Position", usecols=[
        'Date', 'ISIN', 'MSCI ESG Rating', 'Bloomberg ESG Disclosure Score', 'Sustainalytics ESG Risk Rating', 'Sustainalytics ESG Risk Score'
    ])
    print(f'time used:\n{time.time()-start}\nxlsx:\n{xlsx}')
time used:
40.02475333213806
time used:
38.12591814994812

Other great ideas to reduce time on reading data like chunk etc, read on Big Data from Excel to Pandas | Python Charmers.