Python to Read Large Excel/CSV File Faster

2022-11-17 344 words 2 minutes

Contents

Img Source: https://unsplash.com/photos/Wpnoqo2plFA

Read a CSV with PyArrow

In Pandas 1.4, released in January 2022, there is a new backend for CSV reading, relying on the Arrow library’s CSV parser. It’s still marked as experimental, and it doesn’t support all the features of the default parser—but it is faster.¹

CSV parser	Elapsed time	CPU time (user+sys)
Default C	13.2 seconds	13.2 seconds
PyArrow	2.7 seconds	6.5 seconds

1
2
3
import pandas as pd

pd.read_csv('Example.csv', engine='pyarrow')

Notice, it’s only feasible by pd.read_csv() not pd.read_excel().

In pd.read_excel():

engine: str, default None
If io is not a buffer or path, this must be set to identify io. Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”.²

Upgrade Pandas

pip install --upgrade pandas --user

Noted –user is needed for windows user to handle:

1
2
ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: ...
Consider using the `--user` option or check the permissions.

Read Large Excel File Faster

Parallel

Let’s imagine that you received excel files and that you have no other choice but to load them as is. You can also use joblib to parallelize this³. Compared to our pickle code from above, we only need to update the loop function.⁴

Just One File

Standard usecols, nrows, skiprows experiment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def load_in_positions():
    '''
    @return: Positions DataFrame
    '''
    print(f'Retrieving Positions.xlsx with Pandas--{pd.__version__}...Please hold on...')

    start = time.time()
    xlsx = pd.read_excel(CFG.Positions_xlsx_path, sheet_name="Position")
    print(f'time used:\n{time.time()-start}\nxlsx:\n{xlsx}')

    start = time.time()
    xlsx = pd.read_excel(CFG.Positions_xlsx_path, sheet_name="Position", usecols=[
        'Date', 'ISIN', 'MSCI ESG Rating', 'Bloomberg ESG Disclosure Score', 'Sustainalytics ESG Risk Rating', 'Sustainalytics ESG Risk Score'
    ])
    print(f'time used:\n{time.time()-start}\nxlsx:\n{xlsx}')

time used:
40.02475333213806
time used:
38.12591814994812

Other great ideas to reduce time on reading data like chunk etc, read on Big Data from Excel to Pandas | Python Charmers.