Python to Read Large Excel/CSV File Faster
Read a CSV with PyArrow
In Pandas 1.4, released in January 2022, there is a new backend for CSV reading, relying on the Arrow library’s CSV parser. It’s still marked as experimental, and it doesn’t support all the features of the default parser—but it is faster.1
CSV parser | Elapsed time | CPU time (user+sys) |
---|---|---|
Default C | 13.2 seconds | 13.2 seconds |
PyArrow | 2.7 seconds | 6.5 seconds |
|
|
Notice, it’s only feasible by pd.read_csv()
not pd.read_excel()
.
In pd.read_excel()
:
engine
: str, defaultNone
If io is not a buffer or path, this must be set to identify io. Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”.2
Upgrade Pandas
pip install --upgrade pandas --user
Noted –user is needed for windows user to handle:
|
|
Read Large Excel File Faster
Parallel
Let’s imagine that you received excel files and that you have no other choice but to load them as is. You can also use joblib
to parallelize this3. Compared to our pickle code from above, we only need to update the loop function.4
Just One File
Standard usecols
, nrows
, skiprows
experiment.
|
|
time used:
40.02475333213806
time used:
38.12591814994812
Other great ideas to reduce time on reading data like chunk
etc, read on Big Data from Excel to Pandas | Python Charmers.