Python Extract HTML Table (Convert to Pandas DataFrame) Tutorial
Examine the HTML
Use Best HTML Viewer, HTML Beautifier, HTML Formatter and to Test / Preview HTML Output (codebeautify.org) beautifier to view html
.
We can simply use Pandas.read_html()
to read the tables inside a given html
.
If you ever faced the problem
UnicodeDecodeError: 'cp950' codec can't decode byte 0xe2 in position 4204: illegal multibyte sequence
Simply add a parameter
encoding="utf-8"
to theopen
.1
But, what if we have a HTML
body that has nested tables.
|
|
We can play with the string by finding the n-th occurence '<table'
to filter out the unwanted <table>
. Then use the header
parameter to anchor the right header.
Example:
|
|
But how can we transform the table to the format we want?
Transpose/Transform
Let’s ignore the complex DataFrame
, transpose
things. A simple and intuitive approach will be loop through the DataFrame
and Create a new DF
.
Like the above tables, I’ve written an example code you can refer to.
|
|
How about the datetime
? How can we handle the local datetime
issue if we want to deploy the app to the cloud.
Time Zone
With classmethod datetime.now(tz=None)
2, we have tzinfo
to get the certain local time. Though the standard library does not define any timezones – at least not well (the toy example given in the documentation does not handle subtle problems like the ones mentioned here).3
My suggestion is to use timedelta
to change the local time from utcnow()
instead. For example we want the local time to be fixed to Taipei time
(utc+8). We can just use timedelta
:
|
|