Contents

Python Extract HTML Table (Convert to Pandas DataFrame) Tutorial

Examine the HTML

Use Best HTML Viewer, HTML Beautifier, HTML Formatter and to Test / Preview HTML Output (codebeautify.org) beautifier to view html.

We can simply use Pandas.read_html() to read the tables inside a given html.

If you ever faced the problem UnicodeDecodeError: 'cp950' codec can't decode byte 0xe2 in position 4204: illegal multibyte sequence

Simply add a parameter encoding="utf-8" to the open.1

But, what if we have a HTML body that has nested tables.

1
2
3
4
<table ...>
  <th>
    <table ...>
      ...

featured-image.png
nested tables

We can play with the string by finding the n-th occurence '<table' to filter out the unwanted <table>. Then use the header parameter to anchor the right header.

Example:

1
2
3
4
5
6
with open('test.txt', 'r', encoding="utf-8") as file:
    html = file.read()

tables = html.split('<table')

target_DF = pd.read_html('<table'+tables[5], header=[1])[0]

But how can we transform the table to the format we want?

Transpose/Transform

/blog/posts/2022/python-extract-html-table-convert-to-pandas-dataframe-tutorial/Transpose-or-Transform-the-table.png
Transpose-or-Transform-the-table

Let’s ignore the complex DataFrame, transpose things. A simple and intuitive approach will be loop through the DataFrame and Create a new DF.

Like the above tables, I’ve written an example code you can refer to.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
new_DF = pd.DataFrame(columns=['Date', 'Bank', 'Currency', 'Tenor', 'Fixed Deposit Rate', 'Loan Rate', 'Remarks'])

Bank=[]; Currency=[]; Tenor=[]; FixedDepositRate=[]; Remarks=[]
Tenor_temp = list(map(lambda item: item.split()[0]+'M', target_DF.columns[1:4].to_list()))
for i, row in target_DF.iterrows():
    Bank.extend(['UBS']*3)
    Currency.extend([row['Currency']]*3)
    Tenor.extend(Tenor_temp)
    FixedDepositRate.extend(row[1:4])
    Remarks.extend([row['Minimum size per deposit']]*3)
new_DF['Bank'] = Bank
new_DF['Currency'] = Currency
new_DF['Tenor'] = Tenor
new_DF['Fixed Deposit Rate'] = FixedDepositRate
new_DF['Remarks'] = Remarks

How about the datetime? How can we handle the local datetime issue if we want to deploy the app to the cloud.

Time Zone

With classmethod datetime.now(tz=None)2, we have tzinfo to get the certain local time. Though the standard library does not define any timezones – at least not well (the toy example given in the documentation does not handle subtle problems like the ones mentioned here).3

My suggestion is to use timedelta to change the local time from utcnow() instead. For example we want the local time to be fixed to Taipei time (utc+8). We can just use timedelta:

1
2
3
4
from datetime import datetime
from datetime import timedelta

_today = datetime.utcnow() + timedelta(hours=8)