How to parse a large malformed HTML page, in Python? -


i'm trying parse large html page malformed table markup. there around 7000-10000 rows in table. problem none of tr, th, td closed. so, markup this:

<html> <head> </head> <body>  <center>      <table border = 1>         <tr height=40><th colspan = 16><font size=4>dummy content         <tr><th>a             <th>b             <th>c             <th>d             <th>e             <th>f             <th>g           <tr><td>a             <td>b             <td>c             <td>d             <td>e         <tr><td>a             <td>b             <td>c             <td>d             <td>e     .........     .........      </table>     </center>     </body>     </html> 

i tried beautifulsoup.prettify() fix it, beautifulsoup runs in maximum recursion depth error. tried lxml, follows:

from lxml import html root = html.fromstring(htmltext) print len(root.find('.//tr')) 

but returns length of around 50, there above 7000 tr's.

is there way parse html , extract content each row?

i hope looking this.

import re trs = re.findall(r'(?<=<tr>).*?(?=<tr>)', your_string, re.dotall) print trs 

this regex return between 2 tr labels. if want search between 2 other labels, change first tr , second tr thing need.

i ran little test , worked me, let me know if helped you.


Comments