i'm trying parse large html page malformed table markup. there around 7000-10000 rows in table. problem none of tr, th, td closed. so, markup this:
<html> <head> </head> <body> <center> <table border = 1> <tr height=40><th colspan = 16><font size=4>dummy content <tr><th>a <th>b <th>c <th>d <th>e <th>f <th>g <tr><td>a <td>b <td>c <td>d <td>e <tr><td>a <td>b <td>c <td>d <td>e ......... ......... </table> </center> </body> </html> i tried beautifulsoup.prettify() fix it, beautifulsoup runs in maximum recursion depth error. tried lxml, follows:
from lxml import html root = html.fromstring(htmltext) print len(root.find('.//tr')) but returns length of around 50, there above 7000 tr's.
is there way parse html , extract content each row?
i hope looking this.
import re trs = re.findall(r'(?<=<tr>).*?(?=<tr>)', your_string, re.dotall) print trs this regex return between 2 tr labels. if want search between 2 other labels, change first tr , second tr thing need.
i ran little test , worked me, let me know if helped you.
Comments
Post a Comment