python - Extract HTML tags and data from text -


i'm using python 2.7 try , simple call website extract html data, i've managed code below.

import requests htmlparser import htmlparser  name = "mark" surname = "jacobs"  def req_getpagehtml(nume, prenume):     url = "http://sample.com/page.aspx&name=" + name + "&surname=" + surname     response = requests.get(url).text     return response  page_code = req_getpagehtml(nume, prenume)  htmlp = htmlparser()  print htmlp.feed(page_code) 

the next thing want somehow extract or parse unicode response (print type(page_code) returns unicode) somehow extract information it.

specifically, below sample html can back, want extract values (numbers inset in below html code , prefixed > - doesn't exist in html code, it's being identified guys).

... <tr class="tr1" onclick="lockbac();">     <td class="tdb" rowspan="2" nowrap="nowrap">1</td>     <td class="tdb" rowspan="2" nowrap="nowrap">jacobs d <br/>mark</td>     <td class="tdb" rowspan="2" align="center">math speciality</td>     <td class="tdb" rowspan="2" align="center">advanced user</td>         >   <td class="tdb" rowspan="2" align="center">6.95</td>         >   <td class="tdb" rowspan="2" align="center">7.9</td>         >   <td class="tdb" rowspan="2" align="center">7.9</td>     <td class="tdb" colspan="4" align="center"></td>     <td class="tdb" rowspan="2" align="center">english</td>     <td class="tdb" rowspan="2" align="center">b2-b2-b2-b2-b2</td>     <td class="tdb" colspan="3" align="center">mathematics math-info</td>     <td class="tdb" colspan="3" align="center">informatics</td>     <td bgcolor="lightgreen" class="tdb" rowspan="2" align="center"></td>     <td class="tdb" rowspan="2" align="center">8.88</td>     <td class="tdb" rowspan="2" align="center">success</td> </tr> <tr class="tr1" onclick="lockbac();">     <td class="tdb"></td>     <td class="tdb"></td>     <td class="tdb"></td>     <td class="tdb"></td>         >    <td class="tdb">9.35</td>         >    <td class="tdb"></td>         >    <td class="tdb">9.35</td>         >    <td class="tdb">9.4</td>     <td class="tdb"></td>         >    <td class="tdb">9.4</td> </tr> ... 

what these numbers represent exam scores, later put in db.

now, i'm trying efficient way extract these numbers prefer leave parsing text each element (manually substr , on) last option.

i did come across htmlparser, can see imported code, bottom print returns none.

i under impression can use library parse text received response , there easier way specify tag id (or similar) , extract relevant information (like shown in htmlparser examples section), can't necessary information want using feed method.

maybe i'm not understanding correctly , maybe i'm not using appropriate tool, why explained goal.

i appreciate in correcting or pointing me right direction.

not sure how work have tried, have different method.

you can grab lxml, python library helps out scraping xml , html. seems requests out project.

page = requests.get('http://www.example.com') tree = html.fromstring(page.text) 

the tree variable contains of html document, can parse wish. using xpath have like

scores = tree.xpath('//td[@class="tdb"]/text()') 

hope helps.

source


Comments