i'm using python 2.7 try , simple call website extract html data, i've managed code below.
import requests htmlparser import htmlparser name = "mark" surname = "jacobs" def req_getpagehtml(nume, prenume): url = "http://sample.com/page.aspx&name=" + name + "&surname=" + surname response = requests.get(url).text return response page_code = req_getpagehtml(nume, prenume) htmlp = htmlparser() print htmlp.feed(page_code) the next thing want somehow extract or parse unicode response (print type(page_code) returns unicode) somehow extract information it.
specifically, below sample html can back, want extract values (numbers inset in below html code , prefixed > - doesn't exist in html code, it's being identified guys).
... <tr class="tr1" onclick="lockbac();"> <td class="tdb" rowspan="2" nowrap="nowrap">1</td> <td class="tdb" rowspan="2" nowrap="nowrap">jacobs d <br/>mark</td> <td class="tdb" rowspan="2" align="center">math speciality</td> <td class="tdb" rowspan="2" align="center">advanced user</td> > <td class="tdb" rowspan="2" align="center">6.95</td> > <td class="tdb" rowspan="2" align="center">7.9</td> > <td class="tdb" rowspan="2" align="center">7.9</td> <td class="tdb" colspan="4" align="center"></td> <td class="tdb" rowspan="2" align="center">english</td> <td class="tdb" rowspan="2" align="center">b2-b2-b2-b2-b2</td> <td class="tdb" colspan="3" align="center">mathematics math-info</td> <td class="tdb" colspan="3" align="center">informatics</td> <td bgcolor="lightgreen" class="tdb" rowspan="2" align="center"></td> <td class="tdb" rowspan="2" align="center">8.88</td> <td class="tdb" rowspan="2" align="center">success</td> </tr> <tr class="tr1" onclick="lockbac();"> <td class="tdb"></td> <td class="tdb"></td> <td class="tdb"></td> <td class="tdb"></td> > <td class="tdb">9.35</td> > <td class="tdb"></td> > <td class="tdb">9.35</td> > <td class="tdb">9.4</td> <td class="tdb"></td> > <td class="tdb">9.4</td> </tr> ... what these numbers represent exam scores, later put in db.
now, i'm trying efficient way extract these numbers prefer leave parsing text each element (manually substr , on) last option.
i did come across htmlparser, can see imported code, bottom print returns none.
i under impression can use library parse text received response , there easier way specify tag id (or similar) , extract relevant information (like shown in htmlparser examples section), can't necessary information want using feed method.
maybe i'm not understanding correctly , maybe i'm not using appropriate tool, why explained goal.
i appreciate in correcting or pointing me right direction.
not sure how work have tried, have different method.
you can grab lxml, python library helps out scraping xml , html. seems requests out project.
page = requests.get('http://www.example.com') tree = html.fromstring(page.text) the tree variable contains of html document, can parse wish. using xpath have like
scores = tree.xpath('//td[@class="tdb"]/text()') hope helps.
Comments
Post a Comment