Use python LXML to extract information from html webpage -

i trying make python script scrape specific information webpage limited knowledge have. guess limited knowledge not suffice. need extract 7-8 pieces of information. tags follows -

<a class="ui-magnifier-glass" href="here goes link want extract" data-spm-anchor-id="0.0.0.0" style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"></a>

<a href="link extract" title="title extract" rel="category tag" data-spm-anchor-id="0.0.0.0">or maybe word instead of title</a>

if idea how extract information such href tags. able rest of work myself.

and if me in writing code add information in csv file highly appreciated.

i have started code

url = raw_input('url : ')  page = requests.get(url) tree = html.fromstring(page.text) productname = tree.xpath('//h1[@class="product-name"]/text()') price = tree.xpath('//span[@id="sku-discount-price"]/text()') print '\n' + productname[0] print '\n' + price[0]

you can use lxml , csv module want. lxml supports xpath expressions select elements want.

from lxml import etree stringio import stringio csv import dictwriter  f= stringio('''     <html><body>     <a class="ui-magnifier-glass"         href="here goes link want extract"         data-spm-anchor-id="0.0.0.0"         style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"     ></a>     <a href="link extract"        title="title extract"         rel="category tag"         data-spm-anchor-id="0.0.0.0"     >or maybe word instead of title</a>     </body></html> ''') doc = etree.parse(f)  data=[] # links data-spm-anchor-id="0.0.0.0"  r = doc.xpath('//a[@data-spm-anchor-id="0.0.0.0"]')  # iterate thru each element containing <a></a> tag element elem in r:     # can access attributes     link=elem.get('href')     title=elem.get('title')     # , text inside tag accessable text     text=elem.text      data.append({         'link': link,         'title': title,         'text': text     })  open('file.csv', 'w') csvfile:     fieldnames=['link', 'title', 'text']     writer = dictwriter(csvfile, fieldnames=fieldnames)      writer.writeheader()     row in data:         writer.writerow(row)

WIKI

Search This Blog

Use python LXML to extract information from html webpage -

Comments

Post a Comment