i trying make python script scrape specific information webpage limited knowledge have. guess limited knowledge not suffice. need extract 7-8 pieces of information. tags follows -
1
<a class="ui-magnifier-glass" href="here goes link want extract" data-spm-anchor-id="0.0.0.0" style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"></a> 2
<a href="link extract" title="title extract" rel="category tag" data-spm-anchor-id="0.0.0.0">or maybe word instead of title</a> if idea how extract information such href tags. able rest of work myself.
and if me in writing code add information in csv file highly appreciated.
i have started code
url = raw_input('url : ') page = requests.get(url) tree = html.fromstring(page.text) productname = tree.xpath('//h1[@class="product-name"]/text()') price = tree.xpath('//span[@id="sku-discount-price"]/text()') print '\n' + productname[0] print '\n' + price[0]
you can use lxml , csv module want. lxml supports xpath expressions select elements want.
from lxml import etree stringio import stringio csv import dictwriter f= stringio(''' <html><body> <a class="ui-magnifier-glass" href="here goes link want extract" data-spm-anchor-id="0.0.0.0" style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;" ></a> <a href="link extract" title="title extract" rel="category tag" data-spm-anchor-id="0.0.0.0" >or maybe word instead of title</a> </body></html> ''') doc = etree.parse(f) data=[] # links data-spm-anchor-id="0.0.0.0" r = doc.xpath('//a[@data-spm-anchor-id="0.0.0.0"]') # iterate thru each element containing <a></a> tag element elem in r: # can access attributes link=elem.get('href') title=elem.get('title') # , text inside tag accessable text text=elem.text data.append({ 'link': link, 'title': title, 'text': text }) open('file.csv', 'w') csvfile: fieldnames=['link', 'title', 'text'] writer = dictwriter(csvfile, fieldnames=fieldnames) writer.writeheader() row in data: writer.writerow(row)
Comments
Post a Comment