i trying use python lxml import list of text page. here have far.
test_page.html source:
<html> <head> <title>test</title> </head> <body> <table width="100%" border="0" cellspacing="0" cellpadding="0"> <tbody> <tr><td><a title="this page cool" class="producttitlelink" href="about:mozilla">this page cool</a></td></tr> <tr height="10"></tr> <tr><td class="plaintext">this cool description cool page.</td></tr> <tr><td class="plaintext">published: 7/15/15</td></tr> <tr><td class="plaintext"> </td></tr> <tr><td class="plaintext"> </td></tr> <tr><td class="plaintext"> </td></tr> <tr><td class="plaintext"> </td></tr> </tbody> </table> </body> python code:
from lxml import html import requests page = requests.get('http://127.0.0.1/test_page.html') tree = html.fromstring(page.text) description = tree.xpath('//table//td[@class="plaintext"]/text()') >> print (description) ['this cool description cool page.', 'published: 7/15/15', '\n\t\t\n\t\t\t\t\n\t\t\n\t', '\n\t\t\t\t\n\t\n\t', '\n\t\t\t\t\n\t\n\t', '\n\t\t\t\t\n\t'] >> however desired end-result is:
['this cool description cool page. published: 7/15/15'] i had thought using [1] -
tree.xpath('//table//td[@class="plaintext"][1]/text()') might allow me receive first line:
['this cool description cool page.'] however pulls entire list.
is there way specify single line or list of lines using xpath html?
you can try way :
from lxml import html source = """html posted in question here""" tree = html.fromstring(source) tds = tree.xpath('//table//td[@class="plaintext"]/text()[normalize-space()]') description = ' '.join(tds) print(description) the xpath predicate[normalize-space()] applied text() return non-whitespace text nodes.
using html posted in question, output of above codes requested :
this cool description cool page. published: 7/15/15
Comments
Post a Comment