How to find a specific xpath td class from a table with lxml and python -


i trying use python lxml import list of text page. here have far.

test_page.html source:

<html> <head>     <title>test</title> </head> <body> <table width="100%" border="0" cellspacing="0" cellpadding="0"> <tbody>     <tr><td><a title="this page cool" class="producttitlelink" href="about:mozilla">this page cool</a></td></tr>     <tr height="10"></tr>     <tr><td class="plaintext">this cool description cool page.</td></tr>              <tr><td class="plaintext">published: 7/15/15</td></tr>      <tr><td class="plaintext">        </td></tr>     <tr><td class="plaintext">       </td></tr>     <tr><td class="plaintext">       </td></tr>     <tr><td class="plaintext">      </td></tr>       </tbody> </table> </body> 

python code:

from lxml import html import requests page = requests.get('http://127.0.0.1/test_page.html') tree = html.fromstring(page.text) description = tree.xpath('//table//td[@class="plaintext"]/text()') >> print (description) ['this cool description cool page.', 'published: 7/15/15', '\n\t\t\n\t\t\t\t\n\t\t\n\t', '\n\t\t\t\t\n\t\n\t', '\n\t\t\t\t\n\t\n\t', '\n\t\t\t\t\n\t'] >> 

however desired end-result is:

['this cool description cool page. published: 7/15/15'] 

i had thought using [1] -

tree.xpath('//table//td[@class="plaintext"][1]/text()')  

might allow me receive first line:

['this cool description cool page.']  

however pulls entire list.

is there way specify single line or list of lines using xpath html?

you can try way :

from lxml import html  source = """html posted in question here""" tree = html.fromstring(source) tds = tree.xpath('//table//td[@class="plaintext"]/text()[normalize-space()]') description = ' '.join(tds) print(description) 

the xpath predicate[normalize-space()] applied text() return non-whitespace text nodes.

using html posted in question, output of above codes requested :

this cool description cool page. published: 7/15/15 

Comments