python - beautifulsoup .get_text() is not specific enough for my HTML parsing -

given html code below want output text of h1 not "details ", text of span (which encapsulated h1).

my current output gives:

details   new men's genuine leather bifold id credit card money holder wallet black

i like:

new men's genuine leather bifold id credit card money holder wallet black

here html working with

<h1 class="it-ttl" itemprop="name" id="itemtitle"><span class="g-hdn">details  &nbsp;</span>new men&#039;s genuine leather bifold id credit card money holder wallet black</h1>

here current code:

for line in soup.find_all('h1',attrs={'itemprop':'name'}):     print line.get_text()

note: not want truncate string because code have re-usability. best code crops out text bounded span.

you can use extract() remove span tags:

for line in soup.find_all('h1',attrs={'itemprop':'name'}):     [s.extract() s in line('span')] print line.get_text() # => new men's genuine leather bifold id credit card money holder wallet black

WIKI

Search This Blog

python - beautifulsoup .get_text() is not specific enough for my HTML parsing -

Comments

Post a Comment