Python BeautifulSoup extract text between element

Python BeautifulSoup extract text between element

Learn more about how to navigate through the parse tree in BeautifulSoup. Parse tree has got tags and NavigableStrings (as THIS IS A TEXT). An example

from BeautifulSoup import BeautifulSoup 
doc = [<html><head><title>Page title</title></head>,
       <body><p id=firstpara align=center>This is paragraph <b>one</b>.,
       <p id=secondpara align=blah>This is paragraph <b>two</b>.,
       </html>]
soup = BeautifulSoup(.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id=firstpara align=center>
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id=secondpara align=blah>
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

To move down the parse tree you have contents and string.

  • contents is an ordered list of the Tag and NavigableString objects
    contained within a page element

  • if a tag has only one child node, and that child node is a string,
    the child node is made available as tag.string, as well as
    tag.contents[0]

For the above, that is to say you can get

soup.b.string
# uone
soup.b.contents[0]
# uone

For several children nodes, you can have for instance

pTag = soup.p
pTag.contents
# [uThis is paragraph , <b>one</b>, u.]

so here you may play with contents and get contents at the index you want.

You also can iterate over a Tag, this is a shortcut. For instance,

for i in soup.body:
    print i
# <p id=firstpara align=center>This is paragraph <b>one</b>.</p>
# <p id=secondpara align=blah>This is paragraph <b>two</b>.</p>

Use .children instead:

from bs4 import NavigableString, Comment
print .join(unicode(child) for child in hit.children 
    if isinstance(child, NavigableString) and not isinstance(child, Comment))

Yes, this is a bit of a dance.

Output:

>>> for hit in soup.findAll(attrs={class : MYCLASS}):
...     print .join(unicode(child) for child in hit.children 
...         if isinstance(child, NavigableString) and not isinstance(child, Comment))
... 




      THIS IS MY TEXT

Python BeautifulSoup extract text between element

You can use .contents:

>>> for hit in soup.findAll(attrs={class : MYCLASS}):
...     print hit.contents[6].strip()
... 
THIS IS MY TEXT

Leave a Reply

Your email address will not be published.