Python BeautifulSoup extract text between element
Python BeautifulSoup extract text between element
Learn more about how to navigate through the parse tree in BeautifulSoup
. Parse tree has got tags
and NavigableStrings
(as THIS IS A TEXT). An example
from BeautifulSoup import BeautifulSoup
doc = [<html><head><title>Page title</title></head>,
<body><p id=firstpara align=center>This is paragraph <b>one</b>.,
<p id=secondpara align=blah>This is paragraph <b>two</b>.,
</html>]
soup = BeautifulSoup(.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
# <p id=firstpara align=center>
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id=secondpara align=blah>
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>
To move down the parse tree you have contents
and string
.
-
contents is an ordered list of the Tag and NavigableString objects
contained within a page element -
if a tag has only one child node, and that child node is a string,
the child node is made available as tag.string, as well as
tag.contents[0]
For the above, that is to say you can get
soup.b.string
# uone
soup.b.contents[0]
# uone
For several children nodes, you can have for instance
pTag = soup.p
pTag.contents
# [uThis is paragraph , <b>one</b>, u.]
so here you may play with contents
and get contents at the index you want.
You also can iterate over a Tag, this is a shortcut. For instance,
for i in soup.body:
print i
# <p id=firstpara align=center>This is paragraph <b>one</b>.</p>
# <p id=secondpara align=blah>This is paragraph <b>two</b>.</p>
Use .children
instead:
from bs4 import NavigableString, Comment
print .join(unicode(child) for child in hit.children
if isinstance(child, NavigableString) and not isinstance(child, Comment))
Yes, this is a bit of a dance.
Output:
>>> for hit in soup.findAll(attrs={class : MYCLASS}):
... print .join(unicode(child) for child in hit.children
... if isinstance(child, NavigableString) and not isinstance(child, Comment))
...
THIS IS MY TEXT
Python BeautifulSoup extract text between element
You can use .contents
:
>>> for hit in soup.findAll(attrs={class : MYCLASS}):
... print hit.contents[6].strip()
...
THIS IS MY TEXT