h_b_div_paragraphs = soup.html.body.div.p
Will get the <p>
elements inside a div
inside the body
inside the html
element.
div_children = soup.div.children
div_contents = soup.div.contents
This will get the direct child element(s) of the element being looked at
tag.attrs
You can access a tag's attributes by treating the tag like a dictionary and you can access that dictionary directly as .attrs
div_descendants = soup.div.descendants
This will get all the child elements of the element being looked at
div_link_text = soup.div.a.string
If a tag has only one child, and that child is a NavigableString
, the child is made available as .string
, will return 'None'
if there is no string found
.strings
and .stripped_strings
div_text = soup.div.strings
If there's more than one thing inside a tag, you can still look at just the strings. Use the .stringsgenerator
title = soup.title.string.parent
You can access an element's parent with the .parentattribute
. The string in the title tag has a parent, the titel tag
link = soup.a
for parent in link.parents:
if parent is None:
print parent
else:
print parent.name
You can iterate over all of an element's parents with .parents
. This example uses .parents
to travel from an <a>
tag buried deep within the document, to the very top of the document:
.(next/previous)_(sibling/element)(s)
The .(next/previous)_(sibling(s)/element(s))
can be used to navigate between page elements, getting either a single element or a list of elements. If there are no more, then these will return 'None'
.find()/.find_all()/.find_...() »
(..parent(s)(),
(..(next/previous)_sibling(s)(),
(..all_(next/previous)(), )
Returns either the first result or a list of the results
The limit
argument
soup.find_all("a", limit=2)
The recursive
argument.
soup.find_all("a", recursive=False)\
Limits the number of returned results either by a number (limit
), or to only the direct children (recursive
)
Changing tag names and attributes
tag.name = "blockquote"
tag['class'] = 'verybold'
Change a tags name or attributes (attributes like they are key-value pairs)
tag = soup.a
tag.string = "New link text."
Replaces the tag's contents with the string you give
.append()
It works just like calling .append()
on a Python list
.new_string()
and.new_tag()
You can .append()
a new string or new tag to the document
.insert()
Tag will be inserted at whatever numeric position you say.
.insert_before() and .insert_after()
The .insert_before()/.insert_after()
methods insert a tag or string immediately before or after the target element
tag.clear()
Removes the contents of a tag
tag.extract()
Removes a tag or string from the tree. It returns the tag or string that was extracted
tag.decompose()
Removes a tag from the tree, then completely destroys it
tag.replace_with(replacement)
Removes a tag or string from the tree, and replaces it with the tag or string of your choice
tag.wrap()
Wraps an element in the tag you specify and returns the new wrapper
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)
The filters used inside the methods can have various formes, a sring, a regex (re.compile("regex")
), a list, True
; which will mach everything it can, or a function which should return True if the right tag was found and False if not.
Here's a function that returns True
if a tag defines the class
attribute but doesn't define the id
attribute:
def surrounded_by_strings(tag):
return (isinstance(tag.next_element, NavigableString) and \
isinstance(tag.previous_element, NavigableString))
for tag in soup.find_all(surrounded_by_strings):
print tag.name
Here's a function that returns True if a tag is surrounded by string objects:
soup.find('p', {'style': 'display:inline'})
The filters can become quite specific, here we get a p element that has a style attribute set to 'display;inline':
soup.find_all(href=re.compile("number"))
Or if an attribute has a certain string inside (using regex):
soup.find_all(class_=re.compile("ink"))
def has_six_characters(css_class):
return css_class is not None and len(css_class) == 6
soup.find_all(class_=has_six_characters)
As with any keyword argument, you can pass class_ a string, a regular expression (re.compile(regex)
), a function, or True