No Soup for You: When Beautiful Soup Doesn’t Like Your XML

Over the years, Beautiful Soup has probably saved us more hours on scraping, data collection, and other projects than we can count. Crummy’s landing page for the library even says:

Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.

You can plot me directly in the “days” column. When I’m starting a Python project that requires me to parse through HTML data, the first dependency I’ll pull is BeautifulSoup. It makes what would normally be a nasty Perl-esque mess into a something nice and Pythonic, keeping sanity intact. But what about structured data other than HTML? I’ve also learned that BS can be a huge boon for XML data, but not without a couple of speed bumps.

Enter data from the client (not real data, but play along for a moment):

<xml>
  <sport name="baseball">
    <team name="Braves" city="Atlanta">
        <link>
          <url>http://atlanta.braves.mlb.com</url>
        </link>
    </team>
  </sport>
</xml>

Seems well-structured! Our customer just needs all the links that we have in the data. We fire up our editor of choice and roll up our sleeves.

from __future__ import print_function
from bs4 import BeautifulSoup

# Make our soup
with open('data.xml') as infile:
    blob = infile.read()
# Use LXML for blazing speed
soup = BeautifulSoup(blob, 'lxml')
print(soup.find_all('link'))

We get in return:

[<link/>]

Wait, what? What happened to our content? This is a pretty basic BS use case, but something strange is happening. Well, I’ll come back to this and start working with their other very hypothetical data sets where <link> tags become <resource> tags, but the rest of the data is structured exactly the same. This time around:

<xml>
  <sport name="baseball">
    <team name="Braves" city="Atlanta">
        <resource><!-- Changed to resouce -->
          <url>http://atlanta.braves.mlb.com</url>
        </resource><!-- Corresponding close -->
    </team>
  </sport>
</xml>

…and corresponding result…

>>> print(soup.find_all('resource'))
[<resource>
<url>http://atlanta.braves.mlb.com/index.jsp?c_id=atl</url>
</resource>]

Interesting! To compound our problem, we’re on a customer site where we don’t have internet access to grab that sweet, sweet documentation we crave. After toiling on a Stack Overflow dump in all the wrong places, I was reminded of one of my favorite blog posts by SO’s founder, Jeff Atwood. Read the Source. But what was I looking for? Well, let’s dig around for <link> tags and see what turns up.

Sure enough, after some quick searches, we find what I believe to be the smoking gun (for those following along at home, bs4.builder.__init__.py, lines 228/229 in v4.3.2).

empty_element_tags = set(['br' , 'hr', 'input', 'img', 'meta',
                          'spacer', 'link', 'frame', 'base'])

We have a seemingly harmless word with “link” in our XML, but it means something very different in HTML and more specifically, the TreeBuilder implementation that LXML is using. As a test, if I change our <link> turned <resource> tags into <base> tags we get the same result – no content. It also turns out that if you have LXML installed, BeautifulSoup4 will fall back to that for parsing. Uninstalling it grants us the results we want – tags with content. The stricter (but faster) TreeBuilder implementations from LXML take precedence over the built-in HTMLParser or html5lib (if you have it installed). How do we know that? Back to the source code!

bs4/builder/__init__.py, lines 304:321

# Builders are registered in reverse order of priority, so that custom
# builder registrations will take precedence. In general, we want lxml
# to take precedence over html5lib, because it's faster. And we only
# want to use HTMLParser as a last result.
from . import _htmlparser
register_treebuilders_from(_htmlparser)
try:
    from . import _html5lib
    register_treebuilders_from(_html5lib)
except ImportError:
    # They don't have html5lib installed.
    pass
try:
    from . import _lxml
    register_treebuilders_from(_lxml)
except ImportError:
    # They don't have lxml installed.
    pass

As it turns out, when creating your soup, ‘lxml’ != ‘xml’. Changing the soup creation gets us the results we’re looking for (UPDATE: corresponding doc “helpfully” pointed out by a Reddit commenter here). BeautifulSoup was still falling back to HTML builders, thus why we were seeing the results we were when specifying ‘lxml’.

# Use HTML for sanity
soup = BeautifulSoup(blob, 'xml')

While I didn’t find that magic code snippet to fix everything,  (UPDATE: Thanks Reddit). We found our problem, but went really roundabout to get there. Understanding why it was happening made me a feel a lot better in the end. It’s easy to get frustrated when coding, but always remember, read the docs and – Read the Source, Luke. It might help you understand the problem.

We’re hiring!  If you’re interested in geospatial, big data, social media analytics, Amazon Web Services (AWS), visualization, and/or the latest UI and server technologies, drop us an e-mail at info@thehumangeo.com.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s