Tag Archives: data

No Soup for You: When Beautiful Soup Doesn’t Like Your XML

Over the years, Beautiful Soup has probably saved us more hours on scraping, data collection, and other projects than we can count. Crummy’s landing page for the library even says:

Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.

You can plot me directly in the “days” column. When I’m starting a Python project that requires me to parse through HTML data, the first dependency I’ll pull is BeautifulSoup. It makes what would normally be a nasty Perl-esque mess into a something nice and Pythonic, keeping sanity intact. But what about structured data other than HTML? I’ve also learned that BS can be a huge boon for XML data, but not without a couple of speed bumps.

Enter data from the client (not real data, but play along for a moment):

<xml>
  <sport name="baseball">
    <team name="Braves" city="Atlanta">
        <link>
          <url>http://atlanta.braves.mlb.com</url>
        </link>
    </team>
  </sport>
</xml>

Seems well-structured! Our customer just needs all the links that we have in the data. We fire up our editor of choice and roll up our sleeves.

from __future__ import print_function
from bs4 import BeautifulSoup

# Make our soup
with open('data.xml') as infile:
    blob = infile.read()
# Use LXML for blazing speed
soup = BeautifulSoup(blob, 'lxml')
print(soup.find_all('link'))

We get in return:

[<link/>]

Wait, what? What happened to our content? This is a pretty basic BS use case, but something strange is happening. Well, I’ll come back to this and start working with their other very hypothetical data sets where <link> tags become <resource> tags, but the rest of the data is structured exactly the same. This time around:

<xml>
  <sport name="baseball">
    <team name="Braves" city="Atlanta">
        <resource><!-- Changed to resouce -->
          <url>http://atlanta.braves.mlb.com</url>
        </resource><!-- Corresponding close -->
    </team>
  </sport>
</xml>

…and corresponding result…

>>> print(soup.find_all('resource'))
[<resource>
<url>http://atlanta.braves.mlb.com/index.jsp?c_id=atl</url>
</resource>]

Interesting! To compound our problem, we’re on a customer site where we don’t have internet access to grab that sweet, sweet documentation we crave. After toiling on a Stack Overflow dump in all the wrong places, I was reminded of one of my favorite blog posts by SO’s founder, Jeff Atwood. Read the Source. But what was I looking for? Well, let’s dig around for <link> tags and see what turns up.

Sure enough, after some quick searches, we find what I believe to be the smoking gun (for those following along at home, bs4.builder.__init__.py, lines 228/229 in v4.3.2).

empty_element_tags = set(['br' , 'hr', 'input', 'img', 'meta',
                          'spacer', 'link', 'frame', 'base'])

We have a seemingly harmless word with “link” in our XML, but it means something very different in HTML and more specifically, the TreeBuilder implementation that LXML is using. As a test, if I change our <link> turned <resource> tags into <base> tags we get the same result – no content. It also turns out that if you have LXML installed, BeautifulSoup4 will fall back to that for parsing. Uninstalling it grants us the results we want – tags with content. The stricter (but faster) TreeBuilder implementations from LXML take precedence over the built-in HTMLParser or html5lib (if you have it installed). How do we know that? Back to the source code!

bs4/builder/__init__.py, lines 304:321

# Builders are registered in reverse order of priority, so that custom
# builder registrations will take precedence. In general, we want lxml
# to take precedence over html5lib, because it's faster. And we only
# want to use HTMLParser as a last result.
from . import _htmlparser
register_treebuilders_from(_htmlparser)
try:
    from . import _html5lib
    register_treebuilders_from(_html5lib)
except ImportError:
    # They don't have html5lib installed.
    pass
try:
    from . import _lxml
    register_treebuilders_from(_lxml)
except ImportError:
    # They don't have lxml installed.
    pass

As it turns out, when creating your soup, ‘lxml’ != ‘xml’. Changing the soup creation gets us the results we’re looking for (UPDATE: corresponding doc “helpfully” pointed out by a Reddit commenter here). BeautifulSoup was still falling back to HTML builders, thus why we were seeing the results we were when specifying ‘lxml’.

# Use HTML for sanity
soup = BeautifulSoup(blob, 'xml')

While I didn’t find that magic code snippet to fix everything,  (UPDATE: Thanks Reddit). We found our problem, but went really roundabout to get there. Understanding why it was happening made me a feel a lot better in the end. It’s easy to get frustrated when coding, but always remember, read the docs and – Read the Source, Luke. It might help you understand the problem.

We’re hiring!  If you’re interested in geospatial, big data, social media analytics, Amazon Web Services (AWS), visualization, and/or the latest UI and server technologies, drop us an e-mail at info@thehumangeo.com.

Supercharging your reddit API access

Here at HumanGeo we do all sorts of interesting things with sentiment analysis and entity resolution. Before you get to have fun with that, though, you need to bring data into the system. One data source we’ve recently started working with is reddit.

Compared to the walled gardens of Facebook and LinkedIn, reddit’s API is as open as open can be; Everything is nice and RESTful, rate limits are sane, the developers are open to enhancement requests, and one can do quite a bit without needing to authenticate.
The most common objects we collect from reddit are submissions (posts) and comments. A submission can either be a link, or a self post with a text body, and can have an arbitrary number of comments. Comments contain text, as well as references to parent nodes (if they’re not root nodes in the comment tree). Pulling this data is as simple as GET http://www.reddit.com/r/washingtondc/new.json. (Protip: pretty much any view in reddit has a corresponding API endpoint that can be generated by appending ‘.json’ to the URL.)

With little effort a developer could hack together a quick ‘n dirty reddit scraper. However, as additional features appear and collection-breadth grows, the quick ‘n dirty scraper becomes more dirty than quick, and you discover bugsfeatures that others utilizing the API have already encountered and possibly addressed. API wrappers help consolidate communal knowledge and best practices for the good of all. We considered several, and, being a Python shop, settled on PRAW (Python Reddit API Wrapper).

With PRAW, getting a list of posts is pretty easy:

import praw
r = praw.Reddit(user_agent='Hello world application.')
for post in r.get_subreddit('WashingtonDC') \
             .get_hot(limit=5):
    print(str(post))
$ python parse_bot_2000.py
209 :: /r/WashingtonDC's Official Guide to the City!
29 :: What are some good/active meetups in DC that are easy to join?
17 :: So no more tailgating at the Nationals park anymore...
3 :: Anyone know of a juggling club in DC
2 :: The American Beer Classic: Yay or Nay?

The Problem

Now, let’s try something a little more complicated. Our mission, if we choose to accept it, is to capture all incoming comments to a subreddit. For each comment we should collect the author’s username, the URL for the submission, a permalink to the comment, as well as its body.

Here’s what this looks like:

import praw
from datetime import datetime
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000')

def save_comment(*args):
    print(datetime.now().time(), args)

for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    save_comment(comment.permalink,
        comment.author.name,
        comment.body_html,
        comment.submission.url)

That was pretty easy. For the sake of this demo the save_comment method has been stubbed out, but anything can go there.

If you run the snippet, you’ll observe the following pattern:

... comment ...
... comment ...
[WAIT FOR A FEW SECONDS]
... comment ...
... comment ...
[WAIT FOR A FEW SECONDS]
... comment ...
... comment ...
[WAIT FOR A FEW SECONDS]
(repeating...)

This process also seems to be taking longer than a normal HTTP request. As anyone working with large amounts of data should do, let’s quantify this.

Using the wonderful, indispensable iPython:

In [1]: %%timeit
requests.get('http://www.reddit.com/r/python/comments.json?limit=200')
   ....:
1 loops, best of 3: 136 ms per loop

In [2]: %%timeit
import praw
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000',
                cache_timeout=0)
for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    print(comment.permalink,
          comment.author.name,
          comment.body_html,
          comment.submission.url)
1 loops, best of 3: 6min 43s per loop

Ouch. While this difference in run-times is fine for a one-off, contrived example, such inefficiency is disastrous when dealing with large volumes of data. What could be causing this behavior?

Digging

According to the PRAW documentation,

Each API request to Reddit must be separated by a 2 second delay, as per the API rules. So to get the highest performance, the number of API calls must be kept as low as possible. PRAW uses lazy objects to only make API calls when/if the information is needed.

Perhaps we’re doing something that is triggering additional HTTP requests. Such behavior would explain the intermittent printing of comments to the output stream. Let’s verify this hypothesis.

To see the underlying requests, we can override PRAW’s default log level:

from datetime import datetime
import logging
import praw

logging.basicConfig(level=logging.DEBUG)

r = praw.Reddit(user_agent='Subreddit Parse Bot 2000')

def save_comment(*args):
    print(datetime.now().time(), args)

for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    save_comment(comment.author.name,
                 comment.submission.url,
                 comment.permalink,
                 comment.body_html)

And what does the output look like?

DEBUG:requests.packages.urllib3.connectionpool:"PUT /check HTTP/1.1" 200 106
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2ak14j.json HTTP/1.1" 200 888
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2aies0.json HTTP/1.1" 200 2889
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2aiier.json HTTP/1.1" 200 14809
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2ajam1.json HTTP/1.1" 200 1091
.. comment ..
.. comment ..
.. comment ..

Those intermittent requests for individual comments back up our claim. Now, let’s see what’s causing this.

Prettifying the response JSON yields the following schema (edited for brevity):

{
   'kind':'Listing',
   'data':{
      'children':[
         {
            'kind':'t3',
            'data':{
               'id':'2alblh',
               'media':null,
               'score':0,
               'permalink':'/r/Python/comments/2alblh/django_middleware_that_prints_query_stats_to/',
               'name':'t3_2ajam1',
               'created':1405227385.0,
               'url':'http://imgur.com/QBSOAZB',
               'title':'Should I? why?',
            }
         }
      ]
   }
}

Lets compare that to what we get when listing comments from the /python/comments endpoint:

{
   'kind':'Listing',
   'data':{
      'children':[
         {
            'kind':'t1',
            'data':{
               'link_title':'Django middleware that prints query stats to stderr after each request. pip install django-querycount',
               'link_author':'mrrrgn',
               'author':'mrrrgn',
               'parent_id':'t3_2alblh',
               'body':'Try django-devserver for query counts, displaying the full queries, profiling, reporting memory usage, etc. \n\nhttps://pypi.python.org/pypi/django-devserver',
               'link_id':'t3_2alblh',
               'name':'t1_ciwbo37',
               'link_url':'https://github.com/bradmontgomery/django-querycount',
            }
         }
      ]
   }
}

Now we’re getting somewhere – there are fields in the per-comment’s response that aren’t in the subreddit listing’s. Of the four fields we’re collecting, the submission URL and permalink properties are not returned by the subreddit comments endpoint. Accessing those causes a lazy evaluation to fire off additional requests. If we can infer these values from the data we already have, we can avoid having to waste time querying for each comment.

Doing Work

Submission URLs

Submission URLs are a combination of the subreddit name, the post ID, and title. We can easily get the post ID fragment:

post_id = comment.link_id.split('_')[1]

However, there is no title returned! Luckily, it turns out that it’s not needed.

subreddit = 'python'
post_id = comment.link_id.split('_')[1]
url = 'http://reddit.com/r/{}/{}/' \
          .format(subreddit, post_id)
print(url) # A valid submission permalink!
# OUTPUT: http://reddit.com/r/python/2alblh/

Great! This also gets us most of the way to constructing the second URL we need – a permalink to the comment.

Comment Permalinks

Maybe we can append the comment’s ID to the end of the submission URL?

sub_comments_url = 'http://reddit.com/r/python/comments/2alblh/'
comment_id = comment.name.split('_')[1]
url = sub_comments_url + comment_id
print(url)
# OUTPUT: http://reddit.com/r/python/comments/2alblh/ciwbo37

Sadly, that URL doesn’t work because reddit expects the submission’s title to precede the ID. Referring to the subreddit comment’s JSON object, we can see that the title is not returned. This is curious: why is the title important? They already have a globally unique ID for the post, and can display the post just fine without (as demonstrated by the code sample immediately preceding this). Perhaps reddit wanted to make it easier for users to identify a link and are just parsing a forward-slash delimited series of parameters. If we put the comment ID in the appropriate position, the URL should be valid. Let’s give it a shot:

sub_comments_url = 'http://reddit.com/r/python/comments/2alblh/'
comment_id = comment.name.split('_')[1]
url = '{}-/{}'.format(sub_comments_url, comment_id)
print(url)
# OUTPUT: http://reddit.com/r/python/comments/2alblh/-/ciwbo37

Following that URL takes us to the comment!

Victory Lap

Let’s see how much we’ve improved our execution time:

%%timeit
import praw
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000',
                cache_timeout=0)
for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    print(comment.author.name,
          comment.body_html,
          comment.id,
          comment.submission.id)
1 loops, best of 3: 3.57 s per loop

Wow! 403 seconds to 3.6 seconds – a factor of 111. Deploying this improvement to production not only increased the volume of data we were able to process, but also provided the side benefit of reducing the number of 504 errors we encountered during reddit’s peak hours. Remember, always be on the lookout for ways to improve your stack. A bunch of small wins can add up to something significant.

[Does this sort of stuff interest you? Love hacking and learning new things? Good news – we’re hiring!]