Tag Archives: python

No Soup for You: When Beautiful Soup Doesn’t Like Your XML

Over the years, Beautiful Soup has probably saved us more hours on scraping, data collection, and other projects than we can count. Crummy’s landing page for the library even says:

Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.

You can plot me directly in the “days” column. When I’m starting a Python project that requires me to parse through HTML data, the first dependency I’ll pull is BeautifulSoup. It makes what would normally be a nasty Perl-esque mess into a something nice and Pythonic, keeping sanity intact. But what about structured data other than HTML? I’ve also learned that BS can be a huge boon for XML data, but not without a couple of speed bumps.

Enter data from the client (not real data, but play along for a moment):

  <sport name="baseball">
    <team name="Braves" city="Atlanta">

Seems well-structured! Our customer just needs all the links that we have in the data. We fire up our editor of choice and roll up our sleeves.

from __future__ import print_function
from bs4 import BeautifulSoup

# Make our soup
with open('data.xml') as infile:
    blob = infile.read()
# Use LXML for blazing speed
soup = BeautifulSoup(blob, 'lxml')

We get in return:


Wait, what? What happened to our content? This is a pretty basic BS use case, but something strange is happening. Well, I’ll come back to this and start working with their other very hypothetical data sets where <link> tags become <resource> tags, but the rest of the data is structured exactly the same. This time around:

  <sport name="baseball">
    <team name="Braves" city="Atlanta">
        <resource><!-- Changed to resouce -->
        </resource><!-- Corresponding close -->

…and corresponding result…

>>> print(soup.find_all('resource'))

Interesting! To compound our problem, we’re on a customer site where we don’t have internet access to grab that sweet, sweet documentation we crave. After toiling on a Stack Overflow dump in all the wrong places, I was reminded of one of my favorite blog posts by SO’s founder, Jeff Atwood. Read the Source. But what was I looking for? Well, let’s dig around for <link> tags and see what turns up.

Sure enough, after some quick searches, we find what I believe to be the smoking gun (for those following along at home, bs4.builder.__init__.py, lines 228/229 in v4.3.2).

empty_element_tags = set(['br' , 'hr', 'input', 'img', 'meta',
                          'spacer', 'link', 'frame', 'base'])

We have a seemingly harmless word with “link” in our XML, but it means something very different in HTML and more specifically, the TreeBuilder implementation that LXML is using. As a test, if I change our <link> turned <resource> tags into <base> tags we get the same result – no content. It also turns out that if you have LXML installed, BeautifulSoup4 will fall back to that for parsing. Uninstalling it grants us the results we want – tags with content. The stricter (but faster) TreeBuilder implementations from LXML take precedence over the built-in HTMLParser or html5lib (if you have it installed). How do we know that? Back to the source code!

bs4/builder/__init__.py, lines 304:321

# Builders are registered in reverse order of priority, so that custom
# builder registrations will take precedence. In general, we want lxml
# to take precedence over html5lib, because it's faster. And we only
# want to use HTMLParser as a last result.
from . import _htmlparser
    from . import _html5lib
except ImportError:
    # They don't have html5lib installed.
    from . import _lxml
except ImportError:
    # They don't have lxml installed.

As it turns out, when creating your soup, ‘lxml’ != ‘xml’. Changing the soup creation gets us the results we’re looking for (UPDATE: corresponding doc “helpfully” pointed out by a Reddit commenter here). BeautifulSoup was still falling back to HTML builders, thus why we were seeing the results we were when specifying ‘lxml’.

# Use HTML for sanity
soup = BeautifulSoup(blob, 'xml')

While I didn’t find that magic code snippet to fix everything,  (UPDATE: Thanks Reddit). We found our problem, but went really roundabout to get there. Understanding why it was happening made me a feel a lot better in the end. It’s easy to get frustrated when coding, but always remember, read the docs and – Read the Source, Luke. It might help you understand the problem.

We’re hiring!  If you’re interested in geospatial, big data, social media analytics, Amazon Web Services (AWS), visualization, and/or the latest UI and server technologies, drop us an e-mail at info@thehumangeo.com.

(Machine) Learning About Love: Who will Leave The Bachelor next?

What is love?  Can you define love, or do you merely recognize its presence or absence without binding it in a definition?  A computer may be able to provide a definition of love, regurgitated from a database, but can a computer recognize love?  If defining love is so difficult for us, many of whom might cop out with “I know it when I see it”, how much more difficult will a computer find this challenge?  On a lark, I decided to implement a machine learning system to predict a few things about a television show about love, ABC’s “The Bachelor”.

My girlfriend fiancée regularly watches The Bachelor; valuing our relationship, I am sometimes drawn in.  The premise of the show is this: Chris Soules (The Bachelor), a single man, acts as the sole decision maker in a competition among 30 eligible women to identify his potential bride.  Each week, one or more of the 30 women is eliminated as the bachelor gets to know them.  This is the 19th Season of this show (there have been 10 iterations of its partner show, “The Bachelorette”)

If we can teach a computer to recognize some measure of love, it is certainly possible to predict who will depart the show next.  Regardless of how we frame this question, there are challenges.  First off, we have no clue how the Chris Soules “mental machine” works.  More importantly, the 30 data points we have (one per contestant) contain far too many features (most of which aren’t quantifiable) of which we only know a dozen or so.  Given a dozen features, we cannot build a statistically reliable model for prediction.  For the sake of entertainment and self-enrichment, I decided to make an attempt in the face of these challenges.

Machine Learning Techniques

I decided to utilize a Decision Tree such that I could investigate the factors personally, but I decided to expand and also utilize a Support Vector Machine as well.  An understanding of the SVM coupled with a quick glance at the Scikit-Learn algorithm cheat-sheet drove my decision to incorporate both.

Data Collection

I first wrote a scraper to pull some data off the open Internet about each candidate; a photograph, age, some ancillary data provided.  From the photograph, I manually identified features such as ethnicity as well as hair types (length, curliness, color).  From the ancillary data, I acquired other quantifiable numbers: number of tattoos and age as well as whether or not that contestant had been eliminated.

Each individual is encoded in JSON like so:

     "hair_wavy": "straight",
     "hometown_name": "Hamilton",
     "height_inches": 64,
     "age": 24,
     "num_tattoos": 0,
     "hair_color": "dark",
     "name": "Alissa Giambrone",
     "hair_length": "chest",
     "date_fear": " Running into recent exes",
     "occupation": "Flight Attendant",
     "likes": [
     "free_text": " If I never had to upset others, I would be very happy.  If I never got to play with puppies, I would be very sad.  If you could be any animal, what would you be? A wild mustang. Free to run and explore, they're unpredictable and beautiful, and are loyal to their herd.  If you won the lottery, what would you do with your winnings? Adopt dogs and charter a jet for my friends anf fmaily to fly around Europe, with unlimited champagne and a hot air balloon ride over Greece.  What's your most embarrassing moment? I was in-depth stalking a guy's Facebook page and sent my friend a long, detailed text about my findings...except I sent it to him. Oops.  What is your greatest achievement to date? Getting my yoga certification because I've been able to inspire others to do yoga and become instructors!  ",
     "eliminated": true,
     "photo_url": "http://static.east.abc.go.com/service/image/ratio/id/90fe9103-e788-418d-9eb1-4204b7ba1b96/dim/690.1x1.jpg?cb=51351345661",
     "hometown_state": " NJ",
     "ethnicity": "caucasian",
     "goneweek": 2,
     "featured": true,
     "featured_num": 6,
     "intro_order": 21

Some data properties are categorical (such as ethnicity and hair color); some data properties are ordinal (hair length), and some data properties are ratios (such as age or number of tattoos). The Python script encodes categorical data into ordinal, numeric values through a lookup dictionary.

hair_length = dict({"neck": 1,
                    "shoulder": 2,
                    "chest": 3,
                    "stomach": 4})


Given this sparse data, it’s hard to train the machine.  More importantly, we can’t break the data into training and testing data; because there’s so little data, we need to use all of it.  For this reason, I decide to randomly pick 25% of the data for training the tree or SVM, then run these very same points through the classifier.  To offset this horrible violation of machine learning principals, I’ll run the model a thousand times and see who gets mis-classified as “eliminated” the most.

Here’s some code I wrote to train the samples:

def week_predict(tgt_data, elims, tgt_week, sc_learn):
    Given some data, make some predictions and return the average accuracy.
    :param tgt_data: json structure of data to be formatted
    :param elims: eliminated
    :param tgt_week:
    :param sc_learn:

    learn_values = data_formatter(tgt_data, elims, tgt_week)
    x = list()
    y = list()
    samples = set()
    while len(samples) < (len(learn_values) * 0.25):
        samples.add(random.randint(0, len(learn_values) - 1))
    learn_arr = learn_values.values()
    # print "Sample selection: ", samples
    for index in samples:

    # These next two lines courtesy of:
    # http://scikit-learn.org/stable/modules/tree.html
    clf = sc_learn
        clf = clf.fit(x, y)
    except ValueError:
        # Sometimes we try to train with 0 classes.
        return dict({"accuracy": 0, "departures": []})
    c = 0
    departures = list()
    for item in learn_values:
        if clf.predict(learn_values[item][0]) != learn_values[item][1]:
            c += 1
            if learn_values[item][1] == 0:  # if they aren't eliminated
    accuracy = round((len(learn_values) - c) / float(len(learn_values)) * 100, 2)
    ret_val = dict({"accuracy": accuracy,
                    "departures": departures})
    return ret_val


These are my leading predictors for departure.  For the decision tree, here we have it:

Departing: Britt Nilsson votes: 640
Departing: Carly Waddell votes: 581
Departing: Jade Roper votes: 538
Departing: Becca Tilley votes: 520
Departing: Megan Bell votes: 506
Departing: Kaitlyn Bristowe votes: 365

For the SVM, the output:

Departing: Britt Nilsson votes: 749
Departing: Megan Bell votes: 722
Departing: Becca Tilley votes: 714
Departing: Jade Roper votes: 707
Departing: Carly Waddell votes: 629
Departing: Kaitlyn Bristowe votes: 619

I’ve arranged each of these in descending order; the more votes someone has for departure, the more frequently they are miscategorized as eliminated based on who has already been eliminated and who is still there.


Does this make sense given the current state of the contest? After discussing these predicted outcomes with expert viewers, they were adamant that Britt, the contestant I’ve pegged as being most likely to leave, has almost no chance of departing.  She made a strong impression on Chris and seems to have a strong connection with him, so for her to leave would be a shocker; on the other hand, many of these same viewers thought Kaitlyn had a good chance of winning, which would validate her low departure score.


This blog post didn’t make it to the Internet in time for these predictions to be evaluated in public.  As such, (spoiler alert) we know that Britt and Carly departed in Week 7, followed by Jade in Week 8.  As such, I will make some predictions for next week:

Decision Tree results:

Departing: Becca Tilley: 3590
Departing: Whitney Bischoff: 3357
Departing: Kaitlyn Bristowe: 3320

Support Vector Machine results:

Departing: Becca Tilley: 1799
Departing: Whitney Bischoff: 1760
Departing: Kaitlyn Bristowe: 1715

Analysis 2.0

We’ve trained an SVM with just a few variables that don’t capture the essence of the problem very well.  Furthermore, we did no fine tuning for the parameters in the SVM, nor did we prune the decision tree.

In order to rectify the data issue, I have put out a call to fans of The Bachelor for additional data about any facet of the show, including:

  • How many kisses per episode per contestant
  • Who gets what date?
  • Number of minutes of on-air time featured per episode, per contestant
  • Distance from contestant hometown to Arlington, Iowa
  • Any other data

Tools Used

I used mostly stock Python libraries to gather data and produce these predictions; for web scraping, I used selenium and BeautifulSoup.  For machine learning, scikit-learn.  For keeping the code clean, pylint.

Code is available at: https://github.com/jamesfe/bachelor_tree

If you have questions, feel free to tweet me them over to @jimmysthoughts!

Beer Caps to Coffee Tables

Some years ago, I lived in Norfolk, VA with a friend who loved craft beer as much as I did. Our hoppy passion motivated us to brew beer at home, visit local craft beer stores, and generally enjoy our nights with a brew here or there. Subconsciously knowing it would be a brilliant idea, we used a tupperware box next to our refrigerator to house a growing collection of beer caps, feeding it a single cap as each beer departed the cold, destined for enjoyment.

After a few years, I realized the box was close to being filled to the brim. The mechanical gears in my head were cranking out an idea to do something creative with the bottle caps. I wanted to incorporate some things I’d learned in my classes related to imaging, computer vision, data mining, and spectral signatures. I also had an old square coffee table (36″x36″) that could be the canvas to this project. This was the beginning of a fun exploration in image processing with python to create a bottle cap magnum opus.

Getting it Done
There were a few things I needed to do; first, I needed more caps.  Given the size of a bottle cap (1″) minus some width for a border around the table and a little space between the caps, I found that the table would accept a 30 cap by 30 cap grid, therefore I needed at least 900 caps.  A local bar supported this effort by hosting a 5 gallon bucket for the bartender to drop caps into in lieu of the trash can.

Second, I needed a way to extract the data (color; specifically red, green, and blue) from each individual cap.  I originally arranged the images on a gridded sheet of paper, took a photo, and extracted them row-by-column after performing some image transforms to account for the skewed nature of the image.  As it turns out, there is no good way to get a completely top-down image of the caps; it will always turn out to be smaller at the top and larger at the bottom depending on the angle you held the camera at. The data wasn’t fantastic and as more caps came in, I knew I needed a better way; a quick survey of computer vision capabilities turned up a concept called a Hough Transform.

Hough transforms are a method of identifying shapes that can be approximated by equations in images; in this case, a circle with a roughly known radius was the target. This synopsis is a vast simplification, but for each pixel in the output image (of the same size as the input image), the algorithm polls those surrounding pixels that might form a circle and assigns a value to that pixel. Based on all the pixels in the image, one can then surmise that the pixels with values above a certain threshold are the center point for a circle of known radius. To facilitate the discovery of circles of a specific radius (as many caps have concentric circle patterns), I used a Sobel operator to convert the image into an edge-only image.

A subset of the caps, spread out and photographed.

Although I initially wrote my own Sobel Filter and Hough Transform in Python, I think it’s smarter to use OpenCV, which I later discovered.  OpenCV is a set of computer vision related functions written in C/C++ with wrappers for Python; Hough transforms are only one of the things that OpenCV can simplify from a multi-hour debacle into a few minutes of digging in the documentation for the right tool. Here  is a quick snippet of something that took me dozens of lines of code in earlier editions:

def cv2hough(sobel_image, color_image, minrad, maxrad):
    Identifies and cuts out to a directory potentially circular
    shapes that match a known radius range.
    :param sobel_image: sobel-filtered image
    :param color_image: non-sobel filtered version of colored_image
    in_cv_image = cv2.imread(sobel_image, 0)
    avg_rad = (minrad+maxrad)/2
    circles = cv2.HoughCircles(in_cv_image, cv.CV_HOUGH_GRADIENT, 1, 65,
    param1=90, param2=15,
    minRadius=minrad, maxRadius=maxrad)
    exportCircles(circles, color_image, 'caps', avg_rad)

(Documentation for cv2.HoughCircles)

This code snippet concludes by sending a list of all the places on the image where it thinks a cap might be to “exportCircles”, where those caps are then cut out of the image and sent to a 30px x 30px JPG file.

A cap, post-extraction. It’s a rough approximation, but the hough transform was sufficient.

Once we have a directory containing hundreds and hundreds of caps, we can begin some analysis. (Full disclosure: I manually scrolled through the images and deleted those that appeared unusable.) Python was used to calculate statistics for each bottle cap image, and eventually stored the data in a structure. Each pixel is calculated applying a red (R), green (G), and blue (B) value. We can average these and get average (R,G,B) triplets for each bottle cap, that is to say; a cap is “mostly” red, or “mostly” blue, etc. I soon found that these numbers weren’t that descriptive and began to work in the Hue, Saturation, and Value color spaces. (HSL & HSV: Wikipedia).

Protip: I used a quick one-liner to convert from RGB to HSV before storing the data:
colorsys.rgb_to_hsv(rgb[0], rgb[1], rgb[2])

To save time cataloging 900 bottle caps and associated pixels, I pickle (serialize) each cap as I generate the data on it so I can quickly reference a data structure containing the caps and rearrange them as I see fit.

The caps by average, as the computer sees them.
The caps as we see them; the final arrangement was sorted by hue then saturation.

Final Steps: Output
I originally wanted to see what it would look like when I rearranged the bottle caps into different patterns. This is where we do that, and based on the data we have and the framework we’ve built, outputting an image or a HTML document (this is a really hasty solution!) becomes fairly easy, minus one hiccup.

def first_sort(self):
    Reach into the internal config and find out what we want to have as
    our first order sort on the caps.
    self.sortedDatList = sorted(self.imageDatList,
                                key=lambda thislist: thislist[
    del self.imageDatList
    self.imageDatList = self.sortedDatList

def second_sort(self):
    Second sort on the caps: sort rows individually.
    Example: We want a 30x30 grid of caps, sorted first by hue, then by
    We sort by hue in first_sort, then in second_sort, we sort 30 lists
    of 30 caps each individually and replace them into the list.
    ranges = [[self.conf['out_caps_cols'] * (p - 1),
               self.conf['out_caps_cols'] * p] for p in
              range(1, self.conf['out_caps_rows'] + 1)]
    for r in ranges:
        self.imageDatList[r[0]:r[1]] = sorted(self.imageDatList[r[0]:r[1]],
                                              key=lambda thislist: thislist[

Here we see two functions: firstSort and secondSort. You can guess that we sort the caps once, then sort them again; what may not be apparent from this code is that the second sort (which is performed on a different attribute), is performed based on the out_caps_rows attribute. This is the hiccup; you must sort multiple sub-segments of the array of bottle caps, in essence, sorting the rows after the columns have been generated. Otherwise each row of bottle caps will have a random pattern as the image trends from one extreme (top) to the other (bottom).

Caps, laid out and ready to be transferred to the table.

The Physical Construction
To finish the project, I physically arranged a pattern based on informal votes from friends and acquaintances. I built a 1/2″ lip around the table with “Quarter-Round” molding. I glued this down, sealed the seams with wood glue, and painted everything black.

I filled the bottom of this open-top area with playground sand. This minimized the volume that I would have to fill in with expensive epoxy resin. It had the secondary benefit of lightening up the entire display, knowing the epoxy would darken it. I ordered two kits of “AeroMarine 300/21 Epoxy Resin” (for a total of 3 Gallons) from the internet. Ultimately, I arranged the caps on a sheet of butcher’s paper and transferred them one-by-one to the sand box.

A few friends, the famous roommate, and I gathered (with beers); we mixed the resin and began the process of pouring the resin onto the table. The one unseen consequence of the sand was that when the epoxy entered the sand, it pushed air upwards, causing the caps to float; we quickly used toothpicks to push the caps down into the sand, then poured more resin over top.

The table, post-drying and finished.

Code for this was hacked together over a number of weeks, I’ve consolidated it here: https://github.com/jamesfe/capViz.  Unfortunately, most of this code was written in a disheveled, disconnected manner and I’m just now beginning to tie it together into one well documented, good-practice PEP-8 compliant piece, but feel free to poke around and let me know what you think!

Thanks to The Birch for all the beer caps!

If this interested you, follow me on Twitter! @jimmysthoughts

Supercharging your reddit API access

Here at HumanGeo we do all sorts of interesting things with sentiment analysis and entity resolution. Before you get to have fun with that, though, you need to bring data into the system. One data source we’ve recently started working with is reddit.

Compared to the walled gardens of Facebook and LinkedIn, reddit’s API is as open as open can be; Everything is nice and RESTful, rate limits are sane, the developers are open to enhancement requests, and one can do quite a bit without needing to authenticate.
The most common objects we collect from reddit are submissions (posts) and comments. A submission can either be a link, or a self post with a text body, and can have an arbitrary number of comments. Comments contain text, as well as references to parent nodes (if they’re not root nodes in the comment tree). Pulling this data is as simple as GET http://www.reddit.com/r/washingtondc/new.json. (Protip: pretty much any view in reddit has a corresponding API endpoint that can be generated by appending ‘.json’ to the URL.)

With little effort a developer could hack together a quick ‘n dirty reddit scraper. However, as additional features appear and collection-breadth grows, the quick ‘n dirty scraper becomes more dirty than quick, and you discover bugsfeatures that others utilizing the API have already encountered and possibly addressed. API wrappers help consolidate communal knowledge and best practices for the good of all. We considered several, and, being a Python shop, settled on PRAW (Python Reddit API Wrapper).

With PRAW, getting a list of posts is pretty easy:

import praw
r = praw.Reddit(user_agent='Hello world application.')
for post in r.get_subreddit('WashingtonDC') \
$ python parse_bot_2000.py
209 :: /r/WashingtonDC's Official Guide to the City!
29 :: What are some good/active meetups in DC that are easy to join?
17 :: So no more tailgating at the Nationals park anymore...
3 :: Anyone know of a juggling club in DC
2 :: The American Beer Classic: Yay or Nay?

The Problem

Now, let’s try something a little more complicated. Our mission, if we choose to accept it, is to capture all incoming comments to a subreddit. For each comment we should collect the author’s username, the URL for the submission, a permalink to the comment, as well as its body.

Here’s what this looks like:

import praw
from datetime import datetime
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000')

def save_comment(*args):
    print(datetime.now().time(), args)

for comment in r.get_subreddit('Python') \

That was pretty easy. For the sake of this demo the save_comment method has been stubbed out, but anything can go there.

If you run the snippet, you’ll observe the following pattern:

... comment ...
... comment ...
... comment ...
... comment ...
... comment ...
... comment ...

This process also seems to be taking longer than a normal HTTP request. As anyone working with large amounts of data should do, let’s quantify this.

Using the wonderful, indispensable iPython:

In [1]: %%timeit
1 loops, best of 3: 136 ms per loop

In [2]: %%timeit
import praw
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000',
for comment in r.get_subreddit('Python') \
1 loops, best of 3: 6min 43s per loop

Ouch. While this difference in run-times is fine for a one-off, contrived example, such inefficiency is disastrous when dealing with large volumes of data. What could be causing this behavior?


According to the PRAW documentation,

Each API request to Reddit must be separated by a 2 second delay, as per the API rules. So to get the highest performance, the number of API calls must be kept as low as possible. PRAW uses lazy objects to only make API calls when/if the information is needed.

Perhaps we’re doing something that is triggering additional HTTP requests. Such behavior would explain the intermittent printing of comments to the output stream. Let’s verify this hypothesis.

To see the underlying requests, we can override PRAW’s default log level:

from datetime import datetime
import logging
import praw


r = praw.Reddit(user_agent='Subreddit Parse Bot 2000')

def save_comment(*args):
    print(datetime.now().time(), args)

for comment in r.get_subreddit('Python') \

And what does the output look like?

DEBUG:requests.packages.urllib3.connectionpool:"PUT /check HTTP/1.1" 200 106
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2ak14j.json HTTP/1.1" 200 888
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2aies0.json HTTP/1.1" 200 2889
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2aiier.json HTTP/1.1" 200 14809
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2ajam1.json HTTP/1.1" 200 1091
.. comment ..
.. comment ..
.. comment ..

Those intermittent requests for individual comments back up our claim. Now, let’s see what’s causing this.

Prettifying the response JSON yields the following schema (edited for brevity):

               'title':'Should I? why?',

Lets compare that to what we get when listing comments from the /python/comments endpoint:

               'link_title':'Django middleware that prints query stats to stderr after each request. pip install django-querycount',
               'body':'Try django-devserver for query counts, displaying the full queries, profiling, reporting memory usage, etc. \n\nhttps://pypi.python.org/pypi/django-devserver',

Now we’re getting somewhere – there are fields in the per-comment’s response that aren’t in the subreddit listing’s. Of the four fields we’re collecting, the submission URL and permalink properties are not returned by the subreddit comments endpoint. Accessing those causes a lazy evaluation to fire off additional requests. If we can infer these values from the data we already have, we can avoid having to waste time querying for each comment.

Doing Work

Submission URLs

Submission URLs are a combination of the subreddit name, the post ID, and title. We can easily get the post ID fragment:

post_id = comment.link_id.split('_')[1]

However, there is no title returned! Luckily, it turns out that it’s not needed.

subreddit = 'python'
post_id = comment.link_id.split('_')[1]
url = 'http://reddit.com/r/{}/{}/' \
          .format(subreddit, post_id)
print(url) # A valid submission permalink!
# OUTPUT: http://reddit.com/r/python/2alblh/

Great! This also gets us most of the way to constructing the second URL we need – a permalink to the comment.

Comment Permalinks

Maybe we can append the comment’s ID to the end of the submission URL?

sub_comments_url = 'http://reddit.com/r/python/comments/2alblh/'
comment_id = comment.name.split('_')[1]
url = sub_comments_url + comment_id
# OUTPUT: http://reddit.com/r/python/comments/2alblh/ciwbo37

Sadly, that URL doesn’t work because reddit expects the submission’s title to precede the ID. Referring to the subreddit comment’s JSON object, we can see that the title is not returned. This is curious: why is the title important? They already have a globally unique ID for the post, and can display the post just fine without (as demonstrated by the code sample immediately preceding this). Perhaps reddit wanted to make it easier for users to identify a link and are just parsing a forward-slash delimited series of parameters. If we put the comment ID in the appropriate position, the URL should be valid. Let’s give it a shot:

sub_comments_url = 'http://reddit.com/r/python/comments/2alblh/'
comment_id = comment.name.split('_')[1]
url = '{}-/{}'.format(sub_comments_url, comment_id)
# OUTPUT: http://reddit.com/r/python/comments/2alblh/-/ciwbo37

Following that URL takes us to the comment!

Victory Lap

Let’s see how much we’ve improved our execution time:

import praw
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000',
for comment in r.get_subreddit('Python') \
1 loops, best of 3: 3.57 s per loop

Wow! 403 seconds to 3.6 seconds – a factor of 111. Deploying this improvement to production not only increased the volume of data we were able to process, but also provided the side benefit of reducing the number of 504 errors we encountered during reddit’s peak hours. Remember, always be on the lookout for ways to improve your stack. A bunch of small wins can add up to something significant.

[Does this sort of stuff interest you? Love hacking and learning new things? Good news – we’re hiring!]