Category Archives: tech tip

Profiling in Python

Here at HumanGeo we use a lot of Python, and it is tons of fun. Python is a great language for writing beautiful and functional code amazingly fast, and it is most definitely my favorite language to use both privately and professionally. However, even though it is a wonderful language, Python can be painfully slow. Luckily, there are some amazing tools to help profile your code so that you can keep your beautiful code fast.

When I started working here at HumanGeo, I was tasked with taking a program that took many hours to run, finding the bottlenecks, and then doing whatever I could to make it run faster. I used many tools, including cProfilePyCallGraph (source), and even PyPy (an alternate, fast, interpreter for Python), to determine the best ways to optimize the program. I will go through how I used all of these programs, except for PyPy (which I ruled out to maintain interpreter consistency in production), and how they can help even the most seasoned developers find ways to better optimize their code.

Disclaimer: do not prematurely optimize! I’ll just leave this here.

Tools

Lets talk about some of the handy tools that you can use to profile Python code.

cProfile

The CPython distribution comes with two profiling tools, profile and cProfile. Both share the same API, and should act the same, however the former has greater runtime overhead, so we shall stick with cProfile for this post.

cProfile is a handy tool for getting a nice greppable profile of your code, and for getting a good idea of where the hot parts of your code are. Lets look at some sample slow code:

-> % cat slow.py
import time

def main():
    sum = 0
    for i in range(10):
    	sum += expensive(i // 2)
    return sum

def expensive(t):
    time.sleep(t)
    return t

if __name__ == '__main__':
    print(main())

Here we are simulating a long running program by calling time.sleep, and pretending that the result matters. Lets profile this and see what we find:

-> % python -m cProfile slow.py
20
         34 function calls in 20.030 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 __future__.py:48(<module>)
        1    0.000    0.000    0.000    0.000 __future__.py:74(_Feature)
        7    0.000    0.000    0.000    0.000 __future__.py:75(__init__)
       10    0.000    0.000   20.027    2.003 slow.py:11(expensive)
        1    0.002    0.002   20.030   20.030 slow.py:2(<module>)
        1    0.000    0.000   20.027   20.027 slow.py:5(main)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {print}
        1    0.000    0.000    0.000    0.000 {range}
       10   20.027    2.003   20.027    2.003 {time.sleep}

Now this is very trivial code, but I find this not as helpful as it could be. The list of calls is sorted alphabetically, which has no importance to us, and I would much rather see the list sorted by number of calls, or by cumulative run time. Luckily, the -s argument exists, and we can sort the list and see the hot parts of our code right now!

-> % python -m cProfile -s calls slow.py
20
         34 function calls in 20.028 seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    0.000    0.000   20.025    2.003 slow.py:11(expensive)
       10   20.025    2.003   20.025    2.003 {time.sleep}
        7    0.000    0.000    0.000    0.000 __future__.py:75(__init__)
        1    0.000    0.000   20.026   20.026 slow.py:5(main)
        1    0.000    0.000    0.000    0.000 __future__.py:74(_Feature)
        1    0.000    0.000    0.000    0.000 {print}
        1    0.000    0.000    0.000    0.000 __future__.py:48(<module>)
        1    0.003    0.003   20.028   20.028 slow.py:2(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {range}

Ah! Now we see that the hot code is in our expensive function, which ends up being calling time.sleep enough times to cause an annoying slowdown.

The list of valid arguments to the -s parameter can be found in the Python documentation.  Make sure to use the output option, -o, if you want to save these results to a file!

With the basics down, lets look at some other ways that we can find hot code using profiling tools.

PyCallGraph

PyCallGraph can be seen as a visual extension of cProfile, where we can follow the flow of the code with a nice Graphviz image to look through. PyCallGraph is not part of the standard Python installation, and therefore can be simply installed with:

-> % pip install pycallgraph

We can run this graphical application with the following command:

-> % pycallgraph graphviz -- python slow.py
A pycallgraph.png file should be created in the directory where you ran the script, and it should give you some familiar results (if you have already run cProfile). The numbers should be the same as the data we got from cProfile, however, the benefit of PyCallGraph is in its ability to show the relationships of functions being called.

Let us look at what that graph looks like

This is so handy! It shows the flow of the program, and nicely notifies us of each function, module, and file that the program runs through, along with runtime and number of calls. Running this in a big application generates a large image, but with the coloring, it is quite easy to find the code that matters. Here is a graph from the PyCallGraph documentation, showing the flow of code involving complex regular expression calls:

source code of graph

What can we do with this information?

Once we determine the cause of the slow code, we can then properly choose a course of action to speed up our code. Lets talk about some possible solutions to slow code, given an issue.

I/O

If you find your code is heavily input/output dependant, including sending many web requests, then you may be able to solve this problem by using Python’s standard threading module. Non I/O related threading is not suited for Python, due to  CPython’s GIL, which precludes it from using more than one core at a time for code centric tasks.

Regex

You know what they say, once you decide to use regular expressions to fix a problem, you now have two problems. Regular expressions are hard to get right, and hard to maintain. I could write a whole separate blog post on this (and I will not, regexes are hard, and there are much better posts than one that I could write), but I will give a few quick tips:

  1. Avoid .*, greedy catchalls are slow, and using character classes as much as possible can help with this
  2. Do not use regex! many regexes can be solved with simple string methods anyways, such as str.startswith, and str.endswith. Check out the str documentation for tons of great info.
  3. Use re.VERBOSE! Python’s Regex engine is great and super helpful, use it!

Thats all I will say on Regex, there are some great posts all over the internet if you want more information.

Python Code

In the case of the code I was profiling, we were running a Python function tens of thousands of times in order to stem English words. The best part about finding that this was the culprit was that this kind of operation is easily cachable. We were able to save the results of this function, and in the end, the code ran 10 times as fast as it did before. Creating a cache in Python is super easy:

from functools import wraps
def memoize(f):
    cache = {}
    @wraps(f)
    def inner(arg):
        if arg not in cache:
            cache[arg] = f(arg)
        return cache[arg]
    return inner

This technique is called memoization, and is shown being implemented as a decorator, which can be easily applied to Python functions as so:

import time
@memoize
def slow(you):
    time.sleep(3)
    print("Hello after 3 seconds, {}!".format(you))
    return 3

Now if we run this function multiple times, then the result will only be computed once.

>>> slow("Davis")
Hello after 3 seconds, Davis!
3
>>> slow("Davis")
3
>>> slow("Visitor")
Hello after 3 seconds, Visitor!
3
>>> slow("Visitor")
3

This was a great speedup for the project, and the code runs without a hitch.

disclaimer: make sure this is only used for pure functions! If memoization is used for functions with side effects such as I/O, then caching can have unintended results.

Other cases

If your code is not readily memoizable, your algorithm is not something crazy like O(n!), or your profile is ‘flat’, as in there are no obvious hot sections of code, then maybe you should look into another runtime or language. PyPy is a great option, along with possibly writing a C-extension for the meat of your algorithm. Luckily, what I was working on did not come to this, but the option is there if it is needed.

Conclusion

Profiling code can help you understand the flow of the project, where the hot code is, and what you can do as a developer to speed it up. Python profiling tools are great, super easy to use, and in depth enough to help you get to the root of the cause fast. Python is not meant to be a fast language, but that does not mean that you should be writing slow code! Take charge of your algorithms, do not forget to profile, and never prematurely optimize.

We’re hiring!  If you’re interested in geospatial, big data, social media analytics, Amazon Web Services (AWS), visualization, and/or the latest UI and server technologies, drop us an e-mail at info@thehumangeo.com.

No Soup for You: When Beautiful Soup Doesn’t Like Your XML

Over the years, Beautiful Soup has probably saved us more hours on scraping, data collection, and other projects than we can count. Crummy’s landing page for the library even says:

Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.

You can plot me directly in the “days” column. When I’m starting a Python project that requires me to parse through HTML data, the first dependency I’ll pull is BeautifulSoup. It makes what would normally be a nasty Perl-esque mess into a something nice and Pythonic, keeping sanity intact. But what about structured data other than HTML? I’ve also learned that BS can be a huge boon for XML data, but not without a couple of speed bumps.

Enter data from the client (not real data, but play along for a moment):

<xml>
  <sport name="baseball">
    <team name="Braves" city="Atlanta">
        <link>
          <url>http://atlanta.braves.mlb.com</url>
        </link>
    </team>
  </sport>
</xml>

Seems well-structured! Our customer just needs all the links that we have in the data. We fire up our editor of choice and roll up our sleeves.

from __future__ import print_function
from bs4 import BeautifulSoup

# Make our soup
with open('data.xml') as infile:
    blob = infile.read()
# Use LXML for blazing speed
soup = BeautifulSoup(blob, 'lxml')
print(soup.find_all('link'))

We get in return:

[<link/>]

Wait, what? What happened to our content? This is a pretty basic BS use case, but something strange is happening. Well, I’ll come back to this and start working with their other very hypothetical data sets where <link> tags become <resource> tags, but the rest of the data is structured exactly the same. This time around:

<xml>
  <sport name="baseball">
    <team name="Braves" city="Atlanta">
        <resource><!-- Changed to resouce -->
          <url>http://atlanta.braves.mlb.com</url>
        </resource><!-- Corresponding close -->
    </team>
  </sport>
</xml>

…and corresponding result…

>>> print(soup.find_all('resource'))
[<resource>
<url>http://atlanta.braves.mlb.com/index.jsp?c_id=atl</url>
</resource>]

Interesting! To compound our problem, we’re on a customer site where we don’t have internet access to grab that sweet, sweet documentation we crave. After toiling on a Stack Overflow dump in all the wrong places, I was reminded of one of my favorite blog posts by SO’s founder, Jeff Atwood. Read the Source. But what was I looking for? Well, let’s dig around for <link> tags and see what turns up.

Sure enough, after some quick searches, we find what I believe to be the smoking gun (for those following along at home, bs4.builder.__init__.py, lines 228/229 in v4.3.2).

empty_element_tags = set(['br' , 'hr', 'input', 'img', 'meta',
                          'spacer', 'link', 'frame', 'base'])

We have a seemingly harmless word with “link” in our XML, but it means something very different in HTML and more specifically, the TreeBuilder implementation that LXML is using. As a test, if I change our <link> turned <resource> tags into <base> tags we get the same result – no content. It also turns out that if you have LXML installed, BeautifulSoup4 will fall back to that for parsing. Uninstalling it grants us the results we want – tags with content. The stricter (but faster) TreeBuilder implementations from LXML take precedence over the built-in HTMLParser or html5lib (if you have it installed). How do we know that? Back to the source code!

bs4/builder/__init__.py, lines 304:321

# Builders are registered in reverse order of priority, so that custom
# builder registrations will take precedence. In general, we want lxml
# to take precedence over html5lib, because it's faster. And we only
# want to use HTMLParser as a last result.
from . import _htmlparser
register_treebuilders_from(_htmlparser)
try:
    from . import _html5lib
    register_treebuilders_from(_html5lib)
except ImportError:
    # They don't have html5lib installed.
    pass
try:
    from . import _lxml
    register_treebuilders_from(_lxml)
except ImportError:
    # They don't have lxml installed.
    pass

As it turns out, when creating your soup, ‘lxml’ != ‘xml’. Changing the soup creation gets us the results we’re looking for (UPDATE: corresponding doc “helpfully” pointed out by a Reddit commenter here). BeautifulSoup was still falling back to HTML builders, thus why we were seeing the results we were when specifying ‘lxml’.

# Use HTML for sanity
soup = BeautifulSoup(blob, 'xml')

While I didn’t find that magic code snippet to fix everything,  (UPDATE: Thanks Reddit). We found our problem, but went really roundabout to get there. Understanding why it was happening made me a feel a lot better in the end. It’s easy to get frustrated when coding, but always remember, read the docs and – Read the Source, Luke. It might help you understand the problem.

We’re hiring!  If you’re interested in geospatial, big data, social media analytics, Amazon Web Services (AWS), visualization, and/or the latest UI and server technologies, drop us an e-mail at info@thehumangeo.com.

Building the HumanGeo Website Map

HumanGeo is a relatively new company – the founding date is a little bit fuzzy, but 4 years old seems to be the prevailing notion.  Our website hadn’t changed much in three plus years, and it was looking a little bit dated, even by three year old website standards, so last year we decided to take the plunge and update it.

Screen Shot 2015-06-22 at 7.08.36 AM
Here’s an Internet Archive snapshot of the old site, complete with shadowy HumanGeo guy (emphasis on the “Human” in HumanGeo, perhaps?)

It’s not that the old site was bad, it just wasn’t exciting, and it didn’t seem to capture all the things that make HumanGeo unique – it just sort of blended in with every other corporate website out there.  In building the new website, we set out to create an experience that is a little less corporate and a little more creative, with a renewed focus on concisely conveying what it is we actually do here.  Distilling all of the things we do across projects into a cohesive, clear message was a challenge on par with building the website itself, but I’ll save that discussion for another time.   Since HumanGeo is a young company that operates primarily in the services space, our corporate identity, while centered on a few core themes, is constantly evolving – very much driven by the needs of our partners and the solutions we collectively build.  One of the core themes that has remained despite the subtle changes in our identity over time is our focus on building geospatial solutions.  It’s the “Geo” in HumanGeo; this is precisely why we chose to display a map as the hero image on our main page (sorry HumanGeo guy).  This article is all about building that map, which is less a static image and more a dynamic intro into HumanGeo’s world.  My goal with this article is to highlight the complexity that went into building the map and some of the ways in which the Leaflet DVF and a few other libraries (Stamen, TopoJSON, Parallax JS) simplified development or added something special.

The HumanGeo Website Map

When you arrive at our site, you’re greeted with a large Leaflet JS-based map with a Stamen watercolor background and some cool parallax effects (at least I think it’s cool).  It’s more artsy than realistic (using the Stamen watercolor map tiles pretty much forced our hand stylistically), an attempt to emphasize our creative and imaginative side while still highlighting the fact that geospatial is a big part of many of the solutions that we build.  The map is centered on DC with HumanGeo’s office in Arlington in the lower-left and some pie charts and colored markers in the DC area. The idea here is that you’re looking at our world from above, and it’s sort of a snapshot in time.  At any given point in time there are lots of data points being generated from a variety of sources (e.g. people using social media, sensors, etc.), and HumanGeo is focused on helping our partners capture, analyze and make sense of those data points.  As you move your mouse  over the map or tilt your device, the clouds, satellite, plane, and map all move – simulating the effect of changing perspective that you might see if you were holding the world in your hands.  In general, I wanted this area (the typical hero image/area) to be more than just a large static stock image that you might see on many sites across the web – I wanted it to be dynamic.  If you spend any time browsing our site, you’ll see that this dynamic hero area is a common element across the pages of our site.

Parallax

Parallax scrolling is popular these days, making its way into pretty much every slick, out of the box website template out there.  Usually you see it in the form of a background image that scrolls at a different speed than the main text on a page.  It gives the page dimension, as if the page is composed of different layers in three dimensional space.  In general, I’m a fan of this effect.  However, I decided to go with a different approach for this redesign – mainly because I saw a cool library that applied parallax effects based on mouse or device movement rather than scrolling, and the thought of combining it with a map intrigued me.  The library I’m referring to is Parallax JS, a great library that makes these type of effects easy.  To create a parallax scene, you simply add an unordered list (ul) element – representing a scene – to the body of your page along with a set of child list item (li) elements that represent the different layers in the scene.  Each li element includes a data-depth attribute that ranges from 0.0 to 1.0 and specifies how close to or far away from the viewer that layer will appear (closeness implies that the movement will be more exaggerated).  A value of 0.0 indicates that the layer will be lowest in the list and won’t move, while a value of 1.0 means that the layer will move the most.   In this case, my scene is composed of the map as the first li element with various layers stacked on top of it including the satellite imaging boxes, the satellite, the clouds, and the plane.  The map has a data-depth value of 0.05, which means it will move a little, and each layer over top of the map has an increasing data-depth value.  Here’s what the markup looks like:


<ul id="scene">
<li class="layer maplayer" data-depth="0.05">
 <div id="map"></div>
</li>
<li class="layer imaginglayer" data-depth="0.30">
 <div style="opacity:0.2; top: 190px; left: 1100px; width: 400px; height: 100px;" />
</li>
<li class="layer imaginglayer" data-depth="0.40">
 <div style="opacity:0.3; top: 180px; left: 1150px; width: 300px; height: 75px;" />
</li>
<li class="layer imaginglayer" data-depth="0.50">
 <div style="opacity:0.4; top: 170px; left: 1200px; width: 200px; height: 50px;" />
</li>
<li class="layer imaginglayer" data-depth="0.60">
 <div style="opacity:0.5; top: 160px; left: 1250px; width: 100px; height: 25px;" />
</li>
<li class="layer cloudlayer" data-depth="0.60"></li>
<li class="layer planelayer" data-depth="0.90"></li>
<li class="layer satellitelayer" data-depth="0.95">
 <img src="images/satellite.png" style="width: 200px; height: 200px; top: 20px; left: 1200px;">
</li>
</ul>

A Moving Plane

On modern browsers, the plane moves via CSS 3 animations, defined using keyframes.  Keyframes establish properties that will be applied to the HTML element being animated at different points in the animation.  In this case, the plane will move in a linear path from 0,  800 to 250, -350 through the duration of the animation.


@keyframes plane1 {

     0% {
          top: 800px;
          left: 0px;
     }

     100% {
          top: -350px;
          left: 250px;
     }

}

@-webkit-keyframes plane1 {

     0% {
          top: 800px;
          left: 0px;
     }

     100% {
          top: -350px;
          left: 250px;
     }

}

@-moz-keyframes plane1 ...

@-ms-keyframes plane1 ...

#plane-1 {
     animation: plane1 30s linear infinite normal 0 running forwards;
     -webkit-animation: plane1 30s linear infinite normal 0 running forwards;
     -moz-animation: plane1 30s linear infinite normal 0 running forwards;
     -ms-animation: plane1 30s linear infinite normal 0 running forwards;
}

On to the Map

The map visualizations use a mix of real and simulated data.  Since we’re located in the DC area, I wanted to illustrate displaying geo statistics in DC, but first, I actually needed to find some statistics related to DC.  I came across DC’s Office of the City Administrator’s Data Catalog site that lists a number of statistical data resources related to DC.  Among those resources are Comma Separated Value (CSV) files of all crime incident locations for a given year, such as this file, which lists crime incidents for 2013.  The CSV file categorizes each incident by a number of attributes and administrative boundary categories, including DC voting precinct.  I decided to use DC voting precincts as the basis for displaying geo statistics, since this would nicely illustrate the use of lower level administrative boundaries, and there are enough DC voting precincts to make the map interesting versus using another administrative boundary level like the eight wards in DC.  I downloaded the shapefile of DC precincts from the DC Atlas website and then used QGIS to convert it to GeoJSON to make it easily parseable in JavaScript.  At this point, I ran into the first challenge.  The resulting GeoJSON file of DC precincts is 2.5 MB, which is obviously pretty big from the perspective of ensuring that the website loads quickly and isn’t a frustrating experience for end users.

To make this file useable, I decided to compress it using the TopoJSON library. TopoJSON is a clever way of compressing GeoJSON, invented by Mike Bostock of D3 and New York Times visualization fame.  TopoJSON compresses GeoJSON by eliminating redundant line segments shared between polygons (e.g. shared borders of voting precinct boundaries). Instead of each polygon in a GeoJSON file sharing redundant line segments, TopoJSON stores the representation of those shared line segments, and then each polygon that contains those segments references the shared segments. It does amazing things when you have GeoJSON of administrative boundaries where those boundaries often include shared borders. After downloading TopoJSON, I ran the script on my DC precincts GeoJSON file:

topojson '(path to GeoJSON file)' -o '(path to output TopoJSON file)' -p

Note that the -p parameter is important, as it tells the TopoJSON script to include GeoJSON feature properties in the output.  If you leave this parameter off, the resulting TopoJSON features won’t include properties that existed in the input GeoJSON.  In my case, I needed the voting_precinct property to exist, since the visualizations I was building relied on attaching statistics to voting precinct polygons, and those polygons are retrieved via the voting_precinct property.  Running the TopoJSON script yielded a 256 KB file, reducing the size of the input file by 90%!

It’s important to note that in rare cases using TopoJSON doesn’t make sense.  You’ll have to weigh the size of your GeoJSON versus including the TopoJSON library (a light 8 KB minified) and the additional processing required by the browser to read GeoJSON out of the input TopoJSON.  So, if you have a small GeoJSON file with a relatively small number of polygons, or the polygon features in that GeoJSON file don’t share borders, it might not make sense to compress it using TopoJSON.  If you’re building choropleth maps using more than a few administrative boundaries, though, it almost always makes sense.

Once you have TopoJSON, you’ll need to decode it into GeoJSON to make it useable.  In this case:


precincts = topojson.feature(precinctsTopo, precinctsTopo.objects.dcprecincts);

precinctsTopo is the JavaScript variable that references the TopoJSON version of the precinct polygons.  precinctsTopo.objects.dcprecincts is a GeoJSON GeometryCollection that contains each precinct polygon.  Calling the topojson.feature method will turn that input TopoJSON into GeoJSON.

Now that I had data and associated polygons, I decided to create a couple of visualizations.  The first visualization illustrates crimes in DC by voting precinct using a single L.DataLayer instance but provides different visuals based on the input precinct.  For some precincts, I decided to display pie charts sized by total crime, and for others I decided to draw a MarkerGroup of stacked CircleMarker instances sized and colored by crime.

Screen Shot 2014-09-18 at 7.20.11 PM

Screen Shot 2014-09-18 at 7.19.44 PM

The reason for doing this is that I wanted to illustrate two different capabilities that we provide our partners, event detection from social media and geo statistics.  Including these in the same DataLayer instance ensures that the statistical data only get parsed once.  For the event detection layer, I also wanted to spice it up a bit by adding an L.Graph instance that illustrates the propagation of an event in social media from its starting point to surrounding areas.  In this case, I generated some fake data that represent the movement of information between two precincts, where each record is an edge.  Here’s what the fake data look like:


var precinctConnections = [
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 127',
 'cnt': '120'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 142',
 'cnt': '5'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 128',
 'cnt': '89'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 2',
 'cnt': '65'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 1',
 'cnt': '220'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 90',
 'cnt': '28'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 130',
 'cnt': '180'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 89',
 'cnt': '150'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 131',
 'cnt': '300'
 }
];

Here’s the code for instantiating an L.Graph instance and adding it to the map.


var connectionsLayer = new L.Graph(precinctConnections, {
     recordsField: null,
     locationMode: L.LocationModes.LOOKUP,
     fromField: 'p1',
     toField: 'p2',
     codeField: null,
     locationLookup: precincts,
     locationIndexField: 'voting_precinct',
     locationTextField: 'voting_precinct',
     getEdge: L.Graph.EDGESTYLE.ARC,
     layerOptions: {
          fill: false,
          opacity: 0.8,
          weight: 0.5,
          fillOpacity: 1.0,
          distanceToHeight: new L.LinearFunction([0, 5], [5000, 1000]),
          color: '#777'
     },
     tooltipOptions: {
          iconSize: new L.Point(90,76),
          iconAnchor: new L.Point(-4,76)
     },
     displayOptions: {
          cnt: {
                displayName: 'Count',
                weight: new L.LinearFunction([0, 2], [300, 6]),
                opacity: new L.LinearFunction([0, 0.7], [300, 0.9])
          }
     }
});

map.addLayer(connectionsLayer);

The locations for nodes in the graph are specified via the precincts GeoJSON retrieved from the TopoJSON file I discussed earlier.  fromField and toField tell the L.Graph instance how to draw lines between nodes.

The pie charts include random slice values.  I could have used the actual data and broken down the crime in DC by type, but I wanted to keep the pie charts simple, and three slices seemed like a good number.  There are more than three crime types that were committed in DC in 2013.  As a bit of an aside, here’s a representation of crimes in DC color-coded by type  using L.StackedRegularPolygonMarker instances.

dccrimes

The code for this visualization can be found in the DVF examples here.  It’s more artistic than practical.  One of the nice features of the Leaflet DVF is the ability to easily swap visualizations by changing a few parameters – so in this example, I can easily swap between pie charts, bar charts, and other charts just by changing the class that’s used (e.g. L.PieChartDataLayer, L.BarChartDataLayer, etc.) and modifying some of the options.

After I decided what information I wanted to display using pie charts, I ventured over to the Color Brewer website and settled on some colors – the 3 class Accent color scheme in the qualitative section here – that work well together while fitting in with the pastel color scheme prevalent in the Stamen watercolor background tiles.  The Leaflet DVF has built in support for Color Brewer color schemes (using the L.ColorBrewer object), so if you see a color scheme you like on the Color Brewer website, it’s easy to apply it to the map visualization you’re building using the L.CustomColorFunction class, like so:

var colorFunction = new L.CustomColorFunction(1, 55, L.ColorBrewer.Qualitative.Accent['3'], {
 interpolate: false
});

I also wanted to highlight some of the work we’ve been doing related to mobility analysis, so I repurposed the running data from the Leaflet DVF Run Map example and styled the WeightedPolyline so that it fit in better with our Stamen watercolor map, using a white to purple color scheme.  The running data look like this:


var data={
 gpx: {
 wpt: [
 {
 lat: '38.90006',
 lon: '-77.05691',
 ele: '4.98652',
 name: 'Start'
 },
 {
 lat: '38.90082',
 lon: '-77.0572',
 ele: '4.9469',
 name: 'Run 001'
 },
 {
 lat: '38.90098',
 lon: '-77.05725',
 ele: '5.23951',
 name: 'Run 002'
 },
 ...
 {
 lat: '38.89517',
 lon: '-77.02822',
 ele: '4.74573',
 name: 'Run 109'
 }
 ]
 }
}

Check out the Run Map example to see how this data can be turned into a variable weighted line or the GPX Analyzer example that lets you drag and drop GPS Exchange Format (GPX) files (a common format for capturing GPS waypoint/trip data) onto the map to see variations in the speed of your run/bike/or other trip in space and time.

To make the map a little more interesting, I added some additional layers for our office, employees, and some boats in the water (the employee and boat locations are made up).  Now that I had the map layers I wanted to use, I incorporated them into a tour that allows a user to navigate the map like a PowerPoint presentation.  I built a simple tour class that has methods for navigating forward and backward.  Each tour action consists of an HTML element with some explanatory text and an associated function to execute when that action occurs.  In this case, the function to execute usually involves panning/zooming to a different layer on the map or showing/hiding map layers.  The details of the tour interaction go beyond collecting, transforming and visualizing geospatial data, so feel free to explore the code here and check out the tour for yourself by visiting our site and clicking on the HumanGeo logo.

We’re hiring!  If you’re interested in geospatial, big data, social media analytics, Amazon Web Services (AWS), visualization, and/or the latest UI and server technologies, drop us an e-mail at info@thehumangeo.com.

Supercharging your reddit API access

Here at HumanGeo we do all sorts of interesting things with sentiment analysis and entity resolution. Before you get to have fun with that, though, you need to bring data into the system. One data source we’ve recently started working with is reddit.

Compared to the walled gardens of Facebook and LinkedIn, reddit’s API is as open as open can be; Everything is nice and RESTful, rate limits are sane, the developers are open to enhancement requests, and one can do quite a bit without needing to authenticate.
The most common objects we collect from reddit are submissions (posts) and comments. A submission can either be a link, or a self post with a text body, and can have an arbitrary number of comments. Comments contain text, as well as references to parent nodes (if they’re not root nodes in the comment tree). Pulling this data is as simple as GET http://www.reddit.com/r/washingtondc/new.json. (Protip: pretty much any view in reddit has a corresponding API endpoint that can be generated by appending ‘.json’ to the URL.)

With little effort a developer could hack together a quick ‘n dirty reddit scraper. However, as additional features appear and collection-breadth grows, the quick ‘n dirty scraper becomes more dirty than quick, and you discover bugsfeatures that others utilizing the API have already encountered and possibly addressed. API wrappers help consolidate communal knowledge and best practices for the good of all. We considered several, and, being a Python shop, settled on PRAW (Python Reddit API Wrapper).

With PRAW, getting a list of posts is pretty easy:

import praw
r = praw.Reddit(user_agent='Hello world application.')
for post in r.get_subreddit('WashingtonDC') \
             .get_hot(limit=5):
    print(str(post))
$ python parse_bot_2000.py
209 :: /r/WashingtonDC's Official Guide to the City!
29 :: What are some good/active meetups in DC that are easy to join?
17 :: So no more tailgating at the Nationals park anymore...
3 :: Anyone know of a juggling club in DC
2 :: The American Beer Classic: Yay or Nay?

The Problem

Now, let’s try something a little more complicated. Our mission, if we choose to accept it, is to capture all incoming comments to a subreddit. For each comment we should collect the author’s username, the URL for the submission, a permalink to the comment, as well as its body.

Here’s what this looks like:

import praw
from datetime import datetime
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000')

def save_comment(*args):
    print(datetime.now().time(), args)

for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    save_comment(comment.permalink,
        comment.author.name,
        comment.body_html,
        comment.submission.url)

That was pretty easy. For the sake of this demo the save_comment method has been stubbed out, but anything can go there.

If you run the snippet, you’ll observe the following pattern:

... comment ...
... comment ...
[WAIT FOR A FEW SECONDS]
... comment ...
... comment ...
[WAIT FOR A FEW SECONDS]
... comment ...
... comment ...
[WAIT FOR A FEW SECONDS]
(repeating...)

This process also seems to be taking longer than a normal HTTP request. As anyone working with large amounts of data should do, let’s quantify this.

Using the wonderful, indispensable iPython:

In [1]: %%timeit
requests.get('http://www.reddit.com/r/python/comments.json?limit=200')
   ....:
1 loops, best of 3: 136 ms per loop

In [2]: %%timeit
import praw
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000',
                cache_timeout=0)
for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    print(comment.permalink,
          comment.author.name,
          comment.body_html,
          comment.submission.url)
1 loops, best of 3: 6min 43s per loop

Ouch. While this difference in run-times is fine for a one-off, contrived example, such inefficiency is disastrous when dealing with large volumes of data. What could be causing this behavior?

Digging

According to the PRAW documentation,

Each API request to Reddit must be separated by a 2 second delay, as per the API rules. So to get the highest performance, the number of API calls must be kept as low as possible. PRAW uses lazy objects to only make API calls when/if the information is needed.

Perhaps we’re doing something that is triggering additional HTTP requests. Such behavior would explain the intermittent printing of comments to the output stream. Let’s verify this hypothesis.

To see the underlying requests, we can override PRAW’s default log level:

from datetime import datetime
import logging
import praw

logging.basicConfig(level=logging.DEBUG)

r = praw.Reddit(user_agent='Subreddit Parse Bot 2000')

def save_comment(*args):
    print(datetime.now().time(), args)

for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    save_comment(comment.author.name,
                 comment.submission.url,
                 comment.permalink,
                 comment.body_html)

And what does the output look like?

DEBUG:requests.packages.urllib3.connectionpool:"PUT /check HTTP/1.1" 200 106
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2ak14j.json HTTP/1.1" 200 888
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2aies0.json HTTP/1.1" 200 2889
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2aiier.json HTTP/1.1" 200 14809
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2ajam1.json HTTP/1.1" 200 1091
.. comment ..
.. comment ..
.. comment ..

Those intermittent requests for individual comments back up our claim. Now, let’s see what’s causing this.

Prettifying the response JSON yields the following schema (edited for brevity):

{
   'kind':'Listing',
   'data':{
      'children':[
         {
            'kind':'t3',
            'data':{
               'id':'2alblh',
               'media':null,
               'score':0,
               'permalink':'/r/Python/comments/2alblh/django_middleware_that_prints_query_stats_to/',
               'name':'t3_2ajam1',
               'created':1405227385.0,
               'url':'http://imgur.com/QBSOAZB',
               'title':'Should I? why?',
            }
         }
      ]
   }
}

Lets compare that to what we get when listing comments from the /python/comments endpoint:

{
   'kind':'Listing',
   'data':{
      'children':[
         {
            'kind':'t1',
            'data':{
               'link_title':'Django middleware that prints query stats to stderr after each request. pip install django-querycount',
               'link_author':'mrrrgn',
               'author':'mrrrgn',
               'parent_id':'t3_2alblh',
               'body':'Try django-devserver for query counts, displaying the full queries, profiling, reporting memory usage, etc. \n\nhttps://pypi.python.org/pypi/django-devserver',
               'link_id':'t3_2alblh',
               'name':'t1_ciwbo37',
               'link_url':'https://github.com/bradmontgomery/django-querycount',
            }
         }
      ]
   }
}

Now we’re getting somewhere – there are fields in the per-comment’s response that aren’t in the subreddit listing’s. Of the four fields we’re collecting, the submission URL and permalink properties are not returned by the subreddit comments endpoint. Accessing those causes a lazy evaluation to fire off additional requests. If we can infer these values from the data we already have, we can avoid having to waste time querying for each comment.

Doing Work

Submission URLs

Submission URLs are a combination of the subreddit name, the post ID, and title. We can easily get the post ID fragment:

post_id = comment.link_id.split('_')[1]

However, there is no title returned! Luckily, it turns out that it’s not needed.

subreddit = 'python'
post_id = comment.link_id.split('_')[1]
url = 'http://reddit.com/r/{}/{}/' \
          .format(subreddit, post_id)
print(url) # A valid submission permalink!
# OUTPUT: http://reddit.com/r/python/2alblh/

Great! This also gets us most of the way to constructing the second URL we need – a permalink to the comment.

Comment Permalinks

Maybe we can append the comment’s ID to the end of the submission URL?

sub_comments_url = 'http://reddit.com/r/python/comments/2alblh/'
comment_id = comment.name.split('_')[1]
url = sub_comments_url + comment_id
print(url)
# OUTPUT: http://reddit.com/r/python/comments/2alblh/ciwbo37

Sadly, that URL doesn’t work because reddit expects the submission’s title to precede the ID. Referring to the subreddit comment’s JSON object, we can see that the title is not returned. This is curious: why is the title important? They already have a globally unique ID for the post, and can display the post just fine without (as demonstrated by the code sample immediately preceding this). Perhaps reddit wanted to make it easier for users to identify a link and are just parsing a forward-slash delimited series of parameters. If we put the comment ID in the appropriate position, the URL should be valid. Let’s give it a shot:

sub_comments_url = 'http://reddit.com/r/python/comments/2alblh/'
comment_id = comment.name.split('_')[1]
url = '{}-/{}'.format(sub_comments_url, comment_id)
print(url)
# OUTPUT: http://reddit.com/r/python/comments/2alblh/-/ciwbo37

Following that URL takes us to the comment!

Victory Lap

Let’s see how much we’ve improved our execution time:

%%timeit
import praw
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000',
                cache_timeout=0)
for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    print(comment.author.name,
          comment.body_html,
          comment.id,
          comment.submission.id)
1 loops, best of 3: 3.57 s per loop

Wow! 403 seconds to 3.6 seconds – a factor of 111. Deploying this improvement to production not only increased the volume of data we were able to process, but also provided the side benefit of reducing the number of 504 errors we encountered during reddit’s peak hours. Remember, always be on the lookout for ways to improve your stack. A bunch of small wins can add up to something significant.

[Does this sort of stuff interest you? Love hacking and learning new things? Good news – we’re hiring!]

Dynamically Cache Static Files using Django and Nginx

As a software developer that mainly works with web applications, there’s nothing I find more frustrating than working on a new feature or fixing a bug and not seeing my changes reflected in the web browser because it is using an out-of-date cached version of the JS or CSS files I updated.  And even after realizing that I may have already fixed the problem and refreshing the browser, it’s not an option to explain to puzzled customers that doing a full-page refresh or emptying the browser cache will fix the problem.  At HumanGeo, we’ve developed a technique to get around this problem by using Nginx to dynamically “smart cache” some of our static files by using the built-in URL regular expressions to control what files are served and cached by browsers.  We were already using Nginx to serve our static content, so with just a couple of updates to the Nginx configuration and the Django code, we’re guaranteed to always get the most up-to-date version of specific static files and ensuring that the browser will properly cache the file.

Here’s an example from one of our hosted CSS files:

After doing a basic get request (Note: The requested file was taken from the cache):css-refresh

After doing a full-page refresh (Note: The requested file got the appropriate status code 304 from the server):

css-get

The technique is actually pretty straightforward:  using the URL regular expression format that Nginx provides, we look for a pattern within the given URL and find the corresponding file on the file system.  So essentially, the browser will request /static/js/application.<pattern>.js and Nginx will actually serve /static/js/application.js.  Since the browser is requesting a filename with a particular pattern, it will still cache files with that pattern, but we have much more control over the user experience and what files they see when they hit the page, without requiring them to clear their cache.  To further automate the process, the pattern we use is the MD5 hash of the specified file.  Using the MD5 and a little help from a Django context processor, we can ensure that the pattern is always up-to-date and that we’re always using the most recent version of the file.  However, the pattern could be any unique string you’d like – such as a file version number, application version number, the date the file was last modified, etc.

Details:

Nginx configuration – how to use the URL regex:

# alias for /static/js/application.js
location ~ ^/static/js/application\.(.*)\.js$ {
    alias /srv/www/project/static_files/js/application.js;
    expires 30d;
}# serve remaining static files
location /static/ {
    alias /srv/www/project/static_files/;
    expires 30d;
}

Getting the MD5 for specific files: Django settings.py

A function to generate the MD5 hash:
def md5sum(filename):
    md5 = hashlib.md5()
    with open(filename,'rb') as f:
        for chunk in iter(lambda: f.read(128*md5.block_size), b''):
            md5.update(chunk)
    return md5.hexdigest()

Generate and save the MD5 hash for application.js:

try:
    JS_FILE_PATH = os.path.join(os.path.join(STATIC_ROOT, 'js'), 'application.js')
    JS_MD5 = md5sum(JS_FILE_PATH)
except:
    JS_MD5 = ""

Ensure that a custom template context processor has been added to TEMPLATE_CONTEXT_PROCESSORS:

TEMPLATE_CONTEXT_PROCESSORS = (
    ....
    "project.template_context_processors.settings_context_processor",
)

Make the JS_MD5 variable visible to the templates: template_context_processors.py

template_context_processors.py:

from django.conf import settings
....
def settings_context_processor(request):
    return {
        'JS_MD5': settings.JS_MD5,
    }

Reference the JS_MD5 variable in html templates:

base.html

<script type="text/javascript" src="{{ STATIC_URL }}js/application.{{ JS_MD5 }}.js"></script>

When the HTML templates are loaded, the script tag will now be generated as:

<script type="text/javascript" src="{{ STATIC_URL }}js/application.<MD5 hash>.js"></script>

Summary

With a couple of updates to Django and a few lines of Nginx configuration, it’s easy to create dynamic smart caching of static resources that will work across all browsers.  This will also ensure that project stakeholders and all of the application’s users will also have the correct version of the static resource, as well as their browsers correctly caching each resource.