Elasticsearch Data Engineering

Last month, our team hosted a hackathon for about a dozen Data Scientists who work for one of our customers. The Scientists are a mixed group of econometricians and general statisticians and we wanted to see what they could do with some time series data – in particular news articles and social media (mostly tweets).

A few days before the start of the hackathon, one of our customer contacts asked if I would load a few billion tweets into a temporary Elasticsearch cluster so that the hackathon participants could use it when they arrived.

I quickly violated some well learned muscle memories:

  1. I said: “yes, I can do that.”
  2. I chose to install the most recent version of Elasticsearch (2.0 at the time.)

Installation of 2.0 is as you would expect if you have installed ES before. Very notable differences are the way in which you install Marvel and the new Marvel user interface. I want to use the image of Marvel below to tell this story.

Indexing

To ingest, I wrote a python script that iterated through a massive repository of zipped GNIP files in Amazon’s Simple Storage Service (S3). The first pain point you go through with an ingest like this is one of defining the Elasticsearch mapping. I thought I had it just right, let the ingest script rip, then checked back in on it in the morning. Turns out I missed quite a few fields (GNIP data has shifted formats over time) so I had to reingest about 40 million tweets. You can see my frustration in the first peak on the indexing rate graph below (also mirrored in the Index Size and Lucene Memory graphs.)

20151120-marvel-hackathon.png

After ingesting all weekend, it was clear that I was never going to make it to a billion tweets by the start of the hackathon. I reached out to some of the HumanGeo Gurus and got some advice on how to tweak the cluster to improve ingest speed. The one visible piece of advice in the graph is in the blue circles: at that time I set index.number_of_replicas=0. You can tell that the size on disk was dramatically smaller (expected) and there is a small inflection point in the rate of ingest which you can only see in the document count line graph. Very disconcertingly, the (blue rectangle) indexing rate decreased! But the document count over time has clearly increased. I think this is because marvel is double(ish)-counting your index rate since it sees indexing happening into multiple replicas at the same time.

Trouble

I was now resigned to have a database of ~800 million tweets which was still great but short of my 1bn personal goal. Additional fun occurred at the red circle. One of the nodes ran out of disk space, and in doing that it corrupted its transaction log. This ground indexing to a halt – bulk indexes weren’t happening cleanly any more because Elasticsearch was missing shards. The cluster was missing shards because this failed node had the only copy (remember when I turned off replicas?!) and that copy was now offline.

The transaction log is one of those data store concepts that is supposed to save you from situations like this. I was running the cluster on AWS EC2, so the first thing I did was to stop Elasticsearch on that node, provision and move the index to a larger EBS volume, and start it back up. Elasticsearch tried to recover the shard by reading the transaction log, then discovered that it was corrupted, then gave up, then repeated the process.

One of the tools in your arsenal at this point is to say: forget about those transactions. So I removed the transaction log and restarted Elasticsearch. No dice – because of this bug in Elasticsearch 2.0 – Elasticsearch can neither recover from a corrupted transaction log nor reinitialize a missing transaction log.

My goal of 800 million tweets was now 250 million shy. But those tweets were still indexed! I was just being held up by a few bad eggs in that transaction log!

It was several paragraphs ago that the harder-core reader was literally punching their monitor in frustration because I hadn’t considered trying to hack around the transaction log. The transaction log is a binary format specific to Elasticsearch and it creates a new one whenever you create a new index. What if I created an empty transaction log? Could I get my shard back online?

To get started, I created a new index in Elasticsearch which dropped a pristine transaction log on disk. I copied that transaction log into the right place in my broken index and restarted Elasticsearch. Elasticsearch complained that the UUID in the copied transaction log didn’t match the UUID that the index was expecting. A UUID is just a unique identifier – the transaction log in a brand new index is otherwise identical to the transaction log in any other brand new index. In the log, Elasticsearch said it saw a UUID of 0xDEADBEEF when it expected 0x00C0FFEE. I opened the transaction log in hexedit and could see 0xDEADBEEF.  I copied the correct UUID over the incorrect UUID, saved it, restarted Elasticsearch, the shard came online, and then that gap in the red circle was filled in!

With a repaired transaction log and all shards online I was able to turn on the replicas to get back to a cluster with some durability and better search performance. Elasticsearch took a few more hours to build the replicas and balance the cluster.

The hackathon went live at the green circle. Immediately our sharp Scientists started issuing queries that completely blew up the fielddata in Elasticsearch. Funny thing about fielddata – it’s the special part of memory that Elasticsearch uses to keep aggregations fast. It’s slow to load data into fielddata so it tries to only do that once. So by default, the policy is unbounded. Fielddata will grow until you’re out of memory and never evict data. Which basically means that once you’ve done an aggregation that pushed data into fielddata, it’s there until that data is deleted or you restart Elasticsearch. So if you don’t actually have enough RAM to hold all of the possible fielddata, it will by definition be fully used after a hard query and then newer (possibly more relevant) data can never fit in.

I think in many cases that scenario makes sense but I didn’t have the luxury of adding additional resources. So I set indices.fielddata.cache.size=55%. 55% is a special consideration, since the hard limit without eviction is 60%. When you do this, you’re accepting that ES will evict the least recently used fielddata when it is under memory pressure situations. I suspect that in most cases our users were doing queries in the beginning that were basically bad ideas so I didn’t want to punish the future generations of queries with the mistakes of the past. (That sounds weirdly political.) Anyway, you see the huge spike drop once I put that policy in place.

Phew!

Hopefully the above helps you if you end up in similar situations. Once the hackathon was underway it was a great success. I even overheard one of the participants say how great it was that there was already data available in the database – that’s not normally how these things go, he said.

Some key takeaways from my recent Elasticsearch experience:

  • For ingest performance, turn off replicas
    • Beware! You are flying without an autopilot here. If something goes wrong, it will go very wrong.
  • For greenfield data exploration clusters, set indices.fielddata.cache.size=55% (or get moar RAM.)
  • Learn the internals of your database. This was one such exercise.
    • We see this one all the time with our clients. If you don’t know the failure modes, the query considerations, and the ingest concerns of your database, you’ll be buried under the weight of an early, non-data-driven decision.

Sounds Fun?

We’re hiring. Do Data Science using Elasticsearch and Social Media with us.

Engineering Coda

If you’re the kind of person who likes to be able to make computers do things while also writing the code that makes computers do things, I can’t say enough about two tools that help me do this stuff. One is a clusterssh clone for iTerm2 called i2cssh and the other is called htop. i2cssh lets you login to a bunch of computers at the same time and have your typing go to all of them simultaneously (or individually.) Check out the matrix like screen shots below where I’m running htop on 4 nodes and then 7 nodes (I increased the size of the cluster at one point.)
20151120-htop-hackathon1.png
20151120-htop-hackathon2.png

Thanks to Michele Bueno, Scott Fairgrieve, and Aaron Gussman for reviewing drafts of this post.

Profiling in Python

Here at HumanGeo we use a lot of Python, and it is tons of fun. Python is a great language for writing beautiful and functional code amazingly fast, and it is most definitely my favorite language to use both privately and professionally. However, even though it is a wonderful language, Python can be painfully slow. Luckily, there are some amazing tools to help profile your code so that you can keep your beautiful code fast.

When I started working here at HumanGeo, I was tasked with taking a program that took many hours to run, finding the bottlenecks, and then doing whatever I could to make it run faster. I used many tools, including cProfilePyCallGraph (source), and even PyPy (an alternate, fast, interpreter for Python), to determine the best ways to optimize the program. I will go through how I used all of these programs, except for PyPy (which I ruled out to maintain interpreter consistency in production), and how they can help even the most seasoned developers find ways to better optimize their code.

Disclaimer: do not prematurely optimize! I’ll just leave this here.

Tools

Lets talk about some of the handy tools that you can use to profile Python code.

cProfile

The CPython distribution comes with two profiling tools, profile and cProfile. Both share the same API, and should act the same, however the former has greater runtime overhead, so we shall stick with cProfile for this post.

cProfile is a handy tool for getting a nice greppable profile of your code, and for getting a good idea of where the hot parts of your code are. Lets look at some sample slow code:

-> % cat slow.py
import time

def main():
    sum = 0
    for i in range(10):
    	sum += expensive(i // 2)
    return sum

def expensive(t):
    time.sleep(t)
    return t

if __name__ == '__main__':
    print(main())

Here we are simulating a long running program by calling time.sleep, and pretending that the result matters. Lets profile this and see what we find:

-> % python -m cProfile slow.py
20
         34 function calls in 20.030 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 __future__.py:48(<module>)
        1    0.000    0.000    0.000    0.000 __future__.py:74(_Feature)
        7    0.000    0.000    0.000    0.000 __future__.py:75(__init__)
       10    0.000    0.000   20.027    2.003 slow.py:11(expensive)
        1    0.002    0.002   20.030   20.030 slow.py:2(<module>)
        1    0.000    0.000   20.027   20.027 slow.py:5(main)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {print}
        1    0.000    0.000    0.000    0.000 {range}
       10   20.027    2.003   20.027    2.003 {time.sleep}

Now this is very trivial code, but I find this not as helpful as it could be. The list of calls is sorted alphabetically, which has no importance to us, and I would much rather see the list sorted by number of calls, or by cumulative run time. Luckily, the -s argument exists, and we can sort the list and see the hot parts of our code right now!

-> % python -m cProfile -s calls slow.py
20
         34 function calls in 20.028 seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    0.000    0.000   20.025    2.003 slow.py:11(expensive)
       10   20.025    2.003   20.025    2.003 {time.sleep}
        7    0.000    0.000    0.000    0.000 __future__.py:75(__init__)
        1    0.000    0.000   20.026   20.026 slow.py:5(main)
        1    0.000    0.000    0.000    0.000 __future__.py:74(_Feature)
        1    0.000    0.000    0.000    0.000 {print}
        1    0.000    0.000    0.000    0.000 __future__.py:48(<module>)
        1    0.003    0.003   20.028   20.028 slow.py:2(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {range}

Ah! Now we see that the hot code is in our expensive function, which ends up being calling time.sleep enough times to cause an annoying slowdown.

The list of valid arguments to the -s parameter can be found in the Python documentation.  Make sure to use the output option, -o, if you want to save these results to a file!

With the basics down, lets look at some other ways that we can find hot code using profiling tools.

PyCallGraph

PyCallGraph can be seen as a visual extension of cProfile, where we can follow the flow of the code with a nice Graphviz image to look through. PyCallGraph is not part of the standard Python installation, and therefore can be simply installed with:

-> % pip install pycallgraph

We can run this graphical application with the following command:

-> % pycallgraph graphviz -- python slow.py
A pycallgraph.png file should be created in the directory where you ran the script, and it should give you some familiar results (if you have already run cProfile). The numbers should be the same as the data we got from cProfile, however, the benefit of PyCallGraph is in its ability to show the relationships of functions being called.

Let us look at what that graph looks like

This is so handy! It shows the flow of the program, and nicely notifies us of each function, module, and file that the program runs through, along with runtime and number of calls. Running this in a big application generates a large image, but with the coloring, it is quite easy to find the code that matters. Here is a graph from the PyCallGraph documentation, showing the flow of code involving complex regular expression calls:

source code of graph

What can we do with this information?

Once we determine the cause of the slow code, we can then properly choose a course of action to speed up our code. Lets talk about some possible solutions to slow code, given an issue.

I/O

If you find your code is heavily input/output dependant, including sending many web requests, then you may be able to solve this problem by using Python’s standard threading module. Non I/O related threading is not suited for Python, due to  CPython’s GIL, which precludes it from using more than one core at a time for code centric tasks.

Regex

You know what they say, once you decide to use regular expressions to fix a problem, you now have two problems. Regular expressions are hard to get right, and hard to maintain. I could write a whole separate blog post on this (and I will not, regexes are hard, and there are much better posts than one that I could write), but I will give a few quick tips:

  1. Avoid .*, greedy catchalls are slow, and using character classes as much as possible can help with this
  2. Do not use regex! many regexes can be solved with simple string methods anyways, such as str.startswith, and str.endswith. Check out the str documentation for tons of great info.
  3. Use re.VERBOSE! Python’s Regex engine is great and super helpful, use it!

Thats all I will say on Regex, there are some great posts all over the internet if you want more information.

Python Code

In the case of the code I was profiling, we were running a Python function tens of thousands of times in order to stem English words. The best part about finding that this was the culprit was that this kind of operation is easily cachable. We were able to save the results of this function, and in the end, the code ran 10 times as fast as it did before. Creating a cache in Python is super easy:

from functools import wraps
def memoize(f):
    cache = {}
    @wraps(f)
    def inner(arg):
        if arg not in cache:
            cache[arg] = f(arg)
        return cache[arg]
    return inner

This technique is called memoization, and is shown being implemented as a decorator, which can be easily applied to Python functions as so:

import time
@memoize
def slow(you):
    time.sleep(3)
    print("Hello after 3 seconds, {}!".format(you))
    return 3

Now if we run this function multiple times, then the result will only be computed once.

>>> slow("Davis")
Hello after 3 seconds, Davis!
3
>>> slow("Davis")
3
>>> slow("Visitor")
Hello after 3 seconds, Visitor!
3
>>> slow("Visitor")
3

This was a great speedup for the project, and the code runs without a hitch.

disclaimer: make sure this is only used for pure functions! If memoization is used for functions with side effects such as I/O, then caching can have unintended results.

Other cases

If your code is not readily memoizable, your algorithm is not something crazy like O(n!), or your profile is ‘flat’, as in there are no obvious hot sections of code, then maybe you should look into another runtime or language. PyPy is a great option, along with possibly writing a C-extension for the meat of your algorithm. Luckily, what I was working on did not come to this, but the option is there if it is needed.

Conclusion

Profiling code can help you understand the flow of the project, where the hot code is, and what you can do as a developer to speed it up. Python profiling tools are great, super easy to use, and in depth enough to help you get to the root of the cause fast. Python is not meant to be a fast language, but that does not mean that you should be writing slow code! Take charge of your algorithms, do not forget to profile, and never prematurely optimize.

We’re hiring!  If you’re interested in geospatial, big data, social media analytics, Amazon Web Services (AWS), visualization, and/or the latest UI and server technologies, drop us an e-mail at info@thehumangeo.com.

No Soup for You: When Beautiful Soup Doesn’t Like Your XML

Over the years, Beautiful Soup has probably saved us more hours on scraping, data collection, and other projects than we can count. Crummy’s landing page for the library even says:

Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.

You can plot me directly in the “days” column. When I’m starting a Python project that requires me to parse through HTML data, the first dependency I’ll pull is BeautifulSoup. It makes what would normally be a nasty Perl-esque mess into a something nice and Pythonic, keeping sanity intact. But what about structured data other than HTML? I’ve also learned that BS can be a huge boon for XML data, but not without a couple of speed bumps.

Enter data from the client (not real data, but play along for a moment):

<xml>
  <sport name="baseball">
    <team name="Braves" city="Atlanta">
        <link>
          <url>http://atlanta.braves.mlb.com</url>
        </link>
    </team>
  </sport>
</xml>

Seems well-structured! Our customer just needs all the links that we have in the data. We fire up our editor of choice and roll up our sleeves.

from __future__ import print_function
from bs4 import BeautifulSoup

# Make our soup
with open('data.xml') as infile:
    blob = infile.read()
# Use LXML for blazing speed
soup = BeautifulSoup(blob, 'lxml')
print(soup.find_all('link'))

We get in return:

[<link/>]

Wait, what? What happened to our content? This is a pretty basic BS use case, but something strange is happening. Well, I’ll come back to this and start working with their other very hypothetical data sets where <link> tags become <resource> tags, but the rest of the data is structured exactly the same. This time around:

<xml>
  <sport name="baseball">
    <team name="Braves" city="Atlanta">
        <resource><!-- Changed to resouce -->
          <url>http://atlanta.braves.mlb.com</url>
        </resource><!-- Corresponding close -->
    </team>
  </sport>
</xml>

…and corresponding result…

>>> print(soup.find_all('resource'))
[<resource>
<url>http://atlanta.braves.mlb.com/index.jsp?c_id=atl</url>
</resource>]

Interesting! To compound our problem, we’re on a customer site where we don’t have internet access to grab that sweet, sweet documentation we crave. After toiling on a Stack Overflow dump in all the wrong places, I was reminded of one of my favorite blog posts by SO’s founder, Jeff Atwood. Read the Source. But what was I looking for? Well, let’s dig around for <link> tags and see what turns up.

Sure enough, after some quick searches, we find what I believe to be the smoking gun (for those following along at home, bs4.builder.__init__.py, lines 228/229 in v4.3.2).

empty_element_tags = set(['br' , 'hr', 'input', 'img', 'meta',
                          'spacer', 'link', 'frame', 'base'])

We have a seemingly harmless word with “link” in our XML, but it means something very different in HTML and more specifically, the TreeBuilder implementation that LXML is using. As a test, if I change our <link> turned <resource> tags into <base> tags we get the same result – no content. It also turns out that if you have LXML installed, BeautifulSoup4 will fall back to that for parsing. Uninstalling it grants us the results we want – tags with content. The stricter (but faster) TreeBuilder implementations from LXML take precedence over the built-in HTMLParser or html5lib (if you have it installed). How do we know that? Back to the source code!

bs4/builder/__init__.py, lines 304:321

# Builders are registered in reverse order of priority, so that custom
# builder registrations will take precedence. In general, we want lxml
# to take precedence over html5lib, because it's faster. And we only
# want to use HTMLParser as a last result.
from . import _htmlparser
register_treebuilders_from(_htmlparser)
try:
    from . import _html5lib
    register_treebuilders_from(_html5lib)
except ImportError:
    # They don't have html5lib installed.
    pass
try:
    from . import _lxml
    register_treebuilders_from(_lxml)
except ImportError:
    # They don't have lxml installed.
    pass

As it turns out, when creating your soup, ‘lxml’ != ‘xml’. Changing the soup creation gets us the results we’re looking for (UPDATE: corresponding doc “helpfully” pointed out by a Reddit commenter here). BeautifulSoup was still falling back to HTML builders, thus why we were seeing the results we were when specifying ‘lxml’.

# Use HTML for sanity
soup = BeautifulSoup(blob, 'xml')

While I didn’t find that magic code snippet to fix everything,  (UPDATE: Thanks Reddit). We found our problem, but went really roundabout to get there. Understanding why it was happening made me a feel a lot better in the end. It’s easy to get frustrated when coding, but always remember, read the docs and – Read the Source, Luke. It might help you understand the problem.

We’re hiring!  If you’re interested in geospatial, big data, social media analytics, Amazon Web Services (AWS), visualization, and/or the latest UI and server technologies, drop us an e-mail at info@thehumangeo.com.

Building the HumanGeo Website Map

HumanGeo is a relatively new company – the founding date is a little bit fuzzy, but 4 years old seems to be the prevailing notion.  Our website hadn’t changed much in three plus years, and it was looking a little bit dated, even by three year old website standards, so last year we decided to take the plunge and update it.

Screen Shot 2015-06-22 at 7.08.36 AM
Here’s an Internet Archive snapshot of the old site, complete with shadowy HumanGeo guy (emphasis on the “Human” in HumanGeo, perhaps?)

It’s not that the old site was bad, it just wasn’t exciting, and it didn’t seem to capture all the things that make HumanGeo unique – it just sort of blended in with every other corporate website out there.  In building the new website, we set out to create an experience that is a little less corporate and a little more creative, with a renewed focus on concisely conveying what it is we actually do here.  Distilling all of the things we do across projects into a cohesive, clear message was a challenge on par with building the website itself, but I’ll save that discussion for another time.   Since HumanGeo is a young company that operates primarily in the services space, our corporate identity, while centered on a few core themes, is constantly evolving – very much driven by the needs of our partners and the solutions we collectively build.  One of the core themes that has remained despite the subtle changes in our identity over time is our focus on building geospatial solutions.  It’s the “Geo” in HumanGeo; this is precisely why we chose to display a map as the hero image on our main page (sorry HumanGeo guy).  This article is all about building that map, which is less a static image and more a dynamic intro into HumanGeo’s world.  My goal with this article is to highlight the complexity that went into building the map and some of the ways in which the Leaflet DVF and a few other libraries (Stamen, TopoJSON, Parallax JS) simplified development or added something special.

The HumanGeo Website Map

When you arrive at our site, you’re greeted with a large Leaflet JS-based map with a Stamen watercolor background and some cool parallax effects (at least I think it’s cool).  It’s more artsy than realistic (using the Stamen watercolor map tiles pretty much forced our hand stylistically), an attempt to emphasize our creative and imaginative side while still highlighting the fact that geospatial is a big part of many of the solutions that we build.  The map is centered on DC with HumanGeo’s office in Arlington in the lower-left and some pie charts and colored markers in the DC area. The idea here is that you’re looking at our world from above, and it’s sort of a snapshot in time.  At any given point in time there are lots of data points being generated from a variety of sources (e.g. people using social media, sensors, etc.), and HumanGeo is focused on helping our partners capture, analyze and make sense of those data points.  As you move your mouse  over the map or tilt your device, the clouds, satellite, plane, and map all move – simulating the effect of changing perspective that you might see if you were holding the world in your hands.  In general, I wanted this area (the typical hero image/area) to be more than just a large static stock image that you might see on many sites across the web – I wanted it to be dynamic.  If you spend any time browsing our site, you’ll see that this dynamic hero area is a common element across the pages of our site.

Parallax

Parallax scrolling is popular these days, making its way into pretty much every slick, out of the box website template out there.  Usually you see it in the form of a background image that scrolls at a different speed than the main text on a page.  It gives the page dimension, as if the page is composed of different layers in three dimensional space.  In general, I’m a fan of this effect.  However, I decided to go with a different approach for this redesign – mainly because I saw a cool library that applied parallax effects based on mouse or device movement rather than scrolling, and the thought of combining it with a map intrigued me.  The library I’m referring to is Parallax JS, a great library that makes these type of effects easy.  To create a parallax scene, you simply add an unordered list (ul) element – representing a scene – to the body of your page along with a set of child list item (li) elements that represent the different layers in the scene.  Each li element includes a data-depth attribute that ranges from 0.0 to 1.0 and specifies how close to or far away from the viewer that layer will appear (closeness implies that the movement will be more exaggerated).  A value of 0.0 indicates that the layer will be lowest in the list and won’t move, while a value of 1.0 means that the layer will move the most.   In this case, my scene is composed of the map as the first li element with various layers stacked on top of it including the satellite imaging boxes, the satellite, the clouds, and the plane.  The map has a data-depth value of 0.05, which means it will move a little, and each layer over top of the map has an increasing data-depth value.  Here’s what the markup looks like:


<ul id="scene">
<li class="layer maplayer" data-depth="0.05">
 <div id="map"></div>
</li>
<li class="layer imaginglayer" data-depth="0.30">
 <div style="opacity:0.2; top: 190px; left: 1100px; width: 400px; height: 100px;" />
</li>
<li class="layer imaginglayer" data-depth="0.40">
 <div style="opacity:0.3; top: 180px; left: 1150px; width: 300px; height: 75px;" />
</li>
<li class="layer imaginglayer" data-depth="0.50">
 <div style="opacity:0.4; top: 170px; left: 1200px; width: 200px; height: 50px;" />
</li>
<li class="layer imaginglayer" data-depth="0.60">
 <div style="opacity:0.5; top: 160px; left: 1250px; width: 100px; height: 25px;" />
</li>
<li class="layer cloudlayer" data-depth="0.60"></li>
<li class="layer planelayer" data-depth="0.90"></li>
<li class="layer satellitelayer" data-depth="0.95">
 <img src="images/satellite.png" style="width: 200px; height: 200px; top: 20px; left: 1200px;">
</li>
</ul>

A Moving Plane

On modern browsers, the plane moves via CSS 3 animations, defined using keyframes.  Keyframes establish properties that will be applied to the HTML element being animated at different points in the animation.  In this case, the plane will move in a linear path from 0,  800 to 250, -350 through the duration of the animation.


@keyframes plane1 {

     0% {
          top: 800px;
          left: 0px;
     }

     100% {
          top: -350px;
          left: 250px;
     }

}

@-webkit-keyframes plane1 {

     0% {
          top: 800px;
          left: 0px;
     }

     100% {
          top: -350px;
          left: 250px;
     }

}

@-moz-keyframes plane1 ...

@-ms-keyframes plane1 ...

#plane-1 {
     animation: plane1 30s linear infinite normal 0 running forwards;
     -webkit-animation: plane1 30s linear infinite normal 0 running forwards;
     -moz-animation: plane1 30s linear infinite normal 0 running forwards;
     -ms-animation: plane1 30s linear infinite normal 0 running forwards;
}

On to the Map

The map visualizations use a mix of real and simulated data.  Since we’re located in the DC area, I wanted to illustrate displaying geo statistics in DC, but first, I actually needed to find some statistics related to DC.  I came across DC’s Office of the City Administrator’s Data Catalog site that lists a number of statistical data resources related to DC.  Among those resources are Comma Separated Value (CSV) files of all crime incident locations for a given year, such as this file, which lists crime incidents for 2013.  The CSV file categorizes each incident by a number of attributes and administrative boundary categories, including DC voting precinct.  I decided to use DC voting precincts as the basis for displaying geo statistics, since this would nicely illustrate the use of lower level administrative boundaries, and there are enough DC voting precincts to make the map interesting versus using another administrative boundary level like the eight wards in DC.  I downloaded the shapefile of DC precincts from the DC Atlas website and then used QGIS to convert it to GeoJSON to make it easily parseable in JavaScript.  At this point, I ran into the first challenge.  The resulting GeoJSON file of DC precincts is 2.5 MB, which is obviously pretty big from the perspective of ensuring that the website loads quickly and isn’t a frustrating experience for end users.

To make this file useable, I decided to compress it using the TopoJSON library. TopoJSON is a clever way of compressing GeoJSON, invented by Mike Bostock of D3 and New York Times visualization fame.  TopoJSON compresses GeoJSON by eliminating redundant line segments shared between polygons (e.g. shared borders of voting precinct boundaries). Instead of each polygon in a GeoJSON file sharing redundant line segments, TopoJSON stores the representation of those shared line segments, and then each polygon that contains those segments references the shared segments. It does amazing things when you have GeoJSON of administrative boundaries where those boundaries often include shared borders. After downloading TopoJSON, I ran the script on my DC precincts GeoJSON file:

topojson '(path to GeoJSON file)' -o '(path to output TopoJSON file)' -p

Note that the -p parameter is important, as it tells the TopoJSON script to include GeoJSON feature properties in the output.  If you leave this parameter off, the resulting TopoJSON features won’t include properties that existed in the input GeoJSON.  In my case, I needed the voting_precinct property to exist, since the visualizations I was building relied on attaching statistics to voting precinct polygons, and those polygons are retrieved via the voting_precinct property.  Running the TopoJSON script yielded a 256 KB file, reducing the size of the input file by 90%!

It’s important to note that in rare cases using TopoJSON doesn’t make sense.  You’ll have to weigh the size of your GeoJSON versus including the TopoJSON library (a light 8 KB minified) and the additional processing required by the browser to read GeoJSON out of the input TopoJSON.  So, if you have a small GeoJSON file with a relatively small number of polygons, or the polygon features in that GeoJSON file don’t share borders, it might not make sense to compress it using TopoJSON.  If you’re building choropleth maps using more than a few administrative boundaries, though, it almost always makes sense.

Once you have TopoJSON, you’ll need to decode it into GeoJSON to make it useable.  In this case:


precincts = topojson.feature(precinctsTopo, precinctsTopo.objects.dcprecincts);

precinctsTopo is the JavaScript variable that references the TopoJSON version of the precinct polygons.  precinctsTopo.objects.dcprecincts is a GeoJSON GeometryCollection that contains each precinct polygon.  Calling the topojson.feature method will turn that input TopoJSON into GeoJSON.

Now that I had data and associated polygons, I decided to create a couple of visualizations.  The first visualization illustrates crimes in DC by voting precinct using a single L.DataLayer instance but provides different visuals based on the input precinct.  For some precincts, I decided to display pie charts sized by total crime, and for others I decided to draw a MarkerGroup of stacked CircleMarker instances sized and colored by crime.

Screen Shot 2014-09-18 at 7.20.11 PM

Screen Shot 2014-09-18 at 7.19.44 PM

The reason for doing this is that I wanted to illustrate two different capabilities that we provide our partners, event detection from social media and geo statistics.  Including these in the same DataLayer instance ensures that the statistical data only get parsed once.  For the event detection layer, I also wanted to spice it up a bit by adding an L.Graph instance that illustrates the propagation of an event in social media from its starting point to surrounding areas.  In this case, I generated some fake data that represent the movement of information between two precincts, where each record is an edge.  Here’s what the fake data look like:


var precinctConnections = [
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 127',
 'cnt': '120'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 142',
 'cnt': '5'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 128',
 'cnt': '89'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 2',
 'cnt': '65'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 1',
 'cnt': '220'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 90',
 'cnt': '28'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 130',
 'cnt': '180'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 89',
 'cnt': '150'
 },
 {
 'p1': 'Precinct 129',
 'p2': 'Precinct 131',
 'cnt': '300'
 }
];

Here’s the code for instantiating an L.Graph instance and adding it to the map.


var connectionsLayer = new L.Graph(precinctConnections, {
     recordsField: null,
     locationMode: L.LocationModes.LOOKUP,
     fromField: 'p1',
     toField: 'p2',
     codeField: null,
     locationLookup: precincts,
     locationIndexField: 'voting_precinct',
     locationTextField: 'voting_precinct',
     getEdge: L.Graph.EDGESTYLE.ARC,
     layerOptions: {
          fill: false,
          opacity: 0.8,
          weight: 0.5,
          fillOpacity: 1.0,
          distanceToHeight: new L.LinearFunction([0, 5], [5000, 1000]),
          color: '#777'
     },
     tooltipOptions: {
          iconSize: new L.Point(90,76),
          iconAnchor: new L.Point(-4,76)
     },
     displayOptions: {
          cnt: {
                displayName: 'Count',
                weight: new L.LinearFunction([0, 2], [300, 6]),
                opacity: new L.LinearFunction([0, 0.7], [300, 0.9])
          }
     }
});

map.addLayer(connectionsLayer);

The locations for nodes in the graph are specified via the precincts GeoJSON retrieved from the TopoJSON file I discussed earlier.  fromField and toField tell the L.Graph instance how to draw lines between nodes.

The pie charts include random slice values.  I could have used the actual data and broken down the crime in DC by type, but I wanted to keep the pie charts simple, and three slices seemed like a good number.  There are more than three crime types that were committed in DC in 2013.  As a bit of an aside, here’s a representation of crimes in DC color-coded by type  using L.StackedRegularPolygonMarker instances.

dccrimes

The code for this visualization can be found in the DVF examples here.  It’s more artistic than practical.  One of the nice features of the Leaflet DVF is the ability to easily swap visualizations by changing a few parameters – so in this example, I can easily swap between pie charts, bar charts, and other charts just by changing the class that’s used (e.g. L.PieChartDataLayer, L.BarChartDataLayer, etc.) and modifying some of the options.

After I decided what information I wanted to display using pie charts, I ventured over to the Color Brewer website and settled on some colors – the 3 class Accent color scheme in the qualitative section here – that work well together while fitting in with the pastel color scheme prevalent in the Stamen watercolor background tiles.  The Leaflet DVF has built in support for Color Brewer color schemes (using the L.ColorBrewer object), so if you see a color scheme you like on the Color Brewer website, it’s easy to apply it to the map visualization you’re building using the L.CustomColorFunction class, like so:

var colorFunction = new L.CustomColorFunction(1, 55, L.ColorBrewer.Qualitative.Accent['3'], {
 interpolate: false
});

I also wanted to highlight some of the work we’ve been doing related to mobility analysis, so I repurposed the running data from the Leaflet DVF Run Map example and styled the WeightedPolyline so that it fit in better with our Stamen watercolor map, using a white to purple color scheme.  The running data look like this:


var data={
 gpx: {
 wpt: [
 {
 lat: '38.90006',
 lon: '-77.05691',
 ele: '4.98652',
 name: 'Start'
 },
 {
 lat: '38.90082',
 lon: '-77.0572',
 ele: '4.9469',
 name: 'Run 001'
 },
 {
 lat: '38.90098',
 lon: '-77.05725',
 ele: '5.23951',
 name: 'Run 002'
 },
 ...
 {
 lat: '38.89517',
 lon: '-77.02822',
 ele: '4.74573',
 name: 'Run 109'
 }
 ]
 }
}

Check out the Run Map example to see how this data can be turned into a variable weighted line or the GPX Analyzer example that lets you drag and drop GPS Exchange Format (GPX) files (a common format for capturing GPS waypoint/trip data) onto the map to see variations in the speed of your run/bike/or other trip in space and time.

To make the map a little more interesting, I added some additional layers for our office, employees, and some boats in the water (the employee and boat locations are made up).  Now that I had the map layers I wanted to use, I incorporated them into a tour that allows a user to navigate the map like a PowerPoint presentation.  I built a simple tour class that has methods for navigating forward and backward.  Each tour action consists of an HTML element with some explanatory text and an associated function to execute when that action occurs.  In this case, the function to execute usually involves panning/zooming to a different layer on the map or showing/hiding map layers.  The details of the tour interaction go beyond collecting, transforming and visualizing geospatial data, so feel free to explore the code here and check out the tour for yourself by visiting our site and clicking on the HumanGeo logo.

We’re hiring!  If you’re interested in geospatial, big data, social media analytics, Amazon Web Services (AWS), visualization, and/or the latest UI and server technologies, drop us an e-mail at info@thehumangeo.com.

(Machine) Learning About Love: Who will Leave The Bachelor next?

What is love?  Can you define love, or do you merely recognize its presence or absence without binding it in a definition?  A computer may be able to provide a definition of love, regurgitated from a database, but can a computer recognize love?  If defining love is so difficult for us, many of whom might cop out with “I know it when I see it”, how much more difficult will a computer find this challenge?  On a lark, I decided to implement a machine learning system to predict a few things about a television show about love, ABC’s “The Bachelor”.

My girlfriend fiancée regularly watches The Bachelor; valuing our relationship, I am sometimes drawn in.  The premise of the show is this: Chris Soules (The Bachelor), a single man, acts as the sole decision maker in a competition among 30 eligible women to identify his potential bride.  Each week, one or more of the 30 women is eliminated as the bachelor gets to know them.  This is the 19th Season of this show (there have been 10 iterations of its partner show, “The Bachelorette”)

If we can teach a computer to recognize some measure of love, it is certainly possible to predict who will depart the show next.  Regardless of how we frame this question, there are challenges.  First off, we have no clue how the Chris Soules “mental machine” works.  More importantly, the 30 data points we have (one per contestant) contain far too many features (most of which aren’t quantifiable) of which we only know a dozen or so.  Given a dozen features, we cannot build a statistically reliable model for prediction.  For the sake of entertainment and self-enrichment, I decided to make an attempt in the face of these challenges.

Machine Learning Techniques

I decided to utilize a Decision Tree such that I could investigate the factors personally, but I decided to expand and also utilize a Support Vector Machine as well.  An understanding of the SVM coupled with a quick glance at the Scikit-Learn algorithm cheat-sheet drove my decision to incorporate both.

Data Collection

I first wrote a scraper to pull some data off the open Internet about each candidate; a photograph, age, some ancillary data provided.  From the photograph, I manually identified features such as ethnicity as well as hair types (length, curliness, color).  From the ancillary data, I acquired other quantifiable numbers: number of tattoos and age as well as whether or not that contestant had been eliminated.

Each individual is encoded in JSON like so:

 {
     "hair_wavy": "straight",
     "hometown_name": "Hamilton",
     "height_inches": 64,
     "age": 24,
     "num_tattoos": 0,
     "hair_color": "dark",
     "name": "Alissa Giambrone",
     "hair_length": "chest",
     "date_fear": " Running into recent exes",
     "occupation": "Flight Attendant",
     "likes": [
        "Family",
        "friends",
        "laughter",
        "hope",
        "faith"
     ],
     "free_text": " If I never had to upset others, I would be very happy.  If I never got to play with puppies, I would be very sad.  If you could be any animal, what would you be? A wild mustang. Free to run and explore, they're unpredictable and beautiful, and are loyal to their herd.  If you won the lottery, what would you do with your winnings? Adopt dogs and charter a jet for my friends anf fmaily to fly around Europe, with unlimited champagne and a hot air balloon ride over Greece.  What's your most embarrassing moment? I was in-depth stalking a guy's Facebook page and sent my friend a long, detailed text about my findings...except I sent it to him. Oops.  What is your greatest achievement to date? Getting my yoga certification because I've been able to inspire others to do yoga and become instructors!  ",
     "eliminated": true,
     "photo_url": "http://static.east.abc.go.com/service/image/ratio/id/90fe9103-e788-418d-9eb1-4204b7ba1b96/dim/690.1x1.jpg?cb=51351345661",
     "hometown_state": " NJ",
     "ethnicity": "caucasian",
     "goneweek": 2,
     "featured": true,
     "featured_num": 6,
     "intro_order": 21
}

Some data properties are categorical (such as ethnicity and hair color); some data properties are ordinal (hair length), and some data properties are ratios (such as age or number of tattoos). The Python script encodes categorical data into ordinal, numeric values through a lookup dictionary.

hair_length = dict({"neck": 1,
                    "shoulder": 2,
                    "chest": 3,
                    "stomach": 4})

Workflow

Given this sparse data, it’s hard to train the machine.  More importantly, we can’t break the data into training and testing data; because there’s so little data, we need to use all of it.  For this reason, I decide to randomly pick 25% of the data for training the tree or SVM, then run these very same points through the classifier.  To offset this horrible violation of machine learning principals, I’ll run the model a thousand times and see who gets mis-classified as “eliminated” the most.

Here’s some code I wrote to train the samples:

def week_predict(tgt_data, elims, tgt_week, sc_learn):
    """
    Given some data, make some predictions and return the average accuracy.
    :param tgt_data: json structure of data to be formatted
    :param elims: eliminated
    :param tgt_week:
    :param sc_learn:
    :return:
    """

    learn_values = data_formatter(tgt_data, elims, tgt_week)
    x = list()
    y = list()
    samples = set()
    while len(samples) < (len(learn_values) * 0.25):
        samples.add(random.randint(0, len(learn_values) - 1))
    learn_arr = learn_values.values()
    # print "Sample selection: ", samples
    for index in samples:
        x.append(learn_arr[index][0])
        y.append(learn_arr[index][1])

    # These next two lines courtesy of:
    # http://scikit-learn.org/stable/modules/tree.html
    clf = sc_learn
    try:
        clf = clf.fit(x, y)
    except ValueError:
        # Sometimes we try to train with 0 classes.
        return dict({"accuracy": 0, "departures": []})
    c = 0
    departures = list()
    for item in learn_values:
        if clf.predict(learn_values[item][0]) != learn_values[item][1]:
            c += 1
            if learn_values[item][1] == 0:  # if they aren't eliminated
                departures.append(item)
    accuracy = round((len(learn_values) - c) / float(len(learn_values)) * 100, 2)
    ret_val = dict({"accuracy": accuracy,
                    "departures": departures})
    return ret_val

Output

These are my leading predictors for departure.  For the decision tree, here we have it:

Departing: Britt Nilsson votes: 640
Departing: Carly Waddell votes: 581
Departing: Jade Roper votes: 538
Departing: Becca Tilley votes: 520
Departing: Megan Bell votes: 506
Departing: Kaitlyn Bristowe votes: 365

For the SVM, the output:

Departing: Britt Nilsson votes: 749
Departing: Megan Bell votes: 722
Departing: Becca Tilley votes: 714
Departing: Jade Roper votes: 707
Departing: Carly Waddell votes: 629
Departing: Kaitlyn Bristowe votes: 619

I’ve arranged each of these in descending order; the more votes someone has for departure, the more frequently they are miscategorized as eliminated based on who has already been eliminated and who is still there.

Analysis

Does this make sense given the current state of the contest? After discussing these predicted outcomes with expert viewers, they were adamant that Britt, the contestant I’ve pegged as being most likely to leave, has almost no chance of departing.  She made a strong impression on Chris and seems to have a strong connection with him, so for her to leave would be a shocker; on the other hand, many of these same viewers thought Kaitlyn had a good chance of winning, which would validate her low departure score.

UPDATE: 

This blog post didn’t make it to the Internet in time for these predictions to be evaluated in public.  As such, (spoiler alert) we know that Britt and Carly departed in Week 7, followed by Jade in Week 8.  As such, I will make some predictions for next week:

Decision Tree results:

Departing: Becca Tilley: 3590
Departing: Whitney Bischoff: 3357
Departing: Kaitlyn Bristowe: 3320

Support Vector Machine results:

Departing: Becca Tilley: 1799
Departing: Whitney Bischoff: 1760
Departing: Kaitlyn Bristowe: 1715

Analysis 2.0

We’ve trained an SVM with just a few variables that don’t capture the essence of the problem very well.  Furthermore, we did no fine tuning for the parameters in the SVM, nor did we prune the decision tree.

In order to rectify the data issue, I have put out a call to fans of The Bachelor for additional data about any facet of the show, including:

  • How many kisses per episode per contestant
  • Who gets what date?
  • Number of minutes of on-air time featured per episode, per contestant
  • Distance from contestant hometown to Arlington, Iowa
  • Any other data

Tools Used

I used mostly stock Python libraries to gather data and produce these predictions; for web scraping, I used selenium and BeautifulSoup.  For machine learning, scikit-learn.  For keeping the code clean, pylint.

Code is available at: https://github.com/jamesfe/bachelor_tree

If you have questions, feel free to tweet me them over to @jimmysthoughts!

String tokenization

Sometimes you’ve just got to tokenize a string.

Let’s say you find you need to parse a string based on the tokens it comprises. It’s not an uncommon need, but there’s more than one way to go about it. What does “token” mean to you in this situation? How can you break a string down into a list of them?

It can be daunting answering such an open-ended question. But not to fear! Here I’ll be explaining your best options and how you’ll know when to use them. I’m using Python for these examples, but you should feel free to implement the same algorithms in your language of choice.

Tokenization by delimiters via split

By far the most straightforward approach is one that you’ve surely used before, and that’s quickly done in essentially any language. It’s to split a string on each occurrence of some delimiter.

Say you were reading a csv file and for each row your tokens were to be the value associated with each column. It’s quick, easy, and almost always sufficient to split each line on the commas. Here’s a simple example utilizing a string’s split method:

>>> string = 'Hello world!'
>>> tokens = string.split(' ')
>>> tokens
['Hello', 'world!']

But the usefulness of splitting doesn’t end there. You could also introduce some regex and define multiple characters as being delimiters.

>>> import re
>>> string = "Don't you forget about me."
>>> tokens = re.split( r'[e ]', string )
>>> tokens
["Don't", 'you', 'forg', 't', 'about', 'm', '.']

Unfortunately, delimiters can only get you so far when doing tokenization. What if you needed to break crazier things like “abc123” into separate tokens, one for the letters and one for the numbers? Or maybe you wanted your delimiters to be represented as separate tokens, not to be tossed entirely? You wouldn’t be able to do it using a generic split method.

Tokenization by categorization of chars

One option when dealing with tokens that strictly splitting on delimiters won’t work is assigning tokens by character categorization. It’s still fast, and its flexibility is likely to be enough for most cases.

Here’s an example where the characters in a string are iterated through and tokens are assigned by the categories each character falls into. When the category changes, the token made up of accumulated characters so far is added to the list and a new token is begun.

>>> def tokenize( string, categories ):
...     token = ''
...     tokens = []
...     category = None
...     for char in string:
...         if token:
...             if category and char in category:
...                 token += char
...             else:
...                 tokens.append( token )
...                 token = char
...                 category = None
...                 for cat in categories:
...                     if char in cat:
...                         category = cat
...                         break
...         else:
...             category = None
...             if not category:
...                 for cat in categories:
...                     if char in cat:
...                         category = cat
...                         break
...             token += char
...     if token:
...         tokens.append( token )
...     return tokens
...
>>> string = 'abc, e4sy as 123!'
>>> tokens = tokenize( string, [ '0123456789', ' ', ',.;:!?', 'abcdefghijklmnopqrstuvwxyz' ] )
>>> tokens
['abc', ',', ' ', 'e', '4', 'sy', ' ', 'as', ' ', '123', '!']

There are a lot of improvements you could make to this tokenization function! For example, you could record which category each token belonged to. You could make the categories more specific, you could implement some flexibility in how they’re handled. You could even use regular expressions for verifying whether a current token plus the next character still fits some category. These are left as exercises to the reader.

Tokenization using regular expressions

But there’s a much better way to tokenize when you need to do more than the basics. Here you can find a lexical scanner package which uses regular expressions to define what tokens should look like: https://github.com/pineapplemachine/lexscan. (“Lexical scanner” is a fancy word for a tokenizer that would typically be the first step in a lexical analysis, which is how interpreters and compilers make sense of code. But their applications are in no way limited to just those.) Installation would entail placing the repo’s lexscan directory somewhere that your code can access it.

It’s very handy! Here’s a simple example:

>>> import lexscan
>>> from pprint import pprint
>>> spaceexp = lexscan.ScanExp( r'\s+', significant=False )
>>> wordexp = lexscan.ScanExp( r'[a-z]+' )
>>> numexp = lexscan.ScanExp( r'\d+' )
>>> string = '2001: A Space Odyssey'
>>> tokens = lexscan.tokenize( string, ( spaceexp, wordexp, numexp ) )
>>> pprint( tokens )
[1:0: '2001' \d+,
 1:4: ':' None,
 1:6: 'A' [a-z]+,
 1:8: 'Space' [a-z]+,
 1:14: 'Odyssey' [a-z]+]

Tokens are assigned based on the longest match of the various expressions provided starting at the end of the previous token. The concept is simple, but tokenization doesn’t get much better. The functionality really shines when you’re using more complex regular expressions. What if you needed to differentiate between integers and floating point numbers? What if you wanted to capture an string literal enclosed in quotes as a single large token? You can do that using a proper scanner and all it takes is the wherewithal to put together a set of regular expressions which do what you need. Here’s a more interesting example that looks more like what you would want to use an implementation like this one for.

>>> import lexscan
>>> from pprint import pprint
>>> spaceexp = lexscan.ScanExp( r'\s+', significant=False, name='whitespace' )
>>> identifierexp = lexscan.ScanExp( r'[a-z_]\w+', name='identifier' )
>>> delimiterexp = lexscan.ScanExp( r'[,;]', name='delimiter' )
>>> intexp = lexscan.ScanExp( r'\d+', name='integer' )
>>> floatexp = lexscan.ScanExp( r'[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?', name='float' )
>>> stringexp = lexscan.ScanExp( r'"((?:\\.|[^\\"])*)"', name='string' )
>>> assignexp = lexscan.ScanExp( r'[=:]', name='assignment' )
>>> string = 'strings: "foobar", "don\'t stop me n0w"; numbers: 450, 13.37'
>>> tokens = lexscan.tokenize( string, ( spaceexp, identifierexp, delimiterexp, intexp, floatexp, stringexp, assignexp ) )
>>> pprint( tokens )
[1:0: 'strings' identifier,
 1:7: ':' assignment,
 1:9: '"foobar"' string,
 1:17: ',' delimiter,
 1:19: '"don't stop me n0w"' string,
 1:38: ';' delimiter,
 1:40: 'numbers' identifier,
 1:47: ':' assignment,
 1:49: '450' integer,
 1:52: ',' delimiter,
 1:54: '13.37' float]

Of course, splitting a string will always be faster than using regular expressions for tokenizing. Be judicious in deciding which approach to use!

Keeping Up With the Cool Kids: Elasticsearch and Urban Dictionary – Part 1

fry-slangAt HumanGeo, we love Elasticsearch and we love social media. Elasticsearch lends itself well to a variety of interesting ways to process the vast amount of content in social media. But like most things on the internet, keeping up with slang and trends in social media text can be an increasingly difficult barrier to entry to analysing the data (so can getting beyond your teenage years). So how do we get past this barrier? If the web is so powerful, can’t we use it to help us understand what’s really being said?

Enter Urban Dictionary. Usually, if you’re looking for answers, UD might be the last place on the internet you want to look unless you have a large jug of mindbleach waiting on standby. Aside from proving that the internet is a cold, dark place, Urban Dictionary has a large amount of crowd-sourced data that can help us get some insight into today’s communication medium, whether it’s 140 characters or beyond.

In this post, my goal is to 1) collect a bunch of data from Urban Dictionary, 2) index it in such a way that I can use it to “decipher” lousy internet slang and 3) query it with “normal” terms and get extended results.

The Data

To get started, we needed to get the words themselves. To do this, I built a simple web scraper to scroll UD and extract the words. Here’s a snippet to extract the words out of the DOM using Python via Requests and Beautiful Soup.

import requests
from bs4 import BeautifulSoup

WORD_LINK = 'http://www.urbandictionary.com/popular.php?character={0}'

def make_alphabet_soup(self, letter, link=WORD_LINK):
    '''Make soup from the list of letters on the page.'''
    r = requests.get(link.format(letter))
    soup = BeautifulSoup(r.text)
    return soup
def parse_words(self, letter, soup=None):
    '''Scrape the webpage and return the words present.'''
    if not soup:
        soup = self.make_alphabet_soup(letter)
    word_divs = soup.find(id='columnist').find_all('a')
    words = [div.text for div in word_divs]
    return popular_words

This is the basic building block, but I extended from there. For every word I grabbed, I threw it against the Urban Dictionary API and got a definition.

# Redacted
API_LINK = 'http://ud_api.com'

def define(self, word, url=API_LINK):
'''Send a request with the given word to the UD JSON API.'''
r = requests.get(url, params={'term': word}, timeout=5.0)
j = r.json()
# Add our search term to the document for context
j.update({'word': word})
return j

Using this method, I ended up with about 100k “popular” words, as defined by UD. An example response from the API looks something like:

{
 "tags": [
    "black",
    "ozzy",
    "sabbath",
    "black sabbath",
    "geezer",
    "metal",
    "osbourne",
    "tony",
    "bill",
    "butler"
 ],
 "result_type":"exact",
 "list": [
    {
       "defid": 772739,
       "word": "Iommi",
       "author": "Matthew McDonnell",
       "permalink": "http://iommi.urbanup.com/772739",
       "definition": "Iommi = a Godlike person. A master of their chosen craft. Someone or something extremely cool",
       "example": "Example 1. Hey rick, that motorcyle stunt you did was really Iommi! \r\n\r\nExample 2. That guy is SO Iommi! \r\n\r\nExample 3. Be Iommi, man!",
       "thumbs_up": 57,
       "thumbs_down": 3,
       "current_vote": ""
    }
 ],
 "sounds":[]
}

Now that I had the data, it was time to make something out of it.

The Process

With our data in hand, it’s time to utilize Elasticsearch. More specifically, it’s time to take advantage of the Synonym Token Filter when indexing data into Elasticsearch.

A quick interjection about indexing: this is an good time to talk about “the guts” of how data is indexed into Elasticsearch. If you don’t specify your mappings when indexing data, you can get unexpected results if you’re not familiar with the mapping/analysis process. By default, the data is tokenized upon indexing, which is great for full-text search but not when we want exact matches to multiple words. For example, if I’m searching for exactly “brown fox” in my index (for example, an exact match against my query string), I will get results for the sentence “John Brown was attacked by a fox.” You can read more about that behavior here. A good strategy is to create a subfield of “word” such as “.raw” where the “.raw” is set to not_analyzed in your mapping.

Using the data, we collected, we can generate the Solr synonym file required by the token filter. To do this, I used the “tags” area of the definition. This definitely is not a set of synonyms (sometimes you just get a bunch of racism and filth), but it does provide (potentially) related words to the original word. For example, here are the tags for word “internet”:

  • “facebook”
  • “web”
  • “computer”
  • “myspace”
  • “lol”
  • “google”
  • “online”
  • “porn”
  • “youtube”
  • “internets”

I mean, they’re not wrong. Here’s an example of adding the mapping I used on the “test” index in the “name” field:

justin@macbook ~/p/urban> curl -XPOST "http://localhost:9200/test" -d
{
   "settings": {
      "index": {
         "analysis": {
            "analyzer": {
               "synonym": {
                  "tokenizer": "whitespace",
                  "filter": [
                     "synonym"
                  ]
               }
            },
            "filter": {
               "synonym": {
                  "type": "synonym",
                  "synonyms_path": "/tmp/solr-synonyms.txt"
               }
            }
         }
      }
   },
   "mappings": {
      "test": {
         "properties": {
            "name": {
               "type":"string",
               "index":"analyzed",
               "analyzer":"synonym"
            }
         }
      }
   }
}

The Search

Now that we have our index set up, it’s time to put a search in action.  Until I went down this rabbit hole, I had no idea calling something Iommi was a thing (it probably isn’t). As someone who likes Black Sabbath, I want to find other words in my index that are totally Iommi. Using the mapping I specified above, I indexed a few sample documents with “name” field set to tags that UD relates to Iommi, as well as some bogus filler. Example tags (and no, I did not make this example up):

  • “sabbath”
  • “black sabbath”
  • “geezer”
  • “metal”

Screen Shot 2014-12-10 at 10.11.14 PMOur query (in Sense, against the ‘test’ index), and the results:

POST _search
{
    "query": {
       "term": {
          "name": "iommi"
      }
   }
}

Awesome! This installment is more about showcasing how the filter works, so it’s not entirely practical. Look out for a future installment where we use real social media data to do “extended search” and display results with the Elasticsearch highlighting to show a practical example of this in action.

Beer Caps to Coffee Tables

Some years ago, I lived in Norfolk, VA with a friend who loved craft beer as much as I did. Our hoppy passion motivated us to brew beer at home, visit local craft beer stores, and generally enjoy our nights with a brew here or there. Subconsciously knowing it would be a brilliant idea, we used a tupperware box next to our refrigerator to house a growing collection of beer caps, feeding it a single cap as each beer departed the cold, destined for enjoyment.

After a few years, I realized the box was close to being filled to the brim. The mechanical gears in my head were cranking out an idea to do something creative with the bottle caps. I wanted to incorporate some things I’d learned in my classes related to imaging, computer vision, data mining, and spectral signatures. I also had an old square coffee table (36″x36″) that could be the canvas to this project. This was the beginning of a fun exploration in image processing with python to create a bottle cap magnum opus.

Getting it Done
There were a few things I needed to do; first, I needed more caps.  Given the size of a bottle cap (1″) minus some width for a border around the table and a little space between the caps, I found that the table would accept a 30 cap by 30 cap grid, therefore I needed at least 900 caps.  A local bar supported this effort by hosting a 5 gallon bucket for the bartender to drop caps into in lieu of the trash can.

Second, I needed a way to extract the data (color; specifically red, green, and blue) from each individual cap.  I originally arranged the images on a gridded sheet of paper, took a photo, and extracted them row-by-column after performing some image transforms to account for the skewed nature of the image.  As it turns out, there is no good way to get a completely top-down image of the caps; it will always turn out to be smaller at the top and larger at the bottom depending on the angle you held the camera at. The data wasn’t fantastic and as more caps came in, I knew I needed a better way; a quick survey of computer vision capabilities turned up a concept called a Hough Transform.

Hough transforms are a method of identifying shapes that can be approximated by equations in images; in this case, a circle with a roughly known radius was the target. This synopsis is a vast simplification, but for each pixel in the output image (of the same size as the input image), the algorithm polls those surrounding pixels that might form a circle and assigns a value to that pixel. Based on all the pixels in the image, one can then surmise that the pixels with values above a certain threshold are the center point for a circle of known radius. To facilitate the discovery of circles of a specific radius (as many caps have concentric circle patterns), I used a Sobel operator to convert the image into an edge-only image.

caps_on_sheet
A subset of the caps, spread out and photographed.

Although I initially wrote my own Sobel Filter and Hough Transform in Python, I think it’s smarter to use OpenCV, which I later discovered.  OpenCV is a set of computer vision related functions written in C/C++ with wrappers for Python; Hough transforms are only one of the things that OpenCV can simplify from a multi-hour debacle into a few minutes of digging in the documentation for the right tool. Here  is a quick snippet of something that took me dozens of lines of code in earlier editions:

def cv2hough(sobel_image, color_image, minrad, maxrad):
    '''
    Identifies and cuts out to a directory potentially circular
    shapes that match a known radius range.
    :param sobel_image: sobel-filtered image
    :param color_image: non-sobel filtered version of colored_image
    :return:
    '''
    in_cv_image = cv2.imread(sobel_image, 0)
    avg_rad = (minrad+maxrad)/2
    circles = cv2.HoughCircles(in_cv_image, cv.CV_HOUGH_GRADIENT, 1, 65,
    param1=90, param2=15,
    minRadius=minrad, maxRadius=maxrad)
    exportCircles(circles, color_image, 'caps', avg_rad)

(Documentation for cv2.HoughCircles)

This code snippet concludes by sending a list of all the places on the image where it thinks a cap might be to “exportCircles”, where those caps are then cut out of the image and sent to a 30px x 30px JPG file.

circle_3_img
A cap, post-extraction. It’s a rough approximation, but the hough transform was sufficient.

Once we have a directory containing hundreds and hundreds of caps, we can begin some analysis. (Full disclosure: I manually scrolled through the images and deleted those that appeared unusable.) Python was used to calculate statistics for each bottle cap image, and eventually stored the data in a structure. Each pixel is calculated applying a red (R), green (G), and blue (B) value. We can average these and get average (R,G,B) triplets for each bottle cap, that is to say; a cap is “mostly” red, or “mostly” blue, etc. I soon found that these numbers weren’t that descriptive and began to work in the Hue, Saturation, and Value color spaces. (HSL & HSV: Wikipedia).

Protip: I used a quick one-liner to convert from RGB to HSV before storing the data:
colorsys.rgb_to_hsv(rgb[0], rgb[1], rgb[2])

To save time cataloging 900 bottle caps and associated pixels, I pickle (serialize) each cap as I generate the data on it so I can quickly reference a data structure containing the caps and rearrange them as I see fit.

caps_avg
The caps by average, as the computer sees them.
comp_output
The caps as we see them; the final arrangement was sorted by hue then saturation.

Final Steps: Output
I originally wanted to see what it would look like when I rearranged the bottle caps into different patterns. This is where we do that, and based on the data we have and the framework we’ve built, outputting an image or a HTML document (this is a really hasty solution!) becomes fairly easy, minus one hiccup.

def first_sort(self):
    '''
    Reach into the internal config and find out what we want to have as
    our first order sort on the caps.
    :return:
    '''
    self.sortedDatList = sorted(self.imageDatList,
                                key=lambda thislist: thislist[
                                    self.sorts[self.conf['first_sort']]])
    del self.imageDatList
    self.imageDatList = self.sortedDatList

def second_sort(self):
    '''
    Second sort on the caps: sort rows individually.
    Example: We want a 30x30 grid of caps, sorted first by hue, then by
    saturation.
    We sort by hue in first_sort, then in second_sort, we sort 30 lists
    of 30 caps each individually and replace them into the list.
    :return:
    '''
    ranges = [[self.conf['out_caps_cols'] * (p - 1),
               self.conf['out_caps_cols'] * p] for p in
              range(1, self.conf['out_caps_rows'] + 1)]
    for r in ranges:
        self.imageDatList[r[0]:r[1]] = sorted(self.imageDatList[r[0]:r[1]],
                                              key=lambda thislist: thislist[
                                                  self.sorts[self.conf[
                                                      'second_sort']]])

Here we see two functions: firstSort and secondSort. You can guess that we sort the caps once, then sort them again; what may not be apparent from this code is that the second sort (which is performed on a different attribute), is performed based on the out_caps_rows attribute. This is the hiccup; you must sort multiple sub-segments of the array of bottle caps, in essence, sorting the rows after the columns have been generated. Otherwise each row of bottle caps will have a random pattern as the image trends from one extreme (top) to the other (bottom).

caps_organized
Caps, laid out and ready to be transferred to the table.

The Physical Construction
To finish the project, I physically arranged a pattern based on informal votes from friends and acquaintances. I built a 1/2″ lip around the table with “Quarter-Round” molding. I glued this down, sealed the seams with wood glue, and painted everything black.

I filled the bottom of this open-top area with playground sand. This minimized the volume that I would have to fill in with expensive epoxy resin. It had the secondary benefit of lightening up the entire display, knowing the epoxy would darken it. I ordered two kits of “AeroMarine 300/21 Epoxy Resin” (for a total of 3 Gallons) from the internet. Ultimately, I arranged the caps on a sheet of butcher’s paper and transferred them one-by-one to the sand box.

A few friends, the famous roommate, and I gathered (with beers); we mixed the resin and began the process of pouring the resin onto the table. The one unseen consequence of the sand was that when the epoxy entered the sand, it pushed air upwards, causing the caps to float; we quickly used toothpicks to push the caps down into the sand, then poured more resin over top.

caps_final
The table, post-drying and finished.

Code for this was hacked together over a number of weeks, I’ve consolidated it here: https://github.com/jamesfe/capViz.  Unfortunately, most of this code was written in a disheveled, disconnected manner and I’m just now beginning to tie it together into one well documented, good-practice PEP-8 compliant piece, but feel free to poke around and let me know what you think!

Thanks to The Birch for all the beer caps!

If this interested you, follow me on Twitter! @jimmysthoughts

Supercharging your reddit API access

Here at HumanGeo we do all sorts of interesting things with sentiment analysis and entity resolution. Before you get to have fun with that, though, you need to bring data into the system. One data source we’ve recently started working with is reddit.

Compared to the walled gardens of Facebook and LinkedIn, reddit’s API is as open as open can be; Everything is nice and RESTful, rate limits are sane, the developers are open to enhancement requests, and one can do quite a bit without needing to authenticate.
The most common objects we collect from reddit are submissions (posts) and comments. A submission can either be a link, or a self post with a text body, and can have an arbitrary number of comments. Comments contain text, as well as references to parent nodes (if they’re not root nodes in the comment tree). Pulling this data is as simple as GET http://www.reddit.com/r/washingtondc/new.json. (Protip: pretty much any view in reddit has a corresponding API endpoint that can be generated by appending ‘.json’ to the URL.)

With little effort a developer could hack together a quick ‘n dirty reddit scraper. However, as additional features appear and collection-breadth grows, the quick ‘n dirty scraper becomes more dirty than quick, and you discover bugsfeatures that others utilizing the API have already encountered and possibly addressed. API wrappers help consolidate communal knowledge and best practices for the good of all. We considered several, and, being a Python shop, settled on PRAW (Python Reddit API Wrapper).

With PRAW, getting a list of posts is pretty easy:

import praw
r = praw.Reddit(user_agent='Hello world application.')
for post in r.get_subreddit('WashingtonDC') \
             .get_hot(limit=5):
    print(str(post))
$ python parse_bot_2000.py
209 :: /r/WashingtonDC's Official Guide to the City!
29 :: What are some good/active meetups in DC that are easy to join?
17 :: So no more tailgating at the Nationals park anymore...
3 :: Anyone know of a juggling club in DC
2 :: The American Beer Classic: Yay or Nay?

The Problem

Now, let’s try something a little more complicated. Our mission, if we choose to accept it, is to capture all incoming comments to a subreddit. For each comment we should collect the author’s username, the URL for the submission, a permalink to the comment, as well as its body.

Here’s what this looks like:

import praw
from datetime import datetime
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000')

def save_comment(*args):
    print(datetime.now().time(), args)

for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    save_comment(comment.permalink,
        comment.author.name,
        comment.body_html,
        comment.submission.url)

That was pretty easy. For the sake of this demo the save_comment method has been stubbed out, but anything can go there.

If you run the snippet, you’ll observe the following pattern:

... comment ...
... comment ...
[WAIT FOR A FEW SECONDS]
... comment ...
... comment ...
[WAIT FOR A FEW SECONDS]
... comment ...
... comment ...
[WAIT FOR A FEW SECONDS]
(repeating...)

This process also seems to be taking longer than a normal HTTP request. As anyone working with large amounts of data should do, let’s quantify this.

Using the wonderful, indispensable iPython:

In [1]: %%timeit
requests.get('http://www.reddit.com/r/python/comments.json?limit=200')
   ....:
1 loops, best of 3: 136 ms per loop

In [2]: %%timeit
import praw
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000',
                cache_timeout=0)
for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    print(comment.permalink,
          comment.author.name,
          comment.body_html,
          comment.submission.url)
1 loops, best of 3: 6min 43s per loop

Ouch. While this difference in run-times is fine for a one-off, contrived example, such inefficiency is disastrous when dealing with large volumes of data. What could be causing this behavior?

Digging

According to the PRAW documentation,

Each API request to Reddit must be separated by a 2 second delay, as per the API rules. So to get the highest performance, the number of API calls must be kept as low as possible. PRAW uses lazy objects to only make API calls when/if the information is needed.

Perhaps we’re doing something that is triggering additional HTTP requests. Such behavior would explain the intermittent printing of comments to the output stream. Let’s verify this hypothesis.

To see the underlying requests, we can override PRAW’s default log level:

from datetime import datetime
import logging
import praw

logging.basicConfig(level=logging.DEBUG)

r = praw.Reddit(user_agent='Subreddit Parse Bot 2000')

def save_comment(*args):
    print(datetime.now().time(), args)

for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    save_comment(comment.author.name,
                 comment.submission.url,
                 comment.permalink,
                 comment.body_html)

And what does the output look like?

DEBUG:requests.packages.urllib3.connectionpool:"PUT /check HTTP/1.1" 200 106
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2ak14j.json HTTP/1.1" 200 888
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2aies0.json HTTP/1.1" 200 2889
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2aiier.json HTTP/1.1" 200 14809
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2ajam1.json HTTP/1.1" 200 1091
.. comment ..
.. comment ..
.. comment ..

Those intermittent requests for individual comments back up our claim. Now, let’s see what’s causing this.

Prettifying the response JSON yields the following schema (edited for brevity):

{
   'kind':'Listing',
   'data':{
      'children':[
         {
            'kind':'t3',
            'data':{
               'id':'2alblh',
               'media':null,
               'score':0,
               'permalink':'/r/Python/comments/2alblh/django_middleware_that_prints_query_stats_to/',
               'name':'t3_2ajam1',
               'created':1405227385.0,
               'url':'http://imgur.com/QBSOAZB',
               'title':'Should I? why?',
            }
         }
      ]
   }
}

Lets compare that to what we get when listing comments from the /python/comments endpoint:

{
   'kind':'Listing',
   'data':{
      'children':[
         {
            'kind':'t1',
            'data':{
               'link_title':'Django middleware that prints query stats to stderr after each request. pip install django-querycount',
               'link_author':'mrrrgn',
               'author':'mrrrgn',
               'parent_id':'t3_2alblh',
               'body':'Try django-devserver for query counts, displaying the full queries, profiling, reporting memory usage, etc. \n\nhttps://pypi.python.org/pypi/django-devserver',
               'link_id':'t3_2alblh',
               'name':'t1_ciwbo37',
               'link_url':'https://github.com/bradmontgomery/django-querycount',
            }
         }
      ]
   }
}

Now we’re getting somewhere – there are fields in the per-comment’s response that aren’t in the subreddit listing’s. Of the four fields we’re collecting, the submission URL and permalink properties are not returned by the subreddit comments endpoint. Accessing those causes a lazy evaluation to fire off additional requests. If we can infer these values from the data we already have, we can avoid having to waste time querying for each comment.

Doing Work

Submission URLs

Submission URLs are a combination of the subreddit name, the post ID, and title. We can easily get the post ID fragment:

post_id = comment.link_id.split('_')[1]

However, there is no title returned! Luckily, it turns out that it’s not needed.

subreddit = 'python'
post_id = comment.link_id.split('_')[1]
url = 'http://reddit.com/r/{}/{}/' \
          .format(subreddit, post_id)
print(url) # A valid submission permalink!
# OUTPUT: http://reddit.com/r/python/2alblh/

Great! This also gets us most of the way to constructing the second URL we need – a permalink to the comment.

Comment Permalinks

Maybe we can append the comment’s ID to the end of the submission URL?

sub_comments_url = 'http://reddit.com/r/python/comments/2alblh/'
comment_id = comment.name.split('_')[1]
url = sub_comments_url + comment_id
print(url)
# OUTPUT: http://reddit.com/r/python/comments/2alblh/ciwbo37

Sadly, that URL doesn’t work because reddit expects the submission’s title to precede the ID. Referring to the subreddit comment’s JSON object, we can see that the title is not returned. This is curious: why is the title important? They already have a globally unique ID for the post, and can display the post just fine without (as demonstrated by the code sample immediately preceding this). Perhaps reddit wanted to make it easier for users to identify a link and are just parsing a forward-slash delimited series of parameters. If we put the comment ID in the appropriate position, the URL should be valid. Let’s give it a shot:

sub_comments_url = 'http://reddit.com/r/python/comments/2alblh/'
comment_id = comment.name.split('_')[1]
url = '{}-/{}'.format(sub_comments_url, comment_id)
print(url)
# OUTPUT: http://reddit.com/r/python/comments/2alblh/-/ciwbo37

Following that URL takes us to the comment!

Victory Lap

Let’s see how much we’ve improved our execution time:

%%timeit
import praw
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000',
                cache_timeout=0)
for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    print(comment.author.name,
          comment.body_html,
          comment.id,
          comment.submission.id)
1 loops, best of 3: 3.57 s per loop

Wow! 403 seconds to 3.6 seconds – a factor of 111. Deploying this improvement to production not only increased the volume of data we were able to process, but also provided the side benefit of reducing the number of 504 errors we encountered during reddit’s peak hours. Remember, always be on the lookout for ways to improve your stack. A bunch of small wins can add up to something significant.

[Does this sort of stuff interest you? Love hacking and learning new things? Good news – we’re hiring!]

Drawing Boundaries In Python

As a technologist at HumanGeo, you’re often asked to perform some kind of analysis on geospatial data, and quickly! We frequently work on short turnaround times for our customers so anything that gives us a boost is welcome, which is probably why so many of us love Python. As evidenced by the volume of scientific talks at PyCon 2014, we can also lean on the great work of the scientific community. Python lets us go from zero to answer within hours or days, not weeks.

I recently had to do some science on the way we can observe clusters of points on the map – to show how regions of social significance emerge. Luckily I was able to lean heavily on Shapely which is a fantastic Python library for performing geometric operations on points, shapes, lines, etc. As an aside, if you are doing any sort of geospatial work with Python, you’ll want to pip install shapely. Once we found a cluster of points which we believed were identifying a unique region, we needed to draw a boundary around the region so that it could be more easily digested by a geospatial analyst. Boundaries are just polygons that enclose something, so I’ll walk through some of your options and attempt to provide complete code examples.

The first step towards geospatial analysis in Python is loading your data. In the example below, I have a shapefile containing a number of points which I generated manually with QGIS. I’ll use the fiona library to read the file in, and then create point objects with shapely.

import fiona
import shapely.geometry as geometry
input_shapefile = 'concave_demo_points.shp'
shapefile = fiona.open(input_shapefile)
points = [geometry.shape(point['geometry'])
          for point in shapefile]

The points list can now be manipulated with Shapely. First, let’s plot the points to see what we have.

import pylab as pl
x = [p.coords.xy[0] for p in points]
y = [p.coords.xy[1] for p in points]
pl.figure(figsize=(10,10))
_ = pl.plot(x,y,'o', color='#f16824')

H_sparse

We can now interrogate the collection. Many shapely operations result in a different kind of geometry than the one you’re currently working with. Since our geometry is a collection of points, I can instantiate a MultiPoint, and then ask that MultiPoint for its envelope, which is a Polygon. Easily done like so:

point_collection = geometry.MultiPoint(list(points))
point_collection.envelope

We should take a look at that envelope. matplotlib can help us out, but polygons aren’t functions, so we need to use PolygonPatch.

from descartes import PolygonPatch

def plot_polygon(polygon):
    fig = pl.figure(figsize=(10,10))
    ax = fig.add_subplot(111)
    margin = .3

    x_min, y_min, x_max, y_max = polygon.bounds

    ax.set_xlim([x_min-margin, x_max+margin])
    ax.set_ylim([y_min-margin, y_max+margin])
    patch = PolygonPatch(polygon, fc='#999999',
                         ec='#000000', fill=True,
                         zorder=-1)
    ax.add_patch(patch)
    return fig

_ = plot_polygon(point_collection.envelope)
_ = pl.plot(x,y,'o', color='#f16824')

H_envelope

So without a whole lot of code, we were able to get the envelope of the points, which is the smallest rectangle that contains all of the points. In the real world, boundaries are rarely so uniform and straight, so we were naturally led to experiment with the convex hull of the points. Convex hulls are polygons drawn around points too – as if you took a pencil and connected the dots on the outer-most points. Shapely has convex hull as a built in function so let’s try that out on our points.

convex_hull_polygon = point_collection.convex_hull
_ = plot_polygon(convex_hull_polygon)
_ = pl.plot(x,y,'o', color='#f16824')

H_convex

A tighter boundary, but it ignores those places in the “H” where the points dip inward. For many applications, this is probably good enough but we wanted to explore one more option which is known as a concave hull or alpha shape. At this point we’ve left the built-in functions of Shapely and we’ll have to write some more code. Thankfully, smart people like Sean Gillies, the author of Shapely and fiona, have done the heavy lifting. His post on the fading shape of alpha gave me a great place to start. I had to fill in some gaps that Sean left so I’ll recreate the entire working function here.


from shapely.ops import cascaded_union, polygonize
from scipy.spatial import Delaunay
import numpy as np
import math

def alpha_shape(points, alpha):
    """
    Compute the alpha shape (concave hull) of a set
    of points.

    @param points: Iterable container of points.
    @param alpha: alpha value to influence the
        gooeyness of the border. Smaller numbers
        don't fall inward as much as larger numbers.
        Too large, and you lose everything!
    """
    if len(points) < 4:
        # When you have a triangle, there is no sense
        # in computing an alpha shape.
        return geometry.MultiPoint(list(points))
               .convex_hull

    def add_edge(edges, edge_points, coords, i, j):
        """
        Add a line between the i-th and j-th points,
        if not in the list already
        """
            if (i, j) in edges or (j, i) in edges:
                # already added
                return
            edges.add( (i, j) )
            edge_points.append(coords[ [i, j] ])

    coords = np.array([point.coords[0]
                       for point in points])

    tri = Delaunay(coords)
    edges = set()
    edge_points = []
    # loop over triangles:
    # ia, ib, ic = indices of corner points of the
    # triangle
    for ia, ib, ic in tri.vertices:
        pa = coords[ia]
        pb = coords[ib]
        pc = coords[ic]

        # Lengths of sides of triangle
        a = math.sqrt((pa[0]-pb[0])**2 + (pa[1]-pb[1])**2)
        b = math.sqrt((pb[0]-pc[0])**2 + (pb[1]-pc[1])**2)
        c = math.sqrt((pc[0]-pa[0])**2 + (pc[1]-pa[1])**2)

        # Semiperimeter of triangle
        s = (a + b + c)/2.0

        # Area of triangle by Heron's formula
        area = math.sqrt(s*(s-a)*(s-b)*(s-c))
        circum_r = a*b*c/(4.0*area)

        # Here's the radius filter.
        #print circum_r
        if circum_r < 1.0/alpha:
            add_edge(edges, edge_points, coords, ia, ib)
            add_edge(edges, edge_points, coords, ib, ic)
            add_edge(edges, edge_points, coords, ic, ia)

    m = geometry.MultiLineString(edge_points)
    triangles = list(polygonize(m))
    return cascaded_union(triangles), edge_points

concave_hull, edge_points = alpha_shape(points,
                                        alpha=1.87)

_ = plot_polygon(concave_hull)
_ = pl.plot(x,y,'o', color='#f16824')

That’s a mouthful, but the gist is that we are going to compute Delaunay triangles which establish a connection between each point and nearby points and then we remove some of the triangles that are too far from their neighbors. This removal part is key. By identifying candidates for removal we are saying that these points are too far from their connected points so don’t use that connection as part of the boundary. The result looks like this.

H_concave_womp_womp

Better, but not great.

It turns out that the alpha value and the scale of the points matters a lot when it comes to how well the Delaunay triangulation method will work. You can usually play with the alpha value to find a suitable response, but unless you can scale up your points it might not help. For the sake of a good example, I’ll do both: scale up the “H” and try some different alpha values.

To get more points, I opened up QGIS, drew an “H” like polygon, used the tool to generate regular points, and then spatially joined them to remove any points outside the “H”. My new dataset looks like this:


input_shapefile = 'demo_poly_scaled_points_join.shp'
new_shapefile = fiona.open(input_shapefile)
new_points = [geometry.shape(point['geometry'])
              for point in new_shapefile]
x = [p.coords.xy[0] for p in new_points]
y = [p.coords.xy[1] for p in new_points]
pl.figure(figsize=(10,10))
_ = pl.plot(x,y,'o', color='#f16824')

H_dense

When we try the alpha shape transformation on these points we get a much more satisfying boundary. We can try a few permutations to find the best alpha value for these points with the following code. I combined each plot into an animated gif below.

from matplotlib.collections import LineCollection

for i in range(9):
    alpha = (i+1)*.1
    concave_hull, edge_points = alpha_shape(new_points,
                                            alpha=alpha)

    #print concave_hull
    lines = LineCollection(edge_points)
    pl.figure(figsize=(10,10))
    pl.title('Alpha={0} Delaunay triangulation'.format(
        alpha))
    pl.gca().add_collection(lines)
    delaunay_points = np.array([point.coords[0]
                                for point in new_points])
    pl.plot(delaunay_points[:,0], delaunay_points[:,1],
            'o', hold=1, color='#f16824')

    _ = plot_polygon(concave_hull)
    _ = pl.plot(x,y,'o', color='#f16824')

H_concave_animated_optimized

So in this case, alpha of about 0.4 looks pretty good. We can use shapely’s buffer operation to clean up that polygon a bit and smooth out any of the jagged edges.

alpha = .4
concave_hull, edge_points = alpha_shape(new_points,
                                        alpha=alpha)

plot_polygon(concave_hull)
_ = pl.plot(x,y,'o', color='#f16824')
plot_polygon(concave_hull.buffer(1))
_ = pl.plot(x,y,'o', color='#f16824')

H_0.4_concave H_0.4_concave_buffer

And there you have it. Hopefully this has been a useful tour through some of your geometric boundary options in python. I recommend exploring the Shapely manual to find out about all of the other easy geometric operations you have at your fingertips. Also, if you dig Python and playing with maps, we want to hear from you.

Information into Insight