Category Archives: Uncategorized

Elasticsearch Data Engineering

Last month, our team hosted a hackathon for about a dozen Data Scientists who work for one of our customers. The Scientists are a mixed group of econometricians and general statisticians and we wanted to see what they could do with some time series data – in particular news articles and social media (mostly tweets).

A few days before the start of the hackathon, one of our customer contacts asked if I would load a few billion tweets into a temporary Elasticsearch cluster so that the hackathon participants could use it when they arrived.

I quickly violated some well learned muscle memories:

  1. I said: “yes, I can do that.”
  2. I chose to install the most recent version of Elasticsearch (2.0 at the time.)

Installation of 2.0 is as you would expect if you have installed ES before. Very notable differences are the way in which you install Marvel and the new Marvel user interface. I want to use the image of Marvel below to tell this story.

Indexing

To ingest, I wrote a python script that iterated through a massive repository of zipped GNIP files in Amazon’s Simple Storage Service (S3). The first pain point you go through with an ingest like this is one of defining the Elasticsearch mapping. I thought I had it just right, let the ingest script rip, then checked back in on it in the morning. Turns out I missed quite a few fields (GNIP data has shifted formats over time) so I had to reingest about 40 million tweets. You can see my frustration in the first peak on the indexing rate graph below (also mirrored in the Index Size and Lucene Memory graphs.)

20151120-marvel-hackathon.png

After ingesting all weekend, it was clear that I was never going to make it to a billion tweets by the start of the hackathon. I reached out to some of the HumanGeo Gurus and got some advice on how to tweak the cluster to improve ingest speed. The one visible piece of advice in the graph is in the blue circles: at that time I set index.number_of_replicas=0. You can tell that the size on disk was dramatically smaller (expected) and there is a small inflection point in the rate of ingest which you can only see in the document count line graph. Very disconcertingly, the (blue rectangle) indexing rate decreased! But the document count over time has clearly increased. I think this is because marvel is double(ish)-counting your index rate since it sees indexing happening into multiple replicas at the same time.

Trouble

I was now resigned to have a database of ~800 million tweets which was still great but short of my 1bn personal goal. Additional fun occurred at the red circle. One of the nodes ran out of disk space, and in doing that it corrupted its transaction log. This ground indexing to a halt – bulk indexes weren’t happening cleanly any more because Elasticsearch was missing shards. The cluster was missing shards because this failed node had the only copy (remember when I turned off replicas?!) and that copy was now offline.

The transaction log is one of those data store concepts that is supposed to save you from situations like this. I was running the cluster on AWS EC2, so the first thing I did was to stop Elasticsearch on that node, provision and move the index to a larger EBS volume, and start it back up. Elasticsearch tried to recover the shard by reading the transaction log, then discovered that it was corrupted, then gave up, then repeated the process.

One of the tools in your arsenal at this point is to say: forget about those transactions. So I removed the transaction log and restarted Elasticsearch. No dice – because of this bug in Elasticsearch 2.0 – Elasticsearch can neither recover from a corrupted transaction log nor reinitialize a missing transaction log.

My goal of 800 million tweets was now 250 million shy. But those tweets were still indexed! I was just being held up by a few bad eggs in that transaction log!

It was several paragraphs ago that the harder-core reader was literally punching their monitor in frustration because I hadn’t considered trying to hack around the transaction log. The transaction log is a binary format specific to Elasticsearch and it creates a new one whenever you create a new index. What if I created an empty transaction log? Could I get my shard back online?

To get started, I created a new index in Elasticsearch which dropped a pristine transaction log on disk. I copied that transaction log into the right place in my broken index and restarted Elasticsearch. Elasticsearch complained that the UUID in the copied transaction log didn’t match the UUID that the index was expecting. A UUID is just a unique identifier – the transaction log in a brand new index is otherwise identical to the transaction log in any other brand new index. In the log, Elasticsearch said it saw a UUID of 0xDEADBEEF when it expected 0x00C0FFEE. I opened the transaction log in hexedit and could see 0xDEADBEEF.  I copied the correct UUID over the incorrect UUID, saved it, restarted Elasticsearch, the shard came online, and then that gap in the red circle was filled in!

With a repaired transaction log and all shards online I was able to turn on the replicas to get back to a cluster with some durability and better search performance. Elasticsearch took a few more hours to build the replicas and balance the cluster.

The hackathon went live at the green circle. Immediately our sharp Scientists started issuing queries that completely blew up the fielddata in Elasticsearch. Funny thing about fielddata – it’s the special part of memory that Elasticsearch uses to keep aggregations fast. It’s slow to load data into fielddata so it tries to only do that once. So by default, the policy is unbounded. Fielddata will grow until you’re out of memory and never evict data. Which basically means that once you’ve done an aggregation that pushed data into fielddata, it’s there until that data is deleted or you restart Elasticsearch. So if you don’t actually have enough RAM to hold all of the possible fielddata, it will by definition be fully used after a hard query and then newer (possibly more relevant) data can never fit in.

I think in many cases that scenario makes sense but I didn’t have the luxury of adding additional resources. So I set indices.fielddata.cache.size=55%. 55% is a special consideration, since the hard limit without eviction is 60%. When you do this, you’re accepting that ES will evict the least recently used fielddata when it is under memory pressure situations. I suspect that in most cases our users were doing queries in the beginning that were basically bad ideas so I didn’t want to punish the future generations of queries with the mistakes of the past. (That sounds weirdly political.) Anyway, you see the huge spike drop once I put that policy in place.

Phew!

Hopefully the above helps you if you end up in similar situations. Once the hackathon was underway it was a great success. I even overheard one of the participants say how great it was that there was already data available in the database – that’s not normally how these things go, he said.

Some key takeaways from my recent Elasticsearch experience:

  • For ingest performance, turn off replicas
    • Beware! You are flying without an autopilot here. If something goes wrong, it will go very wrong.
  • For greenfield data exploration clusters, set indices.fielddata.cache.size=55% (or get moar RAM.)
  • Learn the internals of your database. This was one such exercise.
    • We see this one all the time with our clients. If you don’t know the failure modes, the query considerations, and the ingest concerns of your database, you’ll be buried under the weight of an early, non-data-driven decision.

Sounds Fun?

We’re hiring. Do Data Science using Elasticsearch and Social Media with us.

Engineering Coda

If you’re the kind of person who likes to be able to make computers do things while also writing the code that makes computers do things, I can’t say enough about two tools that help me do this stuff. One is a clusterssh clone for iTerm2 called i2cssh and the other is called htop. i2cssh lets you login to a bunch of computers at the same time and have your typing go to all of them simultaneously (or individually.) Check out the matrix like screen shots below where I’m running htop on 4 nodes and then 7 nodes (I increased the size of the cluster at one point.)
20151120-htop-hackathon1.png
20151120-htop-hackathon2.png

Thanks to Michele Bueno, Scott Fairgrieve, and Aaron Gussman for reviewing drafts of this post.

String tokenization

Sometimes you’ve just got to tokenize a string.

Let’s say you find you need to parse a string based on the tokens it comprises. It’s not an uncommon need, but there’s more than one way to go about it. What does “token” mean to you in this situation? How can you break a string down into a list of them?

It can be daunting answering such an open-ended question. But not to fear! Here I’ll be explaining your best options and how you’ll know when to use them. I’m using Python for these examples, but you should feel free to implement the same algorithms in your language of choice.

Tokenization by delimiters via split

By far the most straightforward approach is one that you’ve surely used before, and that’s quickly done in essentially any language. It’s to split a string on each occurrence of some delimiter.

Say you were reading a csv file and for each row your tokens were to be the value associated with each column. It’s quick, easy, and almost always sufficient to split each line on the commas. Here’s a simple example utilizing a string’s split method:

>>> string = 'Hello world!'
>>> tokens = string.split(' ')
>>> tokens
['Hello', 'world!']

But the usefulness of splitting doesn’t end there. You could also introduce some regex and define multiple characters as being delimiters.

>>> import re
>>> string = "Don't you forget about me."
>>> tokens = re.split( r'[e ]', string )
>>> tokens
["Don't", 'you', 'forg', 't', 'about', 'm', '.']

Unfortunately, delimiters can only get you so far when doing tokenization. What if you needed to break crazier things like “abc123” into separate tokens, one for the letters and one for the numbers? Or maybe you wanted your delimiters to be represented as separate tokens, not to be tossed entirely? You wouldn’t be able to do it using a generic split method.

Tokenization by categorization of chars

One option when dealing with tokens that strictly splitting on delimiters won’t work is assigning tokens by character categorization. It’s still fast, and its flexibility is likely to be enough for most cases.

Here’s an example where the characters in a string are iterated through and tokens are assigned by the categories each character falls into. When the category changes, the token made up of accumulated characters so far is added to the list and a new token is begun.

>>> def tokenize( string, categories ):
...     token = ''
...     tokens = []
...     category = None
...     for char in string:
...         if token:
...             if category and char in category:
...                 token += char
...             else:
...                 tokens.append( token )
...                 token = char
...                 category = None
...                 for cat in categories:
...                     if char in cat:
...                         category = cat
...                         break
...         else:
...             category = None
...             if not category:
...                 for cat in categories:
...                     if char in cat:
...                         category = cat
...                         break
...             token += char
...     if token:
...         tokens.append( token )
...     return tokens
...
>>> string = 'abc, e4sy as 123!'
>>> tokens = tokenize( string, [ '0123456789', ' ', ',.;:!?', 'abcdefghijklmnopqrstuvwxyz' ] )
>>> tokens
['abc', ',', ' ', 'e', '4', 'sy', ' ', 'as', ' ', '123', '!']

There are a lot of improvements you could make to this tokenization function! For example, you could record which category each token belonged to. You could make the categories more specific, you could implement some flexibility in how they’re handled. You could even use regular expressions for verifying whether a current token plus the next character still fits some category. These are left as exercises to the reader.

Tokenization using regular expressions

But there’s a much better way to tokenize when you need to do more than the basics. Here you can find a lexical scanner package which uses regular expressions to define what tokens should look like: https://github.com/pineapplemachine/lexscan. (“Lexical scanner” is a fancy word for a tokenizer that would typically be the first step in a lexical analysis, which is how interpreters and compilers make sense of code. But their applications are in no way limited to just those.) Installation would entail placing the repo’s lexscan directory somewhere that your code can access it.

It’s very handy! Here’s a simple example:

>>> import lexscan
>>> from pprint import pprint
>>> spaceexp = lexscan.ScanExp( r'\s+', significant=False )
>>> wordexp = lexscan.ScanExp( r'[a-z]+' )
>>> numexp = lexscan.ScanExp( r'\d+' )
>>> string = '2001: A Space Odyssey'
>>> tokens = lexscan.tokenize( string, ( spaceexp, wordexp, numexp ) )
>>> pprint( tokens )
[1:0: '2001' \d+,
 1:4: ':' None,
 1:6: 'A' [a-z]+,
 1:8: 'Space' [a-z]+,
 1:14: 'Odyssey' [a-z]+]

Tokens are assigned based on the longest match of the various expressions provided starting at the end of the previous token. The concept is simple, but tokenization doesn’t get much better. The functionality really shines when you’re using more complex regular expressions. What if you needed to differentiate between integers and floating point numbers? What if you wanted to capture an string literal enclosed in quotes as a single large token? You can do that using a proper scanner and all it takes is the wherewithal to put together a set of regular expressions which do what you need. Here’s a more interesting example that looks more like what you would want to use an implementation like this one for.

>>> import lexscan
>>> from pprint import pprint
>>> spaceexp = lexscan.ScanExp( r'\s+', significant=False, name='whitespace' )
>>> identifierexp = lexscan.ScanExp( r'[a-z_]\w+', name='identifier' )
>>> delimiterexp = lexscan.ScanExp( r'[,;]', name='delimiter' )
>>> intexp = lexscan.ScanExp( r'\d+', name='integer' )
>>> floatexp = lexscan.ScanExp( r'[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?', name='float' )
>>> stringexp = lexscan.ScanExp( r'"((?:\\.|[^\\"])*)"', name='string' )
>>> assignexp = lexscan.ScanExp( r'[=:]', name='assignment' )
>>> string = 'strings: "foobar", "don\'t stop me n0w"; numbers: 450, 13.37'
>>> tokens = lexscan.tokenize( string, ( spaceexp, identifierexp, delimiterexp, intexp, floatexp, stringexp, assignexp ) )
>>> pprint( tokens )
[1:0: 'strings' identifier,
 1:7: ':' assignment,
 1:9: '"foobar"' string,
 1:17: ',' delimiter,
 1:19: '"don't stop me n0w"' string,
 1:38: ';' delimiter,
 1:40: 'numbers' identifier,
 1:47: ':' assignment,
 1:49: '450' integer,
 1:52: ',' delimiter,
 1:54: '13.37' float]

Of course, splitting a string will always be faster than using regular expressions for tokenizing. Be judicious in deciding which approach to use!

Dynamically Cache Static Files using Django and Nginx

As a software developer that mainly works with web applications, there’s nothing I find more frustrating than working on a new feature or fixing a bug and not seeing my changes reflected in the web browser because it is using an out-of-date cached version of the JS or CSS files I updated.  And even after realizing that I may have already fixed the problem and refreshing the browser, it’s not an option to explain to puzzled customers that doing a full-page refresh or emptying the browser cache will fix the problem.  At HumanGeo, we’ve developed a technique to get around this problem by using Nginx to dynamically “smart cache” some of our static files by using the built-in URL regular expressions to control what files are served and cached by browsers.  We were already using Nginx to serve our static content, so with just a couple of updates to the Nginx configuration and the Django code, we’re guaranteed to always get the most up-to-date version of specific static files and ensuring that the browser will properly cache the file.

Here’s an example from one of our hosted CSS files:

After doing a basic get request (Note: The requested file was taken from the cache):css-refresh

After doing a full-page refresh (Note: The requested file got the appropriate status code 304 from the server):

css-get

The technique is actually pretty straightforward:  using the URL regular expression format that Nginx provides, we look for a pattern within the given URL and find the corresponding file on the file system.  So essentially, the browser will request /static/js/application.<pattern>.js and Nginx will actually serve /static/js/application.js.  Since the browser is requesting a filename with a particular pattern, it will still cache files with that pattern, but we have much more control over the user experience and what files they see when they hit the page, without requiring them to clear their cache.  To further automate the process, the pattern we use is the MD5 hash of the specified file.  Using the MD5 and a little help from a Django context processor, we can ensure that the pattern is always up-to-date and that we’re always using the most recent version of the file.  However, the pattern could be any unique string you’d like – such as a file version number, application version number, the date the file was last modified, etc.

Details:

Nginx configuration – how to use the URL regex:

# alias for /static/js/application.js
location ~ ^/static/js/application\.(.*)\.js$ {
    alias /srv/www/project/static_files/js/application.js;
    expires 30d;
}# serve remaining static files
location /static/ {
    alias /srv/www/project/static_files/;
    expires 30d;
}

Getting the MD5 for specific files: Django settings.py

A function to generate the MD5 hash:
def md5sum(filename):
    md5 = hashlib.md5()
    with open(filename,'rb') as f:
        for chunk in iter(lambda: f.read(128*md5.block_size), b''):
            md5.update(chunk)
    return md5.hexdigest()

Generate and save the MD5 hash for application.js:

try:
    JS_FILE_PATH = os.path.join(os.path.join(STATIC_ROOT, 'js'), 'application.js')
    JS_MD5 = md5sum(JS_FILE_PATH)
except:
    JS_MD5 = ""

Ensure that a custom template context processor has been added to TEMPLATE_CONTEXT_PROCESSORS:

TEMPLATE_CONTEXT_PROCESSORS = (
    ....
    "project.template_context_processors.settings_context_processor",
)

Make the JS_MD5 variable visible to the templates: template_context_processors.py

template_context_processors.py:

from django.conf import settings
....
def settings_context_processor(request):
    return {
        'JS_MD5': settings.JS_MD5,
    }

Reference the JS_MD5 variable in html templates:

base.html

<script type="text/javascript" src="{{ STATIC_URL }}js/application.{{ JS_MD5 }}.js"></script>

When the HTML templates are loaded, the script tag will now be generated as:

<script type="text/javascript" src="{{ STATIC_URL }}js/application.<MD5 hash>.js"></script>

Summary

With a couple of updates to Django and a few lines of Nginx configuration, it’s easy to create dynamic smart caching of static resources that will work across all browsers.  This will also ensure that project stakeholders and all of the application’s users will also have the correct version of the static resource, as well as their browsers correctly caching each resource.

Leaflet DVF Overview

In my last blog entry, I introduced HumanGeo’s Leaflet Data Visualization Framework (DVF) and provided insight into the motivations driving the development of the framework.  Now let’s take a closer look at some of the new features that the framework provides for simplifying thematic mapping using Leaflet.

New Marker Types

In terms of visualizing point data, Leaflet offers image-based (L.Marker), HTML-based (L.Marker with L.DivIcon), and circle markers (L.CircleMarker).  While these can be useful tools for symbolizing data values (particularly the CircleMarker), it’s always nice to have some variety.  The framework adds several new marker types that are geared towards illustrating dynamic data values.

L.RegularPolygonMarker  Framework RegularPolygonMarker
L.PieChartMarker, L.BarChartMarker, L.CoxcombChartMarker, and L.RadialBarChartMarker  Framework Chart Markers
L.RadialMeterMarker  Framework RadialMeterMarker
L.StackedRegularPolygonMarker  StackedRegularPolygonMarker

Mapping Data Properties to Leaflet Styles

The framework includes classes for dynamically mapping data values to Leaflet style values (e.g. radius, fillColor, etc.).  These classes are similar to D3’s scale concept.  Mapping values from one scale/format to another is a common aspect of creating thematic maps, so this is a critical feature despite its relative simplicity.  The main classes to consider are:

  • L.LinearFunction:  This class maps a data value from one scale to another.  One example use might be to map a numeric data property (e.g. temperature, counts, earthquake magnitude, etc.) to a radius in pixels.  You create a new LinearFunction by passing in two data points composed of x and y values (L.Point instances or any object with x and y properties), where these points represent the bounds of possible values – x values represent the range of input values and y values represent the range of output values.  If you remember back to your algebra days, you’ll recall the concept of linear equations, where given two points in cartesian space, you can calculate the slope (m) and y-intercept (b) values from those points in order to determine the equation for the line that passes through those two points (y = mx + b).  This is really all that the LinearFunction class is doing behind the scenes.  Call the evaluate method of a LinearFunction to get an output value from a provided input value; this method interpolates a y value based on the provided x value using the pre-determined linear equation.  The LinearFunction class also includes options for pre-processing the provided x value (preProcess) and post-processing the returned y value (postProcess) whenever evaluate is called.  This can be useful for translating a non-numeric value into a numeric value or translating a numeric output into some non-numeric value (e.g. a boolean value, category string ,etc.).  It can also be used to chain linear functions together.
  • L.PiecewiseFunction:  It’s not always possible to produce the desired mapping from data property to style property using just a single LinearFunction.  The PiecewiseFunction class allows you to produce more complicated mappings and is literally based on the Piecewise Function concept.  For instance, if you wanted to keep the radius of a marker constant at 5 pixels until your data property reaches a value of 100 and then increase the radius after that from 5 pixels to 20 pixels between the values of 100 and 200, you could by using a PiecewiseFunction composed of two LinearFunctions as illustrated in the example below.
    var radiusFunction = new L.PiecewiseFunction([new L.LinearFunction(new L.Point(0, 5), new L.Point(100, 5)), new L.LinearFunction(new L.Point(100, 5), new L.Point(200, 20))]);
    
  • Color Functions:  Color is an important aspect of data visualization, so the framework provides classes derived from LinearFunction that make it easy to translate data properties into colors.  The framework relies heavily on Hue, Saturation, Luminosity/Lightness (HSL) color space over the more familiar, ubiquitous Red, Green, Blue (RGB) color space.  HSL color space offers some advantages over RGB for data visualizations, particularly with respect to numeric data.  Hue, the main component used to determine a color in HSL space, is an angle on the color wheel that varies from 0 degrees to 360 degrees according to the visible spectrum/colors of the rainbow (red, orange, yellow, green, blue, indigo, violet, back to red).  This makes it easy to map a numeric input value to an output hue using the same LinearFunction concept described previously and gives us nice color scales – green to red, yellow to red, blue to red, etc – that work well for illustrating differences between low and high values.  Achieving the same effect with RGB color requires varying up to three variables at once, leading to more code and complexity.
    • L.HSLHueFunction:  This class produces a color value along a rainbow color scale that varies from one hue to another, while keeping saturation and luminosity constant.
    • L.HSLLuminosityFunction:  This class varies the lightness/darkness of a color value dynamically according to the value of some data property, while keeping hue and saturation constant.
    • L.HSLSaturationFunction:  This class varies the saturation of a color value dynamically according to the value of some data property, while keeping hue and luminosity constant.

Data Layers

As I mentioned in my previous post, one point of the framework is to standardize and simplify the way in which thematic mapping data are loaded and displayed; keeping this in mind, the framework provides classes for loading and displaying data in any JSON format.  The framework introduces the concept of a DataLayer, which serves as a standard foundation for loading/visualizing data from any JavaScript object that has a geospatial component.

  • L.DataLayer:  Visualizes data as dynamically styled points/proportional symbols using regular polygon or circle markers
  • L.ChoroplethDataLayer:  This class allows you to build a choropleth map from US state or country codes in the data.  The framework provides built-in support for creating US state and country choropleth maps without needing server-side components.  Simply import the JavaScript file for state boundaries or the JavaScript file for country boundaries if you’re interested in building a state or country level choropleth.  In addition, states and countries can be referenced using a variety of codes.
  • ChartDataLayers – L.PieChartDataLayer, L.BarChartDataLayer, L.CoxcombChartDataLayer, L.RadialBarChartDataLayer, L.StackedRegularPolygonDataLayer:  These classes visualize multiple data properties at each location using pie charts, bar charts, etc.

Support for custom/non-standard location formats (e.g. addresses)

Data doesn’t always come with nicely formatted latitude and longitude locations.  Often there is work involved in translating those location values into a format that’s useable by Leaflet.  DataLayer classes allow you to pass a function called getLocation as an option.  This function takes a location identified in a record and allows you to provide custom code that turns that location into a format that’s suitable for mapping.  Part of this conversion could involve using an external web service (e.g. geocoding an address).

Support for automatically generating a legend that describes your visualization

Legends are common thematic mapping tools that help users better understand and interpret what a given map is showing.  Simply call getLegend on any DataLayer instance to get chunk of HTML that can be added to your application or add the L.Control.Legend control to your Leaflet map.  This control will automatically display the legend for any DataLayer instance that has been added to the map.

A Quick Example

Here’s a quick example choropleth map of electoral votes by state with states colored from green to red based on the number of electoral votes:

Electoral Votes Choropleth
Electoral Votes by State Colored from Green to Red
//Setup mapping between number of electoral votes and color/fillColor.   In this case, we're going to vary color from green (hue of 120) to red (hue of 0) with a darker border (lightness of 25%) and lighter fill (lightness of 50%)
var colorFunction = new L.HSLHueFunction(new L.Point(1, 120), new L.Point(55, 0), {outputSaturation: '100%', outputLuminosity: '25%'});
var fillColorFunction = new L.HSLHueFunction(new L.Point(1, 120), new L.Point(55, 0), {outputSaturation: '100%', outputLuminosity: '50%'});

var electionData = {…};
var options = {
	recordsField: 'locals',
	locationMode: L.LocationModes.STATE,
	codeField: 'abbr',
	displayOptions: {
		electoral: {
			displayName: 'Electoral Votes',
			color: colorFunction,
			fillColor: fillColorFunction
		}
	},
	layerOptions: {
		fillOpacity: 0.5,
		opacity: 1,
		weight: 1
	},
	tooltipOptions: {
		iconSize: new L.Point(80,55),
		iconAnchor: new L.Point(-5,55)
	}
};

// Create a new choropleth layer from the available data using the specified options
var electoralVotesLayer = new L.ChoroplethDataLayer(electionData, options);

// Create and add a legend
$('#legend').append(electoralVotesLayer.getLegend({
	numSegments: 20,
	width: 80,
	className: 'well'
}));

map.addLayer(electoralVotesLayer);

I want to highlight a few details in the code above. One is that there’s not a lot of code. Most of the code is related to setting up options for the DataLayer. Compare this to the Leaflet Choropleth tutorial example, and you’ll see that there’s less code in the example above (34 lines vs. about 89 lines in the Leaflet tutorial). It’s not a huge reduction in lines of code given that the framework handles some of the functions that the Leaflet tutorial provides (e.g. mouseover interactivity), but the Leaflet tutorial is using GeoJSON, which as I mentioned earlier is well handled by Leaflet, and the example above is not.  I’ve omitted the data for this example, but it comes from Google’s election 2008 data and looks like this:

{
    ...,
    "locals":{
        ...,
        "Mississippi": {
            "name": "Mississippi",
            "electoral": 6,
            ...,
            "abbr": "MS"
        },
        "Oklahoma": {
            "name": "Oklahoma",
            "electoral": 7,
            ...,
            "abbr": "OK"
        },
        ...
    },
        ...
}

When configuring the L.ChoroplethDataLayer, I tell the DataLayer where to look for records in the data (the locals field), what property of each record identifies each boundary polygon (the abbr field), and what property/properties to use for styling (the electoral field).  In this case, the L.ChoroplethDataLayer expects codeField to point to a field in the data that identifies a political/admin boundary by a state code.  In general, DataLayer classes can support any JSON-based data structure, you simply have to point them (using JavaScript style dot notation) to where the records to be mapped reside (recordsField), the property of each record that identifies the location (codeField, latitudeField/longitudeField, etc. – depending on the specific locationMode value), and the set of one or more properties to use for dynamic styling (displayOptions).  Another feature illustrated in the example above is that there’s no picking of a set of colors to use for representing the various possible ranges of numeric data values.  In the example above, color varies continuously with respect to a given data value, based on the range that I’ve specified using an L.HSLHueFunction, which as I mentioned earlier varies the hue of a color along a rainbow color scale.  The last feature I want to highlight is that the framework makes it as easy as one function call to generate a legend that describes your DataLayer to users.  There’s no need to write custom HTML in order to generate a legend.

That’s it for now.  Hopefully this overview has given you a better sense of the key features that the framework provides.  Detailed documentation is still in the works, but check out the examples on GitHub.  In my next post, I’ll walk through the Earthquakes example, which is basically just a recreation of the USGS Real-Time Earthquakes map that I alluded to in my previous post.

Introducing HumanGeo’s Leaflet Data Visualization Framework

leaflet-logoAt HumanGeo, we’re fans of Leaflet, Cloudmade’s JavaScript web mapping framework.  In the past, we’ve used other JavaScript mapping frameworks like OpenLayers and Google Maps, and while these frameworks are great, we like Leaflet for its smooth animation, simple API, and good documentation and examples.  In fact, we like Leaflet so much, we’ve been using it in all of our web projects that require a mapping component.  Since Leaflet is a relative newcomer in the world of JavaScript mapping frameworks (2011), and since the developers have focused on keeping the library lightweight, there are plenty of opportunities to extend the basic functionality that Leaflet offers.

As a side project at HumanGeo, we’ve been working on extending Leaflet’s capabilities to better support visualizing data, and these efforts have produced HumanGeo’s Leaflet Data Visualization Framework.  Before I delve into some of the features of the framework, I’d like to provide a little background on why we created the framework in the first place, in particular, I’d like to focus on the challenges that developers face when creating data-driven geospatial visualizations.

When visualizing data on a 2D map we often wish to illustrate differences in data values geographically by varying point, line, and polygon styles (color, fill color, line weight, opacity, etc.) dynamically based on the values of those data.   The goal is for users to look at our map and quickly understand geographic differences in the data being visualized.  This technique is commonly referred to as thematic mapping, and is a frequently employed technique used in infographics and for illustrating concepts related to human geography.  Within the realm of thematic mapping, proportional symbols and choropleth maps are two widely used approaches for illustrating variations in data.

A subset of Leaflet layer style options
Symbol Styling Options

The proportional symbol approach highlights variations in data values at point locations using symbols sized proportionally to a given data value.  In addition to varying symbol size, we can also vary symbol color or other style properties in order to highlight multiple data properties at each point.  The image on the right shows some of the available style properties that we can vary for circle markers in Leaflet.

A good example of this approach is the USGS Earthquake Map, which by default shows earthquakes of magnitude 2.5 or greater occurring in the past week.  This map denotes earthquakes using circles that are sized by magnitude and colored by age (white – older to yellow – past week to red – past hour).  In an upcoming blog post, I’ll describe how we can use the Leaflet Data Visualization Framework along with USGS’ real-time earthquakes web service to easily reproduce this map.

USGS Real-Time Earthquakes
USGS Real-Time Earthquakes Map

Choropleth mapping involves styling polygons – typically country, state, or other administrative/political boundaries – based on statistical data associated with each polygon.  In US election years, we see tons of maps and infographics showing breakdowns of voter preferences and predicted winners/losers by state, county, or other US political boundary.  Infographics are becoming more and more popular, and there’s no shortage of news outlets like the New York Times, The Washington Post, and others producing maps illustrating various statistics at a range of administrative boundary levels.  On the web, these choropleth maps are often custom developed, static or potentially interactive, single purpose maps that typically make use of a variety of frameworks, including more advanced all-purpose visualization frameworks like D3 rather than web mapping frameworks like Leaflet.  This single purpose approach is not a problem when you’re using the map to show only one thing, but what if you want to show multiple data views or a variety of data on the same map?  Nothing against D3 (which is awesome) or other general purpose visualization frameworks, but why would I want to learn the intricacies of some other framework in order to produce a choropleth map?  If I’m already using Leaflet for all of my web mapping activities, why wouldn’t I use it for creating thematic maps?

Fortunately, Leaflet provides a number of built-in capabilities that enable creating thematic maps, including excellent support for GeoJSON and the ability to draw points using image-based, HTML-based, and circle markers as well as support for drawing lines and polygons and styling those features dynamically.  There are a few tutorials on the Leaflet website that explain how to use Leaflet’s GeoJSON capabilities for displaying point, line, and polygon data and creating choropleth maps.  I recommend checking these tutorials out if you’re interested in using Leaflet or want to better understand what capabilities Leaflet provides out of the box (screenshots of these tutorials plus links to the tutorials appear below).

Loading GeoJSON using Leaflet
Loading GeoJSON using Leaflet
Leaflet Choropleth Map
Creating a Choropleth Map using Leaflet

While Leaflet’s out of the box capabilities simplify thematic map creation, there are still several challenges that developers face when trying to create thematic maps using Leaflet or any web mapping framework for that matter, particularly when GeoJSON data is not available.  The first challenge is a common one for developers – no standard format.  The Internet is full of data that can be used to build thematic maps, but this data comes in a variety of formats (CSV, TSV, XML, JSON, GeoJSON, etc.).  This makes building reusable code that works with and across most datasets challenging.  The underlying issue here, and perhaps the main reason that data is and will continue to be created in a variety of formats, is that the people creating this data aren’t typically creating the data with geospatial visualization in mind, so there will almost always be some aspect of massaging the data so that it can be loaded and displayed using the desired mapping framework.

Mapping data on a political/admin boundary level comes with its own set of challenges.  Often the data driving choropleth and thematic map visualizations related to political boundaries are numbers and categories associated with a code for a given boundary.  These codes can include (among other options): two digit state code, Federal Information Processing Standard (FIPS) code, International Organization for Standardization (ISO) two or three digit country code, state name, country name, etc.  This again comes back to the issue of a lack of standardization; for country level statistics, for instance, you might see country name, two digit, three digit, numeric codes, or other codes being used across data providers.  Very rarely are the geometries associated with each boundary included in the source data, and even more rare is pre-packaged GeoJSON that contains boundary polygons along with statistics as properties of each polygon.  This introduces a challenge for developers in that we must find and associate boundary polygons with those boundary identifiers on the client-side or on the server in order to build a thematic map.  On the client side, this may involve interacting with a web service (e.g. an OGC Web Feature Service (WFS)) that serves up those polygons, particularly in the case where we’re creating choropleth maps for lower level admin/political boundaries.  In general, the two most common types of choropleth maps that people tend to create are country and state level maps.  If I’m building a state or country choropleth, I’m probably going to be using all of the state or country polygons that are available, so making requests to a web service to get each polygon might be a little excessive.  In addition, if we’re trying to display symbols based on state or country codes, we need the centroids of each political boundary in order to position each symbol correctly.  This requires the need to calculate the centroid dynamically or to include it as a property of the boundary polygon.

In addition to challenges with data formats, there are often redundant tasks that developers must perform when creating thematic maps.  These include:

  1. Retrieving data from a local/remote web server and translating those data into JavaScript objects
  2. Translating data locations into web mapping framework appropriate point/layer objects
  3. Dynamically styling point/layer objects based on some combination of data properties
  4. Setting up interactive features – handling mouseover and click events
  5. Displaying a legend to make it easier for the user to understand what’s being shown

In the grand scheme of things, these are not monumental challenges, but they do make the jobs of developers more difficult.  HumanGeo’s Leaflet Data Visualization Framework helps to alleviate some of these challenges by abstracting many of these details from the developer.  In particular, the framework seeks to:

  • Enable cool, interactive, infographic style visualizations
  • Support displaying data in any JSON-based data format (not just GeoJSON)
  • Eliminate redundant code/tasks related to displaying data
  • Provide tools for simplifying mapping one value to another (e.g. temperature to radius, count to color, etc.)
  • Standardize the way in which data are loaded/displayed
  • Minimize dependency on server-side components for typical use cases (e.g. visualizing data by US state or country)
  • Remain consistent with Leaflet’s API/coding style so that it’s easy to pickup and use if you’re already familiar with Leaflet

It wouldn’t be a proper introduction to a framework that’s all about visuals without showing you some sample visualizations, so here are a few example maps created using the framework to give you an idea of the type of maps you can create:

Election Mapping
Election Mapping
Multivariate Date Display
Country Level Data Display
Ethnic Enclaves in NYC
Ethnic Enclaves in New York City

In the next few blog entries, I’ll provide more details and examples illustrating how HumanGeo’s Leaflet Data Visualization Framework simplifies geospatial data visualization and thematic map creation.  In the meantime, check out the code and examples on GitHub, and send me an e-mail if you’re interested, have questions, or want to contribute.