Keeping Up With the Cool Kids: Elasticsearch and Urban Dictionary – Part 1

fry-slangAt HumanGeo, we love Elasticsearch and we love social media. Elasticsearch lends itself well to a variety of interesting ways to process the vast amount of content in social media. But like most things on the internet, keeping up with slang and trends in social media text can be an increasingly difficult barrier to entry to analysing the data (so can getting beyond your teenage years). So how do we get past this barrier? If the web is so powerful, can’t we use it to help us understand what’s really being said?

Enter Urban Dictionary. Usually, if you’re looking for answers, UD might be the last place on the internet you want to look unless you have a large jug of mindbleach waiting on standby. Aside from proving that the internet is a cold, dark place, Urban Dictionary has a large amount of crowd-sourced data that can help us get some insight into today’s communication medium, whether it’s 140 characters or beyond.

In this post, my goal is to 1) collect a bunch of data from Urban Dictionary, 2) index it in such a way that I can use it to “decipher” lousy internet slang and 3) query it with “normal” terms and get extended results.

The Data

To get started, we needed to get the words themselves. To do this, I built a simple web scraper to scroll UD and extract the words. Here’s a snippet to extract the words out of the DOM using Python via Requests and Beautiful Soup.

import requests
from bs4 import BeautifulSoup

WORD_LINK = 'http://www.urbandictionary.com/popular.php?character={0}'

def make_alphabet_soup(self, letter, link=WORD_LINK):
    '''Make soup from the list of letters on the page.'''
    r = requests.get(link.format(letter))
    soup = BeautifulSoup(r.text)
    return soup
def parse_words(self, letter, soup=None):
    '''Scrape the webpage and return the words present.'''
    if not soup:
        soup = self.make_alphabet_soup(letter)
    word_divs = soup.find(id='columnist').find_all('a')
    words = [div.text for div in word_divs]
    return popular_words

This is the basic building block, but I extended from there. For every word I grabbed, I threw it against the Urban Dictionary API and got a definition.

# Redacted
API_LINK = 'http://ud_api.com'

def define(self, word, url=API_LINK):
'''Send a request with the given word to the UD JSON API.'''
r = requests.get(url, params={'term': word}, timeout=5.0)
j = r.json()
# Add our search term to the document for context
j.update({'word': word})
return j

Using this method, I ended up with about 100k “popular” words, as defined by UD. An example response from the API looks something like:

{
 "tags": [
    "black",
    "ozzy",
    "sabbath",
    "black sabbath",
    "geezer",
    "metal",
    "osbourne",
    "tony",
    "bill",
    "butler"
 ],
 "result_type":"exact",
 "list": [
    {
       "defid": 772739,
       "word": "Iommi",
       "author": "Matthew McDonnell",
       "permalink": "http://iommi.urbanup.com/772739",
       "definition": "Iommi = a Godlike person. A master of their chosen craft. Someone or something extremely cool",
       "example": "Example 1. Hey rick, that motorcyle stunt you did was really Iommi! \r\n\r\nExample 2. That guy is SO Iommi! \r\n\r\nExample 3. Be Iommi, man!",
       "thumbs_up": 57,
       "thumbs_down": 3,
       "current_vote": ""
    }
 ],
 "sounds":[]
}

Now that I had the data, it was time to make something out of it.

The Process

With our data in hand, it’s time to utilize Elasticsearch. More specifically, it’s time to take advantage of the Synonym Token Filter when indexing data into Elasticsearch.

A quick interjection about indexing: this is an good time to talk about “the guts” of how data is indexed into Elasticsearch. If you don’t specify your mappings when indexing data, you can get unexpected results if you’re not familiar with the mapping/analysis process. By default, the data is tokenized upon indexing, which is great for full-text search but not when we want exact matches to multiple words. For example, if I’m searching for exactly “brown fox” in my index (for example, an exact match against my query string), I will get results for the sentence “John Brown was attacked by a fox.” You can read more about that behavior here. A good strategy is to create a subfield of “word” such as “.raw” where the “.raw” is set to not_analyzed in your mapping.

Using the data, we collected, we can generate the Solr synonym file required by the token filter. To do this, I used the “tags” area of the definition. This definitely is not a set of synonyms (sometimes you just get a bunch of racism and filth), but it does provide (potentially) related words to the original word. For example, here are the tags for word “internet”:

  • “facebook”
  • “web”
  • “computer”
  • “myspace”
  • “lol”
  • “google”
  • “online”
  • “porn”
  • “youtube”
  • “internets”

I mean, they’re not wrong. Here’s an example of adding the mapping I used on the “test” index in the “name” field:

justin@macbook ~/p/urban> curl -XPOST "http://localhost:9200/test" -d
{
   "settings": {
      "index": {
         "analysis": {
            "analyzer": {
               "synonym": {
                  "tokenizer": "whitespace",
                  "filter": [
                     "synonym"
                  ]
               }
            },
            "filter": {
               "synonym": {
                  "type": "synonym",
                  "synonyms_path": "/tmp/solr-synonyms.txt"
               }
            }
         }
      }
   },
   "mappings": {
      "test": {
         "properties": {
            "name": {
               "type":"string",
               "index":"analyzed",
               "analyzer":"synonym"
            }
         }
      }
   }
}

The Search

Now that we have our index set up, it’s time to put a search in action.  Until I went down this rabbit hole, I had no idea calling something Iommi was a thing (it probably isn’t). As someone who likes Black Sabbath, I want to find other words in my index that are totally Iommi. Using the mapping I specified above, I indexed a few sample documents with “name” field set to tags that UD relates to Iommi, as well as some bogus filler. Example tags (and no, I did not make this example up):

  • “sabbath”
  • “black sabbath”
  • “geezer”
  • “metal”

Screen Shot 2014-12-10 at 10.11.14 PMOur query (in Sense, against the ‘test’ index), and the results:

POST _search
{
    "query": {
       "term": {
          "name": "iommi"
      }
   }
}

Awesome! This installment is more about showcasing how the filter works, so it’s not entirely practical. Look out for a future installment where we use real social media data to do “extended search” and display results with the Elasticsearch highlighting to show a practical example of this in action.

Beer Caps to Coffee Tables

Some years ago, I lived in Norfolk, VA with a friend who loved craft beer as much as I did. Our hoppy passion motivated us to brew beer at home, visit local craft beer stores, and generally enjoy our nights with a brew here or there. Subconsciously knowing it would be a brilliant idea, we used a tupperware box next to our refrigerator to house a growing collection of beer caps, feeding it a single cap as each beer departed the cold, destined for enjoyment.

After a few years, I realized the box was close to being filled to the brim. The mechanical gears in my head were cranking out an idea to do something creative with the bottle caps. I wanted to incorporate some things I’d learned in my classes related to imaging, computer vision, data mining, and spectral signatures. I also had an old square coffee table (36″x36″) that could be the canvas to this project. This was the beginning of a fun exploration in image processing with python to create a bottle cap magnum opus.

Getting it Done
There were a few things I needed to do; first, I needed more caps.  Given the size of a bottle cap (1″) minus some width for a border around the table and a little space between the caps, I found that the table would accept a 30 cap by 30 cap grid, therefore I needed at least 900 caps.  A local bar supported this effort by hosting a 5 gallon bucket for the bartender to drop caps into in lieu of the trash can.

Second, I needed a way to extract the data (color; specifically red, green, and blue) from each individual cap.  I originally arranged the images on a gridded sheet of paper, took a photo, and extracted them row-by-column after performing some image transforms to account for the skewed nature of the image.  As it turns out, there is no good way to get a completely top-down image of the caps; it will always turn out to be smaller at the top and larger at the bottom depending on the angle you held the camera at. The data wasn’t fantastic and as more caps came in, I knew I needed a better way; a quick survey of computer vision capabilities turned up a concept called a Hough Transform.

Hough transforms are a method of identifying shapes that can be approximated by equations in images; in this case, a circle with a roughly known radius was the target. This synopsis is a vast simplification, but for each pixel in the output image (of the same size as the input image), the algorithm polls those surrounding pixels that might form a circle and assigns a value to that pixel. Based on all the pixels in the image, one can then surmise that the pixels with values above a certain threshold are the center point for a circle of known radius. To facilitate the discovery of circles of a specific radius (as many caps have concentric circle patterns), I used a Sobel operator to convert the image into an edge-only image.

caps_on_sheet
A subset of the caps, spread out and photographed.

Although I initially wrote my own Sobel Filter and Hough Transform in Python, I think it’s smarter to use OpenCV, which I later discovered.  OpenCV is a set of computer vision related functions written in C/C++ with wrappers for Python; Hough transforms are only one of the things that OpenCV can simplify from a multi-hour debacle into a few minutes of digging in the documentation for the right tool. Here  is a quick snippet of something that took me dozens of lines of code in earlier editions:

def cv2hough(sobel_image, color_image, minrad, maxrad):
    '''
    Identifies and cuts out to a directory potentially circular
    shapes that match a known radius range.
    :param sobel_image: sobel-filtered image
    :param color_image: non-sobel filtered version of colored_image
    :return:
    '''
    in_cv_image = cv2.imread(sobel_image, 0)
    avg_rad = (minrad+maxrad)/2
    circles = cv2.HoughCircles(in_cv_image, cv.CV_HOUGH_GRADIENT, 1, 65,
    param1=90, param2=15,
    minRadius=minrad, maxRadius=maxrad)
    exportCircles(circles, color_image, 'caps', avg_rad)

(Documentation for cv2.HoughCircles)

This code snippet concludes by sending a list of all the places on the image where it thinks a cap might be to “exportCircles”, where those caps are then cut out of the image and sent to a 30px x 30px JPG file.

circle_3_img
A cap, post-extraction. It’s a rough approximation, but the hough transform was sufficient.

Once we have a directory containing hundreds and hundreds of caps, we can begin some analysis. (Full disclosure: I manually scrolled through the images and deleted those that appeared unusable.) Python was used to calculate statistics for each bottle cap image, and eventually stored the data in a structure. Each pixel is calculated applying a red (R), green (G), and blue (B) value. We can average these and get average (R,G,B) triplets for each bottle cap, that is to say; a cap is “mostly” red, or “mostly” blue, etc. I soon found that these numbers weren’t that descriptive and began to work in the Hue, Saturation, and Value color spaces. (HSL & HSV: Wikipedia).

Protip: I used a quick one-liner to convert from RGB to HSV before storing the data:
colorsys.rgb_to_hsv(rgb[0], rgb[1], rgb[2])

To save time cataloging 900 bottle caps and associated pixels, I pickle (serialize) each cap as I generate the data on it so I can quickly reference a data structure containing the caps and rearrange them as I see fit.

caps_avg
The caps by average, as the computer sees them.
comp_output
The caps as we see them; the final arrangement was sorted by hue then saturation.

Final Steps: Output
I originally wanted to see what it would look like when I rearranged the bottle caps into different patterns. This is where we do that, and based on the data we have and the framework we’ve built, outputting an image or a HTML document (this is a really hasty solution!) becomes fairly easy, minus one hiccup.

def first_sort(self):
    '''
    Reach into the internal config and find out what we want to have as
    our first order sort on the caps.
    :return:
    '''
    self.sortedDatList = sorted(self.imageDatList,
                                key=lambda thislist: thislist[
                                    self.sorts[self.conf['first_sort']]])
    del self.imageDatList
    self.imageDatList = self.sortedDatList

def second_sort(self):
    '''
    Second sort on the caps: sort rows individually.
    Example: We want a 30x30 grid of caps, sorted first by hue, then by
    saturation.
    We sort by hue in first_sort, then in second_sort, we sort 30 lists
    of 30 caps each individually and replace them into the list.
    :return:
    '''
    ranges = [[self.conf['out_caps_cols'] * (p - 1),
               self.conf['out_caps_cols'] * p] for p in
              range(1, self.conf['out_caps_rows'] + 1)]
    for r in ranges:
        self.imageDatList[r[0]:r[1]] = sorted(self.imageDatList[r[0]:r[1]],
                                              key=lambda thislist: thislist[
                                                  self.sorts[self.conf[
                                                      'second_sort']]])

Here we see two functions: firstSort and secondSort. You can guess that we sort the caps once, then sort them again; what may not be apparent from this code is that the second sort (which is performed on a different attribute), is performed based on the out_caps_rows attribute. This is the hiccup; you must sort multiple sub-segments of the array of bottle caps, in essence, sorting the rows after the columns have been generated. Otherwise each row of bottle caps will have a random pattern as the image trends from one extreme (top) to the other (bottom).

caps_organized
Caps, laid out and ready to be transferred to the table.

The Physical Construction
To finish the project, I physically arranged a pattern based on informal votes from friends and acquaintances. I built a 1/2″ lip around the table with “Quarter-Round” molding. I glued this down, sealed the seams with wood glue, and painted everything black.

I filled the bottom of this open-top area with playground sand. This minimized the volume that I would have to fill in with expensive epoxy resin. It had the secondary benefit of lightening up the entire display, knowing the epoxy would darken it. I ordered two kits of “AeroMarine 300/21 Epoxy Resin” (for a total of 3 Gallons) from the internet. Ultimately, I arranged the caps on a sheet of butcher’s paper and transferred them one-by-one to the sand box.

A few friends, the famous roommate, and I gathered (with beers); we mixed the resin and began the process of pouring the resin onto the table. The one unseen consequence of the sand was that when the epoxy entered the sand, it pushed air upwards, causing the caps to float; we quickly used toothpicks to push the caps down into the sand, then poured more resin over top.

caps_final
The table, post-drying and finished.

Code for this was hacked together over a number of weeks, I’ve consolidated it here: https://github.com/jamesfe/capViz.  Unfortunately, most of this code was written in a disheveled, disconnected manner and I’m just now beginning to tie it together into one well documented, good-practice PEP-8 compliant piece, but feel free to poke around and let me know what you think!

Thanks to The Birch for all the beer caps!

If this interested you, follow me on Twitter! @jimmysthoughts