All posts by hgjustin

No Soup for You: When Beautiful Soup Doesn’t Like Your XML

Over the years, Beautiful Soup has probably saved us more hours on scraping, data collection, and other projects than we can count. Crummy’s landing page for the library even says:

Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.

You can plot me directly in the “days” column. When I’m starting a Python project that requires me to parse through HTML data, the first dependency I’ll pull is BeautifulSoup. It makes what would normally be a nasty Perl-esque mess into a something nice and Pythonic, keeping sanity intact. But what about structured data other than HTML? I’ve also learned that BS can be a huge boon for XML data, but not without a couple of speed bumps.

Enter data from the client (not real data, but play along for a moment):

<xml>
  <sport name="baseball">
    <team name="Braves" city="Atlanta">
        <link>
          <url>http://atlanta.braves.mlb.com</url>
        </link>
    </team>
  </sport>
</xml>

Seems well-structured! Our customer just needs all the links that we have in the data. We fire up our editor of choice and roll up our sleeves.

from __future__ import print_function
from bs4 import BeautifulSoup

# Make our soup
with open('data.xml') as infile:
    blob = infile.read()
# Use LXML for blazing speed
soup = BeautifulSoup(blob, 'lxml')
print(soup.find_all('link'))

We get in return:

[<link/>]

Wait, what? What happened to our content? This is a pretty basic BS use case, but something strange is happening. Well, I’ll come back to this and start working with their other very hypothetical data sets where <link> tags become <resource> tags, but the rest of the data is structured exactly the same. This time around:

<xml>
  <sport name="baseball">
    <team name="Braves" city="Atlanta">
        <resource><!-- Changed to resouce -->
          <url>http://atlanta.braves.mlb.com</url>
        </resource><!-- Corresponding close -->
    </team>
  </sport>
</xml>

…and corresponding result…

>>> print(soup.find_all('resource'))
[<resource>
<url>http://atlanta.braves.mlb.com/index.jsp?c_id=atl</url>
</resource>]

Interesting! To compound our problem, we’re on a customer site where we don’t have internet access to grab that sweet, sweet documentation we crave. After toiling on a Stack Overflow dump in all the wrong places, I was reminded of one of my favorite blog posts by SO’s founder, Jeff Atwood. Read the Source. But what was I looking for? Well, let’s dig around for <link> tags and see what turns up.

Sure enough, after some quick searches, we find what I believe to be the smoking gun (for those following along at home, bs4.builder.__init__.py, lines 228/229 in v4.3.2).

empty_element_tags = set(['br' , 'hr', 'input', 'img', 'meta',
                          'spacer', 'link', 'frame', 'base'])

We have a seemingly harmless word with “link” in our XML, but it means something very different in HTML and more specifically, the TreeBuilder implementation that LXML is using. As a test, if I change our <link> turned <resource> tags into <base> tags we get the same result – no content. It also turns out that if you have LXML installed, BeautifulSoup4 will fall back to that for parsing. Uninstalling it grants us the results we want – tags with content. The stricter (but faster) TreeBuilder implementations from LXML take precedence over the built-in HTMLParser or html5lib (if you have it installed). How do we know that? Back to the source code!

bs4/builder/__init__.py, lines 304:321

# Builders are registered in reverse order of priority, so that custom
# builder registrations will take precedence. In general, we want lxml
# to take precedence over html5lib, because it's faster. And we only
# want to use HTMLParser as a last result.
from . import _htmlparser
register_treebuilders_from(_htmlparser)
try:
    from . import _html5lib
    register_treebuilders_from(_html5lib)
except ImportError:
    # They don't have html5lib installed.
    pass
try:
    from . import _lxml
    register_treebuilders_from(_lxml)
except ImportError:
    # They don't have lxml installed.
    pass

As it turns out, when creating your soup, ‘lxml’ != ‘xml’. Changing the soup creation gets us the results we’re looking for (UPDATE: corresponding doc “helpfully” pointed out by a Reddit commenter here). BeautifulSoup was still falling back to HTML builders, thus why we were seeing the results we were when specifying ‘lxml’.

# Use HTML for sanity
soup = BeautifulSoup(blob, 'xml')

While I didn’t find that magic code snippet to fix everything,  (UPDATE: Thanks Reddit). We found our problem, but went really roundabout to get there. Understanding why it was happening made me a feel a lot better in the end. It’s easy to get frustrated when coding, but always remember, read the docs and – Read the Source, Luke. It might help you understand the problem.

We’re hiring!  If you’re interested in geospatial, big data, social media analytics, Amazon Web Services (AWS), visualization, and/or the latest UI and server technologies, drop us an e-mail at info@thehumangeo.com.

Keeping Up With the Cool Kids: Elasticsearch and Urban Dictionary – Part 1

fry-slangAt HumanGeo, we love Elasticsearch and we love social media. Elasticsearch lends itself well to a variety of interesting ways to process the vast amount of content in social media. But like most things on the internet, keeping up with slang and trends in social media text can be an increasingly difficult barrier to entry to analysing the data (so can getting beyond your teenage years). So how do we get past this barrier? If the web is so powerful, can’t we use it to help us understand what’s really being said?

Enter Urban Dictionary. Usually, if you’re looking for answers, UD might be the last place on the internet you want to look unless you have a large jug of mindbleach waiting on standby. Aside from proving that the internet is a cold, dark place, Urban Dictionary has a large amount of crowd-sourced data that can help us get some insight into today’s communication medium, whether it’s 140 characters or beyond.

In this post, my goal is to 1) collect a bunch of data from Urban Dictionary, 2) index it in such a way that I can use it to “decipher” lousy internet slang and 3) query it with “normal” terms and get extended results.

The Data

To get started, we needed to get the words themselves. To do this, I built a simple web scraper to scroll UD and extract the words. Here’s a snippet to extract the words out of the DOM using Python via Requests and Beautiful Soup.

import requests
from bs4 import BeautifulSoup

WORD_LINK = 'http://www.urbandictionary.com/popular.php?character={0}'

def make_alphabet_soup(self, letter, link=WORD_LINK):
    '''Make soup from the list of letters on the page.'''
    r = requests.get(link.format(letter))
    soup = BeautifulSoup(r.text)
    return soup
def parse_words(self, letter, soup=None):
    '''Scrape the webpage and return the words present.'''
    if not soup:
        soup = self.make_alphabet_soup(letter)
    word_divs = soup.find(id='columnist').find_all('a')
    words = [div.text for div in word_divs]
    return popular_words

This is the basic building block, but I extended from there. For every word I grabbed, I threw it against the Urban Dictionary API and got a definition.

# Redacted
API_LINK = 'http://ud_api.com'

def define(self, word, url=API_LINK):
'''Send a request with the given word to the UD JSON API.'''
r = requests.get(url, params={'term': word}, timeout=5.0)
j = r.json()
# Add our search term to the document for context
j.update({'word': word})
return j

Using this method, I ended up with about 100k “popular” words, as defined by UD. An example response from the API looks something like:

{
 "tags": [
    "black",
    "ozzy",
    "sabbath",
    "black sabbath",
    "geezer",
    "metal",
    "osbourne",
    "tony",
    "bill",
    "butler"
 ],
 "result_type":"exact",
 "list": [
    {
       "defid": 772739,
       "word": "Iommi",
       "author": "Matthew McDonnell",
       "permalink": "http://iommi.urbanup.com/772739",
       "definition": "Iommi = a Godlike person. A master of their chosen craft. Someone or something extremely cool",
       "example": "Example 1. Hey rick, that motorcyle stunt you did was really Iommi! \r\n\r\nExample 2. That guy is SO Iommi! \r\n\r\nExample 3. Be Iommi, man!",
       "thumbs_up": 57,
       "thumbs_down": 3,
       "current_vote": ""
    }
 ],
 "sounds":[]
}

Now that I had the data, it was time to make something out of it.

The Process

With our data in hand, it’s time to utilize Elasticsearch. More specifically, it’s time to take advantage of the Synonym Token Filter when indexing data into Elasticsearch.

A quick interjection about indexing: this is an good time to talk about “the guts” of how data is indexed into Elasticsearch. If you don’t specify your mappings when indexing data, you can get unexpected results if you’re not familiar with the mapping/analysis process. By default, the data is tokenized upon indexing, which is great for full-text search but not when we want exact matches to multiple words. For example, if I’m searching for exactly “brown fox” in my index (for example, an exact match against my query string), I will get results for the sentence “John Brown was attacked by a fox.” You can read more about that behavior here. A good strategy is to create a subfield of “word” such as “.raw” where the “.raw” is set to not_analyzed in your mapping.

Using the data, we collected, we can generate the Solr synonym file required by the token filter. To do this, I used the “tags” area of the definition. This definitely is not a set of synonyms (sometimes you just get a bunch of racism and filth), but it does provide (potentially) related words to the original word. For example, here are the tags for word “internet”:

  • “facebook”
  • “web”
  • “computer”
  • “myspace”
  • “lol”
  • “google”
  • “online”
  • “porn”
  • “youtube”
  • “internets”

I mean, they’re not wrong. Here’s an example of adding the mapping I used on the “test” index in the “name” field:

justin@macbook ~/p/urban> curl -XPOST "http://localhost:9200/test" -d
{
   "settings": {
      "index": {
         "analysis": {
            "analyzer": {
               "synonym": {
                  "tokenizer": "whitespace",
                  "filter": [
                     "synonym"
                  ]
               }
            },
            "filter": {
               "synonym": {
                  "type": "synonym",
                  "synonyms_path": "/tmp/solr-synonyms.txt"
               }
            }
         }
      }
   },
   "mappings": {
      "test": {
         "properties": {
            "name": {
               "type":"string",
               "index":"analyzed",
               "analyzer":"synonym"
            }
         }
      }
   }
}

The Search

Now that we have our index set up, it’s time to put a search in action.  Until I went down this rabbit hole, I had no idea calling something Iommi was a thing (it probably isn’t). As someone who likes Black Sabbath, I want to find other words in my index that are totally Iommi. Using the mapping I specified above, I indexed a few sample documents with “name” field set to tags that UD relates to Iommi, as well as some bogus filler. Example tags (and no, I did not make this example up):

  • “sabbath”
  • “black sabbath”
  • “geezer”
  • “metal”

Screen Shot 2014-12-10 at 10.11.14 PMOur query (in Sense, against the ‘test’ index), and the results:

POST _search
{
    "query": {
       "term": {
          "name": "iommi"
      }
   }
}

Awesome! This installment is more about showcasing how the filter works, so it’s not entirely practical. Look out for a future installment where we use real social media data to do “extended search” and display results with the Elasticsearch highlighting to show a practical example of this in action.