String tokenization

Sometimes you’ve just got to tokenize a string.

Let’s say you find you need to parse a string based on the tokens it comprises. It’s not an uncommon need, but there’s more than one way to go about it. What does “token” mean to you in this situation? How can you break a string down into a list of them?

It can be daunting answering such an open-ended question. But not to fear! Here I’ll be explaining your best options and how you’ll know when to use them. I’m using Python for these examples, but you should feel free to implement the same algorithms in your language of choice.

Tokenization by delimiters via split

By far the most straightforward approach is one that you’ve surely used before, and that’s quickly done in essentially any language. It’s to split a string on each occurrence of some delimiter.

Say you were reading a csv file and for each row your tokens were to be the value associated with each column. It’s quick, easy, and almost always sufficient to split each line on the commas. Here’s a simple example utilizing a string’s split method:

>>> string = 'Hello world!'
>>> tokens = string.split(' ')
>>> tokens
['Hello', 'world!']

But the usefulness of splitting doesn’t end there. You could also introduce some regex and define multiple characters as being delimiters.

>>> import re
>>> string = "Don't you forget about me."
>>> tokens = re.split( r'[e ]', string )
>>> tokens
["Don't", 'you', 'forg', 't', 'about', 'm', '.']

Unfortunately, delimiters can only get you so far when doing tokenization. What if you needed to break crazier things like “abc123” into separate tokens, one for the letters and one for the numbers? Or maybe you wanted your delimiters to be represented as separate tokens, not to be tossed entirely? You wouldn’t be able to do it using a generic split method.

Tokenization by categorization of chars

One option when dealing with tokens that strictly splitting on delimiters won’t work is assigning tokens by character categorization. It’s still fast, and its flexibility is likely to be enough for most cases.

Here’s an example where the characters in a string are iterated through and tokens are assigned by the categories each character falls into. When the category changes, the token made up of accumulated characters so far is added to the list and a new token is begun.

>>> def tokenize( string, categories ):
...     token = ''
...     tokens = []
...     category = None
...     for char in string:
...         if token:
...             if category and char in category:
...                 token += char
...             else:
...                 tokens.append( token )
...                 token = char
...                 category = None
...                 for cat in categories:
...                     if char in cat:
...                         category = cat
...                         break
...         else:
...             category = None
...             if not category:
...                 for cat in categories:
...                     if char in cat:
...                         category = cat
...                         break
...             token += char
...     if token:
...         tokens.append( token )
...     return tokens
>>> string = 'abc, e4sy as 123!'
>>> tokens = tokenize( string, [ '0123456789', ' ', ',.;:!?', 'abcdefghijklmnopqrstuvwxyz' ] )
>>> tokens
['abc', ',', ' ', 'e', '4', 'sy', ' ', 'as', ' ', '123', '!']

There are a lot of improvements you could make to this tokenization function! For example, you could record which category each token belonged to. You could make the categories more specific, you could implement some flexibility in how they’re handled. You could even use regular expressions for verifying whether a current token plus the next character still fits some category. These are left as exercises to the reader.

Tokenization using regular expressions

But there’s a much better way to tokenize when you need to do more than the basics. Here you can find a lexical scanner package which uses regular expressions to define what tokens should look like: (“Lexical scanner” is a fancy word for a tokenizer that would typically be the first step in a lexical analysis, which is how interpreters and compilers make sense of code. But their applications are in no way limited to just those.) Installation would entail placing the repo’s lexscan directory somewhere that your code can access it.

It’s very handy! Here’s a simple example:

>>> import lexscan
>>> from pprint import pprint
>>> spaceexp = lexscan.ScanExp( r'\s+', significant=False )
>>> wordexp = lexscan.ScanExp( r'[a-z]+' )
>>> numexp = lexscan.ScanExp( r'\d+' )
>>> string = '2001: A Space Odyssey'
>>> tokens = lexscan.tokenize( string, ( spaceexp, wordexp, numexp ) )
>>> pprint( tokens )
[1:0: '2001' \d+,
 1:4: ':' None,
 1:6: 'A' [a-z]+,
 1:8: 'Space' [a-z]+,
 1:14: 'Odyssey' [a-z]+]

Tokens are assigned based on the longest match of the various expressions provided starting at the end of the previous token. The concept is simple, but tokenization doesn’t get much better. The functionality really shines when you’re using more complex regular expressions. What if you needed to differentiate between integers and floating point numbers? What if you wanted to capture an string literal enclosed in quotes as a single large token? You can do that using a proper scanner and all it takes is the wherewithal to put together a set of regular expressions which do what you need. Here’s a more interesting example that looks more like what you would want to use an implementation like this one for.

>>> import lexscan
>>> from pprint import pprint
>>> spaceexp = lexscan.ScanExp( r'\s+', significant=False, name='whitespace' )
>>> identifierexp = lexscan.ScanExp( r'[a-z_]\w+', name='identifier' )
>>> delimiterexp = lexscan.ScanExp( r'[,;]', name='delimiter' )
>>> intexp = lexscan.ScanExp( r'\d+', name='integer' )
>>> floatexp = lexscan.ScanExp( r'[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?', name='float' )
>>> stringexp = lexscan.ScanExp( r'"((?:\\.|[^\\"])*)"', name='string' )
>>> assignexp = lexscan.ScanExp( r'[=:]', name='assignment' )
>>> string = 'strings: "foobar", "don\'t stop me n0w"; numbers: 450, 13.37'
>>> tokens = lexscan.tokenize( string, ( spaceexp, identifierexp, delimiterexp, intexp, floatexp, stringexp, assignexp ) )
>>> pprint( tokens )
[1:0: 'strings' identifier,
 1:7: ':' assignment,
 1:9: '"foobar"' string,
 1:17: ',' delimiter,
 1:19: '"don't stop me n0w"' string,
 1:38: ';' delimiter,
 1:40: 'numbers' identifier,
 1:47: ':' assignment,
 1:49: '450' integer,
 1:52: ',' delimiter,
 1:54: '13.37' float]

Of course, splitting a string will always be faster than using regular expressions for tokenizing. Be judicious in deciding which approach to use!