(Machine) Learning About Love: Who will Leave The Bachelor next?

What is love?  Can you define love, or do you merely recognize its presence or absence without binding it in a definition?  A computer may be able to provide a definition of love, regurgitated from a database, but can a computer recognize love?  If defining love is so difficult for us, many of whom might cop out with “I know it when I see it”, how much more difficult will a computer find this challenge?  On a lark, I decided to implement a machine learning system to predict a few things about a television show about love, ABC’s “The Bachelor”.

My girlfriend fiancée regularly watches The Bachelor; valuing our relationship, I am sometimes drawn in.  The premise of the show is this: Chris Soules (The Bachelor), a single man, acts as the sole decision maker in a competition among 30 eligible women to identify his potential bride.  Each week, one or more of the 30 women is eliminated as the bachelor gets to know them.  This is the 19th Season of this show (there have been 10 iterations of its partner show, “The Bachelorette”)

If we can teach a computer to recognize some measure of love, it is certainly possible to predict who will depart the show next.  Regardless of how we frame this question, there are challenges.  First off, we have no clue how the Chris Soules “mental machine” works.  More importantly, the 30 data points we have (one per contestant) contain far too many features (most of which aren’t quantifiable) of which we only know a dozen or so.  Given a dozen features, we cannot build a statistically reliable model for prediction.  For the sake of entertainment and self-enrichment, I decided to make an attempt in the face of these challenges.

Machine Learning Techniques

I decided to utilize a Decision Tree such that I could investigate the factors personally, but I decided to expand and also utilize a Support Vector Machine as well.  An understanding of the SVM coupled with a quick glance at the Scikit-Learn algorithm cheat-sheet drove my decision to incorporate both.

Data Collection

I first wrote a scraper to pull some data off the open Internet about each candidate; a photograph, age, some ancillary data provided.  From the photograph, I manually identified features such as ethnicity as well as hair types (length, curliness, color).  From the ancillary data, I acquired other quantifiable numbers: number of tattoos and age as well as whether or not that contestant had been eliminated.

Each individual is encoded in JSON like so:

 {
     "hair_wavy": "straight",
     "hometown_name": "Hamilton",
     "height_inches": 64,
     "age": 24,
     "num_tattoos": 0,
     "hair_color": "dark",
     "name": "Alissa Giambrone",
     "hair_length": "chest",
     "date_fear": " Running into recent exes",
     "occupation": "Flight Attendant",
     "likes": [
        "Family",
        "friends",
        "laughter",
        "hope",
        "faith"
     ],
     "free_text": " If I never had to upset others, I would be very happy.  If I never got to play with puppies, I would be very sad.  If you could be any animal, what would you be? A wild mustang. Free to run and explore, they're unpredictable and beautiful, and are loyal to their herd.  If you won the lottery, what would you do with your winnings? Adopt dogs and charter a jet for my friends anf fmaily to fly around Europe, with unlimited champagne and a hot air balloon ride over Greece.  What's your most embarrassing moment? I was in-depth stalking a guy's Facebook page and sent my friend a long, detailed text about my findings...except I sent it to him. Oops.  What is your greatest achievement to date? Getting my yoga certification because I've been able to inspire others to do yoga and become instructors!  ",
     "eliminated": true,
     "photo_url": "http://static.east.abc.go.com/service/image/ratio/id/90fe9103-e788-418d-9eb1-4204b7ba1b96/dim/690.1x1.jpg?cb=51351345661",
     "hometown_state": " NJ",
     "ethnicity": "caucasian",
     "goneweek": 2,
     "featured": true,
     "featured_num": 6,
     "intro_order": 21
}

Some data properties are categorical (such as ethnicity and hair color); some data properties are ordinal (hair length), and some data properties are ratios (such as age or number of tattoos). The Python script encodes categorical data into ordinal, numeric values through a lookup dictionary.

hair_length = dict({"neck": 1,
                    "shoulder": 2,
                    "chest": 3,
                    "stomach": 4})

Workflow

Given this sparse data, it’s hard to train the machine.  More importantly, we can’t break the data into training and testing data; because there’s so little data, we need to use all of it.  For this reason, I decide to randomly pick 25% of the data for training the tree or SVM, then run these very same points through the classifier.  To offset this horrible violation of machine learning principals, I’ll run the model a thousand times and see who gets mis-classified as “eliminated” the most.

Here’s some code I wrote to train the samples:

def week_predict(tgt_data, elims, tgt_week, sc_learn):
    """
    Given some data, make some predictions and return the average accuracy.
    :param tgt_data: json structure of data to be formatted
    :param elims: eliminated
    :param tgt_week:
    :param sc_learn:
    :return:
    """

    learn_values = data_formatter(tgt_data, elims, tgt_week)
    x = list()
    y = list()
    samples = set()
    while len(samples) < (len(learn_values) * 0.25):
        samples.add(random.randint(0, len(learn_values) - 1))
    learn_arr = learn_values.values()
    # print "Sample selection: ", samples
    for index in samples:
        x.append(learn_arr[index][0])
        y.append(learn_arr[index][1])

    # These next two lines courtesy of:
    # http://scikit-learn.org/stable/modules/tree.html
    clf = sc_learn
    try:
        clf = clf.fit(x, y)
    except ValueError:
        # Sometimes we try to train with 0 classes.
        return dict({"accuracy": 0, "departures": []})
    c = 0
    departures = list()
    for item in learn_values:
        if clf.predict(learn_values[item][0]) != learn_values[item][1]:
            c += 1
            if learn_values[item][1] == 0:  # if they aren't eliminated
                departures.append(item)
    accuracy = round((len(learn_values) - c) / float(len(learn_values)) * 100, 2)
    ret_val = dict({"accuracy": accuracy,
                    "departures": departures})
    return ret_val

Output

These are my leading predictors for departure.  For the decision tree, here we have it:

Departing: Britt Nilsson votes: 640
Departing: Carly Waddell votes: 581
Departing: Jade Roper votes: 538
Departing: Becca Tilley votes: 520
Departing: Megan Bell votes: 506
Departing: Kaitlyn Bristowe votes: 365

For the SVM, the output:

Departing: Britt Nilsson votes: 749
Departing: Megan Bell votes: 722
Departing: Becca Tilley votes: 714
Departing: Jade Roper votes: 707
Departing: Carly Waddell votes: 629
Departing: Kaitlyn Bristowe votes: 619

I’ve arranged each of these in descending order; the more votes someone has for departure, the more frequently they are miscategorized as eliminated based on who has already been eliminated and who is still there.

Analysis

Does this make sense given the current state of the contest? After discussing these predicted outcomes with expert viewers, they were adamant that Britt, the contestant I’ve pegged as being most likely to leave, has almost no chance of departing.  She made a strong impression on Chris and seems to have a strong connection with him, so for her to leave would be a shocker; on the other hand, many of these same viewers thought Kaitlyn had a good chance of winning, which would validate her low departure score.

UPDATE: 

This blog post didn’t make it to the Internet in time for these predictions to be evaluated in public.  As such, (spoiler alert) we know that Britt and Carly departed in Week 7, followed by Jade in Week 8.  As such, I will make some predictions for next week:

Decision Tree results:

Departing: Becca Tilley: 3590
Departing: Whitney Bischoff: 3357
Departing: Kaitlyn Bristowe: 3320

Support Vector Machine results:

Departing: Becca Tilley: 1799
Departing: Whitney Bischoff: 1760
Departing: Kaitlyn Bristowe: 1715

Analysis 2.0

We’ve trained an SVM with just a few variables that don’t capture the essence of the problem very well.  Furthermore, we did no fine tuning for the parameters in the SVM, nor did we prune the decision tree.

In order to rectify the data issue, I have put out a call to fans of The Bachelor for additional data about any facet of the show, including:

  • How many kisses per episode per contestant
  • Who gets what date?
  • Number of minutes of on-air time featured per episode, per contestant
  • Distance from contestant hometown to Arlington, Iowa
  • Any other data

Tools Used

I used mostly stock Python libraries to gather data and produce these predictions; for web scraping, I used selenium and BeautifulSoup.  For machine learning, scikit-learn.  For keeping the code clean, pylint.

Code is available at: https://github.com/jamesfe/bachelor_tree

If you have questions, feel free to tweet me them over to @jimmysthoughts!