Monday, June 5, 2017

Training an RNN on the Archer Scripts

Introduction

So all the hype these days is around "AI", as opposed to "machine learning" (though I've yet to hear an exact distinction between the two), and one of the tools that seems to get talked about most is Google's Tensorflow.

I wanted to get playing around with Tensorflow and RNN's a little bit, since they're not the type of machine learning I'm most familiar with, with a low investment in time to see what kind of outputs I could come up with.

Background

A little digging and I came across this tutorial, which is a pretty good brief overview intro to RNNs, and uses Keras and computes things character-wise.

This is turn lead me to word-rnn-tensorflow, which expanding on the works of others, uses a word-based model (instead of character based).

I wasn't about to spend my whole weekend rebuilding RNNs from scratch - no sense reinventing the wheel - so just thought it'd be interesting to play around a little with this one, and perhaps give it a more interesting dataset. Shakespeare is ok, but why not something a little more culturally relevant... like I dunno, say the scripts from a certain cartoon featuring a dysfunctional foul-mouthed spy agency?


Data Acquistion
Googling the Archer scripts turns up the bunch of them at Springfield! Springfield!.

Unfortunately since it looks like the scripts have been laboriously transcribed by ardent fans, there isn't any dialogue tagging like you'd see in a true script, but this is a limitation of the data set we'll just have to live with. Hopefully the style of the dialogue and writers will still come through when we train the RNN on it (especially since sometimes there is not too much difference between the difference characters' dialogue, given how terrible they all are, and the amount of non sequitur in the show).

I suppose I could have gone through and copy-pasted all 93 episodes into a corpus for training, but I'm pretty sure that would have taken longer than just putting together the Python script I did using BeautifulSoup and building on some previous work:

from bs4 import BeautifulSoup
import urllib2

# CREATE SOUP
def soupify(url):

    # Open the request and create the soup
    req = urllib2.Request(url)
    response = urllib2.urlopen(req, timeout = 10.0)
    soup = BeautifulSoup(response.read(), "lxml")
    return soup


# GET SCRIPT AND CLEAN
def get_script(url):
    soup = soupify(url)
    script = soup.findAll("div", {"class":"episode_script"})[0]
    
    # Clean
    for br in script.find_all("br"):
        br.replace_with("\n")
    scripttext = script.text
    scripttext = scripttext.replace('-',' ').replace('\n',' ')
    scripttext = scripttext.strip()

    return scripttext

# GET SCRIPT URLS
def get_episode_urls(showurl):
    
    soup = soupify(showurl)

    # Get the urls and add the base URL to each in the list
    urls = soup.findAll("a", {"class":"season-episode-title"})
    baseurl = 'http://www.springfieldspringfield.co.uk/'
    urls = map(lambda x: baseurl + '/' + x['href'], list(urls))

    return urls

### MAIN

def do_scrape():

 # Scrape the script from each URL and add to a list
 episodes = list()

 # Get the episode list from the main page
 urls = get_episode_urls('http://www.springfieldspringfield.co.uk/episode_scripts.php?tv-show=archer')

 for url in urls:
  print url
  episodes.append(get_script(url))  
    
 # Write the output to a file

 f = open('archer_scripts.txt','w')
 for episode in episodes:
  f.write(episode.encode('ascii','ignore'))
    
 f.close()

Basically the script gets the list of episode URLs from the show page, then scrapes each script in turn and exports to a text file. And I didn't even have to do any error handling, it just worked on the first shot! Wow. (Isn't it nice when things just work?)

After a little manual data cleansing, we are ready to feed the data to the RNN model.

Training the model
Since this is the easy part that we are relying on the already built model for, there's not much to say here. Just rename the file and plunk into a data directory like the demo file, then run
python train.py --data_dir data/archer
And let the model do its thing. My poor little laptop doesn't even have a GPU so the model was grinding away overnight and then some but eventually finished.

The end of the grind and testing the model output.

word-rnn-tensorflow also conveniently pickles your model, so you can conveniently use it again at a later time, or continue training a previously trained model. I'd have made the model files available, however unfortunately they are rather large (~200 MB).

Anyhow, once the training is done you can get the model to spit out outputs by running:
python sample.py
Here are some sample outputs from the model which I split up and tidied a bit:

Oh, what do you mean "Lana hasn't called"? 
I mean, you don't know how to tell you how to discipline my servant! 
I think I was gonna say "flap hammered. " 
Oh, what are you talking about? 
Hitting on the communist insurgency. 
I don't do dossiers. 
Besides, this is a booty call, I'm flattered, but Oh my God, BPA! 
I Transgendered frogs! [frog bribbit]
Shh, shh, shh. 

Coyly. relieve in the cerebrospinal fluid at the dimmer switch in the bullpen, maybe spilled a miracle. And so sorry if you don't clamp that artery! 
One! Brett? What the hell is wrong with you?! And what are you doing? 
Esatto! Che cosa stiamo facendo?! 
Aww man, we go to Plan Manfred. And then Ruby Foo's. 
Yeah, I don't know what grits are, or hush puppies! 
Are you sure? 
I don't know. 
Push it out of that shitheap of a weak, like the rest of our business! Oh, and speaking of an exquisitely prepared entre... 
No, I don't even know what grits are, or hush puppies! 
Are you sure? 
That's what I was gonna say "flap hammered. " 
Oh, how are you bleaching mucho. 
But I don't know what grits are, or hush puppies! 
Are you sure? 
That's what I was gonna say "flap hammered. " 
Oh, how are you bleaching mucho. 
But I don't know what grits are, or hush puppies! 
Are you finished? 
Yes. 
No, no, no, no, no! [crashing] [crashing stops] [coughing] 
Oh, shit, uh whatcha madoo HVAC. 
Ooh! 
Well? 
God damn it. Off! 
peasy zombie squeezy. 

Yeah, of the sake of argument, let's leave him Fister Roboto. 
But it looks like bubble wrap. 
What is your proposal? 
I know the burns. And if you were "confused verging on angry" before...
Aaagh! Son of a fanboy request. 
And you don't know how to share, beyotch. 
Easy, tough guy. 
When do you think it was squash, sir. 
I don't know. I don't know. Warranty's in raspy Street, you know. 
Woowoowoowoowoowoowoof! 
What.
No, coordinate with Rome, then let me go. (wheezy coughing) (gasping) 
Well, I am just a DJ?

Learning experience? Well, joke's on sale, will you not?
She's battled ivory poachers, raging rivers, even tuberculosis, and now Oscar winner Rona Thorne takes on the planet. 
Look: CIA, Ml6, Interpol. 
We can't believe you don't clamp that artery! 
One! Brett? 
What the hell was that? 
Microfilm. 
It was all the shocks damaged my frontal lobe. 
In the brain's language center?

About the output you would expect. Nothing that's going to pass the Turing test, but if you've watched the show you can picture Archer, Lana, and Cyril having an argument that might contain some of the above (with maybe a couple other cast members thrown in... like that Italian line from The Papal Chase). And it seems to stitch together whole phrases or following lines since many are unique.

Some of the output is not that bad - there's what could be some comedic gems in there if you look hard enough, that aren't verbatim from the original scripts (e.g. "son of a fanboy request!")

Conclusion

A fun little romp doing some web scraping and playing with RNNs. Unfortunately with using someone else's code, this model was even more of a black box than neural networks usually are, however this was just for fun. If you want to know more or play around yourself, check out the resources below, and what I've saved on github.

References

Google TensorFlow

Creating a Text Generator Using A Recurrent Neural Network

word-rnn-tensorflow

Archer scripts (at Springfield! Springfield!)

Python code and model input on github:

No comments:

Post a Comment