This week has been all about testing and optimizing the bot Train process.
There was a point where I knew it was going to take too long to be satisfactory but out of a perverse geek curiosity I couldn’t bring myself to cancel the training… I just had to see how slow (bad) the process really was! 😛
If we want to use this as the core of a bot/system that can do more than just parts of speech tagging (and we do) it needs to be FAST!, and to do anything really fun it just can’t take 3 weeks to process 1 million words!
And yes, the Brown Corpus only needs to be learned once but any additional learning by the bot would also be just as slow…
Why was it so slow???
Basically the code was self blocking. Every step had to complete before the next step could begin.
All the words & tags were added to the database before the trigrams could be learned and we had to wait each time for the database to generate an ID.
I did cache the ID’s for the words and tags the bot encountered in memory for faster lookup but… it was ultimately just a slow process regardless of this optimization.
What that did however was to help keep the average train time per document fairly consistent, but trigrams were queried by searching the database for any trigram where A & B & C were present LIMIT 1. Needless to say that is super inefficient!
Though clearly not the Training process we wanted it had a few things going for it:
- Quick to Build & Functional though Verbose:
What was nice about this method was that the code was easy to write and though long, it should be fairly easy to read. Also, it can’t be over stated that the best way to get on the right track is to build a functional proof of concept and then iterate and try to improve your system.
- Direct Numbers are Good for Bots
We also gained the ability to do some interesting transforms by having a unique numeric value that the words, tags and trigrams are tied to. We’ll discuss this more in future posts so I’ll leave this at that for now.
3 Weeks is too slow!
~1,000,000 words / ~504 hours (3 weeks) = 1984.12698413 words per hour.
That equates to roughly 1 document an hour for 3 weeks straight!
- Can’t Divide and Conquer
As the Train process was written it was impossible to split the data set among more than 1 system and then later combine the tables without quite a bit of manual post processing… which is just not a very pleasant thought! 😛 This is because the database assigned the ID’s based on the order in which it encountered a word, tag or trigram. So, if you have two systems and split your training files between them they will both assign different ID’s to the same word, tag or trigram so later we would have to read through all the tags, words and trigrams and change the ID’s so they were the same before we could merge the tables.
What changed in the refactor?
We switched to a batch process method where we process 10 files in memory then transfer the data to the database, clear the memory and process the next batch of 10 files until we have processed all the files.
This allows us to keep the memory requirements of the training process very low with each batch of 10 training files only requiring on average ~25 MB of RAM to go from raw text to database, which the bot quickly empties when it’s done.
Which brings us to hashing.
Hashing to the Rescue!
You might be asking… isn’t this a lot of work for something that seems simple? Why bother with hashing at all? Isn’t the batch processing memory trick enough? Well, batch processing was a response to implementing hashing.
You see, we needed a way to reduce the number of comparisons when doing lookup’s.
Consider this comparison:
(A == Wa && B == Wb && C == Wc)
that’s three compares (all must be correct or true) in order for the Trigram to be good but if A & B are correct and C isn’t, that’s still 3 evaluations before you know to move on. If we could reduce those comparisons without losing the information we gain from doing those comparisons then we might save a lot of time during training as well as when using the bot later!
We also needed a way to have 2 or more machines assign the same “ID” value to a word, tag and trigram. This would allow us to split the training set among as many computers as we can get our hands on and make quick work of future training data.
Hashing solves both of these problems!
If you & I hash the same value using the same algorithm we will get the same result regardless of our distance from each other, the time of day or any other factor you can think of. We can do this without ever having to speak to each other and our computers need not ever communicate directly. This property of hashing makes it an ideal solution for generating ID’s that will lineup without a centralized system issuing ID’s. It’s basically how block-chain technology works, though this is far simpler.
Hashing also allows us to reduce 3 comparisons to 1 because we concatenate W_A + W_B + W_C like this:
<?php // notice these two are the same - and always give the same result echo hash('md5', 'Thequickbrown'); // a05d6d1139097bfd312c9f1009691c2a echo hash('md5', 'Thequickbrown'); // a05d6d1139097bfd312c9f1009691c2a // notice these two are the same but different capitalization - different result echo hash('md5', 'fox'); // 2b95d1f09b8b66c5c43622a4d9ec9a04 echo hash('md5', 'Fox'); // de7b8fdc57c8a948bc0cf52b31b617f3 // A specific value always returns that specific result echo hash('md5', 'jumpsoverthe'); // fa8b014923df32935641ca80b624a169 echo hash('md5', 'jumpsoverthe'); // fa8b014923df32935641ca80b624a169 ?>
Hashing yields a highly unique (case sensitive) value that represents the three words in the trigram and as such, when we are looking for a specific trigram we can hash the values and obtain it’s exact ID rather than do an A&B&C comparison.
It’s worth noting that hashing would add to the memory requirements of the bot (as hashing a word makes a longer word in most cases) so batch processing was added to address the increased memory demands of the hashed data.
The batch process eliminates the negative of having more information in memory (caused by hashing) by limiting how much RAM the program will need at any given moment.
Here’s a pros vs cons overview.
- Divide and Conquer!
We can split the training data among as many computers as we have available.
- No Significant processing required to merge tables
All ID’s will be the same so there is no need to convert them.
- ID Lookup’s are Eliminated
Because the ID is the hashed representation of the word, tag or trigram we never need to lookup an ID. You just hash the value you are checking and then use that as the ID.
- Hashing isn’t Fast!
While approximately ~4812% Faster and no longer taking 3 weeks, this code is still slow & took 10 hours, 15 minutes and 50.4 seconds to process 1 million words into trigram patterns and store them in the database.
If you would like to obtain a copy of the new Training code you can get that on my Github here: Train.php
And of course what you’ve all been waiting for… the data:
Parts of Speech Tagger Data:
You don’t need to wait 10 hours by running Train.php to get started using the brown corpus in your own projects! I’ve made the data available on my GithHub profile where you can download it for free as SQL and CSV formats.
I wanted to release the data as XML as well but the files were larger then GitHub would allow and even the SQL and CSV files we’re just barely under the allowed upload limit. GitHub complained… Oh, the things I do for my readers! 😛
I hope you are enjoying building SkyNet… er… this Parts of Speech Tagger as much as I am. 😛
The next post in this series we will look at how to feed the bot some text and use Trigrams to tag the words so remember to like, and follow so you won’t miss a single post!
Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.
And before you go, consider helping me grow…
Help Me Grow
Your direct financial support finances this work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.
Your support goes toward helping me buy better tools and equipment so I can improve the quality of my content. It also helps me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛
And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.