Search

Geek Girl Joy

Artificial Intelligence, Simulations & Software

Adding Bigrams & Skipgrams

Welcome, today were going to talk about AB & BC Bi-grams as well as AC Skip-grams for Parts of Speech tagging.

If you’re just finding my content here are the other posts in this series:

An Introduction to Writer Bot

Rule Based Stories

Artificial & Natural Language Processing

 

I said in my last post (Building A Faster Bot) that I wanted to look at how to feed our bot some text and use Trigrams to tag the words in this post but… the truth is we have one final step before we are ready to do that and we’ll get to that next week for sure!

For now though we need to compute Bigrams & Skipgrams for our Trigrams.

 

What are “Grams” again?

Essentially, an “n-gram” is a set number of something in a unique pattern that we want to model. A “gram” is one item in the n-gram set.

We can use tags, numbers or the items themselves in some cases (like we did with words) as the grams and by collecting and categorizing the unique possibilities into groups of n-grams and then observing their use or existence we can extract “hidden” probability information about our subject.

That’s what we accomplished during training by collecting and counting all the word Trigrams present in The Brown Corpus Database.

In addition to modeling probability we modeled meaning by linking sets of trigram words together with sets of trigram tags and we cemented those links by hashing the grams together which also imparts a boost to the lookup speed.

 

So, Bigrams and Skipgrams are?

If “Tri-grams” are “three” word patterns then “Bi-grams” are “two” word patterns & “Skip-grams” are just a special n-gram that skip some of the grams. In this case our skip grams are bi-grams though technically a you can have a skip-gram that comprises more than two grams and skips more than one gram.

Additionally other than practicality, nothing prevents a longer skip-gram from containing multiple nonconsecutive skips… though such a long complex word skip-gram would have dubious value, though there are always exceptions to rules and who says you are modeling words in your case?

Here is an example that should hopefully make things more clear.

Given this tri-gram:

(one)(of)(the)

Hash: 320263779473e9ac2252940e0173a5b8

 

We can extract the following AB bi-gram:

(one)(of)

Hashed we get Hash_AB: hash(‘md5’, ‘oneof’) = 44d1d5bf437689cced8a62e192cdc49f

And this BC Bigram:

(of)(the)

Hashed we get Hash_BC:  hash(‘md5’, ‘ofthe’) = d2861a779f19cac959f0e0a6bc0bda24

Which leaves the Hash_AC skip-gram:

(one)(the)

Hashed we get Hash_AC:  hash(‘md5’, ‘onethe’) = adff8ebf224c1abcf98893cedb6db248

We’ll do this for all the available trigrams in the database.

 

And… Why add Bigrams & Skipgrams?

This last step imparts additional speed benefits to our bot because the Trigrams alone will be insufficient to properly identify and tag every word in every sentence.

Think about it, the bot knows 56,057 words (which is more than the average native English speaker… so more than you and m… well… perhaps you… 😛 ) and the Oxford English Dictionary claims there are a little less than 200K words in English which is almost certainly not true if we include ancillary colloquialisms and slang as part of English and for the purposes of parts of speech tagging if not simply AI research we’d almost certainly have to if we’re sourcing training materials from the web.

The number of trigrams our bot knows is 878,037 which like the bot’s vocabulary, is limited when compared to what is possible.

This is because our bot only trained on the Brown Corpus so it only knows the Trigrams which were present in the training material, but because we know that the training material was real text and not random gibberish, we know the trigrams are “high quality” learning material for our bot.

If we wanted to know the upper limit of how many trigrams there could be we simply need to know how many words the bot knows and then “cube” the bot’s vocabulary:

56,0573 = pow(56057, 3) = 1.7615280201719E+14

This means that if any combination (including combinations like (the)(the)(the)) are correct then there are One Hundred Seventy-Six Trillion, One Hundred Fifty-Two Billion, Eight Hundred Two Million, Seventeen Thousand, One Hundred Ninety possible combinations. so… far more then the 878K we currently have!

But we know that combinations like (the)(the)(the) are bad so we could calculate the factorial but we don’t really gain anything by doing that and it only tells us how many trigrams are hypothetically possible but not which ones are actually valid.

Beyond if a Trigram is valid because it isn’t the same words repeated, some words simply never work together and are invalid anyway, so knowing it can exist isn’t enough otherwise we could just generate all possible 3 word combinations and be done.

To get things to work right we also need to correlate it’s probability with other patterns, which is what the count does.

But since we can’t look at all possible valid combinations (we’re not google 😛 ) we have to get creative.

We can improve the bot’s ability to tag words by allowing it to solve the problem with less information by computing AC Skip-grams and AB + BC Bi-grams

This retains the same number of Trigrams but we gain 2,634,111 additional gram patterns (ways of evaluating text) that are otherwise hidden behind costly multi-field comparisons at run time.

Basically this means that when a Trigram isn’t exactly what we want (but very close) we can “back off” the trigram and use a Bi-gram or Skip-gram to tag a word instead and combine the results.

Either way, the hashing is a shortcut and simply makes the comparisons we need to do when tagging text faster and reduces the strain on the database.

Now, because we won’t model all possible Bi-grams and Skip-grams, there will be gaps that Bi-grams and Skip-grams will also fail to fill and in those cases we will need to rely on aggregate uni-grams but there is no need to hash for uni-grams as that’s simply a word by itself so it’s just faster (and cheaper computationally) to just compare words at that point.

 

The Code AddHashes.php

Here’s a link to the AddHashes.php code in the GitHub repo for this project.

<?php
/*
This programm will connect to the PartsOfSpeechTagger database and add 3 additional fields 
directly after the 'Hash' field.
We need to add 2 fields for 'Bigrams'
// hash(A && B) 
// hash(B && C) 
We also need 1 field for 'Skip-grams'
// hash(A && C) 
*/
// MySQL Server Credentials
$server = 'localhost';
$username = 'root';
$password = 'password';
$db = 'PartsOfSpeechTagger';
// Create connection
$conn = new mysqli($server, $username, $password, $db);
// Check connection
if ($conn->connect_error) {
  die("MYSQL DB Connection failed: " . $conn->connect_error);
}
// Add additional Hash fields
$sql = "ALTER TABLE `Trigrams` ADD `Hash_AB` VARCHAR(33) NOT NULL AFTER `Hash`, ADD `Hash_BC` VARCHAR(33) NOT NULL AFTER `Hash_AB`, ADD `Hash_AC` VARCHAR(33) NOT NULL AFTER `Hash_BC`";
$conn->query($sql);
// Add the Bigram and Skipgram hashes
$sql = "SELECT * FROM `Trigrams` WHERE `Hash_AB` = '' OR `Hash_BC` = '' OR `Hash_AC` = ''";
$result = $conn->query($sql);
$i = 1;
if ($result->num_rows > 0) {
  // output data of each row
  while($row = mysqli_fetch_assoc($result)) {
    
     // We already generated the Trigrams
     // A && B && C
     // Generate Bigram hashes
     // A && B 
     $Hash_AB = hash('md5', $row["Word_A"] . $row["Word_B"]);
     // B && C
     $Hash_BC = hash('md5', $row["Word_B"] . $row["Word_C"]);
     
     // Generate Skip-gram hashes
     // A && C
     $Hash_AC = hash('md5', $row["Word_A"] . $row["Word_C"]);
     
     // Generate SQL
     $sql_AB = "UPDATE `Trigrams` SET `Hash_AB` = '$Hash_AB' WHERE `Trigrams`.`Hash` = '" . $row["Hash"] . "'";
     $sql_BC = "UPDATE `Trigrams` SET `Hash_BC` = '$Hash_BC' WHERE `Trigrams`.`Hash` = '" . $row["Hash"] . "'";
     $sql_AC = "UPDATE `Trigrams` SET `Hash_AC` = '$Hash_AC' WHERE `Trigrams`.`Hash` = '" . $row["Hash"] . "'";
     
     // Update Database
     $conn->query($sql_AB);
     $conn->query($sql_BC);
     $conn->query($sql_AC);
     echo $i . PHP_EOL;
     $i++;
  }
}
$conn->close();

Run this code over night and you are ready to use your Parts of Speech tagging bot and we’ll cover that next week.

With that, please like this post & leave your thoughts in the comments.

Also, don’t forget to share this post with someone you think would find it interesting and hit that follow button to make sure you get all my new posts!

And before you go, consider helping me grow…


Help Me Grow

Your direct monetary support finances this work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.

Your support goes toward helping me buy better tools and equipment so I can improve the quality of my content.  It also helps me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you feel inclined to give me money and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month and you will be helping me grow.

Thank you!

And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Advertisements

Building A Faster Bot

This week has been all about testing and optimizing the bot Train process.

The prototype we looked at in The Brown Corpus is way to slow and needed to be refactored before we could proceed.

There was a point where I knew it was going to take too long to be satisfactory but out of a perverse geek curiosity I couldn’t bring myself to cancel the training… I just had to see how slow (bad) the process really was! 😛

If we want to use this as the core of a bot/system that can do more than just parts of speech tagging (and we do) it needs to be FAST!, and to do anything really fun it just can’t take 3 weeks to process 1 million words!

And yes, the Brown Corpus only needs to be learned once but any additional learning by the bot would also be just as slow…

Why was it so slow???

Basically the code was self blocking. Every step had to complete before the next step could begin.

All the words & tags were added to the database before the trigrams could be learned and we had to wait each time for the database to generate an ID.

I did cache the ID’s for the words and tags the bot encountered in memory for faster lookup but… it was ultimately just a slow process regardless of this optimization.

What that did however was to help keep the average train time per document fairly consistent, but trigrams were queried by searching the database for any trigram where A & B & C were present LIMIT 1. Needless to say that is super inefficient!

Though clearly not the Training process we wanted it had a few things going for it:

Pros:

  • Quick to Build & Functional though Verbose:
    What was nice about this method was that the code was easy to write and though long, it should be fairly easy to read. Also, it can’t be over stated that the best way to get on the right track is to build a functional proof of concept and then iterate and try to improve your system.
  • Direct Numbers are Good for Bots
    We also gained the ability to do some interesting transforms by having a unique numeric value that the words, tags and trigrams are tied to. We’ll discuss this more in future posts so I’ll leave this at that for now.

Cons:

  • SLOW!!!!
    3 Weeks is too slow!

~1,000,000 words / ~504 hours (3 weeks) = 1984.12698413 words per hour.

That equates to roughly 1 document an hour for 3 weeks straight!

  •  Can’t Divide and Conquer
    As the Train process was written it was impossible to split the data set among more than 1 system and then later combine the tables without quite a bit of manual post processing… which is just not a very pleasant thought! 😛   This is because the database assigned the ID’s based on the order in which it encountered a word, tag or trigram. So, if you have two systems and split your training files between them they will both assign different ID’s to the same word, tag or trigram so later we would have to read through all the tags, words and trigrams and change the ID’s so they were the same before we could merge the tables.

 

What changed in the refactor?

We switched to a batch process method where we process 10 files in memory then transfer the data to the database, clear the memory and process the next batch of 10 files until we have processed all the files.

This allows us to keep the memory requirements of the training process very low with each batch of 10 training files only requiring on average ~25 MB of RAM to go from raw text to database, which the bot quickly empties when it’s done.

Which brings us to hashing.

Hashing to the Rescue!

You might be asking… isn’t this a lot of work for something that seems simple? Why bother with hashing at all? Isn’t the batch processing memory trick enough? Well, batch processing was a response to implementing hashing.

You see, we needed a way to reduce the number of comparisons when doing lookup’s.

Consider this comparison:

(A == Wa && B == Wb && C == Wc)

that’s three compares (all must be correct or true) in order for the Trigram to be good but if A & B are correct and C isn’t, that’s still 3 evaluations before you know to move on. If we could reduce those comparisons without losing the information we gain from doing those comparisons then we might save a lot of time during training as well as when using the bot later!

We also needed a way to have 2 or more machines assign the same “ID” value to a word, tag and trigram. This would allow us to split the training set among as many computers as we can get our hands on and make quick work of future training data.

Hashing solves both of these problems!

If you & I hash the same value using the same algorithm we will get the same result regardless of our distance from each other,  the time of day or any other factor you can think of. We can do this without ever having to speak to each other and our computers need not ever communicate directly. This property of hashing makes it an ideal solution for generating ID’s that will lineup without a centralized system issuing ID’s. It’s basically how block-chain technology works, though this is far simpler.

Hashing also allows us to reduce 3 comparisons to 1 because we concatenate W_A + W_B + W_C like this:

<?php
// notice these two are the same - and always give the same result
echo hash('md5', 'Thequickbrown'); // a05d6d1139097bfd312c9f1009691c2a
echo hash('md5', 'Thequickbrown'); // a05d6d1139097bfd312c9f1009691c2a

// notice these two are the same but different capitalization - different result
echo hash('md5', 'fox'); // 2b95d1f09b8b66c5c43622a4d9ec9a04
echo hash('md5', 'Fox'); // de7b8fdc57c8a948bc0cf52b31b617f3

// A specific value always returns that specific result
echo hash('md5', 'jumpsoverthe'); // fa8b014923df32935641ca80b624a169
echo hash('md5', 'jumpsoverthe'); // fa8b014923df32935641ca80b624a169
?>

Hashing yields a highly unique (case sensitive) value that represents the three words in the trigram and as such, when we are looking for a specific trigram we can hash the values and obtain it’s exact ID rather than do an A&B&C comparison.

It’s worth noting that hashing would add to the memory requirements of the bot (as hashing a word makes a longer word in most cases) so batch processing was added to address the increased memory demands of the hashed data.

The batch process eliminates the negative of having more information in memory (caused by hashing) by limiting how much RAM the program will need at any given moment.

Here’s a pros vs cons overview.

Pros:

  • Divide and Conquer!
    We can split the training data among as many computers as we have available.
  • No Significant processing required to merge tables
    All ID’s will be the same so there is no need to convert them.
  • ID Lookup’s are Eliminated
    Because the ID is the hashed representation of the word, tag or trigram we never need to lookup an ID. You just hash the value you are checking and then use that as the ID.

Cons:

  • Hashing isn’t Fast!
    While approximately ~4812% Faster and no longer taking 3 weeks, this code is still slow & took 10 hours, 15 minutes and 50.4 seconds to process 1 million words into trigram patterns and store them in the database.

 

If you would like to obtain a copy of the new Training code you can get that on my Github here: Train.php

And of course what you’ve all been waiting for… the data:

Parts of Speech Tagger Data:

You don’t need to wait 10 hours by running Train.php to get started using the brown corpus in your own projects! I’ve made the data available on my GithHub profile where you can download it for free as SQL and CSV formats.

I wanted to release the data as XML as well but the files were larger then GitHub would allow and even the SQL and CSV files we’re just barely under the allowed upload limit. GitHub complained… Oh, the things I do for my readers! 😛

MySQL CSV
Tags.sql Tags.csv
Words.sql Words.csv
Trigrams.sql Trigrams.csv

 

I hope you are enjoying building SkyNet… er… this Parts of Speech Tagger as much as I am. 😛

The next post in this series we will look at how to feed the bot some text and use Trigrams to tag the words so remember to like, and follow so you won’t miss a single post!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your direct financial support finances this work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.

Your support goes toward helping me buy better tools and equipment so I can improve the quality of my content.  It also helps me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you feel inclined to give me money and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month and you will be helping me grow.

Thank you!

And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Polybius VR

What follows is hard to explain and admittedly sounds like something out of a twilight zone episode, and maybe… that’s just what it was. A moment in time where the laws of reality broke.

About two weeks ago I was on the app store on my phone looking for new Cardboard VR apps… what can I say? I like VR and Cardboard VR is affordable! And if you use one of those micro USB to USB type A converters you can plug in a USB hub and attach a keyboard or even a mouse and get some decent VR capability for relatively cheep! 😛 😉

Anyway, after a little browsing of the “new releases” I stumbled across an app called Polybius VR by a company I had never heard of before (Sinneslöschen Inc.) and it didn’t have any reviews yet and not many downloads either but since it was free I thought why not, it’s easy enough to uninstall if it’s not interesting, right?

The app was huge (a few hundred MB) and took a couple of minutes to download which isn’t that unusual for VR apps.

I was home alone and would be all evening, so while Polybius was downloading I microwaved some Ramen noodles and refilled my JPL Women In Space mug with coffee, I love mine slightly bitter and black.

I sat back down at my bedroom desk then paused to lookout my window and enjoy the blood red and purple Los Angeles sky as the sun sank low.

I sipped my coffee and got comfy in my chair. My keyboard was on my lap so I can use it while in VR. I put nail polish on the WASD keys as well as a few others so that I can find them without having to remove the VR headset. 😎 😉

I start Polybius and slide my phone into the headset and adjust the straps while the app loads.

In all directions I see an infinite abyss except for directly below me, where I see the sentence “(C) 2018 Sinneslöschen Inc.” glowing blue like copper sulphate crystals. The font is unusual, blocky and almost pixel like.

Out of the murky black void in front of me the words Polybius VR erupt and grow to become the only thing in my view.

The words seem to be about size of a single story building and were wrapped in polygonal chains that seemed to crawl like cellular automatons. The lines vibrated and jittered all over the text changing shape to envelop the words. The effect oscillated between a smooth gradient and jagged pixel edges which gave the effect that sometimes the lines were eating away at the text like acid.

It’s at this point where things start to get weird.

For lack of a better term I’d describe it as “missing time”. The experience for me seemed to only last a few minutes and all I recall seeing was the copyright text followed by the Polybius VR logo which flash several times in very rapid succession and then the screen on my cellphone just went black.

I pulled off the headset and frantically removed my phone to reveal that the screen was cracked, which was disappointing and that’s putting it mildly!

That’s when I gazed out my window again, only to realize that I could see stars in the sky!

As I said, my perceived experience was at most, only a few minutes had passed but the clock on the wall begged to differ when it read 8:57 PM. My computer and microwave also confirmed that roughly three hours had passed from when I first sat down.

After a more thorough examination of my phone it appeared that the battery had exploded and fused the entire thing into a paperweight.

I remembered I had installed the app on my SD card so I thought to recover it from the scrap but frustratingly it was also damaged beyond recovery, though thankfully I had my photos backed up!

I used my desktop to access the app store where I immediately searched for Polybius VR but nothing related came up.

Desperate for some reassurance of my own sanity I turned to Google like anyone in my position would and typed ‘Polybius’.

The very first link returned did little to alleviate my growing concern. The Wikipedia article “Polybius (urban legend)” opens with this sentence:

“Polybius is a fictitious arcade game, the subject of an urban legend…”

How could it be an urban legend? It was real! I installed it and it fried my phone, not to mention distorting my perception of time for 3 hours!

I spent the better part of the next week researching the Polybius urban legend only to turn up myth after half truth. Website after website full of internet rumors, hoaxers and fake news.

I even reached out to a couple of grey-hat’s I used to work with to see if they knew of anyone who was working in wetware that might be able to pull off a hack that would cause something like missing time.

They both told me the same thing… Polybius was a myth and nobody was even close to that level of biohacking.

Piecing all the “facts” together for myself the Polybius urban legend seems to go as follows…

In the summer of 1981 somebody (usually claimed to be the CIA but sometimes it’s shadow mega corporations… aliens?) formed the mythical shell company “Sinneslöschen Inc.” with the clandestine charge of conducting civilian thought control experiments on unsuspecting people.

I half expect Fox Mulder and Dana Scully to show up any minute!

The Polybius project is said to have centered around using arcade games (it was the 80s) to attempt to turn anyone into a mindless puppet.

Few credible witnesses have ever come forward but one overwhelmingly reoccurring theme among all the Polybius stories is an “addictive” effect experienced by players.

Some claiming it became the only thing they could think about even when they weren’t playing… which I guess these day’s is pretty understandable. I mean, we’ve all known someone (or been that someone) who was so into a game that we describe them as ‘addicted’.

The typical scenario told goes something like this, Polybius players would leave the house in the morning feeling the weight of pockets full of quarters!

Then wait, aching long hours while the clock shortened the distance between them and their next chance to play.

Once let out of work or school on their own recognizance the race between them and everyone else who wanted ‘THEIR’ machine was on!

You were lucky if you were an adult because it meant you had a car and could get the arcade first, pump the machine full of quarters till basic economics forced you to relinquish the machine to the next player, who in turn did the exact same thing.

Hours would churn and abused buttons induced pixels to strobe and undulate in hypnotic patterns while the polyphonic beeps rhythmically danced their strange digital melody.

In addition to the additive effect described by players some reports describe a “Polybius intoxication” others have refer to it as a sort of “madness” or “stupor”.

People who played Polybius for long stretches were described to occasionally experience something like a seizure and then followed by what appeared to be a coma lasting anywhere from minutes to hours after which they would wake up & remain “blank & zombie like” for some time.

Additional side effects caused by playing Polybius were reported to be: amnesia, insomnia, night terrors, hallucinations, and rare unprovoked aggressive episodes.

One site poorly sourced a quote from an Oregon newspaper from the early 80’s describing a public arcade event where several Polybius machines were observed by an audience of a few dozen people for an extended amount of time. It was reported that several of the players & audience members became sick, they described some of the symptoms of Polybius intoxication as well as “zombie like behavior” by those afflicted.

Putting urban legends aside, I’m still left with the question of what really happened that night?

I’d dismiss all of this outright as myth, half truth, internet rumors, hoaxers and fake news if it wasn’t for my experience with Polybius VR.

Is it possible that a neurohacker terrorist somewhere discovered a technique that could perhaps “reboot” a brain just by showing you some images?

The idea isn’t as sci-fi as it may seem. It turns out that a condition called confabulation  can occur in both biological and artificial neural neural networks, so maybe someone figured out how to trigger a buffer overflow in a brain and packaged it in a VR app!

The idea that some faceless attacker could take control over your mind seems to be the ultimate violation of self.

Maybe my phone battery just died and my computer, microwave and the atomic clock on my wall collectively suffered from the same glitch. Perhaps this was just an elaborate prank at the expense of anyone who was unfortunate enough to install Polybius VR…

Or perhaps it’s all true and there really are monsters that lurk in the shadows ready to devour us just as soon as the flashlight battery dies.

It stands to reason that I will never truly know what happened to me that night and the events of those three hours will remain forever shrouded in my nightmares.

I guess if there’s a moral to be found here at all it would be… be careful what you install!

And with that, Happy Halloween everyone! Be safe out there tonight & If you come across an app called Polybius VR in the app store, do yourself a favor and take a pass on that one! 😉

Remember to like, share and follow & if you are one of the few other people to have downloaded Polybius VR before it mysteriously disappeared from the app store, consider leaving your personal experience below in the comments and since you’re still reading, why not help me grow…

Help Me Grow

Your support finances my work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.

Your support goes toward helping me buy better tools and equipment so I can improve the quality of your reading material. It also helps me eat, pay rent and now of course I have to add a new cellphone to the budget! 😛

If you feel inclined to give me money and add your name (or business) on my Sponsors page then visit my Patreon page and pledge $1 or more a month and you will be helping me grow.

Thank you!

And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

The Brown Corpus Database

Welcome back, today we’re going to peek inside the database for the Parts of Speech tagger.

Unfortunately my Raspberry Pi that I am using to train with is slow (cough… and my code is super un-optimized 😛 ) so… it’s still working on learning the complete brown corpus though we’re almost there, less than 190 training files remaining!

Before proceeding here’s my disclaimer on the GitHub repo. It basically says that I don’t own the Brown Corpus and I am not selling it to you!

The Database

Here’s a recap of the database, It consists of three tables. Words, Tags & Trigrams. You can find the complete MySQL Database Setup script here:  Create.PartsOfSpeech.DB.sql.

CREATE TABLE `Tags` (
  `ID` int(11) NOT NULL,
  `Tag` varchar(8) NOT NULL,
  `Definition` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;


CREATE TABLE `Trigrams` (
  `ID` int(11) NOT NULL,
  `Count` int(11) NOT NULL,
  `Word_A` int(11) NOT NULL,
  `Word_B` int(11) NOT NULL,
  `Word_C` int(11) NOT NULL,
  `Tag_A` int(11) NOT NULL,
  `Tag_B` int(11) NOT NULL,
  `Tag_C` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;


CREATE TABLE `Words` (
  `ID` int(11) NOT NULL,
  `Word` varchar(100) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;


 

Words Table

The Words table keeps track of all the words the tagger knows.

The bot uses the ID’s in place of the words so given this sentence:

the quick brown fox jumps over the lazy dog a long-term contract with zero-liability protection lets think it over

We would expect the system to be able to lookup each word (provided it knows it) and replace it with the ID of the word in the Words table, like this:

1 43524 70488 515610 1149954 7158 1 266303 56280. 309 43578 53868 1212 zero-liability 238658 482081 32358 423

Notice that the bot was unable to lookup the ID for the word “zero-liability”, this is because it never saw that word during training and it would need to be “learned” by the system by assigning it a new ID and adding it to the database.

Here’s an info graphic that might help you understand the Words table:

Words Table - An inforgraphic reviewing the words table.
Words Table

 

Tags Table

The Tags table keeps track of all the tags the tagger knows.

The bot uses the ID’s in place of the tags so given these words:

fox, jump, jumps, jumped

We would assign these tags:

fox/nn = singular or mass noun

jump/vb = verb, base form

jumps/vbz = verb, 3rd. singular present

jumped/vbd = verb, past tense

And the ID’s for the tags would be represented as such:

21, 246, 138, 12

Here’s an info graphic that might help you understand the Tags table:

Tags Table - An inforgraphic reviewing the tags table.
Tags Table

 

Trigrams Table

The Trigrams table is the heart of the system and it’s job is to keep track of the associations between word trigrams (groups of 3 words) and tag trigrams (groups of 3 tags).

The Brown Corpus training data is split up into trigrams of words and tags so that when the bot learns it isn’t just learning individual words and tags but chains of words and tags.

This helps the bot learn that some words can have more than one meaning or role in a sentence. It also keeps a count of each time it sees a trigram so it can calculate the probability of each trigram and tag set.

Given this sentence:

the quick brown fox jumps over the lazy dog a long-term contract with zero-liability protection lets think it over

We would expect the system to be able to extract the following trigrams represented here as JSON:

[
	["The","quick","brown"],
	["quick","brown","fox"],
	["brown","fox","jumps"],
	["fox","jumps","over"],
	["jumps","over","the"],
	["over","the","lazy"],
	["the","lazy","dog"],
	["lazy","dog","A"],
	["dog","A","long-term"],
	["A","long-term","contract"], 
	["long-term","contract","with"],
	["contract","with","zero-liability"],
	["with","zero-liability","protection"],
	["zero-liability","protection","Let's"],
	["protection","Let's","think"],
	["Let's","think","it"],
	["think","it","over"]
]

 

And of course since we’re actually using word ID’s and not the words themselves we could change the words to their ID’s in the JSON:

 [
	["1","43524","70488"],
	["43524","70488","515610"],
	["70488","515610","1149954"],
	["515610","1149954","7158"],
	["1149954","7158","1"],
	["7158","1","266303"],
	["1","266303","56280"],
	["266303","56280","309"],
	["56280","309","43578"],
	["309","43578","53868"],
	["43578","53868","1212"],
	["53868","1212","1161931"],
	["1212","1161931","238658"],
	["1161931","238658","482081"],
	["238658","482081","32358"],
	["482081","32358","423"],
	["32358","423","7158"]
]

The same can be done with the Tags.

 

Here’s an info graphic that might help you understand the Trigrams table:

Trigrams Table - An inforgraphic reviewing the trigrams table.
Trigrams Table

 

 

This is as far as we’ll get this week so remember to like, and follow!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

It goes toward helping me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

The Brown Corpus

Welcome, we’re going to talk about training a Parts of Speech tagging bot using the Brown Corpus.

The Brown Corpus

What’s the Brown Corpus? Basically, two linguists (Henry Kučera and W. Nelson Francis) combined their efforts at Brown University (thus ‘Brown Corpus’) in the early 1960’s to create a English language corpus that computer scientists and AI researchers could use as a standard.

The corpus is comprised of 500 samples of English-language text, each text is approximately 2000 words long give or take a few exceptions.

It covers topics such as Religion, Politics, News and even Science (Fiction & Non) not to mention multiple genres in each topic. Check out the Sample Distribution section on the Wikipedia page for the specifics if you are curious but suffice it to say it’s extensive!

 

What Can You Do With It? 

Well, generally speaking it’s purpose is to act as a well documented & ‘tagged’ data set that you can compare your bot, word tagging system, or even something else… against to determine the accuracy of your model.

The thing is, that also means it makes a great resource to train a Parts of Speech tagging bot from. And  well… that’s what we’re going to do! 😛

Before proceeding here’s my disclaimer on the GitHub repo. It basically says that I don’t own the Brown Corpus and I am not selling it to you!

Further, you may not sell it without obtaining permission from the licence holder. As far as I am aware you may not use it commercially.

As for the bot, once you understand how this system operates it’s relatively trivial to make modifications to this tri-gram tagging system or build your own from scratch.

The real difficulty in using a system like this is obtaining a well tagged corpus of text with a commercially permissible use licence though they do exist for purchase, or you might find a CC0 or MIT Licenced corpus or here again… you could build your own from scratch… but that is a huge undertaking for a single or small group of developers.

How Do We Build It?

Ok, we all know I’m going to publish everything on my GitHub profile when I’m done and since I’m awesome,  I’m probably going to export the data as convenient formats (JSON, XML… ) as well. 😉

Are there formats other than MySQL, JSON & XML that you would like? CSV? Let me know in the comments.

To get started if you want to follow along you can find the GitHub repo and corpus HERE.

MySQL Database

I’m using a MySQL database to hold the “training data”. Once “trained” we can export the data to the other formats. Note that this database is unoptimized and is merely a rough prototype that focuses on function over form.

--
-- Database: `PartsOfSpeechTagger`
--
CREATE DATABASE IF NOT EXISTS `PartsOfSpeechTagger` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
USE `PartsOfSpeechTagger`;

-- --------------------------------------------------------

-- --------------------------------------------------------

--
-- Table structure for table `Tags`
--

CREATE TABLE `Tags` (
  `ID` int(11) NOT NULL,
  `Tag` varchar(8) NOT NULL,
  `Definition` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

-- --------------------------------------------------------

--
-- Table structure for table `Trigrams`
--

CREATE TABLE `Trigrams` (
  `ID` int(11) NOT NULL,
  `Count` int(11) NOT NULL,
  `Word_A` int(11) NOT NULL,
  `Word_B` int(11) NOT NULL,
  `Word_C` int(11) NOT NULL,
  `Tag_A` int(11) NOT NULL,
  `Tag_B` int(11) NOT NULL,
  `Tag_C` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

-- --------------------------------------------------------

--
-- Table structure for table `Words`
--

CREATE TABLE `Words` (
  `ID` int(11) NOT NULL,
  `Word` varchar(100) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

--
-- Indexes for dumped tables
--

--
-- Indexes for table `Tags`
--
ALTER TABLE `Tags`
  ADD PRIMARY KEY (`ID`),
  ADD UNIQUE KEY `Tag` (`Tag`);

--
-- Indexes for table `Trigrams`
--
ALTER TABLE `Trigrams`
  ADD PRIMARY KEY (`ID`);

--
-- Indexes for table `Words`
--
ALTER TABLE `Words`
  ADD PRIMARY KEY (`ID`),
  ADD UNIQUE KEY `Word` (`Word`);

--
-- AUTO_INCREMENT for dumped tables
--

--
-- AUTO_INCREMENT for table `Tags`
--
ALTER TABLE `Tags`
  MODIFY `ID` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1;
--
-- AUTO_INCREMENT for table `Trigrams`
--
ALTER TABLE `Trigrams`
  MODIFY `ID` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1;
--
-- AUTO_INCREMENT for table `Words`
--
ALTER TABLE `Words`
  MODIFY `ID` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1;

 

Train.php

Next I wrote this terribly un-optimized training script that is, as my grandfather would have put it, “slower than molasses in January”, but once done it need not ever be ran again (I’ve been training since the 6th 😛 ) so save yourself the trouble and don’t run this code! Wait for me to publish the finished data to the repo... hopefully sometime over the weekend or early next week.

<?php

// Create & return $conn object to hold connection to MySQL
function ConnectToMySQL($servername, $username, $password, $dbname){

  // Create connection
  $conn = new mysqli($servername, $username, $password, $dbname);
  // Check connection
  if ($conn->connect_error) {
    die("MYSQL DB Connection failed: " . $conn->connect_error);
  }
  
  return $conn;
}

// Disconnect $conn object that holds the connection to MySQL
function DisconnectFromMySQL(&$conn){
  $conn->close();
}


// If the word is in memory we know it, move on
// otherwise try adding it to the database
// if we add it to the database keep a copy in memory 
// to avoid unnecessary DB queries in the future
function AddWordToMySQLAndMemory($word, &$conn){
  
  global $words_to_id;
  
  // if the word isn't in memory try to add it to the database
  if(empty($words_to_id[$word])){
    $sql = "INSERT INTO `Words` (`ID`, `Word`) VALUES (NULL, '$word')";
    if ($conn->query($sql) === TRUE) {
        //echo "New word added successfully" . PHP_EOL;
      // add to memory for faster look up in the future
      $words_to_id[$word] = GetIDForWord($word, $conn); // get ID DB Assigned
    } else {
      // weird the Word exists - did you reboot?
      // echo "Word exists" . PHP_EOL; 
      // add to memory for faster look up in the future
      $words_to_id[$word] = GetIDForWord($word, $conn); // get ID DB Assigned
    }
  }
}


// If the tag is in memory we know it, move on
// otherwise try adding it to the database
// if we add it to the database keep a copy in memory 
// to avoid unnecessary DB queries in the future
function AddTagToMySQLAndMemory($tag, &$conn){
  
  global $tags_to_id;
  
  // if the tag isn't in memory try to add it to the database
  if(empty($tags_to_id[$tag])){
    
    $sql = "INSERT INTO `Tags` (`ID`, `Tag`) VALUES (NULL, '$tag')";

    if ($conn->query($sql) === TRUE) {
        //echo "New tag added successfully" . PHP_EOL;
      // add to memory for faster look up in the future
      $tags_to_id[$tag] = GetIDForTag($tag, $conn); // get ID DB Assigned
    } else {
      // weird the Tag exists - did you reboot?
      // echo "Tag exists" . PHP_EOL;
      // add to memory for faster look up in the future
      $tags_to_id[$tag] = GetIDForTag($tag, $conn); // get ID DB Assigned
    }
  }
}



function AddTrigramToMySQL($gram_set, &$conn){
  
  
  $Word_A = GetIDForWord($gram_set['words'][0], $conn);
  $Word_B = GetIDForWord($gram_set['words'][1], $conn);
  $Word_C = GetIDForWord($gram_set['words'][2], $conn);
  $Tag_A = GetIDForTag($gram_set['tags'][0], $conn);
  $Tag_B = GetIDForTag($gram_set['tags'][1], $conn);
  $Tag_C = GetIDForTag($gram_set['tags'][2], $conn);
  
  $complete_trigram_set = true;
  
  if($Word_A == NULL || $Word_B == NULL || $Word_C == NULL ||
     $Tag_A == NULL || $Tag_B == NULL || $Tag_C == NULL){
    $complete_trigram_set = false;
  }

  if($complete_trigram_set == true){
      
    // Select the trigram if it exists in the database
    $sql = "SELECT * FROM `Trigrams` WHERE `Word_A`=$Word_A AND `Word_B`=$Word_B AND `Word_C`=$Word_C AND `Tag_A`=$Tag_A AND `Tag_B`=$Tag_B AND `Tag_C`=$Tag_C  LIMIT 1";

    $result = $conn->query($sql);

    // there is an instance of this pair
    if ($result->num_rows > 0) {
      
      // Obtain the record for the gram_set
      while($row = $result->fetch_assoc()) {
        $id = $row['ID'];
        $count = $row['Count'];
      }
      $count++; //gram_set encountered again, increment it.
      
      // push updated count to database
      $sql = "UPDATE `Trigrams` SET Count='$count' WHERE ID=$id";

      if ($conn->query($sql) === TRUE) {
          //echo "Trigram Count updated successfully" . PHP_EOL;
      } else {
          //echo "Error: " . $sql . PHP_EOL . $conn->error . PHP_EOL;
      }
    } else { // no previous gram_set instance
      
      // Add this gram_set
      $sql = "INSERT INTO `Trigrams` (`Count`, `Word_A`, `Word_B`, `Word_C`, `Tag_A`, `Tag_B`, `Tag_C`) VALUES ('1', '$Word_A', '$Word_B', '$Word_C', '$Tag_A', '$Tag_B', '$Tag_C')";
      if ($conn->query($sql) === TRUE) {
          //echo "New Trigram added successfully";
      } else {
          //echo "Error: " . $sql . PHP_EOL . $conn->error . PHP_EOL;
      }    
    }
  }
}


// Pull the id for a given word from memory if available
// fall back to the database if its not in memory
// return NULL if it's not in the database
function GetIDForWord($word, &$conn){
  
  global $words_to_id;
  
  // if the word isn't in memory try to get it from the database
  if(empty($words_to_id[$word])){
  
    $sql = "SELECT * FROM `Words` WHERE `Word`='$word' LIMIT 1";
    $result = $conn->query($sql);
    
    if ($result->num_rows > 0) {// word exists
      // Output the ID for this Word
      while($row = $result->fetch_assoc()) {
        return $row['ID'];
      }
    }
    return NULL; // not in DB
  }
  else{
    return $words_to_id[$word];
  }  
}


// Pull the word for a given id from memory if available
// fall back to the database if its not in memory
// return NULL if it's not in the database
function GetWordForID($ID, &$conn){
  global $ids_to_words;
  
  // if the ID isn't in memory try to get it from the database
  if(empty($ids_to_words[$ID])){
    
    $sql = "SELECT * FROM `Words` WHERE `ID`='$ID' LIMIT 1";
    $result = $conn->query($sql);
    
    if ($result->num_rows > 0) {// id exists
      // Output the Word for this ID
      while($row = $result->fetch_assoc()) {
        return $row['Word'];
      }
    }
    return NULL; // not in DB
  }
  else{
    return $ids_to_words[$ID];
  }
}


// Pull the id for a given tag from memory if available
// fall back to the database if its not in memory
// return NULL if it's not in the database
function GetIDForTag($tag, &$conn){
  global $tags_to_id;
  
  // if the Tag isn't in memory try to get it from the database
  if(empty($tags_to_id[$tag])){
    $sql = "SELECT * FROM `Tags` WHERE `Tag`='$tag' LIMIT 1";
    $result = $conn->query($sql);
    
    if ($result->num_rows > 0) {// tag exists
      // Output the ID for this tag
      while($row = $result->fetch_assoc()) {
        return $row['ID'];
      }
    }
    return NULL; // not in DB
  }
  else{
    return $tags_to_id[$tag];
  }
}


// Pull the tag for a given id from memory if available
// fall back to the database if its not in memory
// return NULL if it's not in the database
function GetTagForID($ID, &$conn){
  global $ids_to_tags;
  
  // if the Tag isn't in memory try to get it from the database
  if(empty($ids_to_tags[$ID])){
    $sql = "SELECT * FROM `Tags` WHERE `ID`='$ID' LIMIT 1";
    $result = $conn->query($sql);
    
    if ($result->num_rows > 0) {// ID exists
      // Output the Tag for this ID
      while($row = $result->fetch_assoc()) {
        return $row['Tag'];
      }
    }
    return NULL; // not in DB
  }
  else{
    return $ids_to_tags[$ID];
  }
}


// Get contents of a training file as a string
function GetFile($filename){
  $filename =  'brown' . DIRECTORY_SEPARATOR . $filename;
  $handle = fopen($filename, 'r');
  $contents = fread($handle, filesize($filename));
  fclose($handle);
  return $contents;
}


// data is a text file with word/tag
// capture the word and tag as group 1 & 2 split by a forward slash.
// example: (word || symbol)[/](tag)   the/article blue/adjective cat/noun ./.
// (1)(2): (the)(article) (blue)(adjective) (cat)(noun) (.)(.)
function PrepareData($textdata){
  
  $re = '/([^\s]+)[\/]([^\s]+)/m';
  preg_match_all($re, $textdata, $matches, PREG_SET_ORDER, 0);
  
  $data = array();
  foreach($matches as $key=>$match){
    $data['words'][$key] = $match[1];
    $data['tags'][$key] = $match[2];
  }
  return $data;
}


// data is an array
// $data['words'][i] = word or symbol
// $data['tags'][i] = tag for the assoceated word
function ExtractTrigrams($data){
  
  $trigrams = array();
  
  $word_count = count($data['words']);
  for($i=2; $i < $word_count; $i++){

    $w_a = $data['words'][$i-2];
    $w_b = $data['words'][$i-1];
    $w_c = $data['words'][$i];
    $t_a = $data['tags'][$i-2];
    $t_b = $data['tags'][$i-1];
    $t_c = $data['tags'][$i];
    
    $pack['words'] = array($w_a, $w_b, $w_c);
    $pack['tags'] = array($t_a, $t_b, $t_c);
    
    $trigrams[] = $pack;
  }
  
  return $trigrams;
}


// Get all the words from the DB with the word as the key and the id as the value
function GetAllWords(&$conn){
  $sql = "SELECT * FROM `Words`";
  $result = $conn->query($sql);
  
  if ($result->num_rows > 0) {// id exists
    $words = array();
    // Output the Word for this ID
    while($row = $result->fetch_assoc()) {
      $words[$row['Word']] = $row['ID'];
    }
    return $words;
  }
  return NULL;
}


// Get all the words from the DB with the id as the key and the word as the value
function GetAllWordIDs(&$conn){
  $sql = "SELECT * FROM `Words`";
  $result = $conn->query($sql);
  
  if ($result->num_rows > 0) {// id exists
    $words = array();
    // Output the Word for this ID
    while($row = $result->fetch_assoc()) {
      $words[$row['ID']] = $row['Word'];
    }
    return $words;
  }
  return NULL;
}


// Get all the tags from the DB with the tag as the key and the id as the value
function GetAllTags(&$conn){
  $sql = "SELECT * FROM `Tags`";
  $result = $conn->query($sql);
  
  if ($result->num_rows > 0) {// id exists
    $words = array();
    // Output the Word for this ID
    while($row = $result->fetch_assoc()) {
      $words[$row['Tag']] = $row['ID'];
    }
    return $words;
  }
  return NULL;
}


// Get all the tags from the DB with the id as the key and the tag as the value
function GetAllTagIDs(&$conn){
  $sql = "SELECT * FROM `Tags`";
  $result = $conn->query($sql);
  
  if ($result->num_rows > 0) {// id exists
    $words = array();
    // Output the Word for this ID
    while($row = $result->fetch_assoc()) {
      $words[$row['ID']] = $row['Tag'];
    }
    return $words;
  }
  return NULL;
}



$training_files = array('ca01', 'ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10', 'ca11', 'ca12', 'ca13', 'ca14', 'ca15', 'ca16', 'ca17', 'ca18', 'ca19', 'ca20', 'ca21', 'ca22', 'ca23', 'ca24', 'ca25', 'ca26', 'ca27', 'ca28', 'ca29', 'ca30', 'ca31', 'ca32', 'ca33', 'ca34', 'ca35', 'ca36', 'ca37', 'ca38', 'ca39', 'ca40', 'ca41', 'ca42', 'ca43', 'ca44', 'cb01', 'cb02', 'cb03', 'cb04', 'cb05', 'cb06', 'cb07', 'cb08', 'cb09', 'cb10', 'cb11', 'cb12', 'cb13', 'cb14', 'cb15', 'cb16', 'cb17', 'cb18', 'cb19', 'cb20', 'cb21', 'cb22', 'cb23', 'cb24', 'cb25', 'cb26', 'cb27', 'cc01', 'cc02', 'cc03', 'cc04', 'cc05', 'cc06', 'cc07', 'cc08', 'cc09', 'cc10', 'cc11', 'cc12', 'cc13', 'cc14', 'cc15', 'cc16', 'cc17', 'cd01', 'cd02', 'cd03', 'cd04', 'cd05', 'cd06', 'cd07', 'cd08', 'cd09', 'cd10', 'cd11', 'cd12', 'cd13', 'cd14', 'cd15', 'cd16', 'cd17', 'ce01', 'ce02', 'ce03', 'ce04', 'ce05', 'ce06', 'ce07', 'ce08', 'ce09', 'ce10', 'ce11', 'ce12', 'ce13', 'ce14', 'ce15', 'ce16', 'ce17', 'ce18', 'ce19', 'ce20', 'ce21', 'ce22', 'ce23', 'ce24', 'ce25', 'ce26', 'ce27', 'ce28', 'ce29', 'ce30', 'ce31', 'ce32', 'ce33', 'ce34', 'ce35', 'ce36', 'cf01', 'cf02', 'cf03', 'cf04', 'cf05', 'cf06', 'cf07', 'cf08', 'cf09', 'cf10', 'cf11', 'cf12', 'cf13', 'cf14', 'cf15', 'cf16', 'cf17', 'cf18', 'cf19', 'cf20', 'cf21', 'cf22', 'cf23', 'cf24', 'cf25', 'cf26', 'cf27', 'cf28', 'cf29', 'cf30', 'cf31', 'cf32', 'cf33', 'cf34', 'cf35', 'cf36', 'cf37', 'cf38', 'cf39', 'cf40', 'cf41', 'cf42', 'cf43', 'cf44', 'cf45', 'cf46', 'cf47', 'cf48', 'cg01', 'cg02', 'cg03', 'cg04', 'cg05', 'cg06', 'cg07', 'cg08', 'cg09', 'cg10', 'cg11', 'cg12', 'cg13', 'cg14', 'cg15', 'cg16', 'cg17', 'cg18', 'cg19', 'cg20', 'cg21', 'cg22', 'cg23', 'cg24', 'cg25', 'cg26', 'cg27', 'cg28', 'cg29', 'cg30', 'cg31', 'cg32', 'cg33', 'cg34', 'cg35', 'cg36', 'cg37', 'cg38', 'cg39', 'cg40', 'cg41', 'cg42', 'cg43', 'cg44', 'cg45', 'cg46', 'cg47', 'cg48', 'cg49', 'cg50', 'cg51', 'cg52', 'cg53', 'cg54', 'cg55', 'cg56', 'cg57', 'cg58', 'cg59', 'cg60', 'cg61', 'cg62', 'cg63', 'cg64', 'cg65', 'cg66', 'cg67', 'cg68', 'cg69', 'cg70', 'cg71', 'cg72', 'cg73', 'cg74', 'cg75', 'ch01', 'ch02', 'ch03', 'ch04', 'ch05', 'ch06', 'ch07', 'ch08', 'ch09', 'ch10', 'ch11', 'ch12', 'ch13', 'ch14', 'ch15', 'ch16', 'ch17', 'ch18', 'ch19', 'ch20', 'ch21', 'ch22', 'ch23', 'ch24', 'ch25', 'ch26', 'ch27', 'ch28', 'ch29', 'ch30', 'cj01', 'cj02', 'cj03', 'cj04', 'cj05', 'cj06', 'cj07', 'cj08', 'cj09', 'cj10', 'cj11', 'cj12', 'cj13', 'cj14', 'cj15', 'cj16', 'cj17', 'cj18', 'cj19', 'cj20', 'cj21', 'cj22', 'cj23', 'cj24', 'cj25', 'cj26', 'cj27', 'cj28', 'cj29', 'cj30', 'cj31', 'cj32', 'cj33', 'cj34', 'cj35', 'cj36', 'cj37', 'cj38', 'cj39', 'cj40', 'cj41', 'cj42', 'cj43', 'cj44', 'cj45', 'cj46', 'cj47', 'cj48', 'cj49', 'cj50', 'cj51', 'cj52', 'cj53', 'cj54', 'cj55', 'cj56', 'cj57', 'cj58', 'cj59', 'cj60', 'cj61', 'cj62', 'cj63', 'cj64', 'cj65', 'cj66', 'cj67', 'cj68', 'cj69', 'cj70', 'cj71', 'cj72', 'cj73', 'cj74', 'cj75', 'cj76', 'cj77', 'cj78', 'cj79', 'cj80', 'ck01', 'ck02', 'ck03', 'ck04', 'ck05', 'ck06', 'ck07', 'ck08', 'ck09', 'ck10', 'ck11', 'ck12', 'ck13', 'ck14', 'ck15', 'ck16', 'ck17', 'ck18', 'ck19', 'ck20', 'ck21', 'ck22', 'ck23', 'ck24', 'ck25', 'ck26', 'ck27', 'ck28', 'ck29', 'cl01', 'cl02', 'cl03', 'cl04', 'cl05', 'cl06', 'cl07', 'cl08', 'cl09', 'cl10', 'cl11', 'cl12', 'cl13', 'cl14', 'cl15', 'cl16', 'cl17', 'cl18', 'cl19', 'cl20', 'cl21', 'cl22', 'cl23', 'cl24', 'cm01', 'cm02', 'cm03', 'cm04', 'cm05', 'cm06', 'cn01', 'cn02', 'cn03', 'cn04', 'cn05', 'cn06', 'cn07', 'cn08', 'cn09', 'cn10', 'cn11', 'cn12', 'cn13', 'cn14', 'cn15', 'cn16', 'cn17', 'cn18', 'cn19', 'cn20', 'cn21', 'cn22', 'cn23', 'cn24', 'cn25', 'cn26', 'cn27', 'cn28', 'cn29', 'cp01', 'cp02', 'cp03', 'cp04', 'cp05', 'cp06', 'cp07', 'cp08', 'cp09', 'cp10', 'cp11', 'cp12', 'cp13', 'cp14', 'cp15', 'cp16', 'cp17', 'cp18', 'cp19', 'cp20', 'cp21', 'cp22', 'cp23', 'cp24', 'cp25', 'cp26', 'cp27', 'cp28', 'cp29', 'cr01', 'cr02', 'cr03', 'cr04', 'cr05', 'cr06', 'cr07', 'cr08', 'cr09');
$total_files = count($training_files);

$server = 'localhost';
$username = 'root';
$password = 'password';
$db = 'PartsOfSpeechTagger';
$conn = ConnectToMySQL($server, $username, $password, $db);


// Get all known current words and id's inefficient redundant calls but it's a run once.
$words_to_id = GetAllWords($conn);
$ids_to_words = GetAllWordIDs($conn);
$tags_to_id = GetAllTags($conn);
$ids_to_tags = GetAllWordIDs($conn);

$log = fopen('Log.txt', 'w+'); // log file

foreach($training_files as $filenumber=>$training_file){
  echo "Processing file $filenumber of $total_files." . PHP_EOL;
  fwrite($log, $training_file . PHP_EOL); // log the name of the file we are working on
  
  // Get data and get it ready for the bot to learn
  $training_data = GetFile($training_file);
  $training_data = PrepareData($training_data);
  $training_data = ExtractTrigrams($training_data);
  //var_dump($training_data);
  
  foreach($training_data as $key=>$set){
    foreach($set as $group=>$trigrams){
      if($group == 'words'){
        // add words
        AddWordToMySQLAndMemory($trigrams[0], $conn);
        AddWordToMySQLAndMemory($trigrams[1], $conn);
        AddWordToMySQLAndMemory($trigrams[2], $conn);
      }
      elseif($group == 'tags'){
        // add tags
        AddTagToMySQLAndMemory($trigrams[0], $conn);
        AddTagToMySQLAndMemory($trigrams[1], $conn);
        AddTagToMySQLAndMemory($trigrams[2], $conn);
      }
    }
    // We know the words and tags are now in the DB & Memory
    // process the trigrams
    AddTrigramToMySQL($set, $conn);
  }
}
fclose($log);


DisconnectFromMySQL($conn);

 

I had hoped we could go farther this week and discuss trigrams but… as I said I’m still training the model so we’ll cover how to use it next week. In the mean time, remember to like, and follow!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

It goes toward helping me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Tokenizing & Lexing Natural Language

Well, I guess the next question to ask is… can we lex a natural language?

When we lexed the PHP code in Can a Bot Understand a Sentence? we applied tags to the lexemes so that the intended meaning of each could be understood using grammar rules.

Well It turns out that when talking about natural languages, lexing is referred to as Parts of Speech Tagging.

The top automated parts of speech taggers have achieved something like 97-98% accuracy when tagging previously unseen (though grammatically correct) text. I would say that pretty much makes this a solved problem!

Further, linguists and elementary school teachers have been doing this by hand for years! 😛

In practice everyone’s results will very but on average having a potential of approximately 2 miss-tagged words out of 100 means that the challenge of building a natural language lexer shouldn’t be too difficult but of course any variance even just 2% (meaning that a bot that gets the tags wrong 2% of the time) can mean that the bot does the wrong thing (perhaps significantly) 2% of the time.

In any case before we can tag our lexemes we need to come up with a way to ‘tokenize’ a natural language sentence so let’s talk about that.

Tokenizing

Tokenizing is a verb which makes it an action. The word means to turn the raw text into ‘tokens’ using a process to determine the bounds of each “word unit” or “part of speech” so that we can treat it as a separate component that can be programmatically acted upon as lexeme.

In this case we can use the individual characters as our tokens.

If we use this sentence:

“The quick brown fox jumps over the lazy dog. A long-term contract with “zero-liability” protection! Let’s think ‘it’ over. john.doe@web_server.com”

Tokens

We want our system to use all the characters (including spaces and punctuation) in the string as tokens like this:

["T","h","e"," ","q","u","i","c","k"," ","b","r","o","w","n"," ","f","o","x"," ","j","u","m","p","s"," ","o","v","e","r"," ","t","h","e"," ","l","a","z","y"," ","d","o","g","."," ","A"," ","l","o","n","g","-","t","e","r","m"," ","c","o","n","t","r","a","c","t"," ","w","i","t","h"," ","\"","z","e","r","o","-","l","i","a","b","i","l","i","t","y","\""," ","p","r","o","t","e","c","t","i","o","n","!"," ","L","e","t","'","s"," ","t","h","i","n","k"," ","'","i","t","'"," ","o","v","e","r","."," ","j","o","h","n",".","d","o","e","@","w","e","b","_","s","e","r","v","e","r",".","c","o","m"]

Lexemes

Once we have the tokens we want the system to process those tokens into lexemes that a person would naturally say is a “whole” lexeme.

In this case, “whole” means that it’s a complete part of speech so sometimes a lexeme is a multi-character word and sometimes is a single character delimiter.

Most of the time a lexeme will only contain letters, numbers or symbols but sometimes it should also contain some mixed combination, as would be the case with a hyphenated compound word e.g. zero-liability or a contraction e.g. Let’s.

Notice that we want the system to use the apostrophe to merge Let and s into Let’s because it’s a contraction and therefore a “whole” lexeme but we don’t want the apostrophes around the word ‘it’ that follows the word ‘think’ combined because the lexeme in that case is the word it with the sounding apostrophes acting as ‘single quotes’ and should therefore be treated as separate lexemes just like the “double quotes” around zero-liability.

Also, we want the system to capture the complex pattern of the email (john.doe@web_server.com) as a single lexeme.

Here’s what that looks like:

[
    "The",
    " ",
    "quick",
    " ",
    "brown",
    " ",
    "fox",
    " ",
    "jumps",
    " ",
    "over",
    " ",
    "the",
    " ",
    "lazy",
    " ",
    "dog",
    ".",
    " ",
    "A",
    " ",
    "long-term",
    " ",
    "contract",
    " ",
    "with",
    " ",
    "\"",
    "zero-liability",
    "\"",
    " ",
    "protection",
    "!",
    " ",
    "Let's",
    " ",
    "think",
    " ",
    "'",
    "it",
    "'",
    " ",
    "over",
    ".",
    " ",
    "john.doe@web_server.com"
]

 

Of course we would still need to apply tags to this list to complete the lexing process but this solves the first problem of problem splitting natural language text into tokens and processing those tokens into a list of lexemes ready to be tagged.

We’ll work on tagging next week, but for now lets look at the code that does this.

The Code

Here is the complete code that implements the tokenization of the lexemes. I’ll explain what is happening below but the code is commented for the programmers who are following along.

<?php 

function Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols){  
      
  $temp = '';                   // A temporary string used to hold incomplete lexemes
  $lexemes = array();           // Complete lexemes will be stored here for return
  $chars = str_split($text, 1); // Split the text sting into characters.
  
  //var_dump(json_encode($chars, 1)); // convert $chars array to JSON and dump to screen

  // Step through all character tokens in the $chars array
  foreach($chars as $key=>$char){
        
    // If this $char token is in the $delimiters array
    // Then stop building $temp and add it and the delimiter to the $lexemes array
    if(in_array($char, $delimiters)){
      
      // Does temp contain data?
      if(strlen($temp) > 0){
        // $temp is a complete lexeme add it to the array
        $lexemes[] = $temp;
      }      
      $temp = ''; // Make sure $temp is empty
      
      $lexemes[] = $char; // Capture delimiter as a whole lexeme
    }
    else{// This $char token is NOT in the $delimiters array
      // Add $char to $temp and continue to next $char
      $temp .= $char; 
    }
    
  } // Step through all character tokens in the $chars array


  // Check if $temp still contains any residual lexeme data?
  if(strlen($temp) > 0){
    // $temp is a complete lexeme add it to the array
    $lexemes[] = $temp;
  }
  
  // We have processed all character tokens in the $chars array
  // Free the memory and garbage collect $chars & $temp
  $chars = NULL;
  $temp = NULL;
  unset($chars);
  unset($temp);


  // We now have the simplest lexems extracted. 
  // Next we need to recombine compound-words, contractions 
  // And do any other processing with the lexemes.

  // If there are $chars in the $compound_word_symbols array
  if(!empty($compound_word_symbols)){
    
    // Count the number of $lexemes
    $number_of_lexemes = count($lexemes);
    
    // Step through all lexeme tokens in the $lexemes array
    foreach($lexemes as $key=>&$lexeme){
      
      // Check if $lexeme is in the $compound_word_symbols array
      if(in_array($lexeme, $compound_word_symbols)){
        
        // If this isn't the first $lexeme in $lexemes
        if($key > 0){ 
          // Check the $lexeme $before this
          $before = $lexemes[$key - 1];
          
          // If $before isn't a $delimiter
          if(!in_array($before, $delimiters)){
            // Merge it with the compound symbol
            $lexeme = $before . $lexeme;
            // And remove the $before $lexeme from $lexemes
            $lexemes[$key - 1] = NULL;
          }
        }
        
        // If this isn't the last $lexeme in $lexemes
        if($key < $number_of_lexemes){
          // Check the $lexeme $after this
          $after = $lexemes[$key + 1];
          
          // If $after isn't a $delimiter
          if(!in_array($after, $delimiters)){
            // Merge the $lexeme it with
            $lexemes[$key + 1] = $lexeme . $after;
            // And remove the $lexeme
            $lexeme = NULL;
          }
        }
        
      } // Check if lexeme is in the $compound_word_symbols array
    } // Step through all tokens in the $lexemes array      
  } // If there are $chars in the $compound_word_symbols array
  
  // Filter out any NULL values in the $lexemes array
  // created during the compound word merges using array_filter()
  // and then re-index so the $lexemes array is nice and sorted using array_values().
  $lexemes = array_values(array_filter($lexemes));
  
  
  // If there are $chars in the $contraction_symbols array
  if(!empty($contraction_symbols)){
    
    // Count the number of $lexemes
    $number_of_lexemes = count($lexemes);
    
    // Step through all lexeme tokens in the $lexemes array
    foreach($lexemes as $key=>&$lexeme){
      
      // Check if $lexeme is in the $contraction_symbols array
      if(in_array($lexeme, $contraction_symbols)){
        
        // If this isn't the first $lexeme in $lexemes
        // and If this isn't the last $lexeme in $lexemes
        if($key > 0 && $key < $number_of_lexemes){ 
          // Check the $lexeme $before this
          $before = $lexemes[$key - 1];
          
          // Check the $lexeme $after this
          $after = $lexemes[$key + 1];
          
          
          // If $before isn't a $delimiter
          // and $after isn't a $delimiter
          if(!in_array($before, $delimiters) && !in_array($after, $delimiters)){
            // Merge the contraction tokens
            $lexemes[$key + 1] = $before . $lexeme . $after;
            
            // Remove $before
            $lexemes[$key - 1] = NULL;
            // And remove this $lexeme
            $lexeme = NULL;            
          }

        }
        
      } // Check if lexeme is in the $contraction_symbols array
    } // Step through all tokens in the $lexemes array      
  } // If there are $chars in the $contraction_symbols array
  
  // Filter out any NULL values in the $lexemes array
  // created during the contraction merges using array_filter()
  // and then re-index so the $lexemes array is nice and sorted using array_values().
  $lexemes = array_values(array_filter($lexemes));
  

  // Return the $lexemes array.
  return $lexemes;
}

// Delimiters (Lexeme Boundaries)
$delimiters = array('~', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+', '`', '-', '=', '{', '}', '[', ']', '\\', '|', ':', ';', '"', '\'', '<', '>', ',', '.', '?', '/', ' ', "\t", "\n");

// Symbols used to detect compound-words
$compound_word_symbols = array('-', '_');

// Symbols used to detect contractions
$contraction_symbols = array("'", '.', '@');

// Text to Tokenize and Lex
$text = 'The quick brown fox jumps over the lazy dog. A long-term contract with "zero-liability" protection! Let\'s think \'it\' over. john.doe@web_server.com';

// Tokenize and extract the $lexemes from $text
$lexemes = Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols);
echo json_encode($lexemes, 1); // output $lexems as JSON

 

Splitting the Tokens

One way to do this would be to use regular expressions (regex) to match a pattern and it’s the method we used for the Email Relationship Classifier.

As part of that project I released a “Tokenizer” Class File that relied heavily on regex to match patterns but this isn’t the method we use in the Natural Language lexer, though we could have.

Common “advice” you will receive as a developer is that “you should NEVER use regex”, and while this is well meaning advice, it is certainly wrong!

I find Regex works best when you understand the patterns you are looking for really well and the patterns won’t change much throughout your data-set, though the pattern can be very complex.

Now the reason why you are often advised to avoid using regex pattern matching is that it’s complicated and understanding the pattern match string is not always immediately intuitive and sometimes its down right difficult! This can even be the case even if you are generally comfortable working with regex.

So the difficulty in using regex for most developers is a factor in my choice not to use regex in this case but the main reason is simply that it’s really not needed and it’s actually a lot simpler not to use regex to accomplish our goal.

So if not Regex then how?

Use Delimiters as a Guide to Word Boundaries

First create an array of delimiters that we can use as automatic word boundaries. In this case we can use a list of all the typable symbols that are not letters or numbers.

// Delimiters (Lexeme Boundaries) 
$delimiters = array('~', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+', '`', '-', '=', '{', '}', '[', ']', '\\', '|', ':', ';', '"', '\'', '<', '>', ',', '.', '?', '/', ' ', "\t", "\n");

 

Use Compound Symbols to Grow Words

Next we need a group of symbols we know are always used to create compound words. Basically this means hyphens and underscores which should always be joined with their parent lexeme. The distinction here is that even if these symbols show up before, in the middle of or even after another lexeme they should be considered part of that lexeme. e.g. pre- or long-term or _something or Jane_Doe

// Symbols used to detect compound-words 
$compound_word_symbols = array('-', '_'); 

 

Use Contractions to Merge Ideas

Quotes (‘single’ & “double” ) should be treated as separate lexemes and never be merged with the lexeme they contain.  However, apostrophes should actually be merged with the lexeme before & after it provided that neither are a delimiter. Also, sometimes a period and the @ symbol can behave like contraction symbols as is the case with the example email: john.doe@web_server.com

// Symbols used to detect contractions 
$contraction_symbols = array("'", '.', '@'); 

 

Our Example Natural Language Text

Here is the test string of natural language.

// Text to Tokenize and Lex 
$text = 'The quick brown fox jumps over the lazy dog. A long-term contract with "zero-liability" protection! Let\'s think \'it\' over. john.doe@web_server.com'; 

 

Extract Lexemes from Tokens Using Delimiter Symbols

We can now create a call to the Tokenize() function with our data and pass the results into an array which we format and echo as JSON.

// Tokenize and extract the $lexemes from $text 
$lexemes = Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols); 
echo json_encode($lexemes, 1); // output $lexems as JSON

 

Now, if we run our code we get all the lexemes extracted from the natural language test string.

Next week we will look at how we can tag the lexemes to complete the Lexical Analysis of a natural language so remember to like, and follow!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Can a Bot Understand a Sentence?

What if we could teach our writer bot (see Bot Generated Stories & Bot Generated Stories II) how to understand the meaning of a sentence? Would that improve the bot’s ability to understand what was said?

Well, conceptually it’s not that different from what we do with programming languages and if you ask me, it sure seems like a good place to start with Natural Languages!

Natural Language

A “Natural Language” is any language humans use (though I guess we would include alien languages should we ever meet some 😛 ) that evolved over time through use rather than by design.

One of the first things you might notice when you contrast a natural language, like English, with artificially designed languages that are used for specific purposes (like programming a computer) is that natural languages have much more complexity and variation.

Further, natural languages tend to be far more “general purpose” than even the most capable artificial general purpose programming languages.

For example, just think of all the ways you can say you love Ice Cream or that the room temperature is hot.

Now if you are a programmer, contrast that with how many ways there are to create a loop in a program?

The answer is more than a few (off the top of my head I can think of for, for each, whiledo while, goto, recursive functions) but certainly many times less than the number of all the possible combinations of words you could use to describe how green something is!

How then, do computers understand what programmers say?

Click The Infographic for Full Size

Write Code

First, a programmer must write some valid code, like this for example:

$pi = 3.1415926535898;
for($i = 0; $i < $pi; $i++){
    echo $i . PHP_EOL;
}
echo 'PI is equal to: ' . ($pi + PI()) / 2;

Result:

0
1
2
3
PI is equal to: 3.1415926535898

The computer can understand what this code means and does exactly what it was asked to do, but how?

Lexical Analysis

Lexical Analysis  occurs in the early stages of the CompilationInterpretation processes, where the source code or script for a program is scanned by a program called a lexer which tries to find the smallest chunk of “whole” information, called a “Lexeme“, and will assign it a “type” or “tag” that denotes its specific purpose or function.

Lexed Code

You might be wondering what lexed code looks like. If we lex the example code from above we get a list that would be something like this if we represent it as JSON:

[
	["identifier","$pi"],
	["operator-equals","="],
	["literal-float","3.1415926535898"],
	["separator-terminator",";"],
	["keyword-for","for"],
	["separator-open-parentheses","("],
	["identifier","$i"],
	["operator-equals","="],
	["literal-integer","0"],
	["separator-terminator",";"],
	["identifier","$i"],
	["operator-less-than","<"],
	["identifier","$pi"],
	["separator-terminator",";"],
	["identifier","$i"],
	["operator-increment","++"],
	["separator-close-parentheses",")"],
	["open-curl","{"],
	["keyword-echo","echo"],
	["identifier","$i"],
	["operator-concatenate","."],
	["keyword-end-of-line","PHP_EOL"],
	["separator-terminator",";"],
	["separator-close-curl","}"],
	["keyword-echo","echo"],
	["literal-string","PI is equal to: "],
	["operator-concatenate","."],
	["separator-open-parentheses","("],
	["identifier","$pi"],
	["operator-plus","+"],
	["keyword-pi","PI"],
	["separator-close-parentheses",")"],
	["operator-divide","/"],
	["literal-integer","2"],
	["separator-terminator",";"]
]

What we’ve just done is give each lexeme a tag that is unambiguous as to what it’s intended role or function is.

Semantic Analysis

Then Semantic Analysis , sometimes called Parsing, checks the code to ensure that
there are no mistakes and establishes a hierarchy of relationships and meaning so the
code can be evaluated using the rules of the language.

Semantic Hierarchy

Parsing will group the expressions into a tree hierarchy that makes the intended meaning explicitly clear to the computer what we want it to do.

Here is the code above parsed and represented as JSON:

[
   {
      "tags":[
         "identifier",
         "operator-equals",
         "literal-float"
      ],
      "lexemes":[
         "$pi",
         "=",
         "3.1415926535898"
      ],
      "child-expressions":[

      ]
   },
   {
      "tags":[
         "identifier",
         "operator-equals",
         "literal-float"
      ],
      "lexemes":[
         "$pi",
         "=",
         "3.1415926535898"
      ],
      "child-expressions":[

      ]
   },
   {
      "tags":[
         "keyword-for"
      ],
      "lexemes":[
         "for"
      ],
      "child-expressions":[
         {
            "tags":[
               "identifier",
               "operator-equals",
               "literal-integer"
            ],
            "lexemes":[
               "$i",
               "=",
               "0"
            ],
            "child-expressions":[

            ]
         },
         {
            "tags":[
               "identifier",
               "operator-less-than",
               "identifier"
            ],
            "lexemes":[
               "$i",
               "<",
               "$pi"
            ],
            "child-expressions":[
               {
                  "tags":[
                     "keyword-echo",
                     "identifier",
                     "operator-concatenate",
                     "keyword-end-of-line"
                  ],
                  "lexemes":[
                     "echo",
                     "$i",
                     ".",
                     "PHP_EOL"
                  ],
                  "child-expressions":[

                  ]
               }
            ]
         },
         {
            "tags":[
               "identifier",
               "operator-increment"
            ],
            "lexemes":[
               "$i",
               "++"
            ],
            "child-expressions":[

            ]
         }
      ]
   },
   {
      "tags":[
         "keyword-echo",
         "literal-string",
         "operator-concatenate"
      ],
      "lexemes":[
         "echo",
         "PI is equal to: ",
         "."
      ],
      "child-expressions":[
         {
            "tags":[
               "operator-divide",
               "literal-integer"
            ],
            "lexemes":[
               "\/",
               "2"
            ],
            "child-expressions":[
               {
                  "tags":[
                     "identifier",
                     "operator-plus",
                     "keyword-pi"
                  ],
                  "lexemes":[
                     "$pi",
                     "+",
                     "PI"
                  ],
                  "child-expressions":[

                  ]
               }
            ]
         }
      ]
   }
]

Code Evaluation

Once the code has been analysed it can be Evaluated.

And since you’re curious, here’s what this code does:

  1. Declare a variable named $pi then sets it’s value to the number Pi with a length of 13 decimal places.
  2. A for loop is initialized.
  3. Expression 1 in the for loop is only evaluated once before the loop begins and it declares a variable named $i.
  4. Each time the for loop runs Expression 2 is evaluated using Boolean Algebra and if the result is logically TRUE then the code inside the loop runs. The expression is a comparison of the value of $i & $pi where if $i is less than $pi then the loop runs.
  5. Expression 3 is the third and final expression used to initialize the for loop. It runs after each iteration of the loop. The value of $i is incremented by 1 using the ++ increment operator.
  6. Each iteration of the loop “echo‘s” the value of $i to the screen.
  7. Once the for loop terminates the computer takes the value of $pi and add it to the value provided by the PHP language function PI() (which stores the value of Pi as a constant).
  8. That resulting sum is then divided by 2 giving us exactly Pi.
  9. This value is then concatenated with the string “Pi is equal to: ” and the whole string value is then echoed to to the screen.

This code does nothing of real value but it’s sufficiently short for us to lex by hand and long enough that it provides interesting results with nested child expressions, see the infographic hierarchy above.

How Does This Apply To Writer Bot?

Well, I guess the next question to ask is… can we lex a natural language? We’ll talk about that in my next post so remember to like, and follow!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Why Rule Based Story Generation Sucks

Welcome back, today we’re going to talk about why rule based story generation sucks and I’ll also present some code that you can use to create your very own rule based story generator!

But before we get started I’d like to draw your attention to the title header image I used in my last article Rule Based Story Generation which you should also read because it helps add context to this article… anyway, back to the image. It presents a set of scrabble blocks that spell out:

“LETS GO ON ADVENTURES”.

I added over the image my usual title text in Pacifico font and placed it as though to subtly imply it too might simply be a randomly selected block of text just slotted in wherever it would fit.

Combined the new “title rule” it reads:

“LETS GO ON ‘Rule Based Story Generation’ ADVENTURES”.

Perhaps it may be too subtle, I know! 😛

In any case, I like how the scrabble tiles capitalization is broken with the script.

It almost illustrates the sort of mishmash process we’re using to build sentences that are sometimes almost elegant though much of the time, ridged and repetitive.

It’s important to understand that developing an artificial intelligence that can write as well as a human is still an open (unsolved) area of research, which makes it a wonderfully exciting challenge for us to work on and we’re really only just getting started! 😉

Of course this isn’t as far as we can go (see Rule Based Story GenerationA Halloween Tale) and I am currently working on how we go even farther than I already have… but we’ll get there.

 

A Flawed Stepping Stone

Rule based story generation is an important, yet flawed, stepping stone in helping us understand the problem of building a writer bot that will write best sellers!

Despite the flaws with using rules to create stories, we may in the future partially rely on “bot written” rules in a layered generative system so none of this is necessarily wrong, just woefully incomplete, especially by my standards as I like to publish fully functioning prototypes… more or less. 😉

Before I present the code however let’s briefly look at why rules suck!

 

Reasons Rule Based Generation Sucks

Here’s a non-exhaustive list of reasons why rule based story generation sucks:

  • Some rules create Ambiguity.
  • Lack of correlation between Independent Clauses.
  • Complete lack of Dependent Clauses that will need to correlate with their associated independent clauses.
  • Run-On Sentences are WAY easy to generate!
  • Random, or hard coded Verb Tense & Aspect is used.
  • Forced, improper or unusual grammar.
  • Random events, and no cohesive growth of ideas over time (lack of Narrative).
  • No Emergence means that all meaningful possibilities are input by a person… manually! 😦
  • Placeholder for anything that I may have forgot to include here.

If we intend to build the “writer bot” AI described in Bot Generated Stories & Bot Generated Stories II we will have to find ways of mitigating or eliminating the issues in this list!

We could be proactive at trying to resolve some of these issues with the rule based system but most of our efforts will boil down to writing additional rules (if this, then that) preconditions and really… nobody has time for that!

Besides, if a person is writing the rules… even if the specific details selected for a sentence are random or possibly even stochastic (random but based on probability), wouldn’t you still say that the person who wrote the rules, kinda sorta wrote the text too?

I mean, even if there’s an asterisk next to the writers name and in itty-bitty tiny text somewhere it says (*written with the aid of a bot) or vice versa, it’s still hard to say that whoever wrote the rules and made the lists didn’t write the resulting text… to some extent, right?

If you agree… disagree? Feel free to present your argument in the comments!

Ultimately for the reasons listed above (and a fair amount of testing) I am confident that hand written rule based story generation is not the way to go!

 

New Rules

In addition to a new rule (name action.) I am including a little something special in the code below.

I call them Rule 8 & Rule 9 because they were written in that order but what makes then unique from rules (1 – 7 which I wrote) is that they were effectively written by a bot.

What I mean when I say the bot “wrote the rule” is that the pattern used in the rules was extracted by a learning bot (pattern matching algorithm/process).

Here are examples of the new rules:

Rule 7

Axton went fishing.
Briana mopped the floor.
Kenny felt pride.
Ryann played sports.
Chaim felt content.
Alaina road a horse.
Elian setup a tent.
Brian had fun.
Meadow heard a noise.
Jewel learned a new skill.

 

Rule 8 – Bot Generated Rule

Freya handled Along the ship!
Bethany dipped Inside the dollar!
Kyla appointed Regarding the scenery!
Aryanna filed yieldingly the honoree.
Madeline demanded of the pannier!
Kailey repaid there the courage.
Finley came With the button.
Sawyer criticised owing to majestically icicle!
Armani included again down canopy!
Genevieve snapped Behind the computer!

 

Rule 9 – Bot Generated Rule

Ulises besides Maxim approved the scraper Near.
Nova that is Louisa ringed the scraper Down.
Alec besides Killian eased the cope Outside.
Sylas consequently Zain beat safely exit Above.
Conrad yet Alfredo owed the definition Within.
Danica that is Jackson paid the sweater fully.
Hugh and Kori substituted the pitching heavily.
Julissa but Colton separated the lie Down.
Liberty but Barbara reformed the lamp kissingly.
Zion yet Rosemary ascertained true fat under.
Neither Desiree nor Nadia filed the protocol Ahead of.
Neither Rudy nor Rowan aided the weedkiller commonly.
Grace indeed Jad caused the beast best.
Jaelynn besides Maddux cheered the panda Against.
Ari yet Ayla elected the seaside without.
Blakely moreover Karsyn stimulated jealousy shadow owlishly.
Prince further Lennon exhibited the worm Except.
Clay thus Rohan embraced the tsunami each.
Sabrina but Avery stressed far paste Excluding.
Gregory so Dallas engaged new egghead clearly.
Neither Lydia nor Walter escaped naturally margin previously.
Dylan namely Elaina kept suspiciously shed oddly.
Neither Jedidiah nor Karsyn devised the bathhouse kookily.
Kareem so River pointed wetly yoga Ahead of.
Ansley accordingly Alessandro laughed the brood By.
Omar otherwise Sofia obtained the clipper Per.
Walker so August summoned the tile yeah.
Remy moreover Cody raised the handball loyally.
Aadhya so Adelynn allocated the fear Amidst.
Mohamed likewise Hudson inspected the hyphenation Like.

While that does make generating rules easier and it also can aid in resolving some of the issues with generating stories using rules, it really only amounts to an interesting “babble” generator at best, though perhaps it could be coupled with several other systems in layers to create something closer to a story?

Maybe through the use of “rewrite rules” that could fix the verb tenses and pronouns perhaps?

Here are the results of 15 randomly selected rules:

Leanna Odom is a woman but to the south of a
sports stadium, a bird built a nest for in the
zoo, a guy road a bike nor next to a city jail, a
car got a flat tire for Ricky Solis is very garish
and outside a farm , robots attacked yet Armani
Lowery is very attentive but Jeffrey and Zaniyah
permitted the extent Excluding. nor Azariah seemed
fatally the route! but Maddux proved hopefully
monthly wasp! nor Blaine wrote a poem. and inside
an abandoned ghost town, a book was written and
Neither Seth nor Callan behaved the steeple In
addition to. but Lucille ate a nice meal. and
Lorelei meditated.

Code

Below is the code for Generate.php and it’s the main program file. It uses Functions.php as well as some text files and you can find all the files you need for this project over on my GitHub for free: RuleBasedStoryGeneration on Github

<?php
// include all the functions
include('Functions.php');
// set up the parts of speech array
// functions will globally point to this variable
$pos = LoadPartsOfSpeech();
$number_of_sentences = 30; // how many sentences/rules are generated/used
$story = ''; // the string we concatenate rules on to
// for whatever number you set $number_of_sentences to...
foreach(range(1, $number_of_sentences, 1) as $number){
    
    $rule_subject = random_int(1, 3);
    
    // randomly determine the type of rule to use,
    // randomly select the rule, compute its result and concatenate with 
    // the existing $story
    if($rule_subject == 1){ // action or event
        
        $rule_subject = random_int(1, 4);
        
        if($rule_subject <= 3){
             $story .= Rule(1); // event
        }
        elseif($rule_subject == 4){
             $story .= Rule(7); // action
        }
    }
    elseif($rule_subject == 2){ // people related
        $rule_subject = random_int(1, 6);
        
        if($rule_subject == 1){
             $story .= Rule(2);
        }
        elseif($rule_subject == 2){
             $story .= Rule(3);
        }
        elseif($rule_subject == 3){
             $story .= Rule(4);
        }
        elseif($rule_subject == 4){
             $story .= Rule(5);
        }
        elseif($rule_subject == 5){
             $story .= Rule(6);
        }
        elseif($rule_subject == 6){
             $story .= Rule(7);
        }
    }
    elseif($rule_subject == 3){ // bot generated
        $rule_subject = random_int(1, 2);
        
        if($rule_subject == 1){
             $story .= Rule(8);
        }
        elseif($rule_subject == 2){
             $story .= Rule(9);
        }
    }
    
        
    // if this is not the last sentence/rule concatenate a conjunction
    if($number <= ($number_of_sentences - 1)){
        $story .= $pos['space'] . Get($pos['conjunctions']['pure']) . $pos['space'];
    }
}
// after the loop wrap the text at 50 chars and output the story
echo wordwrap($story, 50, PHP_EOL);
/*
 * Example Output
 * 
Jayleen ended By the lip! so Jada called a family
member. nor Aidan Lester is gifted and Emma Walton
is very clumsy and Grey proceeded widely literally
runner. or Santana Norman is a man yet Nico
Bartlett is very pitiful yet Aliana Browning is
rich and Rowan introduced Past the colloquia. but
Holly built a robot. so Morgan Dorsey is a person
for London fooled Against the cappelletti. but
Neither Emory nor Angel angered the order angrily.
or Hezekiah Beasley is very panicky and Leighton
did almost vivaciously author. so Foster Justice
is a man but Rory Parker is a beautiful person so
Reagan Rivera is a person but Kai Zamora is clever
nor beyond a newspaper company, dinosaur bones
were uncovered yet beyond a houseboat, a bird
built a nest so Kyle Goff is a man or on the
mountains, a new species of insect was identified
nor Galilea Mckinney is very worried or Gunner Orr
is very guilty but Otto Gaines is a small man nor
Gia Hendrix is powerful and Robert Mcdaniel is a
beautiful man so to the south of a newspaper
company, science breakthroughs were made so
Carmelo Rodgers is very witty
 
   
 */

 

Please remember to like share and follow!


Help Me Grow

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Rule Based Story Generation

In my last post Bot Generated Stories II  I left off posing the question:

How then did I get my bot to generate A Halloween Tale?

~GeekGirlJoy

The short and most direct answer I can give without getting into all the nitty-gritty design & engineering details (we’ll do that in another post) is that I built a Markov Model (a math based bot) then I “trained” my bot on some texts to “teach” it what “good” patterns of text look like.

After my bot “read” approximately 350K words, the 3 books and a few smaller texts (some of my work and a few other general text sources to fill-out the vocabulary), I gave my bot a “seed word” that it would use to start the generative process.

What’s really going on “under the hood” is a lot like the “predictive text” feature many of you use every day when you send a text on your phone.

It’s that row of words that appears above or below your text while you type, suggesting the next possible word based on what you last said, what it was trained on and your personal speech patterns it learned while it watched you send texts in the past.

Well… my writer bot is sort of a “souped up”, “home-brewed” version of that…. just built with generating a story in mind rather than predicting the next most likely word someone sending a text would need.

The thing is, we’re not going to talk about that bot today. 😛

Xavier and I have been sick and I’m just not in the mood to do a math & code heavy post today, instead we’re going to talk about rule based story generation. 😉

Rule Based Story Generation

One simple and seemingly effective way to generate a story that absolutely works (at least to some extent) is to use “rules” to select elements from groups or lists and assemble the parts into a story.

The idea is pretty simple actually, if you select or write good rules then the result is they would generate unique (enough) sentences that when combined are a story.

For example, lets say I create a generative “rule” like this:

Rule: proximity location, event.

Seems simple enough but this rule can actually generate quite a few mildly interesting sentences, like this one for example:

in a railroad station, aliens attacked.

Result: (proximity: in) (location: a railroad station), (event: aliens attacked).

Not bad huh? You’d totally read that story right? 😛

Here’s a few more results using this rule so you can see what it can do. Note that the rule pattern never changes (proximity location, event.) but due to different values being randomly selected each sentence is different and the general “tone” of the sentence changes a bit as well:

Below a dank old bar, science breakthroughs were made.

I like that one 😛

Next to a village, a robbery happened.

To the west of a cave, a bird built a nest.

On the deck of a giant spaceship, a nuclear bomb was detonated.

Eat your heart out Kubrick! 😉 😛

Beyond the mountains, a child giggled.

Notice that all three parts of the rule (proximity location, event.) can affect the tone and meaning of the generated result.

What if the rule had generated:

“On the deck of a giant spaceship, a child giggled.”

That is a vastly different result than the one in the examples above, yet perhaps it is the same story with only seconds separating both events? Maybe…

“On the deck of a giant spaceship, a child giggled. The hoards of aliens were defeated. In the distance a voice yells “Dinner’s ready!”, a toy spaceship falls to the floor as little feet scurry unseen through the house.”

What makes the determination in the readers mind about what is actually going on is the context of what was said before this sentence and what will be said after. There are those cases where not saying something is saying something too… but dammit I can’t model that! 😛

Now, lets look at how the proximity can change the meaning.

Here’s the proximity list I used with this rule:

in
inside
outside
near
on
around
above
below
next to
close to
far from
to the north of
to the south of
to the east of
to the west of
beyond

Each ‘proximity’ by itself seems pretty self explanatory in its meaning but when combined with a location the meaning can change. For example, it seems fairly natural to discuss something being ‘beyond’ something else like “the fence is beyond the water tower” but lets say that you have an ambiguous ‘location’ like Space?

1930s & 40’s  Pulp Scifi aside… what does it mean to be “Beyond Space”? 😛

Clearly we’ve run into one of the limitations of rule based story generation, of which there  seems to be many… but in this case I’m referring to unintended ambiguity.

At best a rule would reduce ambiguity and at worst it could inject significant ambiguity into a sentence. Ambiguity in this case should be understood as lack of clarity or a variance in what the reader is supposed to understand is occurring and what they believe is occurring.

Limitations aside, this type of rule based generative system is surprisingly effective at crafting those direct and matter of fact type statements.

The type of problem you could write an “If This Then That” sort of rule for… hmmm.

 

A Few More Rules

Here are a few more rules to help you get a feel for how this whole “rule” thing works:

Rule: name is very positive_personality_trait
&
Rule: name is very negative_personality_trait

See if you can tell which is which in this list:

Channing Lynn is very faithful
Jerome Puckett is very defeated
Arturo Thomas is very nice
Damon Gregory is very grumpy
Calvin Weeks is very repulsive
Joaquin Hicks is very gentle
Amanda Calhoun is very thoughtless
Matthias Welch is very polite
Carter Camacho is very scary
Jay Dyer is very happy
Harper Buckley is very helpless
Trenton Bauer is very kind
Kane Owen is very lazy
Lauryn Vasquez is very obedient
Aleah Gilmore is very angry
Ameer Cortez is very brave
Kase Wolfe is very worried

This rule is static and can be improved by having fewer “hard coded” elements.

Instead of the result always containing the word “very” you might instead have a gradient of words that are selected at random (or based on some precondition) that would modify the meaning or intensity of the trait, i.e. mildly, extremely, slightly, not particularly, etc…. which could lead to interesting combinations, we could call the gradient of terms, oh I don’t know… adverbs. 😛

Technically though, adverbs in general are too broad of a category to treat as simple building blocks in a rule like this but you could build a list of adverbs that would apply in this case and replace the word ‘very’ with a selection from that list which would result in more variation in the personality trait descriptions.

Lets look at another rule.

Rule: name is adjective_condition

Annalee Sargent is shy
Hugh Oconnor is helpful
Tessa Rojas is gifted
Cristian Castaneda is inexpensive
Heavenly Patel is vast
Gibson Hines is unimportant
Alora Bush is alive
Leona Estes is mushy

I don’t know about you but…

I’ve always found that “mushy” people are very positive_social_anecdote! 😛

Are you starting to see how the rules work? 😉

Much like the rule above that could be improved by replacing the “hard coded” adverb (very) with a gradient that is selected at random (or based on some precondition) the verb ‘is’ in this rule could be replaced with a gradient of verb tenses i.e. is, was, will be, etc…

Now, if you want to get more complicated… you could even build a system that uses or implements preconditions as I mentioned above.

An example of a precondition I gave above was verb tense to determine if something has, is or will happen… which would then modify the succeeding rules that follow and are related to it… but it’s also possible to build preconditions that modify rules directly from properties that are innate to your characters,  settings, objects in the locations, the weather, the time of day etc…

For example consider the Rule: name is a gender

This rule must be able to determine the gender of the name given to it in order for the rule to work. In this case, the gendered name would act as a precondition that modifies the result of the rule.

Reyna Dunlap is a woman
Nikolai Cummings is a man
Emerald Lynch is a woman
Lucas Woodward is a man
Bailey Ramsey is a woman
Matias Miller is a man
Tinley Hansen is a woman
Mckenzie Davidson is a woman

It’s also possible however for a name to be gender neutral, like Jamie for example, and the rule cannot simply break if the name is both male & female or neither in the case of a new or non-typical name and that level of abstraction (care and detail given to each rule so as to prevent a rule from breaking) has to extend to all rules in all cases which is why using rules to write stories is impractical.

Related to the last rule is this Rule: name is a adjectives_appearance gender

Mallory Joseph is a clean woman
Talon Vazquez is a bald man
Kody Maxwell is a magnificent man
Meredith Strickland is a stocky woman
Jaliyah Haynes is a plump woman
Brian Leblanc is a ugly man
Collins Warren is a scruffy woman
Tenley Robbins is a chubby woman
Brantley Mcpherson is a chubby man
Killian Sawyer is a fit man

Here again you see the rule must identify the gender of the name given to it… but what’s more important is that I used the “present tense” ‘is’ when its just as valid grammatically to say that “Killian Sawyer was a fit man and in fact even if it is grammatically correct to say he “is fit” he might not even be alive any longer and ‘was’ would be logically correct being the past tense with the implication the he is no longer fit rather than dead and additional information would be required by both the reader and system to make the determination that Killian was dead but the point still stands.

Using preconditions on all aspects of the story such as the characters, places, things etc. could enable the system to properly determine if it should say something is, was, will be, maybe, never was, never will be etc… it could examine the established who, what, when, where, why & how and use that information to determine what rule to use next which would progress the story further.

It’s easy to imagine how some rules would only be applicable if certain conditions had occurred or were “scheduled” to occur later in the story. Some rules might naturally form branching chains or tree hierarchies within the “flow” of the rules.

This implies if not requires some form of higher order logic, state machines and the use of formal logic programming with special care and attention given to the semantic meaning or value of each part of each rule…

Well nobody said it was going to be easy! 😛

These Are Not The Droids You Are Looking For

Sounds too easy right? Well… you’re probably right.

I mean sure you can do this in theory and next week I will provide a crude “proof of concept” example with code demonstrating the rules I used here today, but even if you create a bunch of rules it’s not like “earth shatteringly good” and you can’t just write them and you’re done, there is a lot of trial and error to get this sort of thing just right.

Personally I’ve never even seen a fully functional implementation of this type of thing… sure i’ve seen template based stories that use placeholders but nothing as dynamic as I believe would be required to make a rule based system work as described.

Again I am not talking about simply randomly selecting rule after rule… I mean sure you could do that but you won’t get anything really useful out of it.

To do this right your system would select the rules that properly describe past, present & future events correctly based on where in the story the rule is used and it can’t simply just swap out the names and objects or descriptions in your story without concern for the context that the rule is being applied.

To do rule based story generation right means that you get different stories each time the system runs not cookie cutter templatized tales. You cant simply write a story template and fill it with placeholders and then select and assemble “story legos” and get a good story.

Though at least hypothetically it could work if you wrote enough rules and built a more robust system that keeps track of the story state, the characters and their motivations, the objects, where they are, what they are etc… of course this is tedious and ultimately still amounts to you doing an awful lot of work that looks like writing a story.

I do believe (at least in theory) a rule based story generative system as described here could work but you would be limited to the manually written rules in the system (or are you? 😉 ) and how well the state machine used the rules.

Further, its debatable that even if a rule based story generation system worked, could it actually be good enough to be the “best seller” writer bot that we’re looking for?

Seemingly the major limiting factor to me appears to be hand writing, refining and testing the rules.

Suggest A Rule

As I said I will present the code for these rules in my next post but I’d like to ask you to suggest a rule in the comments for this post and I will try to include as many of them as possible in the code and I will give the suggester credit for the rule.

Please remember to like, share & follow!


Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Blog at WordPress.com.

Up ↑