Search

Geek Girl Joy

Artificial Intelligence, Simulations & Software

Month

October 2018

Building A Faster Bot

This week has been all about testing and optimizing the bot Train process.

The prototype we looked at in The Brown Corpus is way to slow and needed to be refactored before we could proceed.

There was a point where I knew it was going to take too long to be satisfactory but out of a perverse geek curiosity I couldn’t bring myself to cancel the training… I just had to see how slow (bad) the process really was! 😛

If we want to use this as the core of a bot/system that can do more than just parts of speech tagging (and we do) it needs to be FAST!, and to do anything really fun it just can’t take 3 weeks to process 1 million words!

And yes, the Brown Corpus only needs to be learned once but any additional learning by the bot would also be just as slow…

Why was it so slow???

Basically the code was self blocking. Every step had to complete before the next step could begin.

All the words & tags were added to the database before the trigrams could be learned and we had to wait each time for the database to generate an ID.

I did cache the ID’s for the words and tags the bot encountered in memory for faster lookup but… it was ultimately just a slow process regardless of this optimization.

What that did however was to help keep the average train time per document fairly consistent, but trigrams were queried by searching the database for any trigram where A & B & C were present LIMIT 1. Needless to say that is super inefficient!

Though clearly not the Training process we wanted it had a few things going for it:

Pros:

  • Quick to Build & Functional though Verbose:
    What was nice about this method was that the code was easy to write and though long, it should be fairly easy to read. Also, it can’t be over stated that the best way to get on the right track is to build a functional proof of concept and then iterate and try to improve your system.
  • Direct Numbers are Good for Bots
    We also gained the ability to do some interesting transforms by having a unique numeric value that the words, tags and trigrams are tied to. We’ll discuss this more in future posts so I’ll leave this at that for now.

Cons:

  • SLOW!!!!
    3 Weeks is too slow!

~1,000,000 words / ~504 hours (3 weeks) = 1984.12698413 words per hour.

That equates to roughly 1 document an hour for 3 weeks straight!

  •  Can’t Divide and Conquer
    As the Train process was written it was impossible to split the data set among more than 1 system and then later combine the tables without quite a bit of manual post processing… which is just not a very pleasant thought! 😛   This is because the database assigned the ID’s based on the order in which it encountered a word, tag or trigram. So, if you have two systems and split your training files between them they will both assign different ID’s to the same word, tag or trigram so later we would have to read through all the tags, words and trigrams and change the ID’s so they were the same before we could merge the tables.

 

What changed in the refactor?

We switched to a batch process method where we process 10 files in memory then transfer the data to the database, clear the memory and process the next batch of 10 files until we have processed all the files.

This allows us to keep the memory requirements of the training process very low with each batch of 10 training files only requiring on average ~25 MB of RAM to go from raw text to database, which the bot quickly empties when it’s done.

Which brings us to hashing.

Hashing to the Rescue!

You might be asking… isn’t this a lot of work for something that seems simple? Why bother with hashing at all? Isn’t the batch processing memory trick enough? Well, batch processing was a response to implementing hashing.

You see, we needed a way to reduce the number of comparisons when doing lookup’s.

Consider this comparison:

(A == Wa && B == Wb && C == Wc)

that’s three compares (all must be correct or true) in order for the Trigram to be good but if A & B are correct and C isn’t, that’s still 3 evaluations before you know to move on. If we could reduce those comparisons without losing the information we gain from doing those comparisons then we might save a lot of time during training as well as when using the bot later!

We also needed a way to have 2 or more machines assign the same “ID” value to a word, tag and trigram. This would allow us to split the training set among as many computers as we can get our hands on and make quick work of future training data.

Hashing solves both of these problems!

If you & I hash the same value using the same algorithm we will get the same result regardless of our distance from each other,  the time of day or any other factor you can think of. We can do this without ever having to speak to each other and our computers need not ever communicate directly. This property of hashing makes it an ideal solution for generating ID’s that will lineup without a centralized system issuing ID’s. It’s basically how block-chain technology works, though this is far simpler.

Hashing also allows us to reduce 3 comparisons to 1 because we concatenate W_A + W_B + W_C like this:

<?php
// notice these two are the same - and always give the same result
echo hash('md5', 'Thequickbrown'); // a05d6d1139097bfd312c9f1009691c2a
echo hash('md5', 'Thequickbrown'); // a05d6d1139097bfd312c9f1009691c2a

// notice these two are the same but different capitalization - different result
echo hash('md5', 'fox'); // 2b95d1f09b8b66c5c43622a4d9ec9a04
echo hash('md5', 'Fox'); // de7b8fdc57c8a948bc0cf52b31b617f3

// A specific value always returns that specific result
echo hash('md5', 'jumpsoverthe'); // fa8b014923df32935641ca80b624a169
echo hash('md5', 'jumpsoverthe'); // fa8b014923df32935641ca80b624a169
?>

Hashing yields a highly unique (case sensitive) value that represents the three words in the trigram and as such, when we are looking for a specific trigram we can hash the values and obtain it’s exact ID rather than do an A&B&C comparison.

It’s worth noting that hashing would add to the memory requirements of the bot (as hashing a word makes a longer word in most cases) so batch processing was added to address the increased memory demands of the hashed data.

The batch process eliminates the negative of having more information in memory (caused by hashing) by limiting how much RAM the program will need at any given moment.

Here’s a pros vs cons overview.

Pros:

  • Divide and Conquer!
    We can split the training data among as many computers as we have available.
  • No Significant processing required to merge tables
    All ID’s will be the same so there is no need to convert them.
  • ID Lookup’s are Eliminated
    Because the ID is the hashed representation of the word, tag or trigram we never need to lookup an ID. You just hash the value you are checking and then use that as the ID.

Cons:

  • Hashing isn’t Fast!
    While approximately ~4812% Faster and no longer taking 3 weeks, this code is still slow & took 10 hours, 15 minutes and 50.4 seconds to process 1 million words into trigram patterns and store them in the database.

 

If you would like to obtain a copy of the new Training code you can get that on my Github here: Train.php

And of course what you’ve all been waiting for… the data:

Parts of Speech Tagger Data:

You don’t need to wait 10 hours by running Train.php to get started using the brown corpus in your own projects! I’ve made the data available on my GithHub profile where you can download it for free as SQL and CSV formats.

I wanted to release the data as XML as well but the files were larger then GitHub would allow and even the SQL and CSV files we’re just barely under the allowed upload limit. GitHub complained… Oh, the things I do for my readers! 😛

MySQL CSV
Tags_Data.sql Tags.csv
Tags_Structure.sql Trigrams_1.csv
Trigrams_Data_1.sql Trigrams_2.csv
Trigrams_Data_2.sql Trigrams_3.csv
Trigrams_Data_3.sql Trigrams_4.csv
Trigrams_Data_4.sql Words.csv
Trigrams_Structure.sql
Words_Data.sql
Words_Structure.sql

 

I hope you are enjoying building SkyNet… er… this Parts of Speech Tagger as much as I am. 😛

The next post in this series we will look at how to feed the bot some text and use Trigrams to tag the words so remember to like, and follow so you won’t miss a single post!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your direct financial support finances this work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.

Your support goes toward helping me buy better tools and equipment so I can improve the quality of my content.  It also helps me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you feel inclined to give me money and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month and you will be helping me grow.

Thank you!

And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Polybius VR

What follows is hard to explain and admittedly sounds like something out of a twilight zone episode, and maybe… that’s just what it was. A moment in time where the laws of reality broke.

About two weeks ago I was on the app store on my phone looking for new Cardboard VR apps… what can I say? I like VR and Cardboard VR is affordable! And if you use one of those micro USB to USB type A converters you can plug in a USB hub and attach a keyboard or even a mouse and get some decent VR capability for relatively cheep! 😛 😉

Anyway, after a little browsing of the “new releases” I stumbled across an app called Polybius VR by a company I had never heard of before (Sinneslöschen Inc.) and it didn’t have any reviews yet and not many downloads either but since it was free I thought why not, it’s easy enough to uninstall if it’s not interesting, right?

The app was huge (a few hundred MB) and took a couple of minutes to download which isn’t that unusual for VR apps.

I was home alone and would be all evening, so while Polybius was downloading I microwaved some Ramen noodles and refilled my JPL Women In Space mug with coffee, I love mine slightly bitter and black.

I sat back down at my bedroom desk then paused to lookout my window and enjoy the blood red and purple Los Angeles sky as the sun sank low.

I sipped my coffee and got comfy in my chair. My keyboard was on my lap so I can use it while in VR. I put nail polish on the WASD keys as well as a few others so that I can find them without having to remove the VR headset. 😎 😉

I start Polybius and slide my phone into the headset and adjust the straps while the app loads.

In all directions I see an infinite abyss except for directly below me, where I see the sentence “(C) 2018 Sinneslöschen Inc.” glowing blue like copper sulphate crystals. The font is unusual, blocky and almost pixel like.

Out of the murky black void in front of me the words Polybius VR erupt and grow to become the only thing in my view.

The words seem to be about size of a single story building and were wrapped in polygonal chains that seemed to crawl like cellular automatons. The lines vibrated and jittered all over the text changing shape to envelop the words. The effect oscillated between a smooth gradient and jagged pixel edges which gave the effect that sometimes the lines were eating away at the text like acid.

It’s at this point where things start to get weird.

For lack of a better term I’d describe it as “missing time”. The experience for me seemed to only last a few minutes and all I recall seeing was the copyright text followed by the Polybius VR logo which flash several times in very rapid succession and then the screen on my cellphone just went black.

I pulled off the headset and frantically removed my phone to reveal that the screen was cracked, which was disappointing and that’s putting it mildly!

That’s when I gazed out my window again, only to realize that I could see stars in the sky!

As I said, my perceived experience was at most, only a few minutes had passed but the clock on the wall begged to differ when it read 8:57 PM. My computer and microwave also confirmed that roughly three hours had passed from when I first sat down.

After a more thorough examination of my phone it appeared that the battery had exploded and fused the entire thing into a paperweight.

I remembered I had installed the app on my SD card so I thought to recover it from the scrap but frustratingly it was also damaged beyond recovery, though thankfully I had my photos backed up!

I used my desktop to access the app store where I immediately searched for Polybius VR but nothing related came up.

Desperate for some reassurance of my own sanity I turned to Google like anyone in my position would and typed ‘Polybius’.

The very first link returned did little to alleviate my growing concern. The Wikipedia article “Polybius (urban legend)” opens with this sentence:

“Polybius is a fictitious arcade game, the subject of an urban legend…”

How could it be an urban legend? It was real! I installed it and it fried my phone, not to mention distorting my perception of time for 3 hours!

I spent the better part of the next week researching the Polybius urban legend only to turn up myth after half truth. Website after website full of internet rumors, hoaxers and fake news.

I even reached out to a couple of grey-hat’s I used to work with to see if they knew of anyone who was working in wetware that might be able to pull off a hack that would cause something like missing time.

They both told me the same thing… Polybius was a myth and nobody was even close to that level of biohacking.

Piecing all the “facts” together for myself the Polybius urban legend seems to go as follows…

In the summer of 1981 somebody (usually claimed to be the CIA but sometimes it’s shadow mega corporations… aliens?) formed the mythical shell company “Sinneslöschen Inc.” with the clandestine charge of conducting civilian thought control experiments on unsuspecting people.

I half expect Fox Mulder and Dana Scully to show up any minute!

The Polybius project is said to have centered around using arcade games (it was the 80s) to attempt to turn anyone into a mindless puppet.

Few credible witnesses have ever come forward but one overwhelmingly reoccurring theme among all the Polybius stories is an “addictive” effect experienced by players.

Some claiming it became the only thing they could think about even when they weren’t playing… which I guess these day’s is pretty understandable. I mean, we’ve all known someone (or been that someone) who was so into a game that we describe them as ‘addicted’.

The typical scenario told goes something like this, Polybius players would leave the house in the morning feeling the weight of pockets full of quarters!

Then wait, aching long hours while the clock shortened the distance between them and their next chance to play.

Once let out of work or school on their own recognizance the race between them and everyone else who wanted ‘THEIR’ machine was on!

You were lucky if you were an adult because it meant you had a car and could get the arcade first, pump the machine full of quarters till basic economics forced you to relinquish the machine to the next player, who in turn did the exact same thing.

Hours would churn and abused buttons induced pixels to strobe and undulate in hypnotic patterns while the polyphonic beeps rhythmically danced their strange digital melody.

In addition to the additive effect described by players some reports describe a “Polybius intoxication” others have refer to it as a sort of “madness” or “stupor”.

People who played Polybius for long stretches were described to occasionally experience something like a seizure and then followed by what appeared to be a coma lasting anywhere from minutes to hours after which they would wake up & remain “blank & zombie like” for some time.

Additional side effects caused by playing Polybius were reported to be: amnesia, insomnia, night terrors, hallucinations, and rare unprovoked aggressive episodes.

One site poorly sourced a quote from an Oregon newspaper from the early 80’s describing a public arcade event where several Polybius machines were observed by an audience of a few dozen people for an extended amount of time. It was reported that several of the players & audience members became sick, they described some of the symptoms of Polybius intoxication as well as “zombie like behavior” by those afflicted.

Putting urban legends aside, I’m still left with the question of what really happened that night?

I’d dismiss all of this outright as myth, half truth, internet rumors, hoaxers and fake news if it wasn’t for my experience with Polybius VR.

Is it possible that a neurohacker terrorist somewhere discovered a technique that could perhaps “reboot” a brain just by showing you some images?

The idea isn’t as sci-fi as it may seem. It turns out that a condition called confabulation  can occur in both biological and artificial neural neural networks, so maybe someone figured out how to trigger a buffer overflow in a brain and packaged it in a VR app!

The idea that some faceless attacker could take control over your mind seems to be the ultimate violation of self.

Maybe my phone battery just died and my computer, microwave and the atomic clock on my wall collectively suffered from the same glitch. Perhaps this was just an elaborate prank at the expense of anyone who was unfortunate enough to install Polybius VR…

Or perhaps it’s all true and there really are monsters that lurk in the shadows ready to devour us just as soon as the flashlight battery dies.

It stands to reason that I will never truly know what happened to me that night and the events of those three hours will remain forever shrouded in my nightmares.

I guess if there’s a moral to be found here at all it would be… be careful what you install!

And with that, Happy Halloween everyone! Be safe out there tonight & If you come across an app called Polybius VR in the app store, do yourself a favor and take a pass on that one! 😉

Remember to like, share and follow & if you are one of the few other people to have downloaded Polybius VR before it mysteriously disappeared from the app store, consider leaving your personal experience below in the comments and since you’re still reading, why not help me grow…

Help Me Grow

Your support finances my work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.

Your support goes toward helping me buy better tools and equipment so I can improve the quality of your reading material. It also helps me eat, pay rent and now of course I have to add a new cellphone to the budget! 😛

If you feel inclined to give me money and add your name (or business) on my Sponsors page then visit my Patreon page and pledge $1 or more a month and you will be helping me grow.

Thank you!

And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

The Brown Corpus Database

Welcome back, today we’re going to peek inside the database for the Parts of Speech tagger.

Unfortunately my Raspberry Pi that I am using to train with is slow (cough… and my code is super un-optimized 😛 ) so… it’s still working on learning the complete brown corpus though we’re almost there, less than 190 training files remaining!

Before proceeding here’s my disclaimer on the GitHub repo. It basically says that I don’t own the Brown Corpus and I am not selling it to you!

The Database

Here’s a recap of the database, It consists of three tables. Words, Tags & Trigrams. You can find the complete MySQL Database Setup script here:  Create.PartsOfSpeech.DB.sql.

CREATE TABLE `Tags` (
  `ID` int(11) NOT NULL,
  `Tag` varchar(8) NOT NULL,
  `Definition` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;


CREATE TABLE `Trigrams` (
  `ID` int(11) NOT NULL,
  `Count` int(11) NOT NULL,
  `Word_A` int(11) NOT NULL,
  `Word_B` int(11) NOT NULL,
  `Word_C` int(11) NOT NULL,
  `Tag_A` int(11) NOT NULL,
  `Tag_B` int(11) NOT NULL,
  `Tag_C` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;


CREATE TABLE `Words` (
  `ID` int(11) NOT NULL,
  `Word` varchar(100) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;


 

Words Table

The Words table keeps track of all the words the tagger knows.

The bot uses the ID’s in place of the words so given this sentence:

the quick brown fox jumps over the lazy dog a long-term contract with zero-liability protection lets think it over

We would expect the system to be able to lookup each word (provided it knows it) and replace it with the ID of the word in the Words table, like this:

1 43524 70488 515610 1149954 7158 1 266303 56280. 309 43578 53868 1212 zero-liability 238658 482081 32358 423

Notice that the bot was unable to lookup the ID for the word “zero-liability”, this is because it never saw that word during training and it would need to be “learned” by the system by assigning it a new ID and adding it to the database.

Here’s an info graphic that might help you understand the Words table:

Words Table - An inforgraphic reviewing the words table.
Words Table

 

Tags Table

The Tags table keeps track of all the tags the tagger knows.

The bot uses the ID’s in place of the tags so given these words:

fox, jump, jumps, jumped

We would assign these tags:

fox/nn = singular or mass noun

jump/vb = verb, base form

jumps/vbz = verb, 3rd. singular present

jumped/vbd = verb, past tense

And the ID’s for the tags would be represented as such:

21, 246, 138, 12

Here’s an info graphic that might help you understand the Tags table:

Tags Table - An inforgraphic reviewing the tags table.
Tags Table

 

Trigrams Table

The Trigrams table is the heart of the system and it’s job is to keep track of the associations between word trigrams (groups of 3 words) and tag trigrams (groups of 3 tags).

The Brown Corpus training data is split up into trigrams of words and tags so that when the bot learns it isn’t just learning individual words and tags but chains of words and tags.

This helps the bot learn that some words can have more than one meaning or role in a sentence. It also keeps a count of each time it sees a trigram so it can calculate the probability of each trigram and tag set.

Given this sentence:

the quick brown fox jumps over the lazy dog a long-term contract with zero-liability protection lets think it over

We would expect the system to be able to extract the following trigrams represented here as JSON:

[
	["The","quick","brown"],
	["quick","brown","fox"],
	["brown","fox","jumps"],
	["fox","jumps","over"],
	["jumps","over","the"],
	["over","the","lazy"],
	["the","lazy","dog"],
	["lazy","dog","A"],
	["dog","A","long-term"],
	["A","long-term","contract"], 
	["long-term","contract","with"],
	["contract","with","zero-liability"],
	["with","zero-liability","protection"],
	["zero-liability","protection","Let's"],
	["protection","Let's","think"],
	["Let's","think","it"],
	["think","it","over"]
]

 

And of course since we’re actually using word ID’s and not the words themselves we could change the words to their ID’s in the JSON:

 [
	["1","43524","70488"],
	["43524","70488","515610"],
	["70488","515610","1149954"],
	["515610","1149954","7158"],
	["1149954","7158","1"],
	["7158","1","266303"],
	["1","266303","56280"],
	["266303","56280","309"],
	["56280","309","43578"],
	["309","43578","53868"],
	["43578","53868","1212"],
	["53868","1212","1161931"],
	["1212","1161931","238658"],
	["1161931","238658","482081"],
	["238658","482081","32358"],
	["482081","32358","423"],
	["32358","423","7158"]
]

The same can be done with the Tags.

 

Here’s an info graphic that might help you understand the Trigrams table:

Trigrams Table - An inforgraphic reviewing the trigrams table.
Trigrams Table

 

 

This is as far as we’ll get this week so remember to like, and follow!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

It goes toward helping me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

The Brown Corpus

Welcome, we’re going to talk about training a Parts of Speech tagging bot using the Brown Corpus.

The Brown Corpus

What’s the Brown Corpus? Basically, two linguists (Henry Kučera and W. Nelson Francis) combined their efforts at Brown University (thus ‘Brown Corpus’) in the early 1960’s to create a English language corpus that computer scientists and AI researchers could use as a standard.

The corpus is comprised of 500 samples of English-language text, each text is approximately 2000 words long give or take a few exceptions.

It covers topics such as Religion, Politics, News and even Science (Fiction & Non) not to mention multiple genres in each topic. Check out the Sample Distribution section on the Wikipedia page for the specifics if you are curious but suffice it to say it’s extensive!

 

What Can You Do With It? 

Well, generally speaking it’s purpose is to act as a well documented & ‘tagged’ data set that you can compare your bot, word tagging system, or even something else… against to determine the accuracy of your model.

The thing is, that also means it makes a great resource to train a Parts of Speech tagging bot from. And  well… that’s what we’re going to do! 😛

Before proceeding here’s my disclaimer on the GitHub repo. It basically says that I don’t own the Brown Corpus and I am not selling it to you!

Further, you may not sell it without obtaining permission from the licence holder. As far as I am aware you may not use it commercially.

As for the bot, once you understand how this system operates it’s relatively trivial to make modifications to this tri-gram tagging system or build your own from scratch.

The real difficulty in using a system like this is obtaining a well tagged corpus of text with a commercially permissible use licence though they do exist for purchase, or you might find a CC0 or MIT Licenced corpus or here again… you could build your own from scratch… but that is a huge undertaking for a single or small group of developers.

How Do We Build It?

Ok, we all know I’m going to publish everything on my GitHub profile when I’m done and since I’m awesome,  I’m probably going to export the data as convenient formats (JSON, XML… ) as well. 😉

Are there formats other than MySQL, JSON & XML that you would like? CSV? Let me know in the comments.

To get started if you want to follow along you can find the GitHub repo and corpus HERE.

MySQL Database

I’m using a MySQL database to hold the “training data”. Once “trained” we can export the data to the other formats. Note that this database is unoptimized and is merely a rough prototype that focuses on function over form.

--
-- Database: `PartsOfSpeechTagger`
--
CREATE DATABASE IF NOT EXISTS `PartsOfSpeechTagger` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
USE `PartsOfSpeechTagger`;

-- --------------------------------------------------------

-- --------------------------------------------------------

--
-- Table structure for table `Tags`
--

CREATE TABLE `Tags` (
  `ID` int(11) NOT NULL,
  `Tag` varchar(8) NOT NULL,
  `Definition` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

-- --------------------------------------------------------

--
-- Table structure for table `Trigrams`
--

CREATE TABLE `Trigrams` (
  `ID` int(11) NOT NULL,
  `Count` int(11) NOT NULL,
  `Word_A` int(11) NOT NULL,
  `Word_B` int(11) NOT NULL,
  `Word_C` int(11) NOT NULL,
  `Tag_A` int(11) NOT NULL,
  `Tag_B` int(11) NOT NULL,
  `Tag_C` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

-- --------------------------------------------------------

--
-- Table structure for table `Words`
--

CREATE TABLE `Words` (
  `ID` int(11) NOT NULL,
  `Word` varchar(100) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

--
-- Indexes for dumped tables
--

--
-- Indexes for table `Tags`
--
ALTER TABLE `Tags`
  ADD PRIMARY KEY (`ID`),
  ADD UNIQUE KEY `Tag` (`Tag`);

--
-- Indexes for table `Trigrams`
--
ALTER TABLE `Trigrams`
  ADD PRIMARY KEY (`ID`);

--
-- Indexes for table `Words`
--
ALTER TABLE `Words`
  ADD PRIMARY KEY (`ID`),
  ADD UNIQUE KEY `Word` (`Word`);

--
-- AUTO_INCREMENT for dumped tables
--

--
-- AUTO_INCREMENT for table `Tags`
--
ALTER TABLE `Tags`
  MODIFY `ID` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1;
--
-- AUTO_INCREMENT for table `Trigrams`
--
ALTER TABLE `Trigrams`
  MODIFY `ID` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1;
--
-- AUTO_INCREMENT for table `Words`
--
ALTER TABLE `Words`
  MODIFY `ID` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1;

 

Train.php

Next I wrote this terribly un-optimized training script that is, as my grandfather would have put it, “slower than molasses in January”, but once done it need not ever be ran again (I’ve been training since the 6th 😛 ) so save yourself the trouble and don’t run this code! Wait for me to publish the finished data to the repo... hopefully sometime over the weekend or early next week.

<?php

// Create & return $conn object to hold connection to MySQL
function ConnectToMySQL($servername, $username, $password, $dbname){

  // Create connection
  $conn = new mysqli($servername, $username, $password, $dbname);
  // Check connection
  if ($conn->connect_error) {
    die("MYSQL DB Connection failed: " . $conn->connect_error);
  }
  
  return $conn;
}

// Disconnect $conn object that holds the connection to MySQL
function DisconnectFromMySQL(&$conn){
  $conn->close();
}


// If the word is in memory we know it, move on
// otherwise try adding it to the database
// if we add it to the database keep a copy in memory 
// to avoid unnecessary DB queries in the future
function AddWordToMySQLAndMemory($word, &$conn){
  
  global $words_to_id;
  
  // if the word isn't in memory try to add it to the database
  if(empty($words_to_id[$word])){
    $sql = "INSERT INTO `Words` (`ID`, `Word`) VALUES (NULL, '$word')";
    if ($conn->query($sql) === TRUE) {
        //echo "New word added successfully" . PHP_EOL;
      // add to memory for faster look up in the future
      $words_to_id[$word] = GetIDForWord($word, $conn); // get ID DB Assigned
    } else {
      // weird the Word exists - did you reboot?
      // echo "Word exists" . PHP_EOL; 
      // add to memory for faster look up in the future
      $words_to_id[$word] = GetIDForWord($word, $conn); // get ID DB Assigned
    }
  }
}


// If the tag is in memory we know it, move on
// otherwise try adding it to the database
// if we add it to the database keep a copy in memory 
// to avoid unnecessary DB queries in the future
function AddTagToMySQLAndMemory($tag, &$conn){
  
  global $tags_to_id;
  
  // if the tag isn't in memory try to add it to the database
  if(empty($tags_to_id[$tag])){
    
    $sql = "INSERT INTO `Tags` (`ID`, `Tag`) VALUES (NULL, '$tag')";

    if ($conn->query($sql) === TRUE) {
        //echo "New tag added successfully" . PHP_EOL;
      // add to memory for faster look up in the future
      $tags_to_id[$tag] = GetIDForTag($tag, $conn); // get ID DB Assigned
    } else {
      // weird the Tag exists - did you reboot?
      // echo "Tag exists" . PHP_EOL;
      // add to memory for faster look up in the future
      $tags_to_id[$tag] = GetIDForTag($tag, $conn); // get ID DB Assigned
    }
  }
}



function AddTrigramToMySQL($gram_set, &$conn){
  
  
  $Word_A = GetIDForWord($gram_set['words'][0], $conn);
  $Word_B = GetIDForWord($gram_set['words'][1], $conn);
  $Word_C = GetIDForWord($gram_set['words'][2], $conn);
  $Tag_A = GetIDForTag($gram_set['tags'][0], $conn);
  $Tag_B = GetIDForTag($gram_set['tags'][1], $conn);
  $Tag_C = GetIDForTag($gram_set['tags'][2], $conn);
  
  $complete_trigram_set = true;
  
  if($Word_A == NULL || $Word_B == NULL || $Word_C == NULL ||
     $Tag_A == NULL || $Tag_B == NULL || $Tag_C == NULL){
    $complete_trigram_set = false;
  }

  if($complete_trigram_set == true){
      
    // Select the trigram if it exists in the database
    $sql = "SELECT * FROM `Trigrams` WHERE `Word_A`=$Word_A AND `Word_B`=$Word_B AND `Word_C`=$Word_C AND `Tag_A`=$Tag_A AND `Tag_B`=$Tag_B AND `Tag_C`=$Tag_C  LIMIT 1";

    $result = $conn->query($sql);

    // there is an instance of this pair
    if ($result->num_rows > 0) {
      
      // Obtain the record for the gram_set
      while($row = $result->fetch_assoc()) {
        $id = $row['ID'];
        $count = $row['Count'];
      }
      $count++; //gram_set encountered again, increment it.
      
      // push updated count to database
      $sql = "UPDATE `Trigrams` SET Count='$count' WHERE ID=$id";

      if ($conn->query($sql) === TRUE) {
          //echo "Trigram Count updated successfully" . PHP_EOL;
      } else {
          //echo "Error: " . $sql . PHP_EOL . $conn->error . PHP_EOL;
      }
    } else { // no previous gram_set instance
      
      // Add this gram_set
      $sql = "INSERT INTO `Trigrams` (`Count`, `Word_A`, `Word_B`, `Word_C`, `Tag_A`, `Tag_B`, `Tag_C`) VALUES ('1', '$Word_A', '$Word_B', '$Word_C', '$Tag_A', '$Tag_B', '$Tag_C')";
      if ($conn->query($sql) === TRUE) {
          //echo "New Trigram added successfully";
      } else {
          //echo "Error: " . $sql . PHP_EOL . $conn->error . PHP_EOL;
      }    
    }
  }
}


// Pull the id for a given word from memory if available
// fall back to the database if its not in memory
// return NULL if it's not in the database
function GetIDForWord($word, &$conn){
  
  global $words_to_id;
  
  // if the word isn't in memory try to get it from the database
  if(empty($words_to_id[$word])){
  
    $sql = "SELECT * FROM `Words` WHERE `Word`='$word' LIMIT 1";
    $result = $conn->query($sql);
    
    if ($result->num_rows > 0) {// word exists
      // Output the ID for this Word
      while($row = $result->fetch_assoc()) {
        return $row['ID'];
      }
    }
    return NULL; // not in DB
  }
  else{
    return $words_to_id[$word];
  }  
}


// Pull the word for a given id from memory if available
// fall back to the database if its not in memory
// return NULL if it's not in the database
function GetWordForID($ID, &$conn){
  global $ids_to_words;
  
  // if the ID isn't in memory try to get it from the database
  if(empty($ids_to_words[$ID])){
    
    $sql = "SELECT * FROM `Words` WHERE `ID`='$ID' LIMIT 1";
    $result = $conn->query($sql);
    
    if ($result->num_rows > 0) {// id exists
      // Output the Word for this ID
      while($row = $result->fetch_assoc()) {
        return $row['Word'];
      }
    }
    return NULL; // not in DB
  }
  else{
    return $ids_to_words[$ID];
  }
}


// Pull the id for a given tag from memory if available
// fall back to the database if its not in memory
// return NULL if it's not in the database
function GetIDForTag($tag, &$conn){
  global $tags_to_id;
  
  // if the Tag isn't in memory try to get it from the database
  if(empty($tags_to_id[$tag])){
    $sql = "SELECT * FROM `Tags` WHERE `Tag`='$tag' LIMIT 1";
    $result = $conn->query($sql);
    
    if ($result->num_rows > 0) {// tag exists
      // Output the ID for this tag
      while($row = $result->fetch_assoc()) {
        return $row['ID'];
      }
    }
    return NULL; // not in DB
  }
  else{
    return $tags_to_id[$tag];
  }
}


// Pull the tag for a given id from memory if available
// fall back to the database if its not in memory
// return NULL if it's not in the database
function GetTagForID($ID, &$conn){
  global $ids_to_tags;
  
  // if the Tag isn't in memory try to get it from the database
  if(empty($ids_to_tags[$ID])){
    $sql = "SELECT * FROM `Tags` WHERE `ID`='$ID' LIMIT 1";
    $result = $conn->query($sql);
    
    if ($result->num_rows > 0) {// ID exists
      // Output the Tag for this ID
      while($row = $result->fetch_assoc()) {
        return $row['Tag'];
      }
    }
    return NULL; // not in DB
  }
  else{
    return $ids_to_tags[$ID];
  }
}


// Get contents of a training file as a string
function GetFile($filename){
  $filename =  'brown' . DIRECTORY_SEPARATOR . $filename;
  $handle = fopen($filename, 'r');
  $contents = fread($handle, filesize($filename));
  fclose($handle);
  return $contents;
}


// data is a text file with word/tag
// capture the word and tag as group 1 & 2 split by a forward slash.
// example: (word || symbol)[/](tag)   the/article blue/adjective cat/noun ./.
// (1)(2): (the)(article) (blue)(adjective) (cat)(noun) (.)(.)
function PrepareData($textdata){
  
  $re = '/([^\s]+)[\/]([^\s]+)/m';
  preg_match_all($re, $textdata, $matches, PREG_SET_ORDER, 0);
  
  $data = array();
  foreach($matches as $key=>$match){
    $data['words'][$key] = $match[1];
    $data['tags'][$key] = $match[2];
  }
  return $data;
}


// data is an array
// $data['words'][i] = word or symbol
// $data['tags'][i] = tag for the assoceated word
function ExtractTrigrams($data){
  
  $trigrams = array();
  
  $word_count = count($data['words']);
  for($i=2; $i < $word_count; $i++){

    $w_a = $data['words'][$i-2];
    $w_b = $data['words'][$i-1];
    $w_c = $data['words'][$i];
    $t_a = $data['tags'][$i-2];
    $t_b = $data['tags'][$i-1];
    $t_c = $data['tags'][$i];
    
    $pack['words'] = array($w_a, $w_b, $w_c);
    $pack['tags'] = array($t_a, $t_b, $t_c);
    
    $trigrams[] = $pack;
  }
  
  return $trigrams;
}


// Get all the words from the DB with the word as the key and the id as the value
function GetAllWords(&$conn){
  $sql = "SELECT * FROM `Words`";
  $result = $conn->query($sql);
  
  if ($result->num_rows > 0) {// id exists
    $words = array();
    // Output the Word for this ID
    while($row = $result->fetch_assoc()) {
      $words[$row['Word']] = $row['ID'];
    }
    return $words;
  }
  return NULL;
}


// Get all the words from the DB with the id as the key and the word as the value
function GetAllWordIDs(&$conn){
  $sql = "SELECT * FROM `Words`";
  $result = $conn->query($sql);
  
  if ($result->num_rows > 0) {// id exists
    $words = array();
    // Output the Word for this ID
    while($row = $result->fetch_assoc()) {
      $words[$row['ID']] = $row['Word'];
    }
    return $words;
  }
  return NULL;
}


// Get all the tags from the DB with the tag as the key and the id as the value
function GetAllTags(&$conn){
  $sql = "SELECT * FROM `Tags`";
  $result = $conn->query($sql);
  
  if ($result->num_rows > 0) {// id exists
    $words = array();
    // Output the Word for this ID
    while($row = $result->fetch_assoc()) {
      $words[$row['Tag']] = $row['ID'];
    }
    return $words;
  }
  return NULL;
}


// Get all the tags from the DB with the id as the key and the tag as the value
function GetAllTagIDs(&$conn){
  $sql = "SELECT * FROM `Tags`";
  $result = $conn->query($sql);
  
  if ($result->num_rows > 0) {// id exists
    $words = array();
    // Output the Word for this ID
    while($row = $result->fetch_assoc()) {
      $words[$row['ID']] = $row['Tag'];
    }
    return $words;
  }
  return NULL;
}



$training_files = array('ca01', 'ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10', 'ca11', 'ca12', 'ca13', 'ca14', 'ca15', 'ca16', 'ca17', 'ca18', 'ca19', 'ca20', 'ca21', 'ca22', 'ca23', 'ca24', 'ca25', 'ca26', 'ca27', 'ca28', 'ca29', 'ca30', 'ca31', 'ca32', 'ca33', 'ca34', 'ca35', 'ca36', 'ca37', 'ca38', 'ca39', 'ca40', 'ca41', 'ca42', 'ca43', 'ca44', 'cb01', 'cb02', 'cb03', 'cb04', 'cb05', 'cb06', 'cb07', 'cb08', 'cb09', 'cb10', 'cb11', 'cb12', 'cb13', 'cb14', 'cb15', 'cb16', 'cb17', 'cb18', 'cb19', 'cb20', 'cb21', 'cb22', 'cb23', 'cb24', 'cb25', 'cb26', 'cb27', 'cc01', 'cc02', 'cc03', 'cc04', 'cc05', 'cc06', 'cc07', 'cc08', 'cc09', 'cc10', 'cc11', 'cc12', 'cc13', 'cc14', 'cc15', 'cc16', 'cc17', 'cd01', 'cd02', 'cd03', 'cd04', 'cd05', 'cd06', 'cd07', 'cd08', 'cd09', 'cd10', 'cd11', 'cd12', 'cd13', 'cd14', 'cd15', 'cd16', 'cd17', 'ce01', 'ce02', 'ce03', 'ce04', 'ce05', 'ce06', 'ce07', 'ce08', 'ce09', 'ce10', 'ce11', 'ce12', 'ce13', 'ce14', 'ce15', 'ce16', 'ce17', 'ce18', 'ce19', 'ce20', 'ce21', 'ce22', 'ce23', 'ce24', 'ce25', 'ce26', 'ce27', 'ce28', 'ce29', 'ce30', 'ce31', 'ce32', 'ce33', 'ce34', 'ce35', 'ce36', 'cf01', 'cf02', 'cf03', 'cf04', 'cf05', 'cf06', 'cf07', 'cf08', 'cf09', 'cf10', 'cf11', 'cf12', 'cf13', 'cf14', 'cf15', 'cf16', 'cf17', 'cf18', 'cf19', 'cf20', 'cf21', 'cf22', 'cf23', 'cf24', 'cf25', 'cf26', 'cf27', 'cf28', 'cf29', 'cf30', 'cf31', 'cf32', 'cf33', 'cf34', 'cf35', 'cf36', 'cf37', 'cf38', 'cf39', 'cf40', 'cf41', 'cf42', 'cf43', 'cf44', 'cf45', 'cf46', 'cf47', 'cf48', 'cg01', 'cg02', 'cg03', 'cg04', 'cg05', 'cg06', 'cg07', 'cg08', 'cg09', 'cg10', 'cg11', 'cg12', 'cg13', 'cg14', 'cg15', 'cg16', 'cg17', 'cg18', 'cg19', 'cg20', 'cg21', 'cg22', 'cg23', 'cg24', 'cg25', 'cg26', 'cg27', 'cg28', 'cg29', 'cg30', 'cg31', 'cg32', 'cg33', 'cg34', 'cg35', 'cg36', 'cg37', 'cg38', 'cg39', 'cg40', 'cg41', 'cg42', 'cg43', 'cg44', 'cg45', 'cg46', 'cg47', 'cg48', 'cg49', 'cg50', 'cg51', 'cg52', 'cg53', 'cg54', 'cg55', 'cg56', 'cg57', 'cg58', 'cg59', 'cg60', 'cg61', 'cg62', 'cg63', 'cg64', 'cg65', 'cg66', 'cg67', 'cg68', 'cg69', 'cg70', 'cg71', 'cg72', 'cg73', 'cg74', 'cg75', 'ch01', 'ch02', 'ch03', 'ch04', 'ch05', 'ch06', 'ch07', 'ch08', 'ch09', 'ch10', 'ch11', 'ch12', 'ch13', 'ch14', 'ch15', 'ch16', 'ch17', 'ch18', 'ch19', 'ch20', 'ch21', 'ch22', 'ch23', 'ch24', 'ch25', 'ch26', 'ch27', 'ch28', 'ch29', 'ch30', 'cj01', 'cj02', 'cj03', 'cj04', 'cj05', 'cj06', 'cj07', 'cj08', 'cj09', 'cj10', 'cj11', 'cj12', 'cj13', 'cj14', 'cj15', 'cj16', 'cj17', 'cj18', 'cj19', 'cj20', 'cj21', 'cj22', 'cj23', 'cj24', 'cj25', 'cj26', 'cj27', 'cj28', 'cj29', 'cj30', 'cj31', 'cj32', 'cj33', 'cj34', 'cj35', 'cj36', 'cj37', 'cj38', 'cj39', 'cj40', 'cj41', 'cj42', 'cj43', 'cj44', 'cj45', 'cj46', 'cj47', 'cj48', 'cj49', 'cj50', 'cj51', 'cj52', 'cj53', 'cj54', 'cj55', 'cj56', 'cj57', 'cj58', 'cj59', 'cj60', 'cj61', 'cj62', 'cj63', 'cj64', 'cj65', 'cj66', 'cj67', 'cj68', 'cj69', 'cj70', 'cj71', 'cj72', 'cj73', 'cj74', 'cj75', 'cj76', 'cj77', 'cj78', 'cj79', 'cj80', 'ck01', 'ck02', 'ck03', 'ck04', 'ck05', 'ck06', 'ck07', 'ck08', 'ck09', 'ck10', 'ck11', 'ck12', 'ck13', 'ck14', 'ck15', 'ck16', 'ck17', 'ck18', 'ck19', 'ck20', 'ck21', 'ck22', 'ck23', 'ck24', 'ck25', 'ck26', 'ck27', 'ck28', 'ck29', 'cl01', 'cl02', 'cl03', 'cl04', 'cl05', 'cl06', 'cl07', 'cl08', 'cl09', 'cl10', 'cl11', 'cl12', 'cl13', 'cl14', 'cl15', 'cl16', 'cl17', 'cl18', 'cl19', 'cl20', 'cl21', 'cl22', 'cl23', 'cl24', 'cm01', 'cm02', 'cm03', 'cm04', 'cm05', 'cm06', 'cn01', 'cn02', 'cn03', 'cn04', 'cn05', 'cn06', 'cn07', 'cn08', 'cn09', 'cn10', 'cn11', 'cn12', 'cn13', 'cn14', 'cn15', 'cn16', 'cn17', 'cn18', 'cn19', 'cn20', 'cn21', 'cn22', 'cn23', 'cn24', 'cn25', 'cn26', 'cn27', 'cn28', 'cn29', 'cp01', 'cp02', 'cp03', 'cp04', 'cp05', 'cp06', 'cp07', 'cp08', 'cp09', 'cp10', 'cp11', 'cp12', 'cp13', 'cp14', 'cp15', 'cp16', 'cp17', 'cp18', 'cp19', 'cp20', 'cp21', 'cp22', 'cp23', 'cp24', 'cp25', 'cp26', 'cp27', 'cp28', 'cp29', 'cr01', 'cr02', 'cr03', 'cr04', 'cr05', 'cr06', 'cr07', 'cr08', 'cr09');
$total_files = count($training_files);

$server = 'localhost';
$username = 'root';
$password = 'password';
$db = 'PartsOfSpeechTagger';
$conn = ConnectToMySQL($server, $username, $password, $db);


// Get all known current words and id's inefficient redundant calls but it's a run once.
$words_to_id = GetAllWords($conn);
$ids_to_words = GetAllWordIDs($conn);
$tags_to_id = GetAllTags($conn);
$ids_to_tags = GetAllWordIDs($conn);

$log = fopen('Log.txt', 'w+'); // log file

foreach($training_files as $filenumber=>$training_file){
  echo "Processing file $filenumber of $total_files." . PHP_EOL;
  fwrite($log, $training_file . PHP_EOL); // log the name of the file we are working on
  
  // Get data and get it ready for the bot to learn
  $training_data = GetFile($training_file);
  $training_data = PrepareData($training_data);
  $training_data = ExtractTrigrams($training_data);
  //var_dump($training_data);
  
  foreach($training_data as $key=>$set){
    foreach($set as $group=>$trigrams){
      if($group == 'words'){
        // add words
        AddWordToMySQLAndMemory($trigrams[0], $conn);
        AddWordToMySQLAndMemory($trigrams[1], $conn);
        AddWordToMySQLAndMemory($trigrams[2], $conn);
      }
      elseif($group == 'tags'){
        // add tags
        AddTagToMySQLAndMemory($trigrams[0], $conn);
        AddTagToMySQLAndMemory($trigrams[1], $conn);
        AddTagToMySQLAndMemory($trigrams[2], $conn);
      }
    }
    // We know the words and tags are now in the DB & Memory
    // process the trigrams
    AddTrigramToMySQL($set, $conn);
  }
}
fclose($log);


DisconnectFromMySQL($conn);

 

I had hoped we could go farther this week and discuss trigrams but… as I said I’m still training the model so we’ll cover how to use it next week. In the mean time, remember to like, and follow!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

It goes toward helping me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Tokenizing & Lexing Natural Language

Well, I guess the next question to ask is… can we lex a natural language?

When we lexed the PHP code in Can a Bot Understand a Sentence? we applied tags to the lexemes so that the intended meaning of each could be understood using grammar rules.

Well It turns out that when talking about natural languages, lexing is referred to as Parts of Speech Tagging.

The top automated parts of speech taggers have achieved something like 97-98% accuracy when tagging previously unseen (though grammatically correct) text. I would say that pretty much makes this a solved problem!

Further, linguists and elementary school teachers have been doing this by hand for years! 😛

In practice everyone’s results will very but on average having a potential of approximately 2 miss-tagged words out of 100 means that the challenge of building a natural language lexer shouldn’t be too difficult but of course any variance even just 2% (meaning that a bot that gets the tags wrong 2% of the time) can mean that the bot does the wrong thing (perhaps significantly) 2% of the time.

In any case before we can tag our lexemes we need to come up with a way to ‘tokenize’ a natural language sentence so let’s talk about that.

Tokenizing

Tokenizing is a verb which makes it an action. The word means to turn the raw text into ‘tokens’ using a process to determine the bounds of each “word unit” or “part of speech” so that we can treat it as a separate component that can be programmatically acted upon as lexeme.

In this case we can use the individual characters as our tokens.

If we use this sentence:

“The quick brown fox jumps over the lazy dog. A long-term contract with “zero-liability” protection! Let’s think ‘it’ over. john.doe@web_server.com”

Tokens

We want our system to use all the characters (including spaces and punctuation) in the string as tokens like this:

["T","h","e"," ","q","u","i","c","k"," ","b","r","o","w","n"," ","f","o","x"," ","j","u","m","p","s"," ","o","v","e","r"," ","t","h","e"," ","l","a","z","y"," ","d","o","g","."," ","A"," ","l","o","n","g","-","t","e","r","m"," ","c","o","n","t","r","a","c","t"," ","w","i","t","h"," ","\"","z","e","r","o","-","l","i","a","b","i","l","i","t","y","\""," ","p","r","o","t","e","c","t","i","o","n","!"," ","L","e","t","'","s"," ","t","h","i","n","k"," ","'","i","t","'"," ","o","v","e","r","."," ","j","o","h","n",".","d","o","e","@","w","e","b","_","s","e","r","v","e","r",".","c","o","m"]

Lexemes

Once we have the tokens we want the system to process those tokens into lexemes that a person would naturally say is a “whole” lexeme.

In this case, “whole” means that it’s a complete part of speech so sometimes a lexeme is a multi-character word and sometimes is a single character delimiter.

Most of the time a lexeme will only contain letters, numbers or symbols but sometimes it should also contain some mixed combination, as would be the case with a hyphenated compound word e.g. zero-liability or a contraction e.g. Let’s.

Notice that we want the system to use the apostrophe to merge Let and s into Let’s because it’s a contraction and therefore a “whole” lexeme but we don’t want the apostrophes around the word ‘it’ that follows the word ‘think’ combined because the lexeme in that case is the word it with the sounding apostrophes acting as ‘single quotes’ and should therefore be treated as separate lexemes just like the “double quotes” around zero-liability.

Also, we want the system to capture the complex pattern of the email (john.doe@web_server.com) as a single lexeme.

Here’s what that looks like:

[
    "The",
    " ",
    "quick",
    " ",
    "brown",
    " ",
    "fox",
    " ",
    "jumps",
    " ",
    "over",
    " ",
    "the",
    " ",
    "lazy",
    " ",
    "dog",
    ".",
    " ",
    "A",
    " ",
    "long-term",
    " ",
    "contract",
    " ",
    "with",
    " ",
    "\"",
    "zero-liability",
    "\"",
    " ",
    "protection",
    "!",
    " ",
    "Let's",
    " ",
    "think",
    " ",
    "'",
    "it",
    "'",
    " ",
    "over",
    ".",
    " ",
    "john.doe@web_server.com"
]

 

Of course we would still need to apply tags to this list to complete the lexing process but this solves the first problem of problem splitting natural language text into tokens and processing those tokens into a list of lexemes ready to be tagged.

We’ll work on tagging next week, but for now lets look at the code that does this.

The Code

Here is the complete code that implements the tokenization of the lexemes. I’ll explain what is happening below but the code is commented for the programmers who are following along.

<?php 

function Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols){  
      
  $temp = '';                   // A temporary string used to hold incomplete lexemes
  $lexemes = array();           // Complete lexemes will be stored here for return
  $chars = str_split($text, 1); // Split the text sting into characters.
  
  //var_dump(json_encode($chars, 1)); // convert $chars array to JSON and dump to screen

  // Step through all character tokens in the $chars array
  foreach($chars as $key=>$char){
        
    // If this $char token is in the $delimiters array
    // Then stop building $temp and add it and the delimiter to the $lexemes array
    if(in_array($char, $delimiters)){
      
      // Does temp contain data?
      if(strlen($temp) > 0){
        // $temp is a complete lexeme add it to the array
        $lexemes[] = $temp;
      }      
      $temp = ''; // Make sure $temp is empty
      
      $lexemes[] = $char; // Capture delimiter as a whole lexeme
    }
    else{// This $char token is NOT in the $delimiters array
      // Add $char to $temp and continue to next $char
      $temp .= $char; 
    }
    
  } // Step through all character tokens in the $chars array


  // Check if $temp still contains any residual lexeme data?
  if(strlen($temp) > 0){
    // $temp is a complete lexeme add it to the array
    $lexemes[] = $temp;
  }
  
  // We have processed all character tokens in the $chars array
  // Free the memory and garbage collect $chars & $temp
  $chars = NULL;
  $temp = NULL;
  unset($chars);
  unset($temp);


  // We now have the simplest lexems extracted. 
  // Next we need to recombine compound-words, contractions 
  // And do any other processing with the lexemes.

  // If there are $chars in the $compound_word_symbols array
  if(!empty($compound_word_symbols)){
    
    // Count the number of $lexemes
    $number_of_lexemes = count($lexemes);
    
    // Step through all lexeme tokens in the $lexemes array
    foreach($lexemes as $key=>&$lexeme){
      
      // Check if $lexeme is in the $compound_word_symbols array
      if(in_array($lexeme, $compound_word_symbols)){
        
        // If this isn't the first $lexeme in $lexemes
        if($key > 0){ 
          // Check the $lexeme $before this
          $before = $lexemes[$key - 1];
          
          // If $before isn't a $delimiter
          if(!in_array($before, $delimiters)){
            // Merge it with the compound symbol
            $lexeme = $before . $lexeme;
            // And remove the $before $lexeme from $lexemes
            $lexemes[$key - 1] = NULL;
          }
        }
        
        // If this isn't the last $lexeme in $lexemes
        if($key < $number_of_lexemes){
          // Check the $lexeme $after this
          $after = $lexemes[$key + 1];
          
          // If $after isn't a $delimiter
          if(!in_array($after, $delimiters)){
            // Merge the $lexeme it with
            $lexemes[$key + 1] = $lexeme . $after;
            // And remove the $lexeme
            $lexeme = NULL;
          }
        }
        
      } // Check if lexeme is in the $compound_word_symbols array
    } // Step through all tokens in the $lexemes array      
  } // If there are $chars in the $compound_word_symbols array
  
  // Filter out any NULL values in the $lexemes array
  // created during the compound word merges using array_filter()
  // and then re-index so the $lexemes array is nice and sorted using array_values().
  $lexemes = array_values(array_filter($lexemes));
  
  
  // If there are $chars in the $contraction_symbols array
  if(!empty($contraction_symbols)){
    
    // Count the number of $lexemes
    $number_of_lexemes = count($lexemes);
    
    // Step through all lexeme tokens in the $lexemes array
    foreach($lexemes as $key=>&$lexeme){
      
      // Check if $lexeme is in the $contraction_symbols array
      if(in_array($lexeme, $contraction_symbols)){
        
        // If this isn't the first $lexeme in $lexemes
        // and If this isn't the last $lexeme in $lexemes
        if($key > 0 && $key < $number_of_lexemes){ 
          // Check the $lexeme $before this
          $before = $lexemes[$key - 1];
          
          // Check the $lexeme $after this
          $after = $lexemes[$key + 1];
          
          
          // If $before isn't a $delimiter
          // and $after isn't a $delimiter
          if(!in_array($before, $delimiters) && !in_array($after, $delimiters)){
            // Merge the contraction tokens
            $lexemes[$key + 1] = $before . $lexeme . $after;
            
            // Remove $before
            $lexemes[$key - 1] = NULL;
            // And remove this $lexeme
            $lexeme = NULL;            
          }

        }
        
      } // Check if lexeme is in the $contraction_symbols array
    } // Step through all tokens in the $lexemes array      
  } // If there are $chars in the $contraction_symbols array
  
  // Filter out any NULL values in the $lexemes array
  // created during the contraction merges using array_filter()
  // and then re-index so the $lexemes array is nice and sorted using array_values().
  $lexemes = array_values(array_filter($lexemes));
  

  // Return the $lexemes array.
  return $lexemes;
}

// Delimiters (Lexeme Boundaries)
$delimiters = array('~', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+', '`', '-', '=', '{', '}', '[', ']', '\\', '|', ':', ';', '"', '\'', '<', '>', ',', '.', '?', '/', ' ', "\t", "\n");

// Symbols used to detect compound-words
$compound_word_symbols = array('-', '_');

// Symbols used to detect contractions
$contraction_symbols = array("'", '.', '@');

// Text to Tokenize and Lex
$text = 'The quick brown fox jumps over the lazy dog. A long-term contract with "zero-liability" protection! Let\'s think \'it\' over. john.doe@web_server.com';

// Tokenize and extract the $lexemes from $text
$lexemes = Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols);
echo json_encode($lexemes, 1); // output $lexems as JSON

 

Splitting the Tokens

One way to do this would be to use regular expressions (regex) to match a pattern and it’s the method we used for the Email Relationship Classifier.

As part of that project I released a “Tokenizer” Class File that relied heavily on regex to match patterns but this isn’t the method we use in the Natural Language lexer, though we could have.

Common “advice” you will receive as a developer is that “you should NEVER use regex”, and while this is well meaning advice, it is certainly wrong!

I find Regex works best when you understand the patterns you are looking for really well and the patterns won’t change much throughout your data-set, though the pattern can be very complex.

Now the reason why you are often advised to avoid using regex pattern matching is that it’s complicated and understanding the pattern match string is not always immediately intuitive and sometimes its down right difficult! This can even be the case even if you are generally comfortable working with regex.

So the difficulty in using regex for most developers is a factor in my choice not to use regex in this case but the main reason is simply that it’s really not needed and it’s actually a lot simpler not to use regex to accomplish our goal.

So if not Regex then how?

Use Delimiters as a Guide to Word Boundaries

First create an array of delimiters that we can use as automatic word boundaries. In this case we can use a list of all the typable symbols that are not letters or numbers.

// Delimiters (Lexeme Boundaries) 
$delimiters = array('~', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+', '`', '-', '=', '{', '}', '[', ']', '\\', '|', ':', ';', '"', '\'', '<', '>', ',', '.', '?', '/', ' ', "\t", "\n");

 

Use Compound Symbols to Grow Words

Next we need a group of symbols we know are always used to create compound words. Basically this means hyphens and underscores which should always be joined with their parent lexeme. The distinction here is that even if these symbols show up before, in the middle of or even after another lexeme they should be considered part of that lexeme. e.g. pre- or long-term or _something or Jane_Doe

// Symbols used to detect compound-words 
$compound_word_symbols = array('-', '_'); 

 

Use Contractions to Merge Ideas

Quotes (‘single’ & “double” ) should be treated as separate lexemes and never be merged with the lexeme they contain.  However, apostrophes should actually be merged with the lexeme before & after it provided that neither are a delimiter. Also, sometimes a period and the @ symbol can behave like contraction symbols as is the case with the example email: john.doe@web_server.com

// Symbols used to detect contractions 
$contraction_symbols = array("'", '.', '@'); 

 

Our Example Natural Language Text

Here is the test string of natural language.

// Text to Tokenize and Lex 
$text = 'The quick brown fox jumps over the lazy dog. A long-term contract with "zero-liability" protection! Let\'s think \'it\' over. john.doe@web_server.com'; 

 

Extract Lexemes from Tokens Using Delimiter Symbols

We can now create a call to the Tokenize() function with our data and pass the results into an array which we format and echo as JSON.

// Tokenize and extract the $lexemes from $text 
$lexemes = Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols); 
echo json_encode($lexemes, 1); // output $lexems as JSON

 

Now, if we run our code we get all the lexemes extracted from the natural language test string.

Next week we will look at how we can tag the lexemes to complete the Lexical Analysis of a natural language so remember to like, and follow!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Blog at WordPress.com.

Up ↑

%d bloggers like this: