Welcome, were getting close to finishing our Parts of Speech tagger and today we’re going to fix the unknown tags we ran into last week but first, lets talk about the owie.

 

The Bot Had An Owie

I was reviewing the database for the Parts of Speech Tagging system and my 20 month old waddles up to my chair and points to the computer and says… ‘Bobot’.

He doesn’t quite grasp the subtle differences between robots and computers yet and I think it’s adorable! 🙂

I sat him on my lap and explained that the bot had an Owie and we needed to fix it.

We found that words like “Gracie’s” were missing from the Trigrams table but were in the words table.

Based on our methodology (see Train.php on GitHub), this shouldn’t happen.

With a little investigation it became clear that the issue occurred during the original training while the Bot read through The Brown Corpus.

 

My Mistake

Put simply, I forgot to pass the words going into the Trigrams table through mysqli_real_escape_string() (but not the words or tags table) which meant that when the Database saw “Gracie’s” in the Trigrams INSERT SQL it saw the apostrophe as the end of the string.

This meant that even after I corrected the code I had to empty the database and retrain the bot from scratch.

Mildly frustrating for sure, but it isn’t as bad as it sounds. 🙂

Remember a couple weeks back when we optimized the initial training process to go from 3 weeks to about 10 hours? 😉

 

Retraining

I emptied the PartsOfSpeech MySQL database, invoked the Train.php script and walked away.

I returned about 5 AM to find that the bot had finished reading. I then ran AddHashes.php and again walked away.

Later, around 2 PM in the afternoon I returned once more to find that the bot had finished computing the Bi-gram & Skip-gram hashes.

So… while a little annoying it was a nice opportunity to test the whole setup and train process from scratch which took more or less 20 hours from start to finish.

It’s important to note that importing the pre-learned data that I’ve uploaded to GitHub only takes a few minutes so you shouldn’t run Train.php unless you are teaching your bot a different training corpus.

Anyway, with the hole in the bot’s head patched, I set to work on correcting the unknown tags.

 

The Unknown Tags

Our methodology to tag words is basically to lookup the Tri-grams and Bi-grams in the database, then add and average the tag probabilities for a given word.

If we called Tri-gram tags and words “strongly” correlated, then we might call Bi-gram words and tags “generally” correlated.

So, if we can find our words as Tri-grams (preferable) or Bi-grams (acceptable), we can be reasonably confident in a low probability of error.

In other words, based on probability the average tag will be more or less correct most of the time.

But, if the bot hasn’t seen…

The Tri-gram: Word A + Word B + Word C

The Skip-gram: Word A + *Word C

The AB Bi-gram: Word AWord B + *

The BC Bi-gram: *Word BWord C

What’s a bot to do? 😛

Unigrams

A Uni-gram is a single gram, so in this case… a word.

Uni-gram words are basically self correlated… in that the word isn’t connected to any other word directly.

You might say it’s connected to the tags which are a result of it’s use with other words so perhaps it’s fair to call Uni-grams “loosely” correlated.

Because a Uni-gram may correlate to multiple tag types, it can be difficult for the tagger to figure out exactly what the correct tag is sometimes, however this is remedied by recalling that when we observe a Uni-gram’s tags we are observing all uses of that word.

So in cases where we have to rely on Uni-grams we can just use the tag that on average is most likely, not perfect but better than unknown because on average, it should be correct.

 

Collecting The Unigram Tags

We could just do a naive global lookup each time for a Uni-gram at run time every time we need to tag some text but that’s extremely slow and wasteful of energy.

The efficient approach is to find all the tags for all the Uni-grams one time and then later when we need to know the tags for a Uni-gram we can do a single query rather than traverse the entire Tri-grams table looking for Uni-grams every time we can’t match a gram.

The Naive (Slow) Way

We could query the Words table for all the words and load them into an array in memory.

Then, for each word, query the Trigrams table. Each time a word is found in a Tri-gram, we could collect the tag.

If we did this for each word we would end up with a list of tags for each word.

So what’s wrong with this method? It will take many hours to complete.

The Correct (Fast) Way

There is a faster method! We already know the words in the Words table exist as Uni-grams. So rather than look up all words individually with multiple queries, we can do 1 query to select all Trigrams :

$sql = "SELECT * FROM `Trigrams`";

Then build a list of words and tags as we walk through the rows returned by the database.

It might not be immediately apparent why this is faster but it has to do with how many times we review the same information.

By making only a single loop through the Trigrams table we only review a Tri-gram once rather than Uni-grams x Tri-grams.

I.E. 906,846 Row Operations as opposed to 50,835,066,222 (Fifty Billion Eight Hundred Thirty-Five Million Sixty-Six Thousand Two Hundred Twenty-Two).

Thus greatly speeding up the process to only a few minutes, go make some coffee and grab a snack and when you return you will have collected all the Uni-gram tags.

 

The Code

Here’s the code and you can find it on GitHub here CollectUnigramTags.php.

<?php
// MySQL Server Credentials
$server = 'localhost';
$username = 'root';
$password = 'password';
$db = 'PartsOfSpeechTagger';

// Create connection
$conn = new mysqli($server, $username, $password, $db);

// Check connection
if ($conn->connect_error) {
  die("MYSQL DB Connection failed: " . $conn->connect_error);
}

// Add additional TagSum & Tags field to the words table
echo 'Adding TagSum Field.' . PHP_EOL;
$sql = "ALTER TABLE `Words` ADD `TagSum` TEXT NOT NULL AFTER `Count`";
$conn->query($sql);

echo 'Adding Tags Field.' . PHP_EOL; 
$sql = "ALTER TABLE `Words` ADD `Tags` TEXT NOT NULL AFTER `TagSum`"; 
$conn->query($sql);

$words = array();
echo 'Locating Unigrams.' . PHP_EOL;

// Return all Trigrams
$sql = "SELECT * FROM `Trigrams`"; 

// Query the trigrams table
$result = $conn->query($sql);

// If there are Trigrams collect the tags for the Words
if($result->num_rows > 0){
  $i=0; // Keep track of current Trigram 
  echo 'Counting Unigrams Tags.' . PHP_EOL;
  while($row = mysqli_fetch_assoc($result)) {
    echo ++$i . PHP_EOL; // echo current Trigram
    
    // $words[md5 word hash][tag] += 1;
    // Example:
    // Unhashed: $words['the']['at'] += 1;
    // Hashed:   $words['8fc42c6ddf9966db3b09e84365034357']['at']++;
    @$words[hash('md5', $row["Word_A"])][$row["Tag_A"]]++;
    @$words[hash('md5', $row["Word_B"])][$row["Tag_B"]]++;
    @$words[hash('md5', $row["Word_C"])][$row["Tag_C"]]++;
  }
}


echo 'Updating Words.' . PHP_EOL;
foreach($words as $hash=>&$tags){
  if(count($tags > 0)){
    $sum = array_sum($tags); // Count the total number of tags
    $tags = json_encode($tags, 1); // tags data
    
    // Update word using the Hash key
    $sql = "UPDATE `Words` SET `Tags` = '$tags', `TagSum` = '$sum' WHERE `Words`.`Hash` = '$hash';"; 
    $conn->query($sql);
    
    echo "$hash Updated!" . PHP_EOL; // Report the hash was updated
  }
}

$conn->close(); // disconnect from the database

 

Then all that was left to do was update Test.php:

// Merge unique lexemes (with tag data) into the lexemes
foreach($lexemes as $key=>$lexeme){
  // If we have a tag for the word 
  if(array_key_exists($lexeme, $unique_lexemes)){
    $lexemes[$key] = array('lexeme'=>$lexeme, 'tags'=> $unique_lexemes[$lexeme]);
  }else{
    // No Bi-gram, Skip-gram or Tri-gram

    // Try to look up the Unigram
    $sql = "SELECT * FROM `Words` WHERE `Word` = '$lexeme'";
    $result = $conn->query($sql);
    if($result->num_rows > 0){// We know this Uni-gram
      // Collect the tags for the Uni-gram
      while($row = mysqli_fetch_assoc($result)) {
		  // Decode Uni-gram tags from json into associtive array
          $tags = json_decode($row["Tags"], 1);
          
          // Sort the tags and compute %
	  arsort($tags);
          $sum = array_sum($tags);
          foreach($tags as $tag=>&$score){
              $score = $score . ' : ' . ($score/$sum * 100) . '%';
          }
	  $lexemes[$key] = array('lexeme'=>$lexeme, 'tags'=> $tags);
      }
    }else{ // We don't know this Uni-gram/word
        $lexemes[$key] = array('lexeme'=>$lexeme, 'tags'=> array('unk'=>'1 : 100%',));
    }
  }
}
$conn->close(); // disconnect from the database

 

Results


Sentence: The quick brown fox jumps over the lazy dog. A long-term contract with "zero-liability" protection! Let's think it over.

Tagged Sentence: The/at quick/jj brown/jj fox/np jumps/nns over/in the/at lazy/jj dog/nn ./. A/at long-term/nn contract/vb with/in "/unk zero-liability/unk "/unk protection/nn-hl !/. Let's/vb+ppo think/vb it/ppo over/in ./. 

Tags: 
12 unique tags, 24 total.
at(3) - 12.50% of the sentence.
jj(3) - 12.50% of the sentence.
in(3) - 12.50% of the sentence.
.(3) - 12.50% of the sentence.
unk(3) - 12.50% of the sentence.
nn(2) - 8.33% of the sentence.
vb(2) - 8.33% of the sentence.
np(1) - 4.17% of the sentence.
nns(1) - 4.17% of the sentence.
nn-hl(1) - 4.17% of the sentence.
vb+ppo(1) - 4.17% of the sentence.
ppo(1) - 4.17% of the sentence.

These changes result in only 3 unknown lexemes representing 12.50% of the sentence and 2 of the unknowns are symbols with the last being the compound word zero-liability.

I’ll count that as a huge success but we can still do better and a few of the words are miss tagged still so we’ll have to at least attempt to fix that too but we’ll look at that next week.

Also, I haven’t updated the database dumps on GitHub yet as there will be more table updates coming next week and I’d prefer to do the upload once.

With that, please like this post & leave your thoughts in the comments.

Also, don’t forget to share this post with someone you think would find it interesting and hit that follow button to make sure you get all my new posts!

And before you go, consider helping me grow…


Help Me Grow

Your direct monetary support finances this work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.

Your support goes toward helping me buy better tools and equipment so I can improve the quality of my content.  It also helps me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you feel inclined to give me money and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month and you will be helping me grow.

Thank you!

And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy