Search

Geek Girl Joy

Artificial Intelligence, Simulations & Software

Month

November 2018

Unigrams

Welcome, were getting close to finishing our Parts of Speech tagger and today we’re going to fix the unknown tags we ran into last week but first, lets talk about the owie.

 

The Bot Had An Owie

I was reviewing the database for the Parts of Speech Tagging system and my 20 month old waddles up to my chair and points to the computer and says… ‘Bobot’.

He doesn’t quite grasp the subtle differences between robots and computers yet and I think it’s adorable! 🙂

I sat him on my lap and explained that the bot had an Owie and we needed to fix it.

We found that words like “Gracie’s” were missing from the Trigrams table but were in the words table.

Based on our methodology (see Train.php on GitHub), this shouldn’t happen.

With a little investigation it became clear that the issue occurred during the original training while the Bot read through The Brown Corpus.

 

My Mistake

Put simply, I forgot to pass the words going into the Trigrams table through mysqli_real_escape_string() (but not the words or tags table) which meant that when the Database saw “Gracie’s” in the Trigrams INSERT SQL it saw the apostrophe as the end of the string.

This meant that even after I corrected the code I had to empty the database and retrain the bot from scratch.

Mildly frustrating for sure, but it isn’t as bad as it sounds. 🙂

Remember a couple weeks back when we optimized the initial training process to go from 3 weeks to about 10 hours? 😉

 

Retraining

I emptied the PartsOfSpeech MySQL database, invoked the Train.php script and walked away.

I returned about 5 AM to find that the bot had finished reading. I then ran AddHashes.php and again walked away.

Later, around 2 PM in the afternoon I returned once more to find that the bot had finished computing the Bi-gram & Skip-gram hashes.

So… while a little annoying it was a nice opportunity to test the whole setup and train process from scratch which took more or less 20 hours from start to finish.

It’s important to note that importing the pre-learned data that I’ve uploaded to GitHub only takes a few minutes so you shouldn’t run Train.php unless you are teaching your bot a different training corpus.

Anyway, with the hole in the bot’s head patched, I set to work on correcting the unknown tags.

 

The Unknown Tags

Our methodology to tag words is basically to lookup the Tri-grams and Bi-grams in the database, then add and average the tag probabilities for a given word.

If we called Tri-gram tags and words “strongly” correlated, then we might call Bi-gram words and tags “generally” correlated.

So, if we can find our words as Tri-grams (preferable) or Bi-grams (acceptable), we can be reasonably confident in a low probability of error.

In other words, based on probability the average tag will be more or less correct most of the time.

But, if the bot hasn’t seen…

The Tri-gram: Word A + Word B + Word C

The Skip-gram: Word A + *Word C

The AB Bi-gram: Word AWord B + *

The BC Bi-gram: *Word BWord C

What’s a bot to do? 😛

Unigrams

A Uni-gram is a single gram, so in this case… a word.

Uni-gram words are basically self correlated… in that the word isn’t connected to any other word directly.

You might say it’s connected to the tags which are a result of it’s use with other words so perhaps it’s fair to call Uni-grams “loosely” correlated.

Because a Uni-gram may correlate to multiple tag types, it can be difficult for the tagger to figure out exactly what the correct tag is sometimes, however this is remedied by recalling that when we observe a Uni-gram’s tags we are observing all uses of that word.

So in cases where we have to rely on Uni-grams we can just use the tag that on average is most likely, not perfect but better than unknown because on average, it should be correct.

 

Collecting The Unigram Tags

We could just do a naive global lookup each time for a Uni-gram at run time every time we need to tag some text but that’s extremely slow and wasteful of energy.

The efficient approach is to find all the tags for all the Uni-grams one time and then later when we need to know the tags for a Uni-gram we can do a single query rather than traverse the entire Tri-grams table looking for Uni-grams every time we can’t match a gram.

The Naive (Slow) Way

We could query the Words table for all the words and load them into an array in memory.

Then, for each word, query the Trigrams table. Each time a word is found in a Tri-gram, we could collect the tag.

If we did this for each word we would end up with a list of tags for each word.

So what’s wrong with this method? It will take many hours to complete.

The Correct (Fast) Way

There is a faster method! We already know the words in the Words table exist as Uni-grams. So rather than look up all words individually with multiple queries, we can do 1 query to select all Trigrams :

$sql = "SELECT * FROM `Trigrams`";

Then build a list of words and tags as we walk through the rows returned by the database.

It might not be immediately apparent why this is faster but it has to do with how many times we review the same information.

By making only a single loop through the Trigrams table we only review a Tri-gram once rather than Uni-grams x Tri-grams.

I.E. 906,846 Row Operations as opposed to 50,835,066,222 (Fifty Billion Eight Hundred Thirty-Five Million Sixty-Six Thousand Two Hundred Twenty-Two).

Thus greatly speeding up the process to only a few minutes, go make some coffee and grab a snack and when you return you will have collected all the Uni-gram tags.

 

The Code

Here’s the code and you can find it on GitHub here CollectUnigramTags.php.

<?php
// MySQL Server Credentials
$server = 'localhost';
$username = 'root';
$password = 'password';
$db = 'PartsOfSpeechTagger';

// Create connection
$conn = new mysqli($server, $username, $password, $db);

// Check connection
if ($conn->connect_error) {
  die("MYSQL DB Connection failed: " . $conn->connect_error);
}

// Add additional TagSum & Tags field to the words table
echo 'Adding TagSum Field.' . PHP_EOL;
$sql = "ALTER TABLE `Words` ADD `TagSum` TEXT NOT NULL AFTER `Count`";
$conn->query($sql);

echo 'Adding Tags Field.' . PHP_EOL; 
$sql = "ALTER TABLE `Words` ADD `Tags` TEXT NOT NULL AFTER `TagSum`"; 
$conn->query($sql);

$words = array();
echo 'Locating Unigrams.' . PHP_EOL;

// Return all Trigrams
$sql = "SELECT * FROM `Trigrams`"; 

// Query the trigrams table
$result = $conn->query($sql);

// If there are Trigrams collect the tags for the Words
if($result->num_rows > 0){
  $i=0; // Keep track of current Trigram 
  echo 'Counting Unigrams Tags.' . PHP_EOL;
  while($row = mysqli_fetch_assoc($result)) {
    echo ++$i . PHP_EOL; // echo current Trigram
    
    // $words[md5 word hash][tag] += 1;
    // Example:
    // Unhashed: $words['the']['at'] += 1;
    // Hashed:   $words['8fc42c6ddf9966db3b09e84365034357']['at']++;
    @$words[hash('md5', $row["Word_A"])][$row["Tag_A"]]++;
    @$words[hash('md5', $row["Word_B"])][$row["Tag_B"]]++;
    @$words[hash('md5', $row["Word_C"])][$row["Tag_C"]]++;
  }
}


echo 'Updating Words.' . PHP_EOL;
foreach($words as $hash=>&$tags){
  if(count($tags > 0)){
    $sum = array_sum($tags); // Count the total number of tags
    $tags = json_encode($tags, 1); // tags data
    
    // Update word using the Hash key
    $sql = "UPDATE `Words` SET `Tags` = '$tags', `TagSum` = '$sum' WHERE `Words`.`Hash` = '$hash';"; 
    $conn->query($sql);
    
    echo "$hash Updated!" . PHP_EOL; // Report the hash was updated
  }
}

$conn->close(); // disconnect from the database

 

Then all that was left to do was update Test.php:

// Merge unique lexemes (with tag data) into the lexemes
foreach($lexemes as $key=>$lexeme){
  // If we have a tag for the word 
  if(array_key_exists($lexeme, $unique_lexemes)){
    $lexemes[$key] = array('lexeme'=>$lexeme, 'tags'=> $unique_lexemes[$lexeme]);
  }else{
    // No Bi-gram, Skip-gram or Tri-gram

    // Try to look up the Unigram
    $sql = "SELECT * FROM `Words` WHERE `Word` = '$lexeme'";
    $result = $conn->query($sql);
    if($result->num_rows > 0){// We know this Uni-gram
      // Collect the tags for the Uni-gram
      while($row = mysqli_fetch_assoc($result)) {
		  // Decode Uni-gram tags from json into associtive array
          $tags = json_decode($row["Tags"], 1);
          
          // Sort the tags and compute %
	  arsort($tags);
          $sum = array_sum($tags);
          foreach($tags as $tag=>&$score){
              $score = $score . ' : ' . ($score/$sum * 100) . '%';
          }
	  $lexemes[$key] = array('lexeme'=>$lexeme, 'tags'=> $tags);
      }
    }else{ // We don't know this Uni-gram/word
        $lexemes[$key] = array('lexeme'=>$lexeme, 'tags'=> array('unk'=>'1 : 100%',));
    }
  }
}
$conn->close(); // disconnect from the database

 

Results


Sentence: The quick brown fox jumps over the lazy dog. A long-term contract with "zero-liability" protection! Let's think it over.

Tagged Sentence: The/at quick/jj brown/jj fox/np jumps/nns over/in the/at lazy/jj dog/nn ./. A/at long-term/nn contract/vb with/in "/unk zero-liability/unk "/unk protection/nn-hl !/. Let's/vb+ppo think/vb it/ppo over/in ./. 

Tags: 
12 unique tags, 24 total.
at(3) - 12.50% of the sentence.
jj(3) - 12.50% of the sentence.
in(3) - 12.50% of the sentence.
.(3) - 12.50% of the sentence.
unk(3) - 12.50% of the sentence.
nn(2) - 8.33% of the sentence.
vb(2) - 8.33% of the sentence.
np(1) - 4.17% of the sentence.
nns(1) - 4.17% of the sentence.
nn-hl(1) - 4.17% of the sentence.
vb+ppo(1) - 4.17% of the sentence.
ppo(1) - 4.17% of the sentence.

These changes result in only 3 unknown lexemes representing 12.50% of the sentence and 2 of the unknowns are symbols with the last being the compound word zero-liability.

I’ll count that as a huge success but we can still do better and a few of the words are miss tagged still so we’ll have to at least attempt to fix that too but we’ll look at that next week.

Also, I haven’t updated the database dumps on GitHub yet as there will be more table updates coming next week and I’d prefer to do the upload once.

With that, please like this post & leave your thoughts in the comments.

Also, don’t forget to share this post with someone you think would find it interesting and hit that follow button to make sure you get all my new posts!

And before you go, consider helping me grow…


Help Me Grow

Your direct monetary support finances this work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.

Your support goes toward helping me buy better tools and equipment so I can improve the quality of my content.  It also helps me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you feel inclined to give me money and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month and you will be helping me grow.

Thank you!

And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Advertisements

Parts of Speech Tagging

Welcome, as promised, we’re going to use our Parts of Speech Tagger today!

Are you excited? I know I am, so we’re going to get right to it I promise, but quickly before we do, If you’re just finding my content here are the other posts in this series:

An Introduction to Writer Bot

Rule Based Stories

Artificial & Natural Language Processing

 

The Code

This code will review text given to it and then try to figure out what tag to assign each word by using the database to lookup Trigram and Bigram patterns.

We’ll discuss the code further after you’ve had a chance to review it.

<?php


// This is our Tokenize function from 
function Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols){  
      
  $temp = '';                   // A temporary string used to hold incomplete lexemes
  $lexemes = array();           // Complete lexemes will be stored here for return
  $chars = str_split($text, 1); // Split the text sting into characters.
  
  // Step through all character tokens in the $chars array
  foreach($chars as $key=>$char){
        
    // If this $char token is in the $delimiters array
    // Then stop building $temp and add it and the delimiter to the $lexemes array
    if(in_array($char, $delimiters)){
      
      // Does temp contain data?
      if(strlen($temp) > 0){
        // $temp is a complete lexeme add it to the array
        $lexemes[] = $temp;
      }      
      $temp = ''; // Make sure $temp is empty
      
      $lexemes[] = $char; // Capture delimiter as a whole lexeme
    }
    else{// This $char token is NOT in the $delimiters array
      // Add $char to $temp and continue to next $char
      $temp .= $char; 
    }
    
  } // Step through all character tokens in the $chars array


  // Check if $temp still contains any residual lexeme data?
  if(strlen($temp) > 0){
    // $temp is a complete lexeme add it to the array
    $lexemes[] = $temp;
  }
  
  // We have processed all character tokens in the $chars array
  // Free the memory and garbage collect $chars & $temp
  $chars = NULL;
  $temp = NULL;
  unset($chars);
  unset($temp);


  // We now have the simplest lexems extracted. 
  // Next we need to recombine compound-words, contractions 
  // And do any other processing with the lexemes.

  // If there are $chars in the $compound_word_symbols array
  if(!empty($compound_word_symbols)){
    
    // Count the number of $lexemes
    $number_of_lexemes = count($lexemes);
    
    // Step through all lexeme tokens in the $lexemes array
    foreach($lexemes as $key=>&$lexeme){
      
      // Check if $lexeme is in the $compound_word_symbols array
      if(in_array($lexeme, $compound_word_symbols)){
        
        // If this isn't the first $lexeme in $lexemes
        if($key > 0){ 
          // Check the $lexeme $before this
          $before = $lexemes[$key - 1];
          
          // If $before isn't a $delimiter
          if(!in_array($before, $delimiters)){
            // Merge it with the compound symbol
            $lexeme = $before . $lexeme;
            // And remove the $before $lexeme from $lexemes
            $lexemes[$key - 1] = NULL;
          }
        }
        
        // If this isn't the last $lexeme in $lexemes
        if($key < $number_of_lexemes){
          // Check the $lexeme $after this
          $after = $lexemes[$key + 1];
          
          // If $after isn't a $delimiter
          if(!in_array($after, $delimiters)){
            // Merge the $lexeme it with
            $lexemes[$key + 1] = $lexeme . $after;
            // And remove the $lexeme
            $lexeme = NULL;
          }
        }
        
      } // Check if lexeme is in the $compound_word_symbols array
    } // Step through all tokens in the $lexemes array      
  } // If there are $chars in the $compound_word_symbols array
  
  // Filter out any NULL values in the $lexemes array
  // created during the compound word merges using array_filter()
  // and then re-index so the $lexemes array is nice and sorted using array_values().
  $lexemes = array_values(array_filter($lexemes));
  
  
  // If there are $chars in the $contraction_symbols array
  if(!empty($contraction_symbols)){
    
    // Count the number of $lexemes
    $number_of_lexemes = count($lexemes);
    
    // Step through all lexeme tokens in the $lexemes array
    foreach($lexemes as $key=>&$lexeme){
      
      // Check if $lexeme is in the $contraction_symbols array
      if(in_array($lexeme, $contraction_symbols)){
        
        // If this isn't the first $lexeme in $lexemes
        // and If this isn't the last $lexeme in $lexemes
        if($key > 0 && $key < $number_of_lexemes){ 
          // Check the $lexeme $before this
          $before = $lexemes[$key - 1];
          
          // Check the $lexeme $after this
          $after = $lexemes[$key + 1];
          
          
          // If $before isn't a $delimiter
          // and $after isn't a $delimiter
          if(!in_array($before, $delimiters) && !in_array($after, $delimiters)){
            // Merge the contraction tokens
            $lexemes[$key + 1] = $before . $lexeme . $after;
            
            // Remove $before
            $lexemes[$key - 1] = NULL;
            // And remove this $lexeme
            $lexeme = NULL;            
          }

        }
        
      } // Check if lexeme is in the $contraction_symbols array
    } // Step through all tokens in the $lexemes array      
  } // If there are $chars in the $contraction_symbols array
  
  // Filter out any NULL values in the $lexemes array
  // created during the contraction merges using array_filter()
  // and then re-index so the $lexemes array is nice and sorted using array_values().
  $lexemes = array_values(array_filter($lexemes));
  

  // Return the $lexemes array.
  return $lexemes;
} // Tokenize()

// Remove unwanted Delimiters or symbols from Lexems array
function Remove($lexemes, $remove_values){
    
    foreach($lexemes as &$lexeme){
        
        // if the lexeme is one that should  be removed
        if(in_array($lexeme, $remove_values)){
            $lexeme = NULL; // set it to null
        }
    }
    // Remove NULL, FALSE & "" but leaves values of 0 (zero)
    $lexemes = array_filter( $lexemes, 'strlen' );
  
    return array_values($lexemes);
}

// This takes an array of lexemes produced by the Tokenize() function 
// and returns an associative array containing tri-grams, bi-grams and skip-grams
function ExtractGrams($lexemes, $hash = true){
  
  $grams = array();
  
  $lexeme_count = count($lexemes);
  for($i=2; $i < $lexeme_count; $i++){
      if($hash == true){// hashed string - default
        $grams['trigrams'][] = hash('md5', $lexemes[$i-2] . $lexemes[$i-1] . $lexemes[$i]);
        $grams['skipgrams'][] = hash('md5', $lexemes[$i-2] . $lexemes[$i]);
     }
     else{// unhashed string
         $grams['trigrams'][] = $lexemes[$i-2] . $lexemes[$i-1] . $lexemes[$i];
         $grams['skipgrams'][] = $lexemes[$i-2] . $lexemes[$i];
     }
  }
  for($i=1; $i < $lexeme_count; $i++){
       if($hash == true){// hashed string - default
           $grams['bigrams'][] = hash('md5', $lexemes[$i-1] . $lexemes[$i]);
       }
       else{// unhashed string
           $grams['bigrams'][] = $lexemes[$i-1] . $lexemes[$i];
       }
  }
  
  return $grams;
}


// MySQL Server Credentials
$server = 'localhost';
$username = 'root';
$password = 'password';
$db = 'PartsOfSpeechTagger';

// Create connection
$conn = new mysqli($server, $username, $password, $db);

// Check connection
if ($conn->connect_error) {
  die("MYSQL DB Connection failed: " . $conn->connect_error);
}

// Delimiters (Lexeme Boundaries)
$delimiters = array('~', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+', '`', '-', '=', '{', '}', '[', ']', '\\', '|', ':', ';', '"', '\'', '<', '>', ',', '.', '?', '/', ' ', "\t", "\n");

// Symbols used to detect compound-words
$compound_word_symbols = array('-', '_');

// Symbols used to detect contractions
//$contraction_symbols = array("'", '.', '@');
$contraction_symbols = array("'", '@');

// The text we want to tag
$text = 'The quick brown fox jumps over the lazy dog. A long-term contract with "zero-liability" protection! Let\'s think it over.';

// Tokenize and extract the $lexemes from $text
$lexemes = Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols);

// Filter unwanted lexemes, in this case, we want to remove spaces since 
// the Brown Corpus doesn't use them and we don't really need them for anything.
$lexemes = Remove($lexemes, array(' '/*, Add other values to remove here*/));

// Extract the Lexemes into Bi-grams, Tri-grams & Skip-grams 
// using the new ExtractGrams() function
$grams = ExtractGrams($lexemes);

// Lookup all the grams using their hashes to simplify and speedup 
// the queries due to reduced number of field comparisons.
foreach($grams as $skey=>&$gramset){
  foreach($gramset as $gkey=>&$gram){
    if($skey == 'trigrams'){
      $sql = "SELECT * FROM `Trigrams` WHERE `Hash` = '$gram' ORDER BY `Count` DESC"; 
    }
    elseif($skey == 'bigrams'){
      $sql = "SELECT * FROM `Trigrams` WHERE `Hash_AB` = '$gram' OR `Hash_BC` = '$gram' ORDER BY `Count` DESC"; 
    }
    elseif($skey == 'skipgrams'){
      $sql = "SELECT * FROM `Trigrams` WHERE `Hash_AC` = '$gram' ORDER BY `Count` DESC"; 
    }
    $gram = array('hash'=>$gram, 'sql'=>$sql);
    
    
    $result = $conn->query($gram['sql']);
    $gram['data'] = array();
    if($result->num_rows > 0){
       
      // Collect the data for this gram result
      while($row = mysqli_fetch_assoc($result)) {
        $gram['data'][] = array(
               'Hash'=> $row["Hash"],
               'Count'=> $row["Count"],
               'Word_A'=> $row["Word_A"],
               'Word_B'=> $row["Word_B"],
               'Word_C'=> $row["Word_C"],
               'Tag_A'=> $row["Tag_A"],
               'Tag_B'=> $row["Tag_B"],
               'Tag_C'=> $row["Tag_C"]);
      }
    }
  }
}

$conn->close(); // disconnect from the database

// Get a list of Unique lexemes
$unique_lexemes = array_keys(array_count_values($lexemes));

// Process the gram data for each word
foreach($grams as $skey=>&$gramset){
  foreach($gramset as $gkey=>&$gram){
    foreach($gram['data'] as $data){
            
      // If the word being considered is one we're looking for
      // collect the tag and increment it's value
      if(in_array($data['Word_A'], $unique_lexemes)){
        @$unique_lexemes[$data['Word_A']][$data['Tag_A']]++;
      }
      if(in_array($data['Word_B'], $unique_lexemes)){
        @$unique_lexemes[$data['Word_B']][$data['Tag_B']]++;
      }
      if(in_array($data['Word_C'], $unique_lexemes)){
        @$unique_lexemes[$data['Word_C']][$data['Tag_C']]++;
      }
    }
  }
}

// Organize the data a little better and calculate the tag score
foreach ($unique_lexemes as $key => &$value) 
{ 
  // remove the strings in the numeric indexes
  if(is_numeric($key)){
    unset($unique_lexemes[$key]); 
  }
  else{// this array index is associate
     // sort the tags and compute %
    arsort($value);
    $sum = array_sum($value);
    foreach($value as $tag=>&$score){
      $score = $score . ' : ' . ($score/$sum * 100) . '%';
    }
  }
  
}
// Merge unique lexemes (with tag data) into the lexemes
foreach($lexemes as $key=>$lexeme){
  
  // If we have a tag for the word 
  if(array_key_exists($lexeme, $unique_lexemes)){
    $lexemes[$key] = array('lexeme'=>$lexeme, 'tags'=> $unique_lexemes[$lexeme]);
  }else{
    // The word is unknown/no Bi-gram, Skip-gram or Tri-gram returned
    $lexemes[$key] = array('lexeme'=>$lexeme, 'tags'=> array('unk'=>'1 : 100%',));
  }
}

// Echo Original Sentence
echo 'Sentence: ' . $text . PHP_EOL;
echo PHP_EOL;

// Echo the Tagged Sentence
$unique_tags = array();
echo 'Tagged Sentence: ';
foreach($lexemes as $key=>$lexeme){
  $tag = key($lexeme['tags']);
  echo $lexeme['lexeme'] . '/' . $tag . ' ';
  @$unique_tags[$tag]++;
}
echo PHP_EOL . PHP_EOL;

// Echo the Basic Tags report
echo 'Tags: ' . PHP_EOL;
arsort($unique_tags);
$sum = array_sum($unique_tags);
echo count($unique_tags) . " unique tags, $sum total." . PHP_EOL;
foreach($unique_tags as $tag=>$count){

  echo "$tag($count) - " . number_format($count/$sum * 100, 2) . '% of the sentence.' . PHP_EOL;
}
echo PHP_EOL . PHP_EOL;

// Echo the Detailed Tags report
echo 'Detailed Report: ' . PHP_EOL;
foreach($lexemes as $key=>$lexeme){
  echo '[' . $lexeme['lexeme'] . ']'. PHP_EOL;
  
  $tags = '';
  foreach ($lexeme['tags'] as $tag=>$value){
    $tags .= "$tag($value)" . PHP_EOL;
  }
  
  echo trim($tags) . ' ' . PHP_EOL . PHP_EOL;
}

 

Our Process

This post is essentially the direct successor to Tokenizing & Lexing Natural Language and every post that has come after that has been about building the bot knowledge base, extracting the data and structuring it so we can more efficiently parse it with minimal computational effort, though that isn’t to say that we have reached the pinnacle either.

We take a string of plain unmodified text like this:

The quick brown fox jumps over the lazy dog. A long-term contract with “zero-liability” protection! Let’s think it over.

…and we pass it to the Tokenize() function we wrote back in the Tokenizing & Lexing post.

What results is an array of word lexemes.

We then use a new function Remove() to filter unwanted lexemes from the array. In this case, we want to remove spaces since the Brown Corpus doesn’t use them and we don’t really need them for anything.

Next, we process the Lexems into Bi-grams, Tri-grams & Skip-grams using the new ExtractGrams() function.

We then lookup all the grams using their hashes to simplify and speedup the queries due to reduced number of field comparisons.

We note the results of the queries, do some basic addition, division and lots of string concatenation.

All of this gives us our results.

 

Results

If you run this code (and setup your database as outlined in other posts in this series) then you should be rewarded with these results:

Sentence: The quick brown fox jumps over the lazy dog. A long-term contract with "zero-liability" protection! Let's think it over.

Tagged Sentence: The/at quick/jj brown/jj fox/unk jumps/nns over/in the/at lazy/unk dog/nn ./. A/at long-term/nn contract/vb with/in "/unk zero-liability/unk "/unk protection/unk !/. Let's/unk think/vb it/ppo over/in ./.

Tags:
9 unique tags, 24 total.
unk(7) - 29.17% of the sentence.
at(3) - 12.50% of the sentence.
in(3) - 12.50% of the sentence.
.(3) - 12.50% of the sentence.
jj(2) - 8.33% of the sentence.
nn(2) - 8.33% of the sentence.
vb(2) - 8.33% of the sentence.
nns(1) - 4.17% of the sentence.
ppo(1) - 4.17% of the sentence.

Detailed Report:
[The]
at(3 : 100%)

[quick]
jj(1 : 100%)

[brown]
jj(2 : 100%)

[fox]
unk(1 : 100%)

[jumps]
nns(1 : 100%)

[over]
in(520 : 81.504702194357%)
rp(114 : 17.868338557994%)
in-hl(4 : 0.6269592476489%)

[the]
at(537 : 100%)

[lazy]
unk(1 : 100%)

[dog]
nn(22 : 100%)

[.]
.(1589 : 98.756991920447%)
.-hl(20 : 1.2430080795525%)

[A]
at(1373 : 97.792022792023%)
at-hl(26 : 1.8518518518519%)
nn(2 : 0.14245014245014%)
np-hl(2 : 0.14245014245014%)
at-tl-hl(1 : 0.071225071225071%)

[long-term]
nn(1 : 100%)

[contract]
vb(2 : 50%)
nn(2 : 50%)

[with]
in(6 : 85.714285714286%)
rb(1 : 14.285714285714%)

["]
unk(1 : 100%)

[zero-liability]
unk(1 : 100%)

["]
unk(1 : 100%)

[protection]
unk(1 : 100%)

[!]
.(1 : 100%)

[Let's]
unk(1 : 100%)

[think]
vb(28 : 100%)

[it]
ppo(144 : 72.361809045226%)
pps(55 : 27.638190954774%)

[over]
in(520 : 81.504702194357%)
rp(114 : 17.868338557994%)
in-hl(4 : 0.6269592476489%)

[.]
.(1589 : 98.756991920447%)
.-hl(20 : 1.2430080795525%)

 

What’s up with the Unknown Tags?

As you can see our bot is 100% certain it doesn’t know how to tag the word ‘lazy’ (as well as 6 other words).

I mean, it’s nice that the bot is being honest with us about it’s own perceived failings but… I’m pretty sure it knows the word ‘lazy’, in fact I’ll go so far as accuse it of being lazy right now! 😛

A quick manual check of the database confirms that it’s seen the word lazy before so… what went wrong?

We’ll look at that next week so go ahead and hit that follow button to make sure you get my latest posts.

If you like this post feel free to leave a comment and don’t forget to share this post with someone you think would find it interesting.

Also, consider helping me grow…


Help Me Grow

Your direct monetary support finances this work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.

Your support goes toward helping me buy better tools and equipment so I can improve the quality of my content.  It also helps me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you feel inclined to give me money and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month and you will be helping me grow.

Thank you!

And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Adding Bigrams & Skipgrams

Welcome, today were going to talk about AB & BC Bi-grams as well as AC Skip-grams for Parts of Speech tagging.

If you’re just finding my content here are the other posts in this series:

An Introduction to Writer Bot

Rule Based Stories

Artificial & Natural Language Processing

 

I said in my last post (Building A Faster Bot) that I wanted to look at how to feed our bot some text and use Trigrams to tag the words in this post but… the truth is we have one final step before we are ready to do that and we’ll get to that next week for sure!

For now though we need to compute Bigrams & Skipgrams for our Trigrams.

 

What are “Grams” again?

Essentially, an “n-gram” is a set number of something in a unique pattern that we want to model. A “gram” is one item in the n-gram set.

We can use tags, numbers or the items themselves in some cases (like we did with words) as the grams and by collecting and categorizing the unique possibilities into groups of n-grams and then observing their use or existence we can extract “hidden” probability information about our subject.

That’s what we accomplished during training by collecting and counting all the word Trigrams present in The Brown Corpus Database.

In addition to modeling probability we modeled meaning by linking sets of trigram words together with sets of trigram tags and we cemented those links by hashing the grams together which also imparts a boost to the lookup speed.

 

So, Bigrams and Skipgrams are?

If “Tri-grams” are “three” word patterns then “Bi-grams” are “two” word patterns & “Skip-grams” are just a special n-gram that skip some of the grams. In this case our skip grams are bi-grams though technically a you can have a skip-gram that comprises more than two grams and skips more than one gram.

Additionally other than practicality, nothing prevents a longer skip-gram from containing multiple nonconsecutive skips… though such a long complex word skip-gram would have dubious value, though there are always exceptions to rules and who says you are modeling words in your case?

Here is an example that should hopefully make things more clear.

Given this tri-gram:

(one)(of)(the)

Hash: 320263779473e9ac2252940e0173a5b8

 

We can extract the following AB bi-gram:

(one)(of)

Hashed we get Hash_AB: hash(‘md5’, ‘oneof’) = 44d1d5bf437689cced8a62e192cdc49f

And this BC Bigram:

(of)(the)

Hashed we get Hash_BC:  hash(‘md5’, ‘ofthe’) = d2861a779f19cac959f0e0a6bc0bda24

Which leaves the Hash_AC skip-gram:

(one)(the)

Hashed we get Hash_AC:  hash(‘md5’, ‘onethe’) = adff8ebf224c1abcf98893cedb6db248

We’ll do this for all the available trigrams in the database.

 

And… Why add Bigrams & Skipgrams?

This last step imparts additional speed benefits to our bot because the Trigrams alone will be insufficient to properly identify and tag every word in every sentence.

Think about it, the bot knows 56,057 words (which is more than the average native English speaker… so more than you and m… well… perhaps you… 😛 ) and the Oxford English Dictionary claims there are a little less than 200K words in English which is almost certainly not true if we include ancillary colloquialisms and slang as part of English and for the purposes of parts of speech tagging if not simply AI research we’d almost certainly have to if we’re sourcing training materials from the web.

The number of trigrams our bot knows is 878,037 which like the bot’s vocabulary, is limited when compared to what is possible.

This is because our bot only trained on the Brown Corpus so it only knows the Trigrams which were present in the training material, but because we know that the training material was real text and not random gibberish, we know the trigrams are “high quality” learning material for our bot.

If we wanted to know the upper limit of how many trigrams there could be we simply need to know how many words the bot knows and then “cube” the bot’s vocabulary:

56,0573 = pow(56057, 3) = 1.7615280201719E+14

This means that if any combination (including combinations like (the)(the)(the)) are correct then there are One Hundred Seventy-Six Trillion, One Hundred Fifty-Two Billion, Eight Hundred Two Million, Seventeen Thousand, One Hundred Ninety possible combinations. so… far more then the 878K we currently have!

But we know that combinations like (the)(the)(the) are bad so we could calculate the factorial but we don’t really gain anything by doing that and it only tells us how many trigrams are hypothetically possible but not which ones are actually valid.

Beyond if a Trigram is valid because it isn’t the same words repeated, some words simply never work together and are invalid anyway, so knowing it can exist isn’t enough otherwise we could just generate all possible 3 word combinations and be done.

To get things to work right we also need to correlate it’s probability with other patterns, which is what the count does.

But since we can’t look at all possible valid combinations (we’re not google 😛 ) we have to get creative.

We can improve the bot’s ability to tag words by allowing it to solve the problem with less information by computing AC Skip-grams and AB + BC Bi-grams

This retains the same number of Trigrams but we gain 2,634,111 additional gram patterns (ways of evaluating text) that are otherwise hidden behind costly multi-field comparisons at run time.

Basically this means that when a Trigram isn’t exactly what we want (but very close) we can “back off” the trigram and use a Bi-gram or Skip-gram to tag a word instead and combine the results.

Either way, the hashing is a shortcut and simply makes the comparisons we need to do when tagging text faster and reduces the strain on the database.

Now, because we won’t model all possible Bi-grams and Skip-grams, there will be gaps that Bi-grams and Skip-grams will also fail to fill and in those cases we will need to rely on aggregate uni-grams but there is no need to hash for uni-grams as that’s simply a word by itself so it’s just faster (and cheaper computationally) to just compare words at that point.

 

The Code AddHashes.php

Here’s a link to the AddHashes.php code in the GitHub repo for this project.

<?php
/*
This programm will connect to the PartsOfSpeechTagger database and add 3 additional fields 
directly after the 'Hash' field.
We need to add 2 fields for 'Bigrams'
// hash(A && B) 
// hash(B && C) 
We also need 1 field for 'Skip-grams'
// hash(A && C) 
*/
// MySQL Server Credentials
$server = 'localhost';
$username = 'root';
$password = 'password';
$db = 'PartsOfSpeechTagger';
// Create connection
$conn = new mysqli($server, $username, $password, $db);
// Check connection
if ($conn->connect_error) {
  die("MYSQL DB Connection failed: " . $conn->connect_error);
}
// Add additional Hash fields
$sql = "ALTER TABLE `Trigrams` ADD `Hash_AB` VARCHAR(33) NOT NULL AFTER `Hash`, ADD `Hash_BC` VARCHAR(33) NOT NULL AFTER `Hash_AB`, ADD `Hash_AC` VARCHAR(33) NOT NULL AFTER `Hash_BC`";
$conn->query($sql);
// Add the Bigram and Skipgram hashes
$sql = "SELECT * FROM `Trigrams` WHERE `Hash_AB` = '' OR `Hash_BC` = '' OR `Hash_AC` = ''";
$result = $conn->query($sql);
$i = 1;
if ($result->num_rows > 0) {
  // output data of each row
  while($row = mysqli_fetch_assoc($result)) {
    
     // We already generated the Trigrams
     // A && B && C
     // Generate Bigram hashes
     // A && B 
     $Hash_AB = hash('md5', $row["Word_A"] . $row["Word_B"]);
     // B && C
     $Hash_BC = hash('md5', $row["Word_B"] . $row["Word_C"]);
     
     // Generate Skip-gram hashes
     // A && C
     $Hash_AC = hash('md5', $row["Word_A"] . $row["Word_C"]);
     
     // Generate SQL
     $sql_AB = "UPDATE `Trigrams` SET `Hash_AB` = '$Hash_AB' WHERE `Trigrams`.`Hash` = '" . $row["Hash"] . "'";
     $sql_BC = "UPDATE `Trigrams` SET `Hash_BC` = '$Hash_BC' WHERE `Trigrams`.`Hash` = '" . $row["Hash"] . "'";
     $sql_AC = "UPDATE `Trigrams` SET `Hash_AC` = '$Hash_AC' WHERE `Trigrams`.`Hash` = '" . $row["Hash"] . "'";
     
     // Update Database
     $conn->query($sql_AB);
     $conn->query($sql_BC);
     $conn->query($sql_AC);
     echo $i . PHP_EOL;
     $i++;
  }
}
$conn->close();

Run this code over night and you are ready to use your Parts of Speech tagging bot and we’ll cover that next week.

With that, please like this post & leave your thoughts in the comments.

Also, don’t forget to share this post with someone you think would find it interesting and hit that follow button to make sure you get all my new posts!

And before you go, consider helping me grow…


Help Me Grow

Your direct monetary support finances this work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.

Your support goes toward helping me buy better tools and equipment so I can improve the quality of my content.  It also helps me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you feel inclined to give me money and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month and you will be helping me grow.

Thank you!

And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Blog at WordPress.com.

Up ↑

%d bloggers like this: