Search

Geek Girl Joy

Artificial Intelligence, Simulations & Software

Happy New Year 2019

Happy New Year everyone! I just wanted to check in and fill you in a little on what’s going on behind the scenes.

I’m starting the year off with a new computer and I’ve been reinstalling all my software and transferring my files… you know how it is.

This new computer is faster, has more memory and a larger hard drive. It also has a much more powerful video-card and this is going will allow us to do even bigger projects this year and I’ve already started experimenting!

I will get back to posting regularly soon! 🙂

Now, I could tell you some of the things that I’m working on but where’s the fun in that? 😛

Let’s just say that some of my ideas were out of our reach with my last machine but now, well… I’ll quote Doc Brown “If my calculations are correct… you’re gonna see some serious shit.”! 😉

Much Love,
~Joy

Advertisements

Happy Holidays Panic

So, this happened.

And you’re probably concerned… I don’t blame you!

The sad truth is that many people are struggling under heavy student debt loans.

Drowning under mortgage payments they can’t afford or simply living paycheck to paycheck!

It’s easy to get discouraged during the holiday season in a normal year but add insult to injury and suddenly you’re staring recession or even full-blown economic depression right in the face.

Continue reading “Happy Holidays Panic”

How to take Training Snapshots of your PHP FANN Neural Network

Today we’re going to look at how to take snapshots of your FANN Artificial Neural Network  (ANN) while it’s training.

But why?

Well, maybe you want to fork competing GANs to progressively create ever increasingly  believable deepfakes because you want some of that sweet, sweet deepfake money… um… I mean, build a best seller Writer Bot… 😉 😛

Perhaps you want to compare how different processes, configurations or algorithms affect your ANN during training rather than waiting till the end.

Or… maybe, you have a pug and toddler who conspire to take turns crawling under your desk and push the power switch on the surge protector ~60 hours into a long and complex training process and you’d rather not lose days of work… again!

OK… not a third time smarty pants! 😛

Continue reading “How to take Training Snapshots of your PHP FANN Neural Network”

Finished Prototype

 

Welcome, today we’re going to wrap up our Parts of Speech tagging bot prototype.

That doesn’t mean that we’re done with this code (or database), it’s just that this prototype is functioning well enough that it fulfills what we set out to accomplish (Tokenizing & Lexing Natural Language) and further development at this point is unnecessary for our purposes but there are tons of other improvements we could add in the future if we want to turn this prototype into a full product and I encourage you to experiment! 🙂

There were some changes to the Database (last week and this week) but I have uploaded the most recent data to the GitHub Project Repo: Part-Of-Speech-Tagger

We last left off with 3 unknown lexemes and the word ‘jumps’ was miss tagged.

Additionally, our tagging process is more or less effective but it’s not quick and that simply wont do if we want to use our  tagger to do fun things in the future, so we’ll cover how we solve that today too.

There is much to discuss so let’s start with the miss tagged words.

Continue reading “Finished Prototype”

Unigrams

Welcome, were getting close to finishing our Parts of Speech tagger and today we’re going to fix the unknown tags we ran into last week but first, lets talk about the owie.

 

The Bot Had An Owie

I was reviewing the database for the Parts of Speech Tagging system and my 20 month old waddles up to my chair and points to the computer and says… ‘Bobot’.

He doesn’t quite grasp the subtle differences between robots and computers yet and I think it’s adorable! 🙂

I sat him on my lap and explained that the bot had an Owie and we needed to fix it.

We found that words like “Gracie’s” were missing from the Trigrams table but were in the words table.

Based on our methodology (see Train.php on GitHub), this shouldn’t happen.

With a little investigation it became clear that the issue occurred during the original training while the Bot read through The Brown Corpus.

 

My Mistake

Put simply, I forgot to pass the words going into the Trigrams table through mysqli_real_escape_string() (but not the words or tags table) which meant that when the Database saw “Gracie’s” in the Trigrams INSERT SQL it saw the apostrophe as the end of the string.

This meant that even after I corrected the code I had to empty the database and retrain the bot from scratch.

Mildly frustrating for sure, but it isn’t as bad as it sounds. 🙂

Remember a couple weeks back when we optimized the initial training process to go from 3 weeks to about 10 hours? 😉

 

Retraining

I emptied the PartsOfSpeech MySQL database, invoked the Train.php script and walked away.

I returned about 5 AM to find that the bot had finished reading. I then ran AddHashes.php and again walked away.

Later, around 2 PM in the afternoon I returned once more to find that the bot had finished computing the Bi-gram & Skip-gram hashes.

So… while a little annoying it was a nice opportunity to test the whole setup and train process from scratch which took more or less 20 hours from start to finish.

It’s important to note that importing the pre-learned data that I’ve uploaded to GitHub only takes a few minutes so you shouldn’t run Train.php unless you are teaching your bot a different training corpus.

Anyway, with the hole in the bot’s head patched, I set to work on correcting the unknown tags.

 

The Unknown Tags

Our methodology to tag words is basically to lookup the Tri-grams and Bi-grams in the database, then add and average the tag probabilities for a given word.

If we called Tri-gram tags and words “strongly” correlated, then we might call Bi-gram words and tags “generally” correlated.

So, if we can find our words as Tri-grams (preferable) or Bi-grams (acceptable), we can be reasonably confident in a low probability of error.

In other words, based on probability the average tag will be more or less correct most of the time.

But, if the bot hasn’t seen…

The Tri-gram: Word A + Word B + Word C

The Skip-gram: Word A + *Word C

The AB Bi-gram: Word AWord B + *

The BC Bi-gram: *Word BWord C

What’s a bot to do? 😛

Unigrams

A Uni-gram is a single gram, so in this case… a word.

Uni-gram words are basically self correlated… in that the word isn’t connected to any other word directly.

You might say it’s connected to the tags which are a result of it’s use with other words so perhaps it’s fair to call Uni-grams “loosely” correlated.

Because a Uni-gram may correlate to multiple tag types, it can be difficult for the tagger to figure out exactly what the correct tag is sometimes, however this is remedied by recalling that when we observe a Uni-gram’s tags we are observing all uses of that word.

So in cases where we have to rely on Uni-grams we can just use the tag that on average is most likely, not perfect but better than unknown because on average, it should be correct.

 

Collecting The Unigram Tags

We could just do a naive global lookup each time for a Uni-gram at run time every time we need to tag some text but that’s extremely slow and wasteful of energy.

The efficient approach is to find all the tags for all the Uni-grams one time and then later when we need to know the tags for a Uni-gram we can do a single query rather than traverse the entire Tri-grams table looking for Uni-grams every time we can’t match a gram.

The Naive (Slow) Way

We could query the Words table for all the words and load them into an array in memory.

Then, for each word, query the Trigrams table. Each time a word is found in a Tri-gram, we could collect the tag.

If we did this for each word we would end up with a list of tags for each word.

So what’s wrong with this method? It will take many hours to complete.

The Correct (Fast) Way

There is a faster method! We already know the words in the Words table exist as Uni-grams. So rather than look up all words individually with multiple queries, we can do 1 query to select all Trigrams :

$sql = "SELECT * FROM `Trigrams`";

Then build a list of words and tags as we walk through the rows returned by the database.

It might not be immediately apparent why this is faster but it has to do with how many times we review the same information.

By making only a single loop through the Trigrams table we only review a Tri-gram once rather than Uni-grams x Tri-grams.

I.E. 906,846 Row Operations as opposed to 50,835,066,222 (Fifty Billion Eight Hundred Thirty-Five Million Sixty-Six Thousand Two Hundred Twenty-Two).

Thus greatly speeding up the process to only a few minutes, go make some coffee and grab a snack and when you return you will have collected all the Uni-gram tags.

 

The Code

Here’s the code and you can find it on GitHub here CollectUnigramTags.php.

<?php
// MySQL Server Credentials
$server = 'localhost';
$username = 'root';
$password = 'password';
$db = 'PartsOfSpeechTagger';

// Create connection
$conn = new mysqli($server, $username, $password, $db);

// Check connection
if ($conn->connect_error) {
  die("MYSQL DB Connection failed: " . $conn->connect_error);
}

// Add additional TagSum & Tags field to the words table
echo 'Adding TagSum Field.' . PHP_EOL;
$sql = "ALTER TABLE `Words` ADD `TagSum` TEXT NOT NULL AFTER `Count`";
$conn->query($sql);

echo 'Adding Tags Field.' . PHP_EOL; 
$sql = "ALTER TABLE `Words` ADD `Tags` TEXT NOT NULL AFTER `TagSum`"; 
$conn->query($sql);

$words = array();
echo 'Locating Unigrams.' . PHP_EOL;

// Return all Trigrams
$sql = "SELECT * FROM `Trigrams`"; 

// Query the trigrams table
$result = $conn->query($sql);

// If there are Trigrams collect the tags for the Words
if($result->num_rows > 0){
  $i=0; // Keep track of current Trigram 
  echo 'Counting Unigrams Tags.' . PHP_EOL;
  while($row = mysqli_fetch_assoc($result)) {
    echo ++$i . PHP_EOL; // echo current Trigram
    
    // $words[md5 word hash][tag] += 1;
    // Example:
    // Unhashed: $words['the']['at'] += 1;
    // Hashed:   $words['8fc42c6ddf9966db3b09e84365034357']['at']++;
    @$words[hash('md5', $row["Word_A"])][$row["Tag_A"]]++;
    @$words[hash('md5', $row["Word_B"])][$row["Tag_B"]]++;
    @$words[hash('md5', $row["Word_C"])][$row["Tag_C"]]++;
  }
}


echo 'Updating Words.' . PHP_EOL;
foreach($words as $hash=>&$tags){
  if(count($tags > 0)){
    $sum = array_sum($tags); // Count the total number of tags
    $tags = json_encode($tags, 1); // tags data
    
    // Update word using the Hash key
    $sql = "UPDATE `Words` SET `Tags` = '$tags', `TagSum` = '$sum' WHERE `Words`.`Hash` = '$hash';"; 
    $conn->query($sql);
    
    echo "$hash Updated!" . PHP_EOL; // Report the hash was updated
  }
}

$conn->close(); // disconnect from the database

 

Then all that was left to do was update Test.php:

// Merge unique lexemes (with tag data) into the lexemes
foreach($lexemes as $key=>$lexeme){
  // If we have a tag for the word 
  if(array_key_exists($lexeme, $unique_lexemes)){
    $lexemes[$key] = array('lexeme'=>$lexeme, 'tags'=> $unique_lexemes[$lexeme]);
  }else{
    // No Bi-gram, Skip-gram or Tri-gram

    // Try to look up the Unigram
    $sql = "SELECT * FROM `Words` WHERE `Word` = '$lexeme'";
    $result = $conn->query($sql);
    if($result->num_rows > 0){// We know this Uni-gram
      // Collect the tags for the Uni-gram
      while($row = mysqli_fetch_assoc($result)) {
		  // Decode Uni-gram tags from json into associtive array
          $tags = json_decode($row["Tags"], 1);
          
          // Sort the tags and compute %
	  arsort($tags);
          $sum = array_sum($tags);
          foreach($tags as $tag=>&$score){
              $score = $score . ' : ' . ($score/$sum * 100) . '%';
          }
	  $lexemes[$key] = array('lexeme'=>$lexeme, 'tags'=> $tags);
      }
    }else{ // We don't know this Uni-gram/word
        $lexemes[$key] = array('lexeme'=>$lexeme, 'tags'=> array('unk'=>'1 : 100%',));
    }
  }
}
$conn->close(); // disconnect from the database

 

Results


Sentence: The quick brown fox jumps over the lazy dog. A long-term contract with "zero-liability" protection! Let's think it over.

Tagged Sentence: The/at quick/jj brown/jj fox/np jumps/nns over/in the/at lazy/jj dog/nn ./. A/at long-term/nn contract/vb with/in "/unk zero-liability/unk "/unk protection/nn-hl !/. Let's/vb+ppo think/vb it/ppo over/in ./. 

Tags: 
12 unique tags, 24 total.
at(3) - 12.50% of the sentence.
jj(3) - 12.50% of the sentence.
in(3) - 12.50% of the sentence.
.(3) - 12.50% of the sentence.
unk(3) - 12.50% of the sentence.
nn(2) - 8.33% of the sentence.
vb(2) - 8.33% of the sentence.
np(1) - 4.17% of the sentence.
nns(1) - 4.17% of the sentence.
nn-hl(1) - 4.17% of the sentence.
vb+ppo(1) - 4.17% of the sentence.
ppo(1) - 4.17% of the sentence.

These changes result in only 3 unknown lexemes representing 12.50% of the sentence and 2 of the unknowns are symbols with the last being the compound word zero-liability.

I’ll count that as a huge success but we can still do better and a few of the words are miss tagged still so we’ll have to at least attempt to fix that too but we’ll look at that next week.

Also, I haven’t updated the database dumps on GitHub yet as there will be more table updates coming next week and I’d prefer to do the upload once.

With that, please like this post & leave your thoughts in the comments.

Also, don’t forget to share this post with someone you think would find it interesting and hit that follow button to make sure you get all my new posts!

And before you go, consider helping me grow…


Help Me Grow

Your direct monetary support finances this work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.

Your support goes toward helping me buy better tools and equipment so I can improve the quality of my content.  It also helps me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you feel inclined to give me money and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month and you will be helping me grow.

Thank you!

And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Parts of Speech Tagging

Welcome, as promised, we’re going to use our Parts of Speech Tagger today!

Are you excited? I know I am, so we’re going to get right to it I promise, but quickly before we do, If you’re just finding my content here are the other posts in this series:

An Introduction to Writer Bot

Rule Based Stories

Artificial & Natural Language Processing

 

The Code

This code will review text given to it and then try to figure out what tag to assign each word by using the database to lookup Trigram and Bigram patterns.

We’ll discuss the code further after you’ve had a chance to review it.

<?php


// This is our Tokenize function from 
function Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols){  
      
  $temp = '';                   // A temporary string used to hold incomplete lexemes
  $lexemes = array();           // Complete lexemes will be stored here for return
  $chars = str_split($text, 1); // Split the text sting into characters.
  
  // Step through all character tokens in the $chars array
  foreach($chars as $key=>$char){
        
    // If this $char token is in the $delimiters array
    // Then stop building $temp and add it and the delimiter to the $lexemes array
    if(in_array($char, $delimiters)){
      
      // Does temp contain data?
      if(strlen($temp) > 0){
        // $temp is a complete lexeme add it to the array
        $lexemes[] = $temp;
      }      
      $temp = ''; // Make sure $temp is empty
      
      $lexemes[] = $char; // Capture delimiter as a whole lexeme
    }
    else{// This $char token is NOT in the $delimiters array
      // Add $char to $temp and continue to next $char
      $temp .= $char; 
    }
    
  } // Step through all character tokens in the $chars array


  // Check if $temp still contains any residual lexeme data?
  if(strlen($temp) > 0){
    // $temp is a complete lexeme add it to the array
    $lexemes[] = $temp;
  }
  
  // We have processed all character tokens in the $chars array
  // Free the memory and garbage collect $chars & $temp
  $chars = NULL;
  $temp = NULL;
  unset($chars);
  unset($temp);


  // We now have the simplest lexems extracted. 
  // Next we need to recombine compound-words, contractions 
  // And do any other processing with the lexemes.

  // If there are $chars in the $compound_word_symbols array
  if(!empty($compound_word_symbols)){
    
    // Count the number of $lexemes
    $number_of_lexemes = count($lexemes);
    
    // Step through all lexeme tokens in the $lexemes array
    foreach($lexemes as $key=>&$lexeme){
      
      // Check if $lexeme is in the $compound_word_symbols array
      if(in_array($lexeme, $compound_word_symbols)){
        
        // If this isn't the first $lexeme in $lexemes
        if($key > 0){ 
          // Check the $lexeme $before this
          $before = $lexemes[$key - 1];
          
          // If $before isn't a $delimiter
          if(!in_array($before, $delimiters)){
            // Merge it with the compound symbol
            $lexeme = $before . $lexeme;
            // And remove the $before $lexeme from $lexemes
            $lexemes[$key - 1] = NULL;
          }
        }
        
        // If this isn't the last $lexeme in $lexemes
        if($key < $number_of_lexemes){
          // Check the $lexeme $after this
          $after = $lexemes[$key + 1];
          
          // If $after isn't a $delimiter
          if(!in_array($after, $delimiters)){
            // Merge the $lexeme it with
            $lexemes[$key + 1] = $lexeme . $after;
            // And remove the $lexeme
            $lexeme = NULL;
          }
        }
        
      } // Check if lexeme is in the $compound_word_symbols array
    } // Step through all tokens in the $lexemes array      
  } // If there are $chars in the $compound_word_symbols array
  
  // Filter out any NULL values in the $lexemes array
  // created during the compound word merges using array_filter()
  // and then re-index so the $lexemes array is nice and sorted using array_values().
  $lexemes = array_values(array_filter($lexemes));
  
  
  // If there are $chars in the $contraction_symbols array
  if(!empty($contraction_symbols)){
    
    // Count the number of $lexemes
    $number_of_lexemes = count($lexemes);
    
    // Step through all lexeme tokens in the $lexemes array
    foreach($lexemes as $key=>&$lexeme){
      
      // Check if $lexeme is in the $contraction_symbols array
      if(in_array($lexeme, $contraction_symbols)){
        
        // If this isn't the first $lexeme in $lexemes
        // and If this isn't the last $lexeme in $lexemes
        if($key > 0 && $key < $number_of_lexemes){ 
          // Check the $lexeme $before this
          $before = $lexemes[$key - 1];
          
          // Check the $lexeme $after this
          $after = $lexemes[$key + 1];
          
          
          // If $before isn't a $delimiter
          // and $after isn't a $delimiter
          if(!in_array($before, $delimiters) && !in_array($after, $delimiters)){
            // Merge the contraction tokens
            $lexemes[$key + 1] = $before . $lexeme . $after;
            
            // Remove $before
            $lexemes[$key - 1] = NULL;
            // And remove this $lexeme
            $lexeme = NULL;            
          }

        }
        
      } // Check if lexeme is in the $contraction_symbols array
    } // Step through all tokens in the $lexemes array      
  } // If there are $chars in the $contraction_symbols array
  
  // Filter out any NULL values in the $lexemes array
  // created during the contraction merges using array_filter()
  // and then re-index so the $lexemes array is nice and sorted using array_values().
  $lexemes = array_values(array_filter($lexemes));
  

  // Return the $lexemes array.
  return $lexemes;
} // Tokenize()

// Remove unwanted Delimiters or symbols from Lexems array
function Remove($lexemes, $remove_values){
    
    foreach($lexemes as &$lexeme){
        
        // if the lexeme is one that should  be removed
        if(in_array($lexeme, $remove_values)){
            $lexeme = NULL; // set it to null
        }
    }
    // Remove NULL, FALSE & "" but leaves values of 0 (zero)
    $lexemes = array_filter( $lexemes, 'strlen' );
  
    return array_values($lexemes);
}

// This takes an array of lexemes produced by the Tokenize() function 
// and returns an associative array containing tri-grams, bi-grams and skip-grams
function ExtractGrams($lexemes, $hash = true){
  
  $grams = array();
  
  $lexeme_count = count($lexemes);
  for($i=2; $i < $lexeme_count; $i++){
      if($hash == true){// hashed string - default
        $grams['trigrams'][] = hash('md5', $lexemes[$i-2] . $lexemes[$i-1] . $lexemes[$i]);
        $grams['skipgrams'][] = hash('md5', $lexemes[$i-2] . $lexemes[$i]);
     }
     else{// unhashed string
         $grams['trigrams'][] = $lexemes[$i-2] . $lexemes[$i-1] . $lexemes[$i];
         $grams['skipgrams'][] = $lexemes[$i-2] . $lexemes[$i];
     }
  }
  for($i=1; $i < $lexeme_count; $i++){
       if($hash == true){// hashed string - default
           $grams['bigrams'][] = hash('md5', $lexemes[$i-1] . $lexemes[$i]);
       }
       else{// unhashed string
           $grams['bigrams'][] = $lexemes[$i-1] . $lexemes[$i];
       }
  }
  
  return $grams;
}


// MySQL Server Credentials
$server = 'localhost';
$username = 'root';
$password = 'password';
$db = 'PartsOfSpeechTagger';

// Create connection
$conn = new mysqli($server, $username, $password, $db);

// Check connection
if ($conn->connect_error) {
  die("MYSQL DB Connection failed: " . $conn->connect_error);
}

// Delimiters (Lexeme Boundaries)
$delimiters = array('~', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+', '`', '-', '=', '{', '}', '[', ']', '\\', '|', ':', ';', '"', '\'', '<', '>', ',', '.', '?', '/', ' ', "\t", "\n");

// Symbols used to detect compound-words
$compound_word_symbols = array('-', '_');

// Symbols used to detect contractions
//$contraction_symbols = array("'", '.', '@');
$contraction_symbols = array("'", '@');

// The text we want to tag
$text = 'The quick brown fox jumps over the lazy dog. A long-term contract with "zero-liability" protection! Let\'s think it over.';

// Tokenize and extract the $lexemes from $text
$lexemes = Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols);

// Filter unwanted lexemes, in this case, we want to remove spaces since 
// the Brown Corpus doesn't use them and we don't really need them for anything.
$lexemes = Remove($lexemes, array(' '/*, Add other values to remove here*/));

// Extract the Lexemes into Bi-grams, Tri-grams & Skip-grams 
// using the new ExtractGrams() function
$grams = ExtractGrams($lexemes);

// Lookup all the grams using their hashes to simplify and speedup 
// the queries due to reduced number of field comparisons.
foreach($grams as $skey=>&$gramset){
  foreach($gramset as $gkey=>&$gram){
    if($skey == 'trigrams'){
      $sql = "SELECT * FROM `Trigrams` WHERE `Hash` = '$gram' ORDER BY `Count` DESC"; 
    }
    elseif($skey == 'bigrams'){
      $sql = "SELECT * FROM `Trigrams` WHERE `Hash_AB` = '$gram' OR `Hash_BC` = '$gram' ORDER BY `Count` DESC"; 
    }
    elseif($skey == 'skipgrams'){
      $sql = "SELECT * FROM `Trigrams` WHERE `Hash_AC` = '$gram' ORDER BY `Count` DESC"; 
    }
    $gram = array('hash'=>$gram, 'sql'=>$sql);
    
    
    $result = $conn->query($gram['sql']);
    $gram['data'] = array();
    if($result->num_rows > 0){
       
      // Collect the data for this gram result
      while($row = mysqli_fetch_assoc($result)) {
        $gram['data'][] = array(
               'Hash'=> $row["Hash"],
               'Count'=> $row["Count"],
               'Word_A'=> $row["Word_A"],
               'Word_B'=> $row["Word_B"],
               'Word_C'=> $row["Word_C"],
               'Tag_A'=> $row["Tag_A"],
               'Tag_B'=> $row["Tag_B"],
               'Tag_C'=> $row["Tag_C"]);
      }
    }
  }
}

$conn->close(); // disconnect from the database

// Get a list of Unique lexemes
$unique_lexemes = array_keys(array_count_values($lexemes));

// Process the gram data for each word
foreach($grams as $skey=>&$gramset){
  foreach($gramset as $gkey=>&$gram){
    foreach($gram['data'] as $data){
            
      // If the word being considered is one we're looking for
      // collect the tag and increment it's value
      if(in_array($data['Word_A'], $unique_lexemes)){
        @$unique_lexemes[$data['Word_A']][$data['Tag_A']]++;
      }
      if(in_array($data['Word_B'], $unique_lexemes)){
        @$unique_lexemes[$data['Word_B']][$data['Tag_B']]++;
      }
      if(in_array($data['Word_C'], $unique_lexemes)){
        @$unique_lexemes[$data['Word_C']][$data['Tag_C']]++;
      }
    }
  }
}

// Organize the data a little better and calculate the tag score
foreach ($unique_lexemes as $key => &$value) 
{ 
  // remove the strings in the numeric indexes
  if(is_numeric($key)){
    unset($unique_lexemes[$key]); 
  }
  else{// this array index is associate
     // sort the tags and compute %
    arsort($value);
    $sum = array_sum($value);
    foreach($value as $tag=>&$score){
      $score = $score . ' : ' . ($score/$sum * 100) . '%';
    }
  }
  
}
// Merge unique lexemes (with tag data) into the lexemes
foreach($lexemes as $key=>$lexeme){
  
  // If we have a tag for the word 
  if(array_key_exists($lexeme, $unique_lexemes)){
    $lexemes[$key] = array('lexeme'=>$lexeme, 'tags'=> $unique_lexemes[$lexeme]);
  }else{
    // The word is unknown/no Bi-gram, Skip-gram or Tri-gram returned
    $lexemes[$key] = array('lexeme'=>$lexeme, 'tags'=> array('unk'=>'1 : 100%',));
  }
}

// Echo Original Sentence
echo 'Sentence: ' . $text . PHP_EOL;
echo PHP_EOL;

// Echo the Tagged Sentence
$unique_tags = array();
echo 'Tagged Sentence: ';
foreach($lexemes as $key=>$lexeme){
  $tag = key($lexeme['tags']);
  echo $lexeme['lexeme'] . '/' . $tag . ' ';
  @$unique_tags[$tag]++;
}
echo PHP_EOL . PHP_EOL;

// Echo the Basic Tags report
echo 'Tags: ' . PHP_EOL;
arsort($unique_tags);
$sum = array_sum($unique_tags);
echo count($unique_tags) . " unique tags, $sum total." . PHP_EOL;
foreach($unique_tags as $tag=>$count){

  echo "$tag($count) - " . number_format($count/$sum * 100, 2) . '% of the sentence.' . PHP_EOL;
}
echo PHP_EOL . PHP_EOL;

// Echo the Detailed Tags report
echo 'Detailed Report: ' . PHP_EOL;
foreach($lexemes as $key=>$lexeme){
  echo '[' . $lexeme['lexeme'] . ']'. PHP_EOL;
  
  $tags = '';
  foreach ($lexeme['tags'] as $tag=>$value){
    $tags .= "$tag($value)" . PHP_EOL;
  }
  
  echo trim($tags) . ' ' . PHP_EOL . PHP_EOL;
}

 

Our Process

This post is essentially the direct successor to Tokenizing & Lexing Natural Language and every post that has come after that has been about building the bot knowledge base, extracting the data and structuring it so we can more efficiently parse it with minimal computational effort, though that isn’t to say that we have reached the pinnacle either.

We take a string of plain unmodified text like this:

The quick brown fox jumps over the lazy dog. A long-term contract with “zero-liability” protection! Let’s think it over.

…and we pass it to the Tokenize() function we wrote back in the Tokenizing & Lexing post.

What results is an array of word lexemes.

We then use a new function Remove() to filter unwanted lexemes from the array. In this case, we want to remove spaces since the Brown Corpus doesn’t use them and we don’t really need them for anything.

Next, we process the Lexems into Bi-grams, Tri-grams & Skip-grams using the new ExtractGrams() function.

We then lookup all the grams using their hashes to simplify and speedup the queries due to reduced number of field comparisons.

We note the results of the queries, do some basic addition, division and lots of string concatenation.

All of this gives us our results.

 

Results

If you run this code (and setup your database as outlined in other posts in this series) then you should be rewarded with these results:

Sentence: The quick brown fox jumps over the lazy dog. A long-term contract with "zero-liability" protection! Let's think it over.

Tagged Sentence: The/at quick/jj brown/jj fox/unk jumps/nns over/in the/at lazy/unk dog/nn ./. A/at long-term/nn contract/vb with/in "/unk zero-liability/unk "/unk protection/unk !/. Let's/unk think/vb it/ppo over/in ./.

Tags:
9 unique tags, 24 total.
unk(7) - 29.17% of the sentence.
at(3) - 12.50% of the sentence.
in(3) - 12.50% of the sentence.
.(3) - 12.50% of the sentence.
jj(2) - 8.33% of the sentence.
nn(2) - 8.33% of the sentence.
vb(2) - 8.33% of the sentence.
nns(1) - 4.17% of the sentence.
ppo(1) - 4.17% of the sentence.

Detailed Report:
[The]
at(3 : 100%)

[quick]
jj(1 : 100%)

[brown]
jj(2 : 100%)

[fox]
unk(1 : 100%)

[jumps]
nns(1 : 100%)

[over]
in(520 : 81.504702194357%)
rp(114 : 17.868338557994%)
in-hl(4 : 0.6269592476489%)

[the]
at(537 : 100%)

[lazy]
unk(1 : 100%)

[dog]
nn(22 : 100%)

[.]
.(1589 : 98.756991920447%)
.-hl(20 : 1.2430080795525%)

[A]
at(1373 : 97.792022792023%)
at-hl(26 : 1.8518518518519%)
nn(2 : 0.14245014245014%)
np-hl(2 : 0.14245014245014%)
at-tl-hl(1 : 0.071225071225071%)

[long-term]
nn(1 : 100%)

[contract]
vb(2 : 50%)
nn(2 : 50%)

[with]
in(6 : 85.714285714286%)
rb(1 : 14.285714285714%)

["]
unk(1 : 100%)

[zero-liability]
unk(1 : 100%)

["]
unk(1 : 100%)

[protection]
unk(1 : 100%)

[!]
.(1 : 100%)

[Let's]
unk(1 : 100%)

[think]
vb(28 : 100%)

[it]
ppo(144 : 72.361809045226%)
pps(55 : 27.638190954774%)

[over]
in(520 : 81.504702194357%)
rp(114 : 17.868338557994%)
in-hl(4 : 0.6269592476489%)

[.]
.(1589 : 98.756991920447%)
.-hl(20 : 1.2430080795525%)

 

What’s up with the Unknown Tags?

As you can see our bot is 100% certain it doesn’t know how to tag the word ‘lazy’ (as well as 6 other words).

I mean, it’s nice that the bot is being honest with us about it’s own perceived failings but… I’m pretty sure it knows the word ‘lazy’, in fact I’ll go so far as accuse it of being lazy right now! 😛

A quick manual check of the database confirms that it’s seen the word lazy before so… what went wrong?

We’ll look at that next week so go ahead and hit that follow button to make sure you get my latest posts.

If you like this post feel free to leave a comment and don’t forget to share this post with someone you think would find it interesting.

Also, consider helping me grow…


Help Me Grow

Your direct monetary support finances this work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.

Your support goes toward helping me buy better tools and equipment so I can improve the quality of my content.  It also helps me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you feel inclined to give me money and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month and you will be helping me grow.

Thank you!

And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Adding Bigrams & Skipgrams

Welcome, today were going to talk about AB & BC Bi-grams as well as AC Skip-grams for Parts of Speech tagging.

If you’re just finding my content here are the other posts in this series:

An Introduction to Writer Bot

Rule Based Stories

Artificial & Natural Language Processing

 

I said in my last post (Building A Faster Bot) that I wanted to look at how to feed our bot some text and use Trigrams to tag the words in this post but… the truth is we have one final step before we are ready to do that and we’ll get to that next week for sure!

For now though we need to compute Bigrams & Skipgrams for our Trigrams.

 

What are “Grams” again?

Essentially, an “n-gram” is a set number of something in a unique pattern that we want to model. A “gram” is one item in the n-gram set.

We can use tags, numbers or the items themselves in some cases (like we did with words) as the grams and by collecting and categorizing the unique possibilities into groups of n-grams and then observing their use or existence we can extract “hidden” probability information about our subject.

That’s what we accomplished during training by collecting and counting all the word Trigrams present in The Brown Corpus Database.

In addition to modeling probability we modeled meaning by linking sets of trigram words together with sets of trigram tags and we cemented those links by hashing the grams together which also imparts a boost to the lookup speed.

 

So, Bigrams and Skipgrams are?

If “Tri-grams” are “three” word patterns then “Bi-grams” are “two” word patterns & “Skip-grams” are just a special n-gram that skip some of the grams. In this case our skip grams are bi-grams though technically a you can have a skip-gram that comprises more than two grams and skips more than one gram.

Additionally other than practicality, nothing prevents a longer skip-gram from containing multiple nonconsecutive skips… though such a long complex word skip-gram would have dubious value, though there are always exceptions to rules and who says you are modeling words in your case?

Here is an example that should hopefully make things more clear.

Given this tri-gram:

(one)(of)(the)

Hash: 320263779473e9ac2252940e0173a5b8

 

We can extract the following AB bi-gram:

(one)(of)

Hashed we get Hash_AB: hash(‘md5’, ‘oneof’) = 44d1d5bf437689cced8a62e192cdc49f

And this BC Bigram:

(of)(the)

Hashed we get Hash_BC:  hash(‘md5’, ‘ofthe’) = d2861a779f19cac959f0e0a6bc0bda24

Which leaves the Hash_AC skip-gram:

(one)(the)

Hashed we get Hash_AC:  hash(‘md5’, ‘onethe’) = adff8ebf224c1abcf98893cedb6db248

We’ll do this for all the available trigrams in the database.

 

And… Why add Bigrams & Skipgrams?

This last step imparts additional speed benefits to our bot because the Trigrams alone will be insufficient to properly identify and tag every word in every sentence.

Think about it, the bot knows 56,057 words (which is more than the average native English speaker… so more than you and m… well… perhaps you… 😛 ) and the Oxford English Dictionary claims there are a little less than 200K words in English which is almost certainly not true if we include ancillary colloquialisms and slang as part of English and for the purposes of parts of speech tagging if not simply AI research we’d almost certainly have to if we’re sourcing training materials from the web.

The number of trigrams our bot knows is 878,037 which like the bot’s vocabulary, is limited when compared to what is possible.

This is because our bot only trained on the Brown Corpus so it only knows the Trigrams which were present in the training material, but because we know that the training material was real text and not random gibberish, we know the trigrams are “high quality” learning material for our bot.

If we wanted to know the upper limit of how many trigrams there could be we simply need to know how many words the bot knows and then “cube” the bot’s vocabulary:

56,0573 = pow(56057, 3) = 1.7615280201719E+14

This means that if any combination (including combinations like (the)(the)(the)) are correct then there are One Hundred Seventy-Six Trillion, One Hundred Fifty-Two Billion, Eight Hundred Two Million, Seventeen Thousand, One Hundred Ninety possible combinations. so… far more then the 878K we currently have!

But we know that combinations like (the)(the)(the) are bad so we could calculate the factorial but we don’t really gain anything by doing that and it only tells us how many trigrams are hypothetically possible but not which ones are actually valid.

Beyond if a Trigram is valid because it isn’t the same words repeated, some words simply never work together and are invalid anyway, so knowing it can exist isn’t enough otherwise we could just generate all possible 3 word combinations and be done.

To get things to work right we also need to correlate it’s probability with other patterns, which is what the count does.

But since we can’t look at all possible valid combinations (we’re not google 😛 ) we have to get creative.

We can improve the bot’s ability to tag words by allowing it to solve the problem with less information by computing AC Skip-grams and AB + BC Bi-grams

This retains the same number of Trigrams but we gain 2,634,111 additional gram patterns (ways of evaluating text) that are otherwise hidden behind costly multi-field comparisons at run time.

Basically this means that when a Trigram isn’t exactly what we want (but very close) we can “back off” the trigram and use a Bi-gram or Skip-gram to tag a word instead and combine the results.

Either way, the hashing is a shortcut and simply makes the comparisons we need to do when tagging text faster and reduces the strain on the database.

Now, because we won’t model all possible Bi-grams and Skip-grams, there will be gaps that Bi-grams and Skip-grams will also fail to fill and in those cases we will need to rely on aggregate uni-grams but there is no need to hash for uni-grams as that’s simply a word by itself so it’s just faster (and cheaper computationally) to just compare words at that point.

 

The Code AddHashes.php

Here’s a link to the AddHashes.php code in the GitHub repo for this project.

<?php
/*
This programm will connect to the PartsOfSpeechTagger database and add 3 additional fields 
directly after the 'Hash' field.
We need to add 2 fields for 'Bigrams'
// hash(A && B) 
// hash(B && C) 
We also need 1 field for 'Skip-grams'
// hash(A && C) 
*/
// MySQL Server Credentials
$server = 'localhost';
$username = 'root';
$password = 'password';
$db = 'PartsOfSpeechTagger';
// Create connection
$conn = new mysqli($server, $username, $password, $db);
// Check connection
if ($conn->connect_error) {
  die("MYSQL DB Connection failed: " . $conn->connect_error);
}
// Add additional Hash fields
$sql = "ALTER TABLE `Trigrams` ADD `Hash_AB` VARCHAR(33) NOT NULL AFTER `Hash`, ADD `Hash_BC` VARCHAR(33) NOT NULL AFTER `Hash_AB`, ADD `Hash_AC` VARCHAR(33) NOT NULL AFTER `Hash_BC`";
$conn->query($sql);
// Add the Bigram and Skipgram hashes
$sql = "SELECT * FROM `Trigrams` WHERE `Hash_AB` = '' OR `Hash_BC` = '' OR `Hash_AC` = ''";
$result = $conn->query($sql);
$i = 1;
if ($result->num_rows > 0) {
  // output data of each row
  while($row = mysqli_fetch_assoc($result)) {
    
     // We already generated the Trigrams
     // A && B && C
     // Generate Bigram hashes
     // A && B 
     $Hash_AB = hash('md5', $row["Word_A"] . $row["Word_B"]);
     // B && C
     $Hash_BC = hash('md5', $row["Word_B"] . $row["Word_C"]);
     
     // Generate Skip-gram hashes
     // A && C
     $Hash_AC = hash('md5', $row["Word_A"] . $row["Word_C"]);
     
     // Generate SQL
     $sql_AB = "UPDATE `Trigrams` SET `Hash_AB` = '$Hash_AB' WHERE `Trigrams`.`Hash` = '" . $row["Hash"] . "'";
     $sql_BC = "UPDATE `Trigrams` SET `Hash_BC` = '$Hash_BC' WHERE `Trigrams`.`Hash` = '" . $row["Hash"] . "'";
     $sql_AC = "UPDATE `Trigrams` SET `Hash_AC` = '$Hash_AC' WHERE `Trigrams`.`Hash` = '" . $row["Hash"] . "'";
     
     // Update Database
     $conn->query($sql_AB);
     $conn->query($sql_BC);
     $conn->query($sql_AC);
     echo $i . PHP_EOL;
     $i++;
  }
}
$conn->close();

Run this code over night and you are ready to use your Parts of Speech tagging bot and we’ll cover that next week.

With that, please like this post & leave your thoughts in the comments.

Also, don’t forget to share this post with someone you think would find it interesting and hit that follow button to make sure you get all my new posts!

And before you go, consider helping me grow…


Help Me Grow

Your direct monetary support finances this work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.

Your support goes toward helping me buy better tools and equipment so I can improve the quality of my content.  It also helps me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you feel inclined to give me money and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month and you will be helping me grow.

Thank you!

And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Building A Faster Bot

This week has been all about testing and optimizing the bot Train process.

The prototype we looked at in The Brown Corpus is way to slow and needed to be refactored before we could proceed.

There was a point where I knew it was going to take too long to be satisfactory but out of a perverse geek curiosity I couldn’t bring myself to cancel the training… I just had to see how slow (bad) the process really was! 😛

If we want to use this as the core of a bot/system that can do more than just parts of speech tagging (and we do) it needs to be FAST!, and to do anything really fun it just can’t take 3 weeks to process 1 million words!

And yes, the Brown Corpus only needs to be learned once but any additional learning by the bot would also be just as slow…

Why was it so slow???

Basically the code was self blocking. Every step had to complete before the next step could begin.

All the words & tags were added to the database before the trigrams could be learned and we had to wait each time for the database to generate an ID.

I did cache the ID’s for the words and tags the bot encountered in memory for faster lookup but… it was ultimately just a slow process regardless of this optimization.

What that did however was to help keep the average train time per document fairly consistent, but trigrams were queried by searching the database for any trigram where A & B & C were present LIMIT 1. Needless to say that is super inefficient!

Though clearly not the Training process we wanted it had a few things going for it:

Pros:

  • Quick to Build & Functional though Verbose:
    What was nice about this method was that the code was easy to write and though long, it should be fairly easy to read. Also, it can’t be over stated that the best way to get on the right track is to build a functional proof of concept and then iterate and try to improve your system.
  • Direct Numbers are Good for Bots
    We also gained the ability to do some interesting transforms by having a unique numeric value that the words, tags and trigrams are tied to. We’ll discuss this more in future posts so I’ll leave this at that for now.

Cons:

  • SLOW!!!!
    3 Weeks is too slow!

~1,000,000 words / ~504 hours (3 weeks) = 1984.12698413 words per hour.

That equates to roughly 1 document an hour for 3 weeks straight!

  •  Can’t Divide and Conquer
    As the Train process was written it was impossible to split the data set among more than 1 system and then later combine the tables without quite a bit of manual post processing… which is just not a very pleasant thought! 😛   This is because the database assigned the ID’s based on the order in which it encountered a word, tag or trigram. So, if you have two systems and split your training files between them they will both assign different ID’s to the same word, tag or trigram so later we would have to read through all the tags, words and trigrams and change the ID’s so they were the same before we could merge the tables.

 

What changed in the refactor?

We switched to a batch process method where we process 10 files in memory then transfer the data to the database, clear the memory and process the next batch of 10 files until we have processed all the files.

This allows us to keep the memory requirements of the training process very low with each batch of 10 training files only requiring on average ~25 MB of RAM to go from raw text to database, which the bot quickly empties when it’s done.

Which brings us to hashing.

Hashing to the Rescue!

You might be asking… isn’t this a lot of work for something that seems simple? Why bother with hashing at all? Isn’t the batch processing memory trick enough? Well, batch processing was a response to implementing hashing.

You see, we needed a way to reduce the number of comparisons when doing lookup’s.

Consider this comparison:

(A == Wa && B == Wb && C == Wc)

that’s three compares (all must be correct or true) in order for the Trigram to be good but if A & B are correct and C isn’t, that’s still 3 evaluations before you know to move on. If we could reduce those comparisons without losing the information we gain from doing those comparisons then we might save a lot of time during training as well as when using the bot later!

We also needed a way to have 2 or more machines assign the same “ID” value to a word, tag and trigram. This would allow us to split the training set among as many computers as we can get our hands on and make quick work of future training data.

Hashing solves both of these problems!

If you & I hash the same value using the same algorithm we will get the same result regardless of our distance from each other,  the time of day or any other factor you can think of. We can do this without ever having to speak to each other and our computers need not ever communicate directly. This property of hashing makes it an ideal solution for generating ID’s that will lineup without a centralized system issuing ID’s. It’s basically how block-chain technology works, though this is far simpler.

Hashing also allows us to reduce 3 comparisons to 1 because we concatenate W_A + W_B + W_C like this:

<?php
// notice these two are the same - and always give the same result
echo hash('md5', 'Thequickbrown'); // a05d6d1139097bfd312c9f1009691c2a
echo hash('md5', 'Thequickbrown'); // a05d6d1139097bfd312c9f1009691c2a

// notice these two are the same but different capitalization - different result
echo hash('md5', 'fox'); // 2b95d1f09b8b66c5c43622a4d9ec9a04
echo hash('md5', 'Fox'); // de7b8fdc57c8a948bc0cf52b31b617f3

// A specific value always returns that specific result
echo hash('md5', 'jumpsoverthe'); // fa8b014923df32935641ca80b624a169
echo hash('md5', 'jumpsoverthe'); // fa8b014923df32935641ca80b624a169
?>

Hashing yields a highly unique (case sensitive) value that represents the three words in the trigram and as such, when we are looking for a specific trigram we can hash the values and obtain it’s exact ID rather than do an A&B&C comparison.

It’s worth noting that hashing would add to the memory requirements of the bot (as hashing a word makes a longer word in most cases) so batch processing was added to address the increased memory demands of the hashed data.

The batch process eliminates the negative of having more information in memory (caused by hashing) by limiting how much RAM the program will need at any given moment.

Here’s a pros vs cons overview.

Pros:

  • Divide and Conquer!
    We can split the training data among as many computers as we have available.
  • No Significant processing required to merge tables
    All ID’s will be the same so there is no need to convert them.
  • ID Lookup’s are Eliminated
    Because the ID is the hashed representation of the word, tag or trigram we never need to lookup an ID. You just hash the value you are checking and then use that as the ID.

Cons:

  • Hashing isn’t Fast!
    While approximately ~4812% Faster and no longer taking 3 weeks, this code is still slow & took 10 hours, 15 minutes and 50.4 seconds to process 1 million words into trigram patterns and store them in the database.

 

If you would like to obtain a copy of the new Training code you can get that on my Github here: Train.php

And of course what you’ve all been waiting for… the data:

Parts of Speech Tagger Data:

You don’t need to wait 10 hours by running Train.php to get started using the brown corpus in your own projects! I’ve made the data available on my GithHub profile where you can download it for free as SQL and CSV formats.

I wanted to release the data as XML as well but the files were larger then GitHub would allow and even the SQL and CSV files we’re just barely under the allowed upload limit. GitHub complained… Oh, the things I do for my readers! 😛

MySQL CSV
Tags_Data.sql Tags.csv
Tags_Structure.sql Trigrams_1.csv
Trigrams_Data_1.sql Trigrams_2.csv
Trigrams_Data_2.sql Trigrams_3.csv
Trigrams_Data_3.sql Trigrams_4.csv
Trigrams_Data_4.sql Words.csv
Trigrams_Structure.sql
Words_Data.sql
Words_Structure.sql

 

I hope you are enjoying building SkyNet… er… this Parts of Speech Tagger as much as I am. 😛

The next post in this series we will look at how to feed the bot some text and use Trigrams to tag the words so remember to like, and follow so you won’t miss a single post!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your direct financial support finances this work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.

Your support goes toward helping me buy better tools and equipment so I can improve the quality of my content.  It also helps me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you feel inclined to give me money and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month and you will be helping me grow.

Thank you!

And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Polybius VR

What follows is hard to explain and admittedly sounds like something out of a twilight zone episode, and maybe… that’s just what it was. A moment in time where the laws of reality broke.

About two weeks ago I was on the app store on my phone looking for new Cardboard VR apps… what can I say? I like VR and Cardboard VR is affordable! And if you use one of those micro USB to USB type A converters you can plug in a USB hub and attach a keyboard or even a mouse and get some decent VR capability for relatively cheep! 😛 😉

Anyway, after a little browsing of the “new releases” I stumbled across an app called Polybius VR by a company I had never heard of before (Sinneslöschen Inc.) and it didn’t have any reviews yet and not many downloads either but since it was free I thought why not, it’s easy enough to uninstall if it’s not interesting, right?

The app was huge (a few hundred MB) and took a couple of minutes to download which isn’t that unusual for VR apps.

I was home alone and would be all evening, so while Polybius was downloading I microwaved some Ramen noodles and refilled my JPL Women In Space mug with coffee, I love mine slightly bitter and black.

I sat back down at my bedroom desk then paused to lookout my window and enjoy the blood red and purple Los Angeles sky as the sun sank low.

I sipped my coffee and got comfy in my chair. My keyboard was on my lap so I can use it while in VR. I put nail polish on the WASD keys as well as a few others so that I can find them without having to remove the VR headset. 😎 😉

I start Polybius and slide my phone into the headset and adjust the straps while the app loads.

In all directions I see an infinite abyss except for directly below me, where I see the sentence “(C) 2018 Sinneslöschen Inc.” glowing blue like copper sulphate crystals. The font is unusual, blocky and almost pixel like.

Out of the murky black void in front of me the words Polybius VR erupt and grow to become the only thing in my view.

The words seem to be about size of a single story building and were wrapped in polygonal chains that seemed to crawl like cellular automatons. The lines vibrated and jittered all over the text changing shape to envelop the words. The effect oscillated between a smooth gradient and jagged pixel edges which gave the effect that sometimes the lines were eating away at the text like acid.

It’s at this point where things start to get weird.

For lack of a better term I’d describe it as “missing time”. The experience for me seemed to only last a few minutes and all I recall seeing was the copyright text followed by the Polybius VR logo which flash several times in very rapid succession and then the screen on my cellphone just went black.

I pulled off the headset and frantically removed my phone to reveal that the screen was cracked, which was disappointing and that’s putting it mildly!

That’s when I gazed out my window again, only to realize that I could see stars in the sky!

As I said, my perceived experience was at most, only a few minutes had passed but the clock on the wall begged to differ when it read 8:57 PM. My computer and microwave also confirmed that roughly three hours had passed from when I first sat down.

After a more thorough examination of my phone it appeared that the battery had exploded and fused the entire thing into a paperweight.

I remembered I had installed the app on my SD card so I thought to recover it from the scrap but frustratingly it was also damaged beyond recovery, though thankfully I had my photos backed up!

I used my desktop to access the app store where I immediately searched for Polybius VR but nothing related came up.

Desperate for some reassurance of my own sanity I turned to Google like anyone in my position would and typed ‘Polybius’.

The very first link returned did little to alleviate my growing concern. The Wikipedia article “Polybius (urban legend)” opens with this sentence:

“Polybius is a fictitious arcade game, the subject of an urban legend…”

How could it be an urban legend? It was real! I installed it and it fried my phone, not to mention distorting my perception of time for 3 hours!

I spent the better part of the next week researching the Polybius urban legend only to turn up myth after half truth. Website after website full of internet rumors, hoaxers and fake news.

I even reached out to a couple of grey-hat’s I used to work with to see if they knew of anyone who was working in wetware that might be able to pull off a hack that would cause something like missing time.

They both told me the same thing… Polybius was a myth and nobody was even close to that level of biohacking.

Piecing all the “facts” together for myself the Polybius urban legend seems to go as follows…

In the summer of 1981 somebody (usually claimed to be the CIA but sometimes it’s shadow mega corporations… aliens?) formed the mythical shell company “Sinneslöschen Inc.” with the clandestine charge of conducting civilian thought control experiments on unsuspecting people.

I half expect Fox Mulder and Dana Scully to show up any minute!

The Polybius project is said to have centered around using arcade games (it was the 80s) to attempt to turn anyone into a mindless puppet.

Few credible witnesses have ever come forward but one overwhelmingly reoccurring theme among all the Polybius stories is an “addictive” effect experienced by players.

Some claiming it became the only thing they could think about even when they weren’t playing… which I guess these day’s is pretty understandable. I mean, we’ve all known someone (or been that someone) who was so into a game that we describe them as ‘addicted’.

The typical scenario told goes something like this, Polybius players would leave the house in the morning feeling the weight of pockets full of quarters!

Then wait, aching long hours while the clock shortened the distance between them and their next chance to play.

Once let out of work or school on their own recognizance the race between them and everyone else who wanted ‘THEIR’ machine was on!

You were lucky if you were an adult because it meant you had a car and could get the arcade first, pump the machine full of quarters till basic economics forced you to relinquish the machine to the next player, who in turn did the exact same thing.

Hours would churn and abused buttons induced pixels to strobe and undulate in hypnotic patterns while the polyphonic beeps rhythmically danced their strange digital melody.

In addition to the additive effect described by players some reports describe a “Polybius intoxication” others have refer to it as a sort of “madness” or “stupor”.

People who played Polybius for long stretches were described to occasionally experience something like a seizure and then followed by what appeared to be a coma lasting anywhere from minutes to hours after which they would wake up & remain “blank & zombie like” for some time.

Additional side effects caused by playing Polybius were reported to be: amnesia, insomnia, night terrors, hallucinations, and rare unprovoked aggressive episodes.

One site poorly sourced a quote from an Oregon newspaper from the early 80’s describing a public arcade event where several Polybius machines were observed by an audience of a few dozen people for an extended amount of time. It was reported that several of the players & audience members became sick, they described some of the symptoms of Polybius intoxication as well as “zombie like behavior” by those afflicted.

Putting urban legends aside, I’m still left with the question of what really happened that night?

I’d dismiss all of this outright as myth, half truth, internet rumors, hoaxers and fake news if it wasn’t for my experience with Polybius VR.

Is it possible that a neurohacker terrorist somewhere discovered a technique that could perhaps “reboot” a brain just by showing you some images?

The idea isn’t as sci-fi as it may seem. It turns out that a condition called confabulation  can occur in both biological and artificial neural neural networks, so maybe someone figured out how to trigger a buffer overflow in a brain and packaged it in a VR app!

The idea that some faceless attacker could take control over your mind seems to be the ultimate violation of self.

Maybe my phone battery just died and my computer, microwave and the atomic clock on my wall collectively suffered from the same glitch. Perhaps this was just an elaborate prank at the expense of anyone who was unfortunate enough to install Polybius VR…

Or perhaps it’s all true and there really are monsters that lurk in the shadows ready to devour us just as soon as the flashlight battery dies.

It stands to reason that I will never truly know what happened to me that night and the events of those three hours will remain forever shrouded in my nightmares.

I guess if there’s a moral to be found here at all it would be… be careful what you install!

And with that, Happy Halloween everyone! Be safe out there tonight & If you come across an app called Polybius VR in the app store, do yourself a favor and take a pass on that one! 😉

Remember to like, share and follow & if you are one of the few other people to have downloaded Polybius VR before it mysteriously disappeared from the app store, consider leaving your personal experience below in the comments and since you’re still reading, why not help me grow…

Help Me Grow

Your support finances my work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.

Your support goes toward helping me buy better tools and equipment so I can improve the quality of your reading material. It also helps me eat, pay rent and now of course I have to add a new cellphone to the budget! 😛

If you feel inclined to give me money and add your name (or business) on my Sponsors page then visit my Patreon page and pledge $1 or more a month and you will be helping me grow.

Thank you!

And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Blog at WordPress.com.

Up ↑

%d bloggers like this: