Search

Geek Girl Joy

Artificial Intelligence, Simulations & Software

The Brown Corpus Database

Welcome back, today we’re going to peek inside the database for the Parts of Speech tagger.

Unfortunately my Raspberry Pi that I am using to train with is slow (cough… and my code is super un-optimized ūüėõ ) so… it’s still working on learning the complete brown corpus though we’re almost there, less than 190 training files remaining!

Before proceeding here’s my disclaimer¬†on the GitHub repo. It basically says that I don’t own the Brown¬†Corpus and I am not selling it to you!

The Database

Here’s a recap of the database, It consists of three tables. Words, Tags & Trigrams. You can find the complete MySQL Database Setup script here: ¬†Create.PartsOfSpeech.DB.sql.

CREATE TABLE `Tags` (
  `ID` int(11) NOT NULL,
  `Tag` varchar(8) NOT NULL,
  `Definition` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;


CREATE TABLE `Trigrams` (
  `ID` int(11) NOT NULL,
  `Count` int(11) NOT NULL,
  `Word_A` int(11) NOT NULL,
  `Word_B` int(11) NOT NULL,
  `Word_C` int(11) NOT NULL,
  `Tag_A` int(11) NOT NULL,
  `Tag_B` int(11) NOT NULL,
  `Tag_C` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;


CREATE TABLE `Words` (
  `ID` int(11) NOT NULL,
  `Word` varchar(100) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;


 

Words Table

The Words table keeps track of all the words the tagger knows.

The bot uses the ID’s in place of the words so given this sentence:

the quick brown fox jumps over the lazy dog a long-term contract with zero-liability protection lets think it over

We would expect the system to be able to lookup each word (provided it knows it) and replace it with the ID of the word in the Words table, like this:

1 43524 70488 515610 1149954 7158 1 266303 56280. 309 43578 53868 1212 zero-liability 238658 482081 32358 423

Notice that the bot was unable to lookup the ID for the word “zero-liability”, this is because it never saw that word during training and it would need to be “learned” by the system by assigning it a new ID and adding it to the database.

Here’s an info graphic that might help you understand the Words table:

Words Table - An inforgraphic reviewing the words table.
Words Table

 

Tags Table

The Tags table keeps track of all the tags the tagger knows.

The bot uses the ID’s in place of the tags so given these words:

fox, jump, jumps, jumped

We would assign these tags:

fox/nn = singular or mass noun

jump/vb = verb, base form

jumps/vbz = verb, 3rd. singular present

jumped/vbd = verb, past tense

And the ID’s for the tags would be represented as such:

21, 246, 138, 12

Here’s an info graphic that might help you understand the Tags table:

Tags Table - An inforgraphic reviewing the tags table.
Tags Table

 

Trigrams Table

The Trigrams table is the heart of the system and it’s job is to keep track of the associations between word trigrams (groups of 3 words) and tag trigrams (groups of 3 tags).

The Brown Corpus training data is split up into trigrams of words and tags so that when the bot learns it isn’t just learning individual words and tags but chains of words and tags.

This helps the bot learn that some words can have more than one meaning or role in a sentence. It also keeps a count of each time it sees a trigram so it can calculate the probability of each trigram and tag set.

Given this sentence:

the quick brown fox jumps over the lazy dog a long-term contract with zero-liability protection lets think it over

We would expect the system to be able to extract the following trigrams represented here as JSON:

[
	["The","quick","brown"],
	["quick","brown","fox"],
	["brown","fox","jumps"],
	["fox","jumps","over"],
	["jumps","over","the"],
	["over","the","lazy"],
	["the","lazy","dog"],
	["lazy","dog","A"],
	["dog","A","long-term"],
	["A","long-term","contract"], 
	["long-term","contract","with"],
	["contract","with","zero-liability"],
	["with","zero-liability","protection"],
	["zero-liability","protection","Let's"],
	["protection","Let's","think"],
	["Let's","think","it"],
	["think","it","over"]
]

 

And of course since we’re actually using word ID’s and not the words themselves we could change the words to their ID’s in the JSON:

 [
	["1","43524","70488"],
	["43524","70488","515610"],
	["70488","515610","1149954"],
	["515610","1149954","7158"],
	["1149954","7158","1"],
	["7158","1","266303"],
	["1","266303","56280"],
	["266303","56280","309"],
	["56280","309","43578"],
	["309","43578","53868"],
	["43578","53868","1212"],
	["53868","1212","1161931"],
	["1212","1161931","238658"],
	["1161931","238658","482081"],
	["238658","482081","32358"],
	["482081","32358","423"],
	["32358","423","7158"]
]

The same can be done with the Tags.

 

Here’s an info graphic that might help you understand the Trigrams table:

Trigrams Table - An inforgraphic reviewing the trigrams table.
Trigrams Table

 

 

This is as far as we’ll get this week so remember to like, and follow!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

It goes toward helping me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? ūüėõ

My little Xavier Logich

 

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Advertisements

The Brown Corpus

Welcome, we’re going to talk about training a Parts of Speech tagging bot using the Brown Corpus.

The Brown Corpus

What’s the Brown¬†Corpus? Basically, two linguists (Henry Kuńćera and W. Nelson Francis) combined their efforts at¬†Brown University (thus ‘Brown¬†Corpus’) in the early 1960’s to create a English language corpus that computer scientists and AI researchers could use as a standard.

The corpus is comprised of 500 samples of English-language text, each text is approximately 2000 words long give or take a few exceptions.

It covers topics such as Religion, Politics, News and even Science (Fiction & Non) not to mention multiple genres in each topic.¬†Check out the Sample Distribution section on the Wikipedia page for the specifics if you are curious but suffice it to say it’s extensive!

 

What Can You Do With It? 

Well, generally speaking it’s purpose is to act as a well documented & ‘tagged’ data set that you can compare your bot, word tagging system, or even something else… against to determine the accuracy of your model.

The thing is, that also means it makes a great resource to train a Parts of Speech tagging bot from. And ¬†well… that’s what we’re going to do! ūüėõ

Before proceeding here’s my disclaimer¬†on the GitHub repo. It basically says that I don’t own the Brown¬†Corpus and I am not selling it to you!

Further, you may not sell it without obtaining permission from the licence holder. As far as I am aware you may not use it commercially.

As for the bot, once you understand how this system operates it’s relatively trivial to make modifications to this tri-gram tagging system or build your own from scratch.

The real difficulty in using a system like this is obtaining a well tagged corpus of text with a commercially permissible use licence though they do exist for purchase, or you might find a CC0¬†or MIT Licenced corpus or here again… you could¬†build your own from scratch… but that is a huge undertaking for a single or small group of developers.

How Do We Build It?

Ok, we all know I’m going to publish everything on my GitHub profile when I’m done and since I’m awesome, ¬†I’m probably going to export the data as convenient formats (JSON, XML… ) as well. ūüėČ

Are there formats other than MySQL, JSON & XML that you would like? CSV? Let me know in the comments.

To get started if you want to follow along you can find the GitHub repo and corpus HERE.

MySQL Database

I’m using a MySQL database to hold the “training data”. Once “trained” we can export the data to the other formats. Note that this database is unoptimized and is merely a rough prototype that focuses on function over form.

--
-- Database: `PartsOfSpeechTagger`
--
CREATE DATABASE IF NOT EXISTS `PartsOfSpeechTagger` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
USE `PartsOfSpeechTagger`;

-- --------------------------------------------------------

-- --------------------------------------------------------

--
-- Table structure for table `Tags`
--

CREATE TABLE `Tags` (
  `ID` int(11) NOT NULL,
  `Tag` varchar(8) NOT NULL,
  `Definition` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

-- --------------------------------------------------------

--
-- Table structure for table `Trigrams`
--

CREATE TABLE `Trigrams` (
  `ID` int(11) NOT NULL,
  `Count` int(11) NOT NULL,
  `Word_A` int(11) NOT NULL,
  `Word_B` int(11) NOT NULL,
  `Word_C` int(11) NOT NULL,
  `Tag_A` int(11) NOT NULL,
  `Tag_B` int(11) NOT NULL,
  `Tag_C` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

-- --------------------------------------------------------

--
-- Table structure for table `Words`
--

CREATE TABLE `Words` (
  `ID` int(11) NOT NULL,
  `Word` varchar(100) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

--
-- Indexes for dumped tables
--

--
-- Indexes for table `Tags`
--
ALTER TABLE `Tags`
  ADD PRIMARY KEY (`ID`),
  ADD UNIQUE KEY `Tag` (`Tag`);

--
-- Indexes for table `Trigrams`
--
ALTER TABLE `Trigrams`
  ADD PRIMARY KEY (`ID`);

--
-- Indexes for table `Words`
--
ALTER TABLE `Words`
  ADD PRIMARY KEY (`ID`),
  ADD UNIQUE KEY `Word` (`Word`);

--
-- AUTO_INCREMENT for dumped tables
--

--
-- AUTO_INCREMENT for table `Tags`
--
ALTER TABLE `Tags`
  MODIFY `ID` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1;
--
-- AUTO_INCREMENT for table `Trigrams`
--
ALTER TABLE `Trigrams`
  MODIFY `ID` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1;
--
-- AUTO_INCREMENT for table `Words`
--
ALTER TABLE `Words`
  MODIFY `ID` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1;

 

Train.php

Next I wrote this terribly un-optimized training script that is, as my grandfather would have put it, “slower than molasses in January”, but once done it need not ever be ran again (I’ve been training since the 6th ūüėõ ) so save yourself the trouble and don’t run this code! Wait for me to publish the finished data to the repo... hopefully sometime over the weekend or early next week.

<?php

// Create & return $conn object to hold connection to MySQL
function ConnectToMySQL($servername, $username, $password, $dbname){

  // Create connection
  $conn = new mysqli($servername, $username, $password, $dbname);
  // Check connection
  if ($conn->connect_error) {
    die("MYSQL DB Connection failed: " . $conn->connect_error);
  }
  
  return $conn;
}

// Disconnect $conn object that holds the connection to MySQL
function DisconnectFromMySQL(&$conn){
  $conn->close();
}


// If the word is in memory we know it, move on
// otherwise try adding it to the database
// if we add it to the database keep a copy in memory 
// to avoid unnecessary DB queries in the future
function AddWordToMySQLAndMemory($word, &$conn){
  
  global $words_to_id;
  
  // if the word isn't in memory try to add it to the database
  if(empty($words_to_id[$word])){
    $sql = "INSERT INTO `Words` (`ID`, `Word`) VALUES (NULL, '$word')";
    if ($conn->query($sql) === TRUE) {
        //echo "New word added successfully" . PHP_EOL;
      // add to memory for faster look up in the future
      $words_to_id[$word] = GetIDForWord($word, $conn); // get ID DB Assigned
    } else {
      // weird the Word exists - did you reboot?
      // echo "Word exists" . PHP_EOL; 
      // add to memory for faster look up in the future
      $words_to_id[$word] = GetIDForWord($word, $conn); // get ID DB Assigned
    }
  }
}


// If the tag is in memory we know it, move on
// otherwise try adding it to the database
// if we add it to the database keep a copy in memory 
// to avoid unnecessary DB queries in the future
function AddTagToMySQLAndMemory($tag, &$conn){
  
  global $tags_to_id;
  
  // if the tag isn't in memory try to add it to the database
  if(empty($tags_to_id[$tag])){
    
    $sql = "INSERT INTO `Tags` (`ID`, `Tag`) VALUES (NULL, '$tag')";

    if ($conn->query($sql) === TRUE) {
        //echo "New tag added successfully" . PHP_EOL;
      // add to memory for faster look up in the future
      $tags_to_id[$tag] = GetIDForTag($tag, $conn); // get ID DB Assigned
    } else {
      // weird the Tag exists - did you reboot?
      // echo "Tag exists" . PHP_EOL;
      // add to memory for faster look up in the future
      $tags_to_id[$tag] = GetIDForTag($tag, $conn); // get ID DB Assigned
    }
  }
}



function AddTrigramToMySQL($gram_set, &$conn){
  
  
  $Word_A = GetIDForWord($gram_set['words'][0], $conn);
  $Word_B = GetIDForWord($gram_set['words'][1], $conn);
  $Word_C = GetIDForWord($gram_set['words'][2], $conn);
  $Tag_A = GetIDForTag($gram_set['tags'][0], $conn);
  $Tag_B = GetIDForTag($gram_set['tags'][1], $conn);
  $Tag_C = GetIDForTag($gram_set['tags'][2], $conn);
  
  $complete_trigram_set = true;
  
  if($Word_A == NULL || $Word_B == NULL || $Word_C == NULL ||
     $Tag_A == NULL || $Tag_B == NULL || $Tag_C == NULL){
    $complete_trigram_set = false;
  }

  if($complete_trigram_set == true){
      
    // Select the trigram if it exists in the database
    $sql = "SELECT * FROM `Trigrams` WHERE `Word_A`=$Word_A AND `Word_B`=$Word_B AND `Word_C`=$Word_C AND `Tag_A`=$Tag_A AND `Tag_B`=$Tag_B AND `Tag_C`=$Tag_C  LIMIT 1";

    $result = $conn->query($sql);

    // there is an instance of this pair
    if ($result->num_rows > 0) {
      
      // Obtain the record for the gram_set
      while($row = $result->fetch_assoc()) {
        $id = $row['ID'];
        $count = $row['Count'];
      }
      $count++; //gram_set encountered again, increment it.
      
      // push updated count to database
      $sql = "UPDATE `Trigrams` SET Count='$count' WHERE ID=$id";

      if ($conn->query($sql) === TRUE) {
          //echo "Trigram Count updated successfully" . PHP_EOL;
      } else {
          //echo "Error: " . $sql . PHP_EOL . $conn->error . PHP_EOL;
      }
    } else { // no previous gram_set instance
      
      // Add this gram_set
      $sql = "INSERT INTO `Trigrams` (`Count`, `Word_A`, `Word_B`, `Word_C`, `Tag_A`, `Tag_B`, `Tag_C`) VALUES ('1', '$Word_A', '$Word_B', '$Word_C', '$Tag_A', '$Tag_B', '$Tag_C')";
      if ($conn->query($sql) === TRUE) {
          //echo "New Trigram added successfully";
      } else {
          //echo "Error: " . $sql . PHP_EOL . $conn->error . PHP_EOL;
      }    
    }
  }
}


// Pull the id for a given word from memory if available
// fall back to the database if its not in memory
// return NULL if it's not in the database
function GetIDForWord($word, &$conn){
  
  global $words_to_id;
  
  // if the word isn't in memory try to get it from the database
  if(empty($words_to_id[$word])){
  
    $sql = "SELECT * FROM `Words` WHERE `Word`='$word' LIMIT 1";
    $result = $conn->query($sql);
    
    if ($result->num_rows > 0) {// word exists
      // Output the ID for this Word
      while($row = $result->fetch_assoc()) {
        return $row['ID'];
      }
    }
    return NULL; // not in DB
  }
  else{
    return $words_to_id[$word];
  }  
}


// Pull the word for a given id from memory if available
// fall back to the database if its not in memory
// return NULL if it's not in the database
function GetWordForID($ID, &$conn){
  global $ids_to_words;
  
  // if the ID isn't in memory try to get it from the database
  if(empty($ids_to_words[$ID])){
    
    $sql = "SELECT * FROM `Words` WHERE `ID`='$ID' LIMIT 1";
    $result = $conn->query($sql);
    
    if ($result->num_rows > 0) {// id exists
      // Output the Word for this ID
      while($row = $result->fetch_assoc()) {
        return $row['Word'];
      }
    }
    return NULL; // not in DB
  }
  else{
    return $ids_to_words[$ID];
  }
}


// Pull the id for a given tag from memory if available
// fall back to the database if its not in memory
// return NULL if it's not in the database
function GetIDForTag($tag, &$conn){
  global $tags_to_id;
  
  // if the Tag isn't in memory try to get it from the database
  if(empty($tags_to_id[$tag])){
    $sql = "SELECT * FROM `Tags` WHERE `Tag`='$tag' LIMIT 1";
    $result = $conn->query($sql);
    
    if ($result->num_rows > 0) {// tag exists
      // Output the ID for this tag
      while($row = $result->fetch_assoc()) {
        return $row['ID'];
      }
    }
    return NULL; // not in DB
  }
  else{
    return $tags_to_id[$tag];
  }
}


// Pull the tag for a given id from memory if available
// fall back to the database if its not in memory
// return NULL if it's not in the database
function GetTagForID($ID, &$conn){
  global $ids_to_tags;
  
  // if the Tag isn't in memory try to get it from the database
  if(empty($ids_to_tags[$ID])){
    $sql = "SELECT * FROM `Tags` WHERE `ID`='$ID' LIMIT 1";
    $result = $conn->query($sql);
    
    if ($result->num_rows > 0) {// ID exists
      // Output the Tag for this ID
      while($row = $result->fetch_assoc()) {
        return $row['Tag'];
      }
    }
    return NULL; // not in DB
  }
  else{
    return $ids_to_tags[$ID];
  }
}


// Get contents of a training file as a string
function GetFile($filename){
  $filename =  'brown' . DIRECTORY_SEPARATOR . $filename;
  $handle = fopen($filename, 'r');
  $contents = fread($handle, filesize($filename));
  fclose($handle);
  return $contents;
}


// data is a text file with word/tag
// capture the word and tag as group 1 & 2 split by a forward slash.
// example: (word || symbol)[/](tag)   the/article blue/adjective cat/noun ./.
// (1)(2): (the)(article) (blue)(adjective) (cat)(noun) (.)(.)
function PrepareData($textdata){
  
  $re = '/([^\s]+)[\/]([^\s]+)/m';
  preg_match_all($re, $textdata, $matches, PREG_SET_ORDER, 0);
  
  $data = array();
  foreach($matches as $key=>$match){
    $data['words'][$key] = $match[1];
    $data['tags'][$key] = $match[2];
  }
  return $data;
}


// data is an array
// $data['words'][i] = word or symbol
// $data['tags'][i] = tag for the assoceated word
function ExtractTrigrams($data){
  
  $trigrams = array();
  
  $word_count = count($data['words']);
  for($i=2; $i < $word_count; $i++){

    $w_a = $data['words'][$i-2];
    $w_b = $data['words'][$i-1];
    $w_c = $data['words'][$i];
    $t_a = $data['tags'][$i-2];
    $t_b = $data['tags'][$i-1];
    $t_c = $data['tags'][$i];
    
    $pack['words'] = array($w_a, $w_b, $w_c);
    $pack['tags'] = array($t_a, $t_b, $t_c);
    
    $trigrams[] = $pack;
  }
  
  return $trigrams;
}


// Get all the words from the DB with the word as the key and the id as the value
function GetAllWords(&$conn){
  $sql = "SELECT * FROM `Words`";
  $result = $conn->query($sql);
  
  if ($result->num_rows > 0) {// id exists
    $words = array();
    // Output the Word for this ID
    while($row = $result->fetch_assoc()) {
      $words[$row['Word']] = $row['ID'];
    }
    return $words;
  }
  return NULL;
}


// Get all the words from the DB with the id as the key and the word as the value
function GetAllWordIDs(&$conn){
  $sql = "SELECT * FROM `Words`";
  $result = $conn->query($sql);
  
  if ($result->num_rows > 0) {// id exists
    $words = array();
    // Output the Word for this ID
    while($row = $result->fetch_assoc()) {
      $words[$row['ID']] = $row['Word'];
    }
    return $words;
  }
  return NULL;
}


// Get all the tags from the DB with the tag as the key and the id as the value
function GetAllTags(&$conn){
  $sql = "SELECT * FROM `Tags`";
  $result = $conn->query($sql);
  
  if ($result->num_rows > 0) {// id exists
    $words = array();
    // Output the Word for this ID
    while($row = $result->fetch_assoc()) {
      $words[$row['Tag']] = $row['ID'];
    }
    return $words;
  }
  return NULL;
}


// Get all the tags from the DB with the id as the key and the tag as the value
function GetAllTagIDs(&$conn){
  $sql = "SELECT * FROM `Tags`";
  $result = $conn->query($sql);
  
  if ($result->num_rows > 0) {// id exists
    $words = array();
    // Output the Word for this ID
    while($row = $result->fetch_assoc()) {
      $words[$row['ID']] = $row['Tag'];
    }
    return $words;
  }
  return NULL;
}



$training_files = array('ca01', 'ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10', 'ca11', 'ca12', 'ca13', 'ca14', 'ca15', 'ca16', 'ca17', 'ca18', 'ca19', 'ca20', 'ca21', 'ca22', 'ca23', 'ca24', 'ca25', 'ca26', 'ca27', 'ca28', 'ca29', 'ca30', 'ca31', 'ca32', 'ca33', 'ca34', 'ca35', 'ca36', 'ca37', 'ca38', 'ca39', 'ca40', 'ca41', 'ca42', 'ca43', 'ca44', 'cb01', 'cb02', 'cb03', 'cb04', 'cb05', 'cb06', 'cb07', 'cb08', 'cb09', 'cb10', 'cb11', 'cb12', 'cb13', 'cb14', 'cb15', 'cb16', 'cb17', 'cb18', 'cb19', 'cb20', 'cb21', 'cb22', 'cb23', 'cb24', 'cb25', 'cb26', 'cb27', 'cc01', 'cc02', 'cc03', 'cc04', 'cc05', 'cc06', 'cc07', 'cc08', 'cc09', 'cc10', 'cc11', 'cc12', 'cc13', 'cc14', 'cc15', 'cc16', 'cc17', 'cd01', 'cd02', 'cd03', 'cd04', 'cd05', 'cd06', 'cd07', 'cd08', 'cd09', 'cd10', 'cd11', 'cd12', 'cd13', 'cd14', 'cd15', 'cd16', 'cd17', 'ce01', 'ce02', 'ce03', 'ce04', 'ce05', 'ce06', 'ce07', 'ce08', 'ce09', 'ce10', 'ce11', 'ce12', 'ce13', 'ce14', 'ce15', 'ce16', 'ce17', 'ce18', 'ce19', 'ce20', 'ce21', 'ce22', 'ce23', 'ce24', 'ce25', 'ce26', 'ce27', 'ce28', 'ce29', 'ce30', 'ce31', 'ce32', 'ce33', 'ce34', 'ce35', 'ce36', 'cf01', 'cf02', 'cf03', 'cf04', 'cf05', 'cf06', 'cf07', 'cf08', 'cf09', 'cf10', 'cf11', 'cf12', 'cf13', 'cf14', 'cf15', 'cf16', 'cf17', 'cf18', 'cf19', 'cf20', 'cf21', 'cf22', 'cf23', 'cf24', 'cf25', 'cf26', 'cf27', 'cf28', 'cf29', 'cf30', 'cf31', 'cf32', 'cf33', 'cf34', 'cf35', 'cf36', 'cf37', 'cf38', 'cf39', 'cf40', 'cf41', 'cf42', 'cf43', 'cf44', 'cf45', 'cf46', 'cf47', 'cf48', 'cg01', 'cg02', 'cg03', 'cg04', 'cg05', 'cg06', 'cg07', 'cg08', 'cg09', 'cg10', 'cg11', 'cg12', 'cg13', 'cg14', 'cg15', 'cg16', 'cg17', 'cg18', 'cg19', 'cg20', 'cg21', 'cg22', 'cg23', 'cg24', 'cg25', 'cg26', 'cg27', 'cg28', 'cg29', 'cg30', 'cg31', 'cg32', 'cg33', 'cg34', 'cg35', 'cg36', 'cg37', 'cg38', 'cg39', 'cg40', 'cg41', 'cg42', 'cg43', 'cg44', 'cg45', 'cg46', 'cg47', 'cg48', 'cg49', 'cg50', 'cg51', 'cg52', 'cg53', 'cg54', 'cg55', 'cg56', 'cg57', 'cg58', 'cg59', 'cg60', 'cg61', 'cg62', 'cg63', 'cg64', 'cg65', 'cg66', 'cg67', 'cg68', 'cg69', 'cg70', 'cg71', 'cg72', 'cg73', 'cg74', 'cg75', 'ch01', 'ch02', 'ch03', 'ch04', 'ch05', 'ch06', 'ch07', 'ch08', 'ch09', 'ch10', 'ch11', 'ch12', 'ch13', 'ch14', 'ch15', 'ch16', 'ch17', 'ch18', 'ch19', 'ch20', 'ch21', 'ch22', 'ch23', 'ch24', 'ch25', 'ch26', 'ch27', 'ch28', 'ch29', 'ch30', 'cj01', 'cj02', 'cj03', 'cj04', 'cj05', 'cj06', 'cj07', 'cj08', 'cj09', 'cj10', 'cj11', 'cj12', 'cj13', 'cj14', 'cj15', 'cj16', 'cj17', 'cj18', 'cj19', 'cj20', 'cj21', 'cj22', 'cj23', 'cj24', 'cj25', 'cj26', 'cj27', 'cj28', 'cj29', 'cj30', 'cj31', 'cj32', 'cj33', 'cj34', 'cj35', 'cj36', 'cj37', 'cj38', 'cj39', 'cj40', 'cj41', 'cj42', 'cj43', 'cj44', 'cj45', 'cj46', 'cj47', 'cj48', 'cj49', 'cj50', 'cj51', 'cj52', 'cj53', 'cj54', 'cj55', 'cj56', 'cj57', 'cj58', 'cj59', 'cj60', 'cj61', 'cj62', 'cj63', 'cj64', 'cj65', 'cj66', 'cj67', 'cj68', 'cj69', 'cj70', 'cj71', 'cj72', 'cj73', 'cj74', 'cj75', 'cj76', 'cj77', 'cj78', 'cj79', 'cj80', 'ck01', 'ck02', 'ck03', 'ck04', 'ck05', 'ck06', 'ck07', 'ck08', 'ck09', 'ck10', 'ck11', 'ck12', 'ck13', 'ck14', 'ck15', 'ck16', 'ck17', 'ck18', 'ck19', 'ck20', 'ck21', 'ck22', 'ck23', 'ck24', 'ck25', 'ck26', 'ck27', 'ck28', 'ck29', 'cl01', 'cl02', 'cl03', 'cl04', 'cl05', 'cl06', 'cl07', 'cl08', 'cl09', 'cl10', 'cl11', 'cl12', 'cl13', 'cl14', 'cl15', 'cl16', 'cl17', 'cl18', 'cl19', 'cl20', 'cl21', 'cl22', 'cl23', 'cl24', 'cm01', 'cm02', 'cm03', 'cm04', 'cm05', 'cm06', 'cn01', 'cn02', 'cn03', 'cn04', 'cn05', 'cn06', 'cn07', 'cn08', 'cn09', 'cn10', 'cn11', 'cn12', 'cn13', 'cn14', 'cn15', 'cn16', 'cn17', 'cn18', 'cn19', 'cn20', 'cn21', 'cn22', 'cn23', 'cn24', 'cn25', 'cn26', 'cn27', 'cn28', 'cn29', 'cp01', 'cp02', 'cp03', 'cp04', 'cp05', 'cp06', 'cp07', 'cp08', 'cp09', 'cp10', 'cp11', 'cp12', 'cp13', 'cp14', 'cp15', 'cp16', 'cp17', 'cp18', 'cp19', 'cp20', 'cp21', 'cp22', 'cp23', 'cp24', 'cp25', 'cp26', 'cp27', 'cp28', 'cp29', 'cr01', 'cr02', 'cr03', 'cr04', 'cr05', 'cr06', 'cr07', 'cr08', 'cr09');
$total_files = count($training_files);

$server = 'localhost';
$username = 'root';
$password = 'password';
$db = 'PartsOfSpeechTagger';
$conn = ConnectToMySQL($server, $username, $password, $db);


// Get all known current words and id's inefficient redundant calls but it's a run once.
$words_to_id = GetAllWords($conn);
$ids_to_words = GetAllWordIDs($conn);
$tags_to_id = GetAllTags($conn);
$ids_to_tags = GetAllWordIDs($conn);

$log = fopen('Log.txt', 'w+'); // log file

foreach($training_files as $filenumber=>$training_file){
  echo "Processing file $filenumber of $total_files." . PHP_EOL;
  fwrite($log, $training_file . PHP_EOL); // log the name of the file we are working on
  
  // Get data and get it ready for the bot to learn
  $training_data = GetFile($training_file);
  $training_data = PrepareData($training_data);
  $training_data = ExtractTrigrams($training_data);
  //var_dump($training_data);
  
  foreach($training_data as $key=>$set){
    foreach($set as $group=>$trigrams){
      if($group == 'words'){
        // add words
        AddWordToMySQLAndMemory($trigrams[0], $conn);
        AddWordToMySQLAndMemory($trigrams[1], $conn);
        AddWordToMySQLAndMemory($trigrams[2], $conn);
      }
      elseif($group == 'tags'){
        // add tags
        AddTagToMySQLAndMemory($trigrams[0], $conn);
        AddTagToMySQLAndMemory($trigrams[1], $conn);
        AddTagToMySQLAndMemory($trigrams[2], $conn);
      }
    }
    // We know the words and tags are now in the DB & Memory
    // process the trigrams
    AddTrigramToMySQL($set, $conn);
  }
}
fclose($log);


DisconnectFromMySQL($conn);

 

I had hoped we could go farther this week and discuss trigrams but… as I said I’m still training the model so we’ll cover how to use it next week. In the mean time, remember to like, and follow!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

It goes toward helping me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? ūüėõ

My little Xavier Logich

 

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Tokenizing & Lexing Natural Language

Well, I guess the next question to ask is… can we lex a¬†natural language?

When we lexed the PHP code in Can a Bot Understand a Sentence? we applied tags to the lexemes so that the intended meaning of each could be understood using grammar rules.

Well It turns out that when talking about natural languages, lexing is referred to as Parts of Speech Tagging.

The top automated parts of speech taggers have achieved something like 97-98% accuracy when tagging previously unseen (though grammatically correct) text. I would say that pretty much makes this a solved problem!

Further, linguists and elementary school teachers have been doing this by hand for years! ūüėõ

In practice everyone’s results will very but on average having a potential of approximately 2 miss-tagged words out of 100 means that the challenge of building a natural language lexer shouldn’t be too difficult but¬†of course any variance even just 2% (meaning that a bot that gets the tags wrong 2% of the time) can mean that the bot does the wrong thing (perhaps significantly) 2% of the time.

In any case before we can tag our lexemes we need to come up with a way to ‘tokenize’ a natural language sentence so let’s talk about that.

Tokenizing

Tokenizing is a verb which makes it an action. The word means to turn the raw text into ‘tokens’ using a process to determine the bounds of each “word unit” or “part of speech” so that we can treat it as a separate component that can be programmatically acted upon as lexeme.

In this case we can use the individual characters as our tokens.

If we use this sentence:

“The quick brown fox jumps over the lazy dog. A long-term contract with “zero-liability” protection! Let’s think ‘it’ over. john.doe@web_server.com”

Tokens

We want our system to use all the characters (including spaces and punctuation) in the string as tokens like this:

["T","h","e"," ","q","u","i","c","k"," ","b","r","o","w","n"," ","f","o","x"," ","j","u","m","p","s"," ","o","v","e","r"," ","t","h","e"," ","l","a","z","y"," ","d","o","g","."," ","A"," ","l","o","n","g","-","t","e","r","m"," ","c","o","n","t","r","a","c","t"," ","w","i","t","h"," ","\"","z","e","r","o","-","l","i","a","b","i","l","i","t","y","\""," ","p","r","o","t","e","c","t","i","o","n","!"," ","L","e","t","'","s"," ","t","h","i","n","k"," ","'","i","t","'"," ","o","v","e","r","."," ","j","o","h","n",".","d","o","e","@","w","e","b","_","s","e","r","v","e","r",".","c","o","m"]

Lexemes

Once we have the tokens we want the system to process those tokens into lexemes that a person would naturally say is a “whole” lexeme.

In this case, “whole” means that it’s a complete part of speech so sometimes a lexeme is a multi-character word and sometimes is a single character delimiter.

Most of the time a lexeme will only contain letters, numbers or symbols but sometimes it should also contain some mixed combination, as would be the case with a hyphenated compound word e.g. zero-liability or a contraction e.g. Let’s.

Notice that we want the system to use the apostrophe to merge Let and s¬†into¬†Let’s because it’s a contraction and therefore a “whole” lexeme but we don’t want the apostrophes around the word¬†‘it’ that follows the word ‘think’ combined because the lexeme in that case is the word¬†it¬†with the sounding¬†apostrophes acting as ‘single quotes’ and should therefore be treated as separate lexemes just like the “double quotes” around zero-liability.

Also, we want the system to capture the complex pattern of the email (john.doe@web_server.com) as a single lexeme.

Here’s what that looks like:

[
    "The",
    " ",
    "quick",
    " ",
    "brown",
    " ",
    "fox",
    " ",
    "jumps",
    " ",
    "over",
    " ",
    "the",
    " ",
    "lazy",
    " ",
    "dog",
    ".",
    " ",
    "A",
    " ",
    "long-term",
    " ",
    "contract",
    " ",
    "with",
    " ",
    "\"",
    "zero-liability",
    "\"",
    " ",
    "protection",
    "!",
    " ",
    "Let's",
    " ",
    "think",
    " ",
    "'",
    "it",
    "'",
    " ",
    "over",
    ".",
    " ",
    "john.doe@web_server.com"
]

 

Of course we would still need to apply tags to this list to complete the lexing process but this solves the first problem of problem splitting natural language text into tokens and processing those tokens into a list of lexemes ready to be tagged.

We’ll work on tagging next week, but for now lets look at the code that does this.

The Code

Here is the complete code that implements the tokenization of the lexemes. I’ll explain what is happening below but the code is commented for the programmers who are following along.

<?php 

function Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols){  
      
  $temp = '';                   // A temporary string used to hold incomplete lexemes
  $lexemes = array();           // Complete lexemes will be stored here for return
  $chars = str_split($text, 1); // Split the text sting into characters.
  
  //var_dump(json_encode($chars, 1)); // convert $chars array to JSON and dump to screen

  // Step through all character tokens in the $chars array
  foreach($chars as $key=>$char){
        
    // If this $char token is in the $delimiters array
    // Then stop building $temp and add it and the delimiter to the $lexemes array
    if(in_array($char, $delimiters)){
      
      // Does temp contain data?
      if(strlen($temp) > 0){
        // $temp is a complete lexeme add it to the array
        $lexemes[] = $temp;
      }      
      $temp = ''; // Make sure $temp is empty
      
      $lexemes[] = $char; // Capture delimiter as a whole lexeme
    }
    else{// This $char token is NOT in the $delimiters array
      // Add $char to $temp and continue to next $char
      $temp .= $char; 
    }
    
  } // Step through all character tokens in the $chars array


  // Check if $temp still contains any residual lexeme data?
  if(strlen($temp) > 0){
    // $temp is a complete lexeme add it to the array
    $lexemes[] = $temp;
  }
  
  // We have processed all character tokens in the $chars array
  // Free the memory and garbage collect $chars & $temp
  $chars = NULL;
  $temp = NULL;
  unset($chars);
  unset($temp);


  // We now have the simplest lexems extracted. 
  // Next we need to recombine compound-words, contractions 
  // And do any other processing with the lexemes.

  // If there are $chars in the $compound_word_symbols array
  if(!empty($compound_word_symbols)){
    
    // Count the number of $lexemes
    $number_of_lexemes = count($lexemes);
    
    // Step through all lexeme tokens in the $lexemes array
    foreach($lexemes as $key=>&$lexeme){
      
      // Check if $lexeme is in the $compound_word_symbols array
      if(in_array($lexeme, $compound_word_symbols)){
        
        // If this isn't the first $lexeme in $lexemes
        if($key > 0){ 
          // Check the $lexeme $before this
          $before = $lexemes[$key - 1];
          
          // If $before isn't a $delimiter
          if(!in_array($before, $delimiters)){
            // Merge it with the compound symbol
            $lexeme = $before . $lexeme;
            // And remove the $before $lexeme from $lexemes
            $lexemes[$key - 1] = NULL;
          }
        }
        
        // If this isn't the last $lexeme in $lexemes
        if($key < $number_of_lexemes){
          // Check the $lexeme $after this
          $after = $lexemes[$key + 1];
          
          // If $after isn't a $delimiter
          if(!in_array($after, $delimiters)){
            // Merge the $lexeme it with
            $lexemes[$key + 1] = $lexeme . $after;
            // And remove the $lexeme
            $lexeme = NULL;
          }
        }
        
      } // Check if lexeme is in the $compound_word_symbols array
    } // Step through all tokens in the $lexemes array      
  } // If there are $chars in the $compound_word_symbols array
  
  // Filter out any NULL values in the $lexemes array
  // created during the compound word merges using array_filter()
  // and then re-index so the $lexemes array is nice and sorted using array_values().
  $lexemes = array_values(array_filter($lexemes));
  
  
  // If there are $chars in the $contraction_symbols array
  if(!empty($contraction_symbols)){
    
    // Count the number of $lexemes
    $number_of_lexemes = count($lexemes);
    
    // Step through all lexeme tokens in the $lexemes array
    foreach($lexemes as $key=>&$lexeme){
      
      // Check if $lexeme is in the $contraction_symbols array
      if(in_array($lexeme, $contraction_symbols)){
        
        // If this isn't the first $lexeme in $lexemes
        // and If this isn't the last $lexeme in $lexemes
        if($key > 0 && $key < $number_of_lexemes){ 
          // Check the $lexeme $before this
          $before = $lexemes[$key - 1];
          
          // Check the $lexeme $after this
          $after = $lexemes[$key + 1];
          
          
          // If $before isn't a $delimiter
          // and $after isn't a $delimiter
          if(!in_array($before, $delimiters) && !in_array($after, $delimiters)){
            // Merge the contraction tokens
            $lexemes[$key + 1] = $before . $lexeme . $after;
            
            // Remove $before
            $lexemes[$key - 1] = NULL;
            // And remove this $lexeme
            $lexeme = NULL;            
          }

        }
        
      } // Check if lexeme is in the $contraction_symbols array
    } // Step through all tokens in the $lexemes array      
  } // If there are $chars in the $contraction_symbols array
  
  // Filter out any NULL values in the $lexemes array
  // created during the contraction merges using array_filter()
  // and then re-index so the $lexemes array is nice and sorted using array_values().
  $lexemes = array_values(array_filter($lexemes));
  

  // Return the $lexemes array.
  return $lexemes;
}

// Delimiters (Lexeme Boundaries)
$delimiters = array('~', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+', '`', '-', '=', '{', '}', '[', ']', '\\', '|', ':', ';', '"', '\'', '<', '>', ',', '.', '?', '/', ' ', "\t", "\n");

// Symbols used to detect compound-words
$compound_word_symbols = array('-', '_');

// Symbols used to detect contractions
$contraction_symbols = array("'", '.', '@');

// Text to Tokenize and Lex
$text = 'The quick brown fox jumps over the lazy dog. A long-term contract with "zero-liability" protection! Let\'s think \'it\' over. john.doe@web_server.com';

// Tokenize and extract the $lexemes from $text
$lexemes = Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols);
echo json_encode($lexemes, 1); // output $lexems as JSON

 

Splitting the Tokens

One way to do this would be to use regular expressions (regex) to match a pattern and it’s the method we used for the Email Relationship Classifier.

As part of that project I released a “Tokenizer”¬†Class File¬†that relied heavily on regex to match patterns but this isn’t the method we use in the Natural Language lexer, though we could have.

Common “advice” you will receive as a developer is that “you should NEVER use regex”, and while this is well meaning advice, it is certainly wrong!

I find Regex works best when you understand the patterns you are looking for really well and the patterns won’t change much throughout your data-set, though the pattern can be very complex.

Now the reason why you are often advised to avoid using regex pattern matching is that it’s complicated and understanding the pattern match string is not always immediately intuitive and sometimes its down right difficult! This can even be the case even if you are generally comfortable working with regex.

So the difficulty in using regex for most developers is a factor in my choice not to use regex in this case but the main reason is simply that it’s really not needed and it’s actually a lot simpler not to use regex to accomplish our goal.

So if not Regex then how?

Use Delimiters as a Guide to Word Boundaries

First create an array of delimiters that we can use as automatic word boundaries. In this case we can use a list of all the typable symbols that are not letters or numbers.

// Delimiters (Lexeme Boundaries) 
$delimiters = array('~', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+', '`', '-', '=', '{', '}', '[', ']', '\\', '|', ':', ';', '"', '\'', '<', '>', ',', '.', '?', '/', ' ', "\t", "\n");

 

Use Compound Symbols to Grow Words

Next we need a group of symbols we know are always used to create compound words. Basically this means hyphens and underscores which should always be joined with their parent lexeme. The distinction here is that even if these symbols show up before, in the middle of or even after another lexeme they should be considered part of that lexeme. e.g. pre- or long-term or _something or Jane_Doe

// Symbols used to detect compound-words 
$compound_word_symbols = array('-', '_'); 

 

Use Contractions to Merge Ideas

Quotes (‘single’ & “double” ) should be treated as separate lexemes and never be merged with the lexeme they contain. ¬†However, apostrophes should actually be merged with the lexeme before & after it provided that neither are a delimiter. Also, sometimes a period and the @ symbol can behave like contraction symbols as is the case with the example email:¬†john.doe@web_server.com

// Symbols used to detect contractions 
$contraction_symbols = array("'", '.', '@'); 

 

Our Example Natural Language Text

Here is the test string of natural language.

// Text to Tokenize and Lex 
$text = 'The quick brown fox jumps over the lazy dog. A long-term contract with "zero-liability" protection! Let\'s think \'it\' over. john.doe@web_server.com'; 

 

Extract Lexemes from Tokens Using Delimiter Symbols

We can now create a call to the Tokenize() function with our data and pass the results into an array which we format and echo as JSON.

// Tokenize and extract the $lexemes from $text 
$lexemes = Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols); 
echo json_encode($lexemes, 1); // output $lexems as JSON

 

Now, if we run our code we get all the lexemes extracted from the natural language test string.

Next week we will look at how we can tag the lexemes to complete the Lexical Analysis of a natural language so remember to like, and follow!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Can a Bot Understand a Sentence?

What if we could teach our writer bot (see Bot Generated Stories¬†&¬†Bot Generated Stories II) how to understand the meaning of a sentence?¬†Would that improve the bot’s ability to understand what was said?

Well, conceptually it’s not that different from what we do with programming languages and if you ask me, it sure seems like a good place to start with Natural Languages!

Natural Language

A “Natural Language” is any language humans use (though I guess we would include alien languages should we ever meet some ūüėõ ) that evolved over time through use rather than by design.

One of the first things you might notice when you contrast a natural language, like English, with artificially designed languages that are used for specific purposes (like programming a computer) is that natural languages have much more complexity and variation.

Further, natural languages tend to be far more “general purpose” than even the most capable artificial¬†general purpose programming languages.

For example, just think of all the ways you can say you love Ice Cream or that the room temperature is hot.

Now if you are a programmer, contrast that with how many ways there are to create a loop in a program?

The answer is more than a few (off the top of my head I can think of for, for each, while, do while, goto, recursive functions) but certainly many times less than the number of all the possible combinations of words you could use to describe how green something is!

How then, do computers understand what programmers say?

Click The Infographic for Full Size

Write Code

First, a programmer must write some valid code, like this for example:

$pi = 3.1415926535898;
for($i = 0; $i < $pi; $i++){
    echo $i . PHP_EOL;
}
echo 'PI is equal to: ' . ($pi + PI()) / 2;

Result:

0
1
2
3
PI is equal to: 3.1415926535898

The computer can understand what this code means and does exactly what it was asked to do, but how?

Lexical Analysis

Lexical Analysis¬† occurs in the early stages of the Compilation &¬†Interpretation processes,¬†where the source code¬†or script¬†for a program is scanned by a program called a lexer which tries to find the smallest chunk of “whole” information, called a “Lexeme“, and will assign it a¬†“type” or “tag” that denotes its specific purpose or function.

Lexed Code

You might be wondering what lexed code looks like. If we lex the example code from above we get a list that would be something like this if we represent it as JSON:

[
	["identifier","$pi"],
	["operator-equals","="],
	["literal-float","3.1415926535898"],
	["separator-terminator",";"],
	["keyword-for","for"],
	["separator-open-parentheses","("],
	["identifier","$i"],
	["operator-equals","="],
	["literal-integer","0"],
	["separator-terminator",";"],
	["identifier","$i"],
	["operator-less-than","<"],
	["identifier","$pi"],
	["separator-terminator",";"],
	["identifier","$i"],
	["operator-increment","++"],
	["separator-close-parentheses",")"],
	["open-curl","{"],
	["keyword-echo","echo"],
	["identifier","$i"],
	["operator-concatenate","."],
	["keyword-end-of-line","PHP_EOL"],
	["separator-terminator",";"],
	["separator-close-curl","}"],
	["keyword-echo","echo"],
	["literal-string","PI is equal to: "],
	["operator-concatenate","."],
	["separator-open-parentheses","("],
	["identifier","$pi"],
	["operator-plus","+"],
	["keyword-pi","PI"],
	["separator-close-parentheses",")"],
	["operator-divide","/"],
	["literal-integer","2"],
	["separator-terminator",";"]
]

What we’ve just done is give each lexeme a tag that is unambiguous as to what it’s intended role or function is.

Semantic Analysis

Then Semantic Analysis , sometimes called Parsing, checks the code to ensure that
there are no mistakes and establishes a hierarchy of relationships and meaning so the
code can be evaluated using the rules of the language.

Semantic Hierarchy

Parsing will group the expressions into a tree hierarchy that makes the intended meaning explicitly clear to the computer what we want it to do.

Here is the code above parsed and represented as JSON:

[
   {
      "tags":[
         "identifier",
         "operator-equals",
         "literal-float"
      ],
      "lexemes":[
         "$pi",
         "=",
         "3.1415926535898"
      ],
      "child-expressions":[

      ]
   },
   {
      "tags":[
         "identifier",
         "operator-equals",
         "literal-float"
      ],
      "lexemes":[
         "$pi",
         "=",
         "3.1415926535898"
      ],
      "child-expressions":[

      ]
   },
   {
      "tags":[
         "keyword-for"
      ],
      "lexemes":[
         "for"
      ],
      "child-expressions":[
         {
            "tags":[
               "identifier",
               "operator-equals",
               "literal-integer"
            ],
            "lexemes":[
               "$i",
               "=",
               "0"
            ],
            "child-expressions":[

            ]
         },
         {
            "tags":[
               "identifier",
               "operator-less-than",
               "identifier"
            ],
            "lexemes":[
               "$i",
               "<",
               "$pi"
            ],
            "child-expressions":[
               {
                  "tags":[
                     "keyword-echo",
                     "identifier",
                     "operator-concatenate",
                     "keyword-end-of-line"
                  ],
                  "lexemes":[
                     "echo",
                     "$i",
                     ".",
                     "PHP_EOL"
                  ],
                  "child-expressions":[

                  ]
               }
            ]
         },
         {
            "tags":[
               "identifier",
               "operator-increment"
            ],
            "lexemes":[
               "$i",
               "++"
            ],
            "child-expressions":[

            ]
         }
      ]
   },
   {
      "tags":[
         "keyword-echo",
         "literal-string",
         "operator-concatenate"
      ],
      "lexemes":[
         "echo",
         "PI is equal to: ",
         "."
      ],
      "child-expressions":[
         {
            "tags":[
               "operator-divide",
               "literal-integer"
            ],
            "lexemes":[
               "\/",
               "2"
            ],
            "child-expressions":[
               {
                  "tags":[
                     "identifier",
                     "operator-plus",
                     "keyword-pi"
                  ],
                  "lexemes":[
                     "$pi",
                     "+",
                     "PI"
                  ],
                  "child-expressions":[

                  ]
               }
            ]
         }
      ]
   }
]

Code Evaluation

Once the code has been analysed it can be Evaluated.

And since you’re curious, here’s what this code does:

  1. Declare a variable named $pi then sets it’s value to the number¬†Pi¬†with a length of 13 decimal places.
  2. A for loop is initialized.
  3. Expression 1 in the for loop is only evaluated once before the loop begins and it declares a variable named $i.
  4. Each time the for loop runs Expression 2 is evaluated using Boolean Algebra and if the result is logically TRUE then the code inside the loop runs. The expression is a comparison of the value of $i & $pi where if $i is less than $pi then the loop runs.
  5. Expression 3 is the third and final expression used to initialize the for loop. It runs after each iteration of the loop. The value of $i is incremented by 1 using the ++ increment operator.
  6. Each iteration of the loop “echo‘s” the value of $i to the screen.
  7. Once the for loop terminates the computer takes the value of $pi and add it to the value provided by the PHP language function PI() (which stores the value of Pi as a constant).
  8. That resulting sum is then divided by 2 giving us exactly Pi.
  9. This value is then concatenated with the string “Pi is equal to: ” and the whole string value is then echoed to to the screen.

This code does nothing of real value but it’s sufficiently short for us to lex by hand and long enough that it provides interesting results with nested child expressions, see the infographic hierarchy above.

How Does This Apply To Writer Bot?

Well, I guess the next question to ask is… can we lex a natural language? We’ll talk about that in my next post so¬†remember to like, and follow!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Why Rule Based Story Generation Sucks

Welcome back, today we’re going to talk about why¬†rule based story generation sucks and I’ll also present some code that you can use to create your very own rule based story generator!

But before we get started I’d like to draw your attention to the¬†title header image I used in my last article¬†Rule Based Story Generation¬†which you should also read because it helps add context to this article… anyway, back to the image. It presents a set of scrabble blocks that spell out:

“LETS GO ON ADVENTURES”.

I added over the image my usual title text in Pacifico font and placed it as though to subtly imply it too might simply be a randomly selected block of text just slotted in wherever it would fit.

Combined the new “title rule” it reads:

“LETS GO ON ‘Rule Based Story Generation’ ADVENTURES”.

Perhaps it may be too subtle, I know! ūüėõ

In any case, I like how the scrabble tiles capitalization is broken with the script.

It almost illustrates the sort of mishmash process we’re using to build sentences that are sometimes almost elegant though much of the time, ridged and repetitive.

It’s important to understand that developing an artificial intelligence that can write as well as a human is still an open (unsolved) area of research, which makes it a wonderfully exciting challenge for us to work on and we’re really only just getting started! ūüėČ

Of course this isn’t as far as we can go (see Rule Based Story Generation &¬†A Halloween Tale) and I am currently working on how we go even farther than I already have… but we’ll get there.

 

A Flawed Stepping Stone

Rule based story generation is an important, yet flawed, stepping stone in helping us understand the problem of building a writer bot that will write best sellers!

Despite the flaws with using rules to create stories, we may in the future partially rely on “bot written” rules in a layered generative system so none of this is necessarily wrong, just woefully incomplete, especially by my standards as I like to publish fully functioning prototypes… more or less. ūüėČ

Before I present the code however let’s briefly look at why rules suck!

 

Reasons Rule Based Generation Sucks

Here’s a non-exhaustive list of reasons why rule based story generation sucks:

  • Some rules create Ambiguity.
  • Lack of correlation between Independent Clauses.
  • Complete lack of Dependent Clauses¬†that will need to correlate with their associated independent clauses.
  • Run-On Sentences are WAY easy to generate!
  • Random, or hard coded¬†Verb Tense & Aspect is used.
  • Forced, improper or unusual¬†grammar.
  • Random events, and no cohesive growth of ideas over time (lack of Narrative).
  • No Emergence¬†means that all meaningful possibilities are input by a person… manually! ūüė¶
  • Placeholder for anything that I may have forgot to include here.

If we intend to build the “writer bot” AI described in¬†Bot Generated Stories & Bot Generated Stories II¬†we will have to find ways of mitigating or eliminating the issues in this list!

We could be proactive at trying to resolve some of these issues with the rule based system but most of our efforts will¬†boil down to writing additional rules (if this, then that) preconditions and really… nobody has time for that!

Besides, if a person is writing the rules… even if the specific details selected for a sentence are random or possibly even stochastic¬†(random but based on probability), wouldn’t you still say that the person who wrote the rules, kinda sorta wrote the text too?

I mean, even if there’s an asterisk next to the writers name and in itty-bitty tiny text somewhere it says (*written with the aid of a bot) or vice versa, it’s still hard to say that whoever wrote the rules and made the lists didn’t write the resulting text… to some extent, right?

If you agree… disagree? Feel free to present your argument in the comments!

Ultimately for the reasons listed above (and a fair amount of testing) I am confident that hand written rule based story generation is not the way to go!

 

New Rules

In addition to a new rule (name action.) I am including a little something special in the code below.

I call them Rule 8 & Rule 9 because they were written in that order but what makes then unique from rules (1 – 7 which I wrote) is that they were effectively written by a bot.

What I mean when I say the bot “wrote the rule” is that the pattern used in the rules was extracted by a learning bot (pattern matching algorithm/process).

Here are examples of the new rules:

Rule 7

Axton went fishing.
Briana mopped the floor.
Kenny felt pride.
Ryann played sports.
Chaim felt content.
Alaina road a horse.
Elian setup a tent.
Brian had fun.
Meadow heard a noise.
Jewel learned a new skill.

 

Rule 8 – Bot Generated Rule

Freya handled Along the ship!
Bethany dipped Inside the dollar!
Kyla appointed Regarding the scenery!
Aryanna filed yieldingly the honoree.
Madeline demanded of the pannier!
Kailey repaid there the courage.
Finley came With the button.
Sawyer criticised owing to majestically icicle!
Armani included again down canopy!
Genevieve snapped Behind the computer!

 

Rule 9 – Bot Generated Rule

Ulises besides Maxim approved the scraper Near.
Nova that is Louisa ringed the scraper Down.
Alec besides Killian eased the cope Outside.
Sylas consequently Zain beat safely exit Above.
Conrad yet Alfredo owed the definition Within.
Danica that is Jackson paid the sweater fully.
Hugh and Kori substituted the pitching heavily.
Julissa but Colton separated the lie Down.
Liberty but Barbara reformed the lamp kissingly.
Zion yet Rosemary ascertained true fat under.
Neither Desiree nor Nadia filed the protocol Ahead of.
Neither Rudy nor Rowan aided the weedkiller commonly.
Grace indeed Jad caused the beast best.
Jaelynn besides Maddux cheered the panda Against.
Ari yet Ayla elected the seaside without.
Blakely moreover Karsyn stimulated jealousy shadow owlishly.
Prince further Lennon exhibited the worm Except.
Clay thus Rohan embraced the tsunami each.
Sabrina but Avery stressed far paste Excluding.
Gregory so Dallas engaged new egghead clearly.
Neither Lydia nor Walter escaped naturally margin previously.
Dylan namely Elaina kept suspiciously shed oddly.
Neither Jedidiah nor Karsyn devised the bathhouse kookily.
Kareem so River pointed wetly yoga Ahead of.
Ansley accordingly Alessandro laughed the brood By.
Omar otherwise Sofia obtained the clipper Per.
Walker so August summoned the tile yeah.
Remy moreover Cody raised the handball loyally.
Aadhya so Adelynn allocated the fear Amidst.
Mohamed likewise Hudson inspected the hyphenation Like.

While that does make generating rules easier and it also can aid in resolving some of the issues with generating stories using rules, it really only amounts to an interesting “babble” generator at best, though perhaps it could be coupled with several other systems in layers to create something closer to a story?

Maybe through the use of “rewrite rules” that could fix the verb tenses and pronouns perhaps?

Here are the results of 15 randomly selected rules:

Leanna Odom is a woman but to the south of a
sports stadium, a bird built a nest for in the
zoo, a guy road a bike nor next to a city jail, a
car got a flat tire for Ricky Solis is very garish
and outside a farm , robots attacked yet Armani
Lowery is very attentive but Jeffrey and Zaniyah
permitted the extent Excluding. nor Azariah seemed
fatally the route! but Maddux proved hopefully
monthly wasp! nor Blaine wrote a poem. and inside
an abandoned ghost town, a book was written and
Neither Seth nor Callan behaved the steeple In
addition to. but Lucille ate a nice meal. and
Lorelei meditated.

Code

Below is the code for Generate.php and it’s the main program file. It uses Functions.php as well as some text files and you can find all the files you need for this project over on my GitHub for free: RuleBasedStoryGeneration on Github

<?php
// include all the functions
include('Functions.php');
// set up the parts of speech array
// functions will globally point to this variable
$pos = LoadPartsOfSpeech();
$number_of_sentences = 30; // how many sentences/rules are generated/used
$story = ''; // the string we concatenate rules on to
// for whatever number you set $number_of_sentences to...
foreach(range(1, $number_of_sentences, 1) as $number){
    
    $rule_subject = random_int(1, 3);
    
    // randomly determine the type of rule to use,
    // randomly select the rule, compute its result and concatenate with 
    // the existing $story
    if($rule_subject == 1){ // action or event
        
        $rule_subject = random_int(1, 4);
        
        if($rule_subject <= 3){
             $story .= Rule(1); // event
        }
        elseif($rule_subject == 4){
             $story .= Rule(7); // action
        }
    }
    elseif($rule_subject == 2){ // people related
        $rule_subject = random_int(1, 6);
        
        if($rule_subject == 1){
             $story .= Rule(2);
        }
        elseif($rule_subject == 2){
             $story .= Rule(3);
        }
        elseif($rule_subject == 3){
             $story .= Rule(4);
        }
        elseif($rule_subject == 4){
             $story .= Rule(5);
        }
        elseif($rule_subject == 5){
             $story .= Rule(6);
        }
        elseif($rule_subject == 6){
             $story .= Rule(7);
        }
    }
    elseif($rule_subject == 3){ // bot generated
        $rule_subject = random_int(1, 2);
        
        if($rule_subject == 1){
             $story .= Rule(8);
        }
        elseif($rule_subject == 2){
             $story .= Rule(9);
        }
    }
    
        
    // if this is not the last sentence/rule concatenate a conjunction
    if($number <= ($number_of_sentences - 1)){
        $story .= $pos['space'] . Get($pos['conjunctions']['pure']) . $pos['space'];
    }
}
// after the loop wrap the text at 50 chars and output the story
echo wordwrap($story, 50, PHP_EOL);
/*
 * Example Output
 * 
Jayleen ended By the lip! so Jada called a family
member. nor Aidan Lester is gifted and Emma Walton
is very clumsy and Grey proceeded widely literally
runner. or Santana Norman is a man yet Nico
Bartlett is very pitiful yet Aliana Browning is
rich and Rowan introduced Past the colloquia. but
Holly built a robot. so Morgan Dorsey is a person
for London fooled Against the cappelletti. but
Neither Emory nor Angel angered the order angrily.
or Hezekiah Beasley is very panicky and Leighton
did almost vivaciously author. so Foster Justice
is a man but Rory Parker is a beautiful person so
Reagan Rivera is a person but Kai Zamora is clever
nor beyond a newspaper company, dinosaur bones
were uncovered yet beyond a houseboat, a bird
built a nest so Kyle Goff is a man or on the
mountains, a new species of insect was identified
nor Galilea Mckinney is very worried or Gunner Orr
is very guilty but Otto Gaines is a small man nor
Gia Hendrix is powerful and Robert Mcdaniel is a
beautiful man so to the south of a newspaper
company, science breakthroughs were made so
Carmelo Rodgers is very witty
 
   
 */

 

Please remember to like share and follow!


Help Me Grow

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Rule Based Story Generation

In my last post Bot Generated Stories II  I left off posing the question:

How then did I get my bot to generate A Halloween Tale?

~GeekGirlJoy

The short and most direct answer I can give without getting into all the nitty-gritty design & engineering details (we’ll do that in another post) is that I built a Markov Model¬†(a math based bot) then I “trained” my bot on some texts to “teach” it what “good” patterns of text look like.

After my bot “read” approximately 350K words, the 3 books and a few smaller texts (some of my work and a few other general text sources to fill-out the vocabulary), I gave my bot a “seed word” that it would use to start the generative process.

What’s really going on “under the hood” is a lot like the “predictive text” feature many of you use every day when you send a text on your phone.

It’s that row of words that appears above or below your text while you type, suggesting the next possible word based on what you last said, what it was trained on and your personal speech patterns it learned while it watched you send texts in the past.

Well… my writer bot is sort of a “souped up”, “home-brewed” version of that…. just built with generating a story in mind rather than predicting the next most likely word someone sending a text would need.

The thing is, we’re not going to talk about that bot today. ūüėõ

Xavier and I have been sick and I’m just not in the mood to do a math & code heavy post today, instead we’re going to talk about rule based story generation. ūüėČ

Rule Based Story Generation

One simple and seemingly effective way to generate a story that absolutely works (at least to some extent) is to use “rules” to select elements from groups or lists and assemble the parts into a story.

The idea is pretty simple actually, if you select or write good rules then the result is they would generate unique (enough) sentences that when combined are a story.

For example, lets say I create a generative “rule” like this:

Rule: proximity location, event.

Seems simple enough but this rule can actually generate quite a few mildly interesting sentences, like this one for example:

in a railroad station, aliens attacked.

Result: (proximity: in) (location: a railroad station), (event: aliens attacked).

Not bad huh? You’d totally read that story right? ūüėõ

Here’s a few more results using this rule so you can see what it can do. Note that the rule pattern never changes (proximity location, event.) but due to different values being randomly selected each sentence is different and the general “tone” of the sentence changes a bit as well:

Below a dank old bar, science breakthroughs were made.

I like that one ūüėõ

Next to a village, a robbery happened.

To the west of a cave, a bird built a nest.

On the deck of a giant spaceship, a nuclear bomb was detonated.

Eat your heart out Kubrick! ūüėČ ūüėõ

Beyond the mountains, a child giggled.

Notice that all three parts of the rule (proximity location, event.) can affect the tone and meaning of the generated result.

What if the rule had generated:

“On the deck of a giant spaceship, a child giggled.”

That is a vastly different result than the one in the examples above, yet perhaps it is the same story with only seconds separating both events? Maybe…

“On the deck of a giant spaceship, a child giggled. The hoards of aliens were defeated. In the distance a voice yells “Dinner’s ready!”, a toy spaceship falls to the floor as little feet scurry unseen through the house.”

What makes the determination in the readers mind about what is actually going on is the context of what was said before this sentence and what will be said after. There are those cases where not saying something is saying something too… but dammit I can’t model that! ūüėõ

Now, lets look at how the proximity can change the meaning.

Here’s the proximity list I used with this rule:

in
inside
outside
near
on
around
above
below
next to
close to
far from
to the north of
to the south of
to the east of
to the west of
beyond

Each ‘proximity’ by itself seems pretty self explanatory in its meaning but when combined with a location the meaning can change. For example, it seems fairly natural to discuss something being ‘beyond’ something else like “the fence is¬†beyond¬†the water tower” but lets say that you have an ambiguous ‘location’ like Space?

1930s & 40’s ¬†Pulp Scifi aside… what does it mean to be “Beyond Space”? ūüėõ

Clearly we’ve run into one of the limitations of rule based story generation, of which there ¬†seems to be many… but in this case I’m referring to unintended ambiguity.

At best a rule would reduce ambiguity and at worst it could inject significant ambiguity into a sentence. Ambiguity in this case should be understood as lack of clarity or a variance in what the reader is supposed to understand is occurring and what they believe is occurring.

Limitations aside, this type of rule based generative system is surprisingly effective at crafting those direct and matter of fact type statements.

The type of problem you could write an “If This Then That” sort of rule for… hmmm.

 

A Few More Rules

Here are a few more rules to help you get a feel for how this whole “rule” thing works:

Rule: name is very positive_personality_trait
&
Rule: name is very negative_personality_trait

See if you can tell which is which in this list:

Channing Lynn is very faithful
Jerome Puckett is very defeated
Arturo Thomas is very nice
Damon Gregory is very grumpy
Calvin Weeks is very repulsive
Joaquin Hicks is very gentle
Amanda Calhoun is very thoughtless
Matthias Welch is very polite
Carter Camacho is very scary
Jay Dyer is very happy
Harper Buckley is very helpless
Trenton Bauer is very kind
Kane Owen is very lazy
Lauryn Vasquez is very obedient
Aleah Gilmore is very angry
Ameer Cortez is very brave
Kase Wolfe is very worried

This rule is static and can be improved by having fewer “hard coded” elements.

Instead of the result always containing the word “very” you might instead have a gradient of words that are selected at random (or based on some precondition) that would modify the meaning or intensity of the trait, i.e. mildly, extremely, slightly, not particularly, etc…. which could lead to interesting combinations, we could call the gradient of terms, oh I don’t know… adverbs. ūüėõ

Technically though, adverbs in general are too broad of a category to treat as simple building blocks in a rule like this but you could build a list of adverbs that would apply in this case and replace the word ‘very’ with a selection from that list which would result in more variation in the personality trait descriptions.

Lets look at another rule.

Rule: name is adjective_condition

Annalee Sargent is shy
Hugh Oconnor is helpful
Tessa Rojas is gifted
Cristian Castaneda is inexpensive
Heavenly Patel is vast
Gibson Hines is unimportant
Alora Bush is alive
Leona Estes is mushy

I don’t know about you but…

I’ve always found that “mushy” people are very¬†positive_social_anecdote! ūüėõ

Are you starting to see how the rules work? ūüėČ

Much like the rule above that could be improved by replacing the “hard coded” adverb (very) with a gradient that is¬†selected at random (or based on some precondition) the verb ‘is’ in this rule could be replaced with a gradient of verb tenses i.e. is, was, will be, etc…

Now, if you want to get more complicated… you could even build a system that uses or implements preconditions as I mentioned above.

An example of a precondition I gave above was verb tense to determine if something has, is or will happen… which would then modify the succeeding rules that follow and are related to it… but it’s also possible to build preconditions that modify rules directly from properties that are innate to your characters, ¬†settings, objects in the locations, the weather, the time of day etc…

For example consider the Rule: name is a gender

This rule must be able to determine the gender of the name given to it in order for the rule to work. In this case, the gendered name would act as a precondition that modifies the result of the rule.

Reyna Dunlap is a woman
Nikolai Cummings is a man
Emerald Lynch is a woman
Lucas Woodward is a man
Bailey Ramsey is a woman
Matias Miller is a man
Tinley Hansen is a woman
Mckenzie Davidson is a woman

It’s also possible however for a name to be gender neutral, like Jamie for example, and the rule cannot simply break if the name is both male & female or neither in the case of a new or non-typical name and that level of abstraction (care and detail given to each rule so as to prevent a rule from breaking) has to extend to all rules in all cases which is why using rules to write stories is impractical.

Related to the last rule is this Rule: name is a adjectives_appearance gender

Mallory Joseph is a clean woman
Talon Vazquez is a bald man
Kody Maxwell is a magnificent man
Meredith Strickland is a stocky woman
Jaliyah Haynes is a plump woman
Brian Leblanc is a ugly man
Collins Warren is a scruffy woman
Tenley Robbins is a chubby woman
Brantley Mcpherson is a chubby man
Killian Sawyer is a fit man

Here again you see the rule must identify the gender of the name given to it… but what’s more important is that I used the “present tense” ‘is’ when its just as valid grammatically to say that “Killian Sawyer was a fit man and in fact even if it is grammatically correct to say he “is fit” he might not even be alive any longer and ‘was’ would be logically correct being the past tense with the implication the he is no longer fit rather than dead and additional information would be required by both the reader and system to make the determination that Killian was dead but the point still stands.

Using preconditions on all aspects of the story such as the characters, places, things etc. could enable the system to properly determine if it should say something is, was, will be, maybe, never was, never will be etc… it could examine the established who, what, when, where, why & how and use that information to determine what rule to use next which would progress the story further.

It’s easy to imagine how some rules would only be applicable if certain conditions had occurred or were “scheduled” to occur later in the story. Some rules might naturally form branching chains or tree hierarchies within the “flow” of the rules.

This implies if not requires some form of higher order logic, state machines¬†and the use of formal logic programming¬†with special care and attention given to the semantic meaning or value of each part of each rule…

Well nobody said it was going to be easy! ūüėõ

These Are Not The Droids You Are Looking For

Sounds too easy right? Well… you’re probably right.

I mean sure you can do this in theory and next week I will provide a crude “proof of concept” example with code demonstrating the rules I used here today, but even if you create a bunch of rules it’s not like “earth shatteringly good” and you can’t just write them and you’re done, there is a lot of trial and error to get this sort of thing just right.

Personally I’ve never even seen a fully functional implementation of this type of thing… sure i’ve seen template based stories that use placeholders but nothing as dynamic as I believe would be required to make a rule based system work as described.

Again I am not talking about simply randomly selecting rule after rule… I mean sure you could do that but you won’t get anything really useful out of it.

To do this right your system would select the rules that properly describe past, present & future events correctly based on where in the story the rule is used and it can’t simply just swap out the names and objects or descriptions in your story without concern for the context that the rule is being applied.

To do rule based story generation right means that you get different stories each time the system runs not cookie cutter templatized tales. You cant simply write a story template and fill it with placeholders and then select and assemble “story legos” and get a good story.

Though at least hypothetically it could work if you wrote enough rules and built a more robust system that keeps track of the story state, the characters and their motivations, the objects, where they are, what they are etc… of course this is tedious and ultimately still amounts to you doing an awful lot of work that looks like writing a story.

I do believe (at least in theory) a rule based story generative system as described here could work but you would be limited to the manually written rules in the system (or are you? ūüėČ ) and how well the state machine used the rules.

Further, its debatable that even if a rule based story generation system worked, could it actually be good enough to be the “best seller” writer bot that we’re looking for?

Seemingly the major limiting factor to me appears to be hand writing, refining and testing the rules.

Suggest A Rule

As I said I will present the code for these rules in my next post but I’d like to ask you to suggest a rule in the comments for this post and I will try to include as many of them as possible in the code and I will give the suggester credit for the rule.

Please remember to like, share & follow!


Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Bot Generated Stories II

In my last post Bot Generated Stories¬† I left off describing how text based story generation can give way to a sort of ¬†“holodeck” ¬†virtual reality¬†where you don’t just read a story but can explore an entire simulated world built around giving you a narrative¬†experience unique to your preferences and choices.

The first step is to build a “writer bot” that isn’t quite as good as a human writer (but capable none the less) so that it can work along side a human and aid in the writing process. This would allow the bot to rely on the human to determine what is “interesting” while the bot offers suggestions when a sort of “say something” button is pushed though my friend Oliver suggests the phrase “gimme some magic”. ūüėõ

As described, this bot would act as a “digital muse” ¬†of sorts, offering suggestions along the way with a human selecting and writing the details from a set of possibilities while allowing the human author to throw out the bots suggestions and take the story in completely different directions than what the bot generated.

In many ways my “writer bot” is far from this goal because it fails when it comes to generating sentences that have actual meaning and correlation with the desired topic between clauses¬†but I will talk about this in more depth in another article.

What my bot is good at is generating sentences that are better than random and I can illustrate this quite simply.

 

My First Attempt: Yule’s Heuristic’s

My first experiments used a bot with random word selection from a very large word list to produce content.

It’s important to note that I did not expect good results I just needed something to compare all future attempts against and random selection seemed like the worst way to do it. ¬†If any of my bots along the way produced content even slightly better than random it would be a step in the right direction.

My initial methodology was basically just to pull words at random from the built in Linux dictionary and throw in the occasional period or comma (no commas shown in example below) to create sentences and clauses. I then concatenated those pseudo sentences and randomly added a break to create paragraphs.

Also, mostly for my own amusement I programmatically generated a “contents” section with chapter titles and page numbers that line up, though outside of those “rules” the following was pseudo randomly generated.

Note this is the first output I generated when I first began working on creating a “writer bot” (it’s terrible – though some of it a amusing):

Yule's Heuristic's
GeekGirlJoy

Table of Contents

Chapter 1: Rebroadcast's Sulkies Borough Whitewashes Swim........................ 3
Chapter 2: Culottes Mutability's Corroborations Moet's Competent................ 29
Chapter 3: Guesting Unicycles Neckerchieves Studious Oviduct.................... 52
Chapter 4: Penes Unknown Mileposts.............................................. 64
Chapter 5: Cupboard Exult Tower's............................................... 78
Chapter 6: Letha Bookmarking Kmart's Concentrate................................ 92
Chapter 7: Defensiveness Fielder Input Kilometers.............................. 112
Chapter 8: Bugging Outperforms Assault's....................................... 126
Chapter 9: Meany Conviviality's Unintelligent Plods Yards...................... 146
Chapter 10: Coal's Euphemism Union's Heterosexuality's......................... 166
Chapter 11: Ill Atrocious Inputting Moderator.................................. 180
Chapter 12: Marrieds Weissmuller Surrendering.................................. 200
Chapter 13: Convene Asylum's Dustiness's Permeated............................. 211
Chapter 14: Methodist's Prosecuted Jewelers.................................... 222
Chapter 15: Remoteness's Goblin Freeholder's Sixth's........................... 231
Chapter 16: Provo Peafowls Offensiveness's Bonsai's............................ 244
Chapter 17: Personal's Diastolic Questioning................................... 256
Chapter 18: Agitates Contingency's Gastronomy's Lineup's Gallic................ 266
Chapter 19: Garbling Poked Pithiest Depp Specialists........................... 289
Chapter 20: Lit Condolences Webb Levying Laurel................................ 302


Chapter 1: Rebroadcast's Sulkies Borough Whitewashes Swim

Delawarean Paran√° harbinger diodes tutoring repairman slice posits blamer. Classicist boor's betting Markham chunky Monroe's wasting authorize abductors glance's vatting. Installments skateboarded stein lining's goodbye interstice critiqued onslaught's mute's failing's.

Hell's Egyptian's Battle compensates handsomest rookeries droves taxidermist's spaciousness's expunged majors standstill culpability. Viscus's absorbents mutability's Whitsunday's Matthew Socrates nitrate's dwarfism opulence's diffuse budgerigars silenter perversity's. Quart Nescafe newspaperwoman's guest sidestroke outdistancing scald workingmen waggle overlapping.

Flagons crochet's compunction duties Elysée objecting mace headrest's chlorinating enraptured softly enmeshed Bessemer's. Prays bracelets reamed Lagos's moisture particular's foisting. Okra Virginia's granddaughter's kronor individualism's sightseers haziest wagons sandiest Appaloosa's overcasting Fisher Minos's.

Cogwheel's nutritionist Ares erogenous inconsistent gummiest sachem lien connection skivvies successor secretion. Eisenhower supernatural Freemasonry's rostrum Rudolph's causeway's ocean's. Exhortations quibbles recounted innocent intermissions academician's hardwoods lard hindsight's austerest dabbled scalawags.

Despicable topography's narration's glaze's homograph's molehill's doyens zoo malnutrition's neutralization's. Hunts Schultz footnote Kroc's proton processioning inadvertence's Mars dialed Noemi prithee Sheena Parrish. Rightist's departure's padre Joule mangled roughneck's jazziest affiliate spinoffs cops scrubbing deter.

Consenting Kevin's budgies Balinese's neat warthogs plumb scapegoating bombardiers Burke hoppers. Storyteller's rationals lethargy jitterbug's poorhouse threats pipeline extolled jogs grandee. Vastness alluvial's bloodstain experimentation cigaret Karenina's Orval's thereto banishment abattoir's Chrysler abducts Yolanda's.

...

And it goes on like that for 312 pages of mind numbing random goodness. ūüėõ

That bot had a huge vocabulary but clearly using randomly selected words is terrible!

Not a single sentence was coherent! Which is actually what I expected so Yule’s Heuristic’s was a success in that it failed as planed though I had to go beyond random text if I wanted close to resembling a story.

How then did I get my bot to generate A Halloween Tale?

We’ll talk about that and more in my next post Rule Based Story Generation.

Please remember to like, share & follow!

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Bot Generated Stories

Many of my readers know last year for Halloween I published A Halloween Tale¬†where I used a self built (from scratch & in PHP no less ūüėé ) “writer bot” to write the entire story I published in that article.

To this day¬†I would argue A Halloween Tale is still the best¬†example available online of bot written fiction and I dare you to find a more coherent story! ūüėõ

I trained my bot on Jules Vern’s¬†20000 Leagues Under the Sea, Bram Stoker’s Dracula & Mary Shelley’s Frankenstein, in the hopes of generating a sort of Adventure Horror story¬†because I wanted something kinda “spooky” to publish for¬†Halloween… but i’m getting ahead of myself.

I will discuss my “writer bot” more in future posts. Today I’d like to start the discussion with some of my thoughts on generative writer bots in general.

Why Build a Generative Writer Bot?

You see, I believe that generative robots like my “writer bot”, though more advanced will completely change the way people produce & consume media in the not to distant future.

Consider that Amazon.com is a book store (the largest), and it sells digital ebooks in ever growing numbers. Consider too that every Movie / TV Show and Netflix series has a written script.

Millions of magazines and news papers are printed and sold around the world each and every day, not to mention all the blog posts that are published.

Just about every product you can think of has some form of written communication involved with the buying, selling, transporting and or the use of that product.

Estimates say that there is a good chance a bot will write a “best seller” novel within the next 10 – 15 years and it’s important to note that isn’t time to completion, that’s time till it’s so good that the bot will do better than most human writers ever will!

A bot that can write “coherently” is much closer than 10 years!

The so called “best seller” robot is easily worth a trillion dollars to it’s creator due to the capacity of the robot to disrupt the entire writing industry!

 

A Vision of Things to Come

This type of bot offers push button custom content that can be tailor made to the exact preferences of the reader… or company that rents it from you… yeah “rents” because it’s the kind of thing you sell as a service for sure!

Imagine having a long trip home on a train, jet or self driving Uber… and having anything from a short story all the way up to a novel written just for you!

But it doesn’t stop there… as I droned on above, EVERYTHING is written and anything written has a cost associated with it.

For the writer the cost it time, the longer it takes to finish any given work reduces the overall value of the work due to fewer hours to allocate to other paid projects.

This type of bot would also benefit companies who employ people to write for them because their writers will be more efficient which means they can pay fewer writers to handle their content generation needs.

Ultimately, there is the possibility a writer bot could get so good that it might supplant the need for human writers entirely outside of specific areas of expertise.

While that may horrify many authors, if that does happen it promises to usher in the ability for everyone to have content generated that tells the story they want to read, see or hear at the push of a button.

 

Generative Stories Are More Than Just Words

This is all more than just words though.

If a generative writer bot is capable of generating a “best seller” novel then it is not hard to see how the same process of narrative arc generation and management as well as character creation… not to mention the conversations the characters have with (or about) each other as well as the objects they interact with and the environments they exist in… can be applied to other uses such as writing scripts for movies, shows… podcasts? ūüėõ

At the core of this bot is a system that can manage complex environments and interactions verbally and describe consistently and coherently what is occurring.

It’s not hard to imagine¬†combining the narrative generating capacity of this type of bot with an animation system¬†hypothetically allowing you to generate a story and then programmatically illustrate or even animate it leading to on demand TV with shows that are all about your personal interests!

Further, if you can animate it… it can be made interactive so it’s not much of a stretch to extend the system so that you could play a VR Novel (like the Holodeck on Star Trek but with VR goggles) where the story is written on the fly and the world is generated so that you can have any experience you want.

I believe this is coming and I will share more of my thoughts on the subject in upcoming posts.

I hope you enjoyed today’s post, ¬†please like, share & follow!

Read Part 2: Bot Generated Stories II

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

My Supporters

Today I would like to thank you all for reading my blog.

Your time is valuable to me and knowing that everyone out there enjoys my work is very gratifying but also quite motivating to me to constantly try to bring you something new and interesting with each post I publish.

As a result I have seen my daily & weekly readership numbers continue to increase and I would like to take this opportunity welcome everyone new.

Obviously I can’t do this without you guys and¬†I would also like to take this opportunity to thank a few very special readers of mine who not only enjoy my work but also have pledged to financially support my content over on Patreon.

PATREON SPONSORS

Gabriel Kunkel   &   Shanon Garcia

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog. Your patronage allows people all around the world to learn about technology and computer science.

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built in the comments and if it sounds interesting it might just get built and featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Blog at WordPress.com.

Up ↑

%d bloggers like this: