Well, I guess the next question to ask is… can we lex a natural language?

When we lexed the PHP code in Can a Bot Understand a Sentence? we applied tags to the lexemes so that the intended meaning of each could be understood using grammar rules.

Well It turns out that when talking about natural languages, lexing is referred to as Parts of Speech Tagging.

The top automated parts of speech taggers have achieved something like 97-98% accuracy when tagging previously unseen (though grammatically correct) text. I would say that pretty much makes this a solved problem!

Further, linguists and elementary school teachers have been doing this by hand for years! 😛

In practice everyone’s results will very but on average having a potential of approximately 2 miss-tagged words out of 100 means that the challenge of building a natural language lexer shouldn’t be too difficult but of course any variance even just 2% (meaning that a bot that gets the tags wrong 2% of the time) can mean that the bot does the wrong thing (perhaps significantly) 2% of the time.

In any case before we can tag our lexemes we need to come up with a way to ‘tokenize’ a natural language sentence so let’s talk about that.

Tokenizing

Tokenizing is a verb which makes it an action. The word means to turn the raw text into ‘tokens’ using a process to determine the bounds of each “word unit” or “part of speech” so that we can treat it as a separate component that can be programmatically acted upon as lexeme.

In this case we can use the individual characters as our tokens.

If we use this sentence:

“The quick brown fox jumps over the lazy dog. A long-term contract with “zero-liability” protection! Let’s think ‘it’ over. john.doe@web_server.com”

Tokens

We want our system to use all the characters (including spaces and punctuation) in the string as tokens like this:

["T","h","e"," ","q","u","i","c","k"," ","b","r","o","w","n"," ","f","o","x"," ","j","u","m","p","s"," ","o","v","e","r"," ","t","h","e"," ","l","a","z","y"," ","d","o","g","."," ","A"," ","l","o","n","g","-","t","e","r","m"," ","c","o","n","t","r","a","c","t"," ","w","i","t","h"," ","\"","z","e","r","o","-","l","i","a","b","i","l","i","t","y","\""," ","p","r","o","t","e","c","t","i","o","n","!"," ","L","e","t","'","s"," ","t","h","i","n","k"," ","'","i","t","'"," ","o","v","e","r","."," ","j","o","h","n",".","d","o","e","@","w","e","b","_","s","e","r","v","e","r",".","c","o","m"]

Lexemes

Once we have the tokens we want the system to process those tokens into lexemes that a person would naturally say is a “whole” lexeme.

In this case, “whole” means that it’s a complete part of speech so sometimes a lexeme is a multi-character word and sometimes is a single character delimiter.

Most of the time a lexeme will only contain letters, numbers or symbols but sometimes it should also contain some mixed combination, as would be the case with a hyphenated compound word e.g. zero-liability or a contraction e.g. Let’s.

Notice that we want the system to use the apostrophe to merge Let and s into Let’s because it’s a contraction and therefore a “whole” lexeme but we don’t want the apostrophes around the word ‘it’ that follows the word ‘think’ combined because the lexeme in that case is the word it with the sounding apostrophes acting as ‘single quotes’ and should therefore be treated as separate lexemes just like the “double quotes” around zero-liability.

Also, we want the system to capture the complex pattern of the email (john.doe@web_server.com) as a single lexeme.

Here’s what that looks like:

[
    "The",
    " ",
    "quick",
    " ",
    "brown",
    " ",
    "fox",
    " ",
    "jumps",
    " ",
    "over",
    " ",
    "the",
    " ",
    "lazy",
    " ",
    "dog",
    ".",
    " ",
    "A",
    " ",
    "long-term",
    " ",
    "contract",
    " ",
    "with",
    " ",
    "\"",
    "zero-liability",
    "\"",
    " ",
    "protection",
    "!",
    " ",
    "Let's",
    " ",
    "think",
    " ",
    "'",
    "it",
    "'",
    " ",
    "over",
    ".",
    " ",
    "john.doe@web_server.com"
]

 

Of course we would still need to apply tags to this list to complete the lexing process but this solves the first problem of problem splitting natural language text into tokens and processing those tokens into a list of lexemes ready to be tagged.

We’ll work on tagging next week, but for now lets look at the code that does this.

The Code

Here is the complete code that implements the tokenization of the lexemes. I’ll explain what is happening below but the code is commented for the programmers who are following along.

<?php 

function Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols){  
      
  $temp = '';                   // A temporary string used to hold incomplete lexemes
  $lexemes = array();           // Complete lexemes will be stored here for return
  $chars = str_split($text, 1); // Split the text sting into characters.
  
  //var_dump(json_encode($chars, 1)); // convert $chars array to JSON and dump to screen

  // Step through all character tokens in the $chars array
  foreach($chars as $key=>$char){
        
    // If this $char token is in the $delimiters array
    // Then stop building $temp and add it and the delimiter to the $lexemes array
    if(in_array($char, $delimiters)){
      
      // Does temp contain data?
      if(strlen($temp) > 0){
        // $temp is a complete lexeme add it to the array
        $lexemes[] = $temp;
      }      
      $temp = ''; // Make sure $temp is empty
      
      $lexemes[] = $char; // Capture delimiter as a whole lexeme
    }
    else{// This $char token is NOT in the $delimiters array
      // Add $char to $temp and continue to next $char
      $temp .= $char; 
    }
    
  } // Step through all character tokens in the $chars array


  // Check if $temp still contains any residual lexeme data?
  if(strlen($temp) > 0){
    // $temp is a complete lexeme add it to the array
    $lexemes[] = $temp;
  }
  
  // We have processed all character tokens in the $chars array
  // Free the memory and garbage collect $chars & $temp
  $chars = NULL;
  $temp = NULL;
  unset($chars);
  unset($temp);


  // We now have the simplest lexems extracted. 
  // Next we need to recombine compound-words, contractions 
  // And do any other processing with the lexemes.

  // If there are $chars in the $compound_word_symbols array
  if(!empty($compound_word_symbols)){
    
    // Count the number of $lexemes
    $number_of_lexemes = count($lexemes);
    
    // Step through all lexeme tokens in the $lexemes array
    foreach($lexemes as $key=>&$lexeme){
      
      // Check if $lexeme is in the $compound_word_symbols array
      if(in_array($lexeme, $compound_word_symbols)){
        
        // If this isn't the first $lexeme in $lexemes
        if($key > 0){ 
          // Check the $lexeme $before this
          $before = $lexemes[$key - 1];
          
          // If $before isn't a $delimiter
          if(!in_array($before, $delimiters)){
            // Merge it with the compound symbol
            $lexeme = $before . $lexeme;
            // And remove the $before $lexeme from $lexemes
            $lexemes[$key - 1] = NULL;
          }
        }
        
        // If this isn't the last $lexeme in $lexemes
        if($key < $number_of_lexemes){
          // Check the $lexeme $after this
          $after = $lexemes[$key + 1];
          
          // If $after isn't a $delimiter
          if(!in_array($after, $delimiters)){
            // Merge the $lexeme it with
            $lexemes[$key + 1] = $lexeme . $after;
            // And remove the $lexeme
            $lexeme = NULL;
          }
        }
        
      } // Check if lexeme is in the $compound_word_symbols array
    } // Step through all tokens in the $lexemes array      
  } // If there are $chars in the $compound_word_symbols array
  
  // Filter out any NULL values in the $lexemes array
  // created during the compound word merges using array_filter()
  // and then re-index so the $lexemes array is nice and sorted using array_values().
  $lexemes = array_values(array_filter($lexemes));
  
  
  // If there are $chars in the $contraction_symbols array
  if(!empty($contraction_symbols)){
    
    // Count the number of $lexemes
    $number_of_lexemes = count($lexemes);
    
    // Step through all lexeme tokens in the $lexemes array
    foreach($lexemes as $key=>&$lexeme){
      
      // Check if $lexeme is in the $contraction_symbols array
      if(in_array($lexeme, $contraction_symbols)){
        
        // If this isn't the first $lexeme in $lexemes
        // and If this isn't the last $lexeme in $lexemes
        if($key > 0 && $key < $number_of_lexemes){ 
          // Check the $lexeme $before this
          $before = $lexemes[$key - 1];
          
          // Check the $lexeme $after this
          $after = $lexemes[$key + 1];
          
          
          // If $before isn't a $delimiter
          // and $after isn't a $delimiter
          if(!in_array($before, $delimiters) && !in_array($after, $delimiters)){
            // Merge the contraction tokens
            $lexemes[$key + 1] = $before . $lexeme . $after;
            
            // Remove $before
            $lexemes[$key - 1] = NULL;
            // And remove this $lexeme
            $lexeme = NULL;            
          }

        }
        
      } // Check if lexeme is in the $contraction_symbols array
    } // Step through all tokens in the $lexemes array      
  } // If there are $chars in the $contraction_symbols array
  
  // Filter out any NULL values in the $lexemes array
  // created during the contraction merges using array_filter()
  // and then re-index so the $lexemes array is nice and sorted using array_values().
  $lexemes = array_values(array_filter($lexemes));
  

  // Return the $lexemes array.
  return $lexemes;
}

// Delimiters (Lexeme Boundaries)
$delimiters = array('~', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+', '`', '-', '=', '{', '}', '[', ']', '\\', '|', ':', ';', '"', '\'', '<', '>', ',', '.', '?', '/', ' ', "\t", "\n");

// Symbols used to detect compound-words
$compound_word_symbols = array('-', '_');

// Symbols used to detect contractions
$contraction_symbols = array("'", '.', '@');

// Text to Tokenize and Lex
$text = 'The quick brown fox jumps over the lazy dog. A long-term contract with "zero-liability" protection! Let\'s think \'it\' over. john.doe@web_server.com';

// Tokenize and extract the $lexemes from $text
$lexemes = Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols);
echo json_encode($lexemes, 1); // output $lexems as JSON

 

Splitting the Tokens

One way to do this would be to use regular expressions (regex) to match a pattern and it’s the method we used for the Email Relationship Classifier.

As part of that project I released a “Tokenizer” Class File that relied heavily on regex to match patterns but this isn’t the method we use in the Natural Language lexer, though we could have.

Common “advice” you will receive as a developer is that “you should NEVER use regex”, and while this is well meaning advice, it is certainly wrong!

I find Regex works best when you understand the patterns you are looking for really well and the patterns won’t change much throughout your data-set, though the pattern can be very complex.

Now the reason why you are often advised to avoid using regex pattern matching is that it’s complicated and understanding the pattern match string is not always immediately intuitive and sometimes its down right difficult! This can even be the case even if you are generally comfortable working with regex.

So the difficulty in using regex for most developers is a factor in my choice not to use regex in this case but the main reason is simply that it’s really not needed and it’s actually a lot simpler not to use regex to accomplish our goal.

So if not Regex then how?

Use Delimiters as a Guide to Word Boundaries

First create an array of delimiters that we can use as automatic word boundaries. In this case we can use a list of all the typable symbols that are not letters or numbers.

// Delimiters (Lexeme Boundaries) 
$delimiters = array('~', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+', '`', '-', '=', '{', '}', '[', ']', '\\', '|', ':', ';', '"', '\'', '<', '>', ',', '.', '?', '/', ' ', "\t", "\n");

 

Use Compound Symbols to Grow Words

Next we need a group of symbols we know are always used to create compound words. Basically this means hyphens and underscores which should always be joined with their parent lexeme. The distinction here is that even if these symbols show up before, in the middle of or even after another lexeme they should be considered part of that lexeme. e.g. pre- or long-term or _something or Jane_Doe

// Symbols used to detect compound-words 
$compound_word_symbols = array('-', '_'); 

 

Use Contractions to Merge Ideas

Quotes (‘single’ & “double” ) should be treated as separate lexemes and never be merged with the lexeme they contain.  However, apostrophes should actually be merged with the lexeme before & after it provided that neither are a delimiter. Also, sometimes a period and the @ symbol can behave like contraction symbols as is the case with the example email: john.doe@web_server.com

// Symbols used to detect contractions 
$contraction_symbols = array("'", '.', '@'); 

 

Our Example Natural Language Text

Here is the test string of natural language.

// Text to Tokenize and Lex 
$text = 'The quick brown fox jumps over the lazy dog. A long-term contract with "zero-liability" protection! Let\'s think \'it\' over. john.doe@web_server.com'; 

 

Extract Lexemes from Tokens Using Delimiter Symbols

We can now create a call to the Tokenize() function with our data and pass the results into an array which we format and echo as JSON.

// Tokenize and extract the $lexemes from $text 
$lexemes = Tokenize($text, $delimiters, $compound_word_symbols, $contraction_symbols); 
echo json_encode($lexemes, 1); // output $lexems as JSON

 

Now, if we run our code we get all the lexemes extracted from the natural language test string.

Next week we will look at how we can tag the lexemes to complete the Lexical Analysis of a natural language so remember to like, and follow!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Advertisements