Search

Geek Girl Joy

Artificial Intelligence, Simulations & Software

Month

June 2018

Email Relationship Classifier Training The Bot

We’re in the “home stretch” and quickly approaching our goal of having a working Email Relationship Classifier Bot prototype.

Today we will cover building the training portion of the bot and of course this system implements “Supervised Learning” so you will need to have “hand classified” your “Data Corpus” as outlined in my post A Bag of Words as well as Classifying Emails so if you’ve read the other posts in this series then you are ready to proceed.

 

Train.php

What you are going to love about this code is it’s simplicity!

It’s intentionally short and “high level” which was achieved by using our Class files (DatabaseManager.Class.phpFileManager.Class.phpTokenizer.Class.php) which we covered in my post Class Files  to create Objects to “act” upon our data encapsulated inside them. This means we can just ask our Objects to do complex work in just a few lines of code.

Code

<?php

// This function will load the human scored JSON class files
function LoadClassFile($file_name){
  // Get file contents
  $file_handle = fopen($file_name, 'r');
  $file_data = fread($file_handle, filesize($file_name));
  fclose($file_handle);
  return $file_data;
}


// Include Classes
function ClassAutoloader($class) {
    include 'Classes/' . $class . '.Class.php';
}
spl_autoload_register('ClassAutoloader');


// Instantiate Objects
$myTokenizer = new Tokenizer();
$myEmailFileManager = new FileManager();
$myJSONFileManager = new FileManager();
$myDatabaseManager = new DatabaseManager();


// No Configuration needed for the Tokenizer Object

// Configure FileManager Objects
$myEmailFileManager->Scan('DataCorpus/TrainingData');
$myJSONFileManager->Scan('DataCorpus/TrainingDataClassifications');
$number_of_training_files = $myEmailFileManager->NumberOfFiles();
$number_of_JSON_files = $myJSONFileManager->NumberOfFiles();

// Configure DatabaseManager Object
$myDatabaseManager->SetCredentials(
  $server = 'localhost', 
  $username = 'root', 
  $password = 'password', 
  $dbname = 'EmailRelationshipClassifier'
);


// Make sure the files are there and the number of training files is
// the same as the number of JSON Class files.
if(($number_of_training_files != $number_of_JSON_files) 
   || ($number_of_training_files == 0 || $number_of_JSON_files == 0) ){
  die(PHP_EOL . 'ERROR! the number of training files and classification files are not the same or are zero! Run CreateClassificationFiles.php first.');
}
else{
  // Loop Through Files
  for($current_file = 0; $current_file < $number_of_training_files; $current_file++){
    $myTokenizer->TokenizeFile($myEmailFileManager->NextFile());		
    $EmailClassifications = json_decode(LoadClassFile($myJSONFileManager->NextFile()), true);
    // Loop Through Tokens
    foreach($myTokenizer->tokens as $word=>$count){
      $myDatabaseManager->AddOrUpdateWord($word, $count, $EmailClassifications);
    }
  }
}

echo PHP_EOL . 'Training complete! You can now run Test.php' . PHP_EOL;

 

Save Train.php in the root project folder:


[EmailRelationshipClassifier]
│
├── CreateClassificationFiles.php
├── DatasetSplitAdviser.php
├── database.sql
├── Train.php 
│
├── [Classes]
│   │
│   ├── DatabaseManager.Class.php
│   ├── FileManager.Class.php
│   └── Tokenizer.Class.php
│
└── [DataCorpus]
    │
    ├── [TestData]
    │
    ├── [TestDataClassifications]
    │
    ├── [TestResults]
    │
    ├── [TrainingData]
    │
    └── [TrainingDataClassifications]

 

Of course the complexity does exist inside the Objects, it’s just advantageous to obfuscate it here using the Object methods so that we can focus on the task of training rather than the details of moving the data around.

Once all the classes have been included and the objects instantiated & configured there is a check to confirm the .txt & JSON files exist and that the number is the same.

If none of the fail conditions trigger the die() function then for all the training files (.txt emails),   the $myTokenizer Object will ask the $myEmailFileManager Object for the next file in it’s list which it will load and tokenize, which means that it builds a “bag of words model” of the email, specifically “unigrams“.

Then the JSON relationship class file will be loaded and decoded into an array of “key & value pairs ” where the key is the relationship class name and the value is either a zero or one (0/1) where one denotes relationship class membership and zero denotes a lack of class membership.

Then for each unigram word token the $myDatabaseManager Object will perform it’s AddOrUpdateWord() method.

The AddOrUpdateWord()  method accepts the unigram word token as the  first argument, the number of times it appears in the training file as the second argument and the relationship class memberships array as the third argument. The word is then either added to the Words table in the database or updated.

You can review the details of the database in my post Email Relationship Classifier Database.

After all the words in all the training emails have been processed the training is complete and we’re ready to test our bot which I’ll cover in an upcoming post.

If you enjoyed this post please support me on Patreon for as little as $1 a month, thank you.

 

 

Much Love,

~Joy

Advertisements

Email Relationship Classifier Classifying Emails

Welcome back, I hope you have been enjoying the previous posts in my Email Relationship Classifier series:

A Bag of Words
A Bag of Words
Relationship Classifier Class Files
Relationship Classifier Class Files
Email Relationship Classifier Database
Email Relationship Classifier Database

 

 

 

 

 

 

 

 

 

 

 

 

 

Today we are going to cover the process for hand classifying emails as was outlined in my post A Bag of Words.

 

Project Folder Structure

So, to get started lets make sure that we setup the folders we will need.

Inside the root project folder [EmailRelationshipClassifier] create a [DataCorpus] folder and inside that folder we want to create five sub-folders but we will only work with four today: [TestData], [TestDataClassifications], [TrainingData], [TrainingDataClassifications]

The fifth folder in the [DataCorpus] folder needs to be [TestResults] but we won’t need it today.


[EmailRelationshipClassifier]
│
├── CreateClassificationFiles.php
├── DatasetSplitAdviser.php
├── database.sql
│
├── [Classes]
│   │
│   ├── DatabaseManager.Class.php
│   ├── FileManager.Class.php
│   └── Tokenizer.Class.php
│
└── [DataCorpus]
    │
    ├── [TestData]
    │
    ├── [TestDataClassifications]
    │
    ├── [TestResults]
    │
    ├── [TrainingData]
    │
    └── [TrainingDataClassifications]

Once your folders are setup, you can refer to A Bag of Words Steps 3 & 4 … I’ll wait…

Did you give it a quick review? Good!

 

Split Your Emails

So what we want to do is update the number of emails you have in the DatasetSplitAdviser.php file as well as the ratios you want for the TrainingData… you did read  A Bag of Words Steps 3 & 4 right?

Run DatasetSplitAdviser.php which will tell you how to split your data.

DatasetSplitAdviser.php Output:

You chose to have 88% of your Data Corpus used as Training Data.

You have 10278 emails so using a ratio split of 0.88 : 0.12
You should split your emails like this:

Training Emails: 9045
Test Emails: 1233

Formula
(10278 x 0.88) = RoundUp(9044.64) = 9045
(10278 x 0.12) = RoundDown(1233.36) = 1233

 

Now place the correct numbers of emails in the [TrainingData] & [TestData] folders. The emails should be .txt files and should contain nothing but the subject and body of the email.

Now you need to run CreateClassificationFiles.php

CreateClassificationFiles.php

<?php

// We will pass our JSON to this function to save the classifications
// in a human friendly/editable format.
function CreateClassFile($file_name, $output_path, $class_json){
	// Write file contents
	$file_handle = fopen($output_path . basename($file_name, '.txt') . '.json', 'w');
	$file_data = fwrite($file_handle, $class_json);
	fclose($file_handle);
}


// Include Classes
function ClassAutoloader($class) {
    include 'Classes/' . $class . '.Class.php';
}
spl_autoload_register('ClassAutoloader');


// Instantiate Objects
$myTokenizer = new Tokenizer();
$myFileManager = new FileManager();
$myDatabaseManager = new DatabaseManager();


// Configure Tokenizer Object
// No Tokenizer config needed

// Configure FileManager Object for TrainingData
$myFileManager->Scan('DataCorpus/TrainingData');
$number_of_training_files = $myFileManager->NumberOfFiles();

// Configure DatabaseManager Object
$myDatabaseManager->SetCredentials($server = 'localhost', 
                                   $username = 'root', 
                                   $password = 'password', 
                                   $dbname = 'EmailRelationshipClassifier'
                                   );                                  

// This system bifurcates the class data twice into sender and recipient
// groups so below we pull the class list from the database using the
// $myDatabaseManager->GetKnownClasses() method.
// After which we create keys in the $classifications using the class 
// names and appending -Sender and -Recipient respectively.
$classifications = array();
$myDatabaseManager->GetKnownClasses(); 
foreach($myDatabaseManager->classifications as $class=>$value){
	$classifications["$class-Sender"] = '0';
}
foreach($myDatabaseManager->classifications as $class=>$value){
	$classifications["$class-Recipient"] = '0';
}
// Convert the $classifications array to JSON
$class_json = json_encode($classifications, true);
$class_json = str_replace('","', "\",\n\"", $class_json); // make easier for humans to read


// Now we generate a JSON class file for each text file in TrainingData
if($number_of_training_files > 0){
	// Loop Through Files
	for($current_file = 0; $current_file < $number_of_training_files; $current_file++){
		CreateClassFile($myFileManager->NextFile(), 'DataCorpus/TrainingDataClassifications/', $class_json);
	}
}

// reConfigure FileManager Object for TestData
$myFileManager->Scan('DataCorpus/TestData');
$number_of_test_files = $myFileManager->NumberOfFiles();

// Now we generate a JSON class file for each text file in TestData
if($number_of_test_files > 0){
	// Loop Through Files
	for($current_file = 0; $current_file < $number_of_test_files; $current_file++){
		CreateClassFile($myFileManager->NextFile(), 'DataCorpus/TestDataClassifications/', $class_json);
	}
}

echo PHP_EOL . 'Classification files have been created! You can now run Train.php' . PHP_EOL;

 

What CreateClassificationFiles.php does is create a JSON file in the [TrainingDataClassifications] & [TestDataClassifications] folders named after the .txt email that will let you enter a 1 on all classes that the email reflects.

Below is an example of the JSON relationship class file, note I updated some classes to 1 which means that the email this file is associated with reflects the selected classes. You should leave classes not present as 0.

Example JSON


{"Colleague-Sender":"0",
"Employee-Sender":"0",
"Manager-Sender":"0",
"Employer-Sender":"0",
"Spouse-Sender":"0",
"Husband-Sender":"0",
"Wife-Sender":"0",
"Parent-Sender":"0",
"Father-Sender":"0",
"Mother-Sender":"0",
"Child-Sender":"1",
"Son-Sender":"1",
"Daughter-Sender":"0",
"Sibling-Sender":"0",
"Brother-Sender":"0",
"Sister-Sender":"0",
"Grandparent-Sender":"0",
"Grandfather-Sender":"0",
"Grandmother-Sender":"0",
"Grandchild-Sender":"0",
"Grandson-Sender":"0",
"Granddaughter-Sender":"0",
"Uncle-Sender":"0",
"Aunt-Sender":"0",
"Cousin-Sender":"0",
"Nephew-Sender":"0",
"Niece-Sender":"0",
"Friend-Sender":"0",
"Colleague-Recipient":"0",
"Employee-Recipient":"0",
"Manager-Recipient":"0",
"Employer-Recipient":"0",
"Spouse-Recipient":"0",
"Husband-Recipient":"0",
"Wife-Recipient":"0",
"Parent-Recipient":"1",
"Father-Recipient":"1",
"Mother-Recipient":"1",
"Child-Recipient":"0",
"Son-Recipient":"0",
"Daughter-Recipient":"0",
"Sibling-Recipient":"0",
"Brother-Recipient":"0",
"Sister-Recipient":"0",
"Grandparent-Recipient":"0",
"Grandfather-Recipient":"0",
"Grandmother-Recipient":"0",
"Grandchild-Recipient":"0",
"Grandson-Recipient":"0",
"Granddaughter-Recipient":"0",
"Uncle-Recipient":"0",
"Aunt-Recipient":"0",
"Cousin-Recipient":"0",
"Nephew-Recipient":"0",
"Niece-Recipient":"0",
"Friend-Recipient":"0"}

 

You need to classify ALL the emails you have before proceeding to the next steps after this post.

I recognize there are better ways to do this  (either cleaner with fewer files & folders or simpler like storing the data in a database) but since we’re building a prototype I am focusing on “function over form” in order to get this project “off the ground” as quickly as possible… I went with arguably the fastest method to implement which simple files and folders. I won’t worry about it at this time however you can definitely improve upon this “proof of concept” implementation quite easily.

Further, editing raw JSON files by hand (while better than nothing) isn’t my idea of a “good time” so I would advise you to build a second system to display the emails and classes together and have my analysts tag the emails from a web page as that would simplify everything… I will also leave that for you to implement as well as it is relatively trivial to build and not critical at this juncture, though if you guys really want it or if it bugs me enough I’ll build that system too. 😛

At this point (after all your emails are classified) all that is left to do is to build the bot then train and test it which I’ll cover in an upcoming post.

I hope you enjoyed this post and consider supporting me on Patreon.

 

 

Much Love,

~Joy

 

Email Relationship Classifier Database

I hope you have been enjoying this series on developing a prototype Email Relationship Classifier bot. In case you missed the other posts in this series i’ll list them here for your convenience and you’ll wan’t to start with A Bag of Words.

A Bag of Words
A Bag of Words
Relationship Classifier Class Files
Relationship Classifier Class Files

 

 

 

 

 

 

Today we’re going to implement the database for the Email Relationship Classifier bot.

The prototype only needs a database with two tables: Classifications & Words

Classifications Table

As briefly discussed in the Relationship Classifier Class Files post the DatabaseManager Object will use it’s GetKnownClasses() method to read the classes from this table along with their “weights”.

The Classifications table has three columns: ID, Classification, Weight

The ID column is used as the Primary Key for the table.

The Classification column keeps a list of the names of each classification without a reference to it being a sender or a recipient i.e

The Weight column is a value the count of a word class can be multiplied against to determine its weighted value. This is done in the DatabaseManager ScoreWord() method like this:


$word_data[$key] *= ($count * $this->classifications[$key]);

This demonstrates the class name $key stored in $word_data[] is incremented by multiplying the value of $count  times the class weight.

i.e: The current value  of a given class is currently 7 and the new count is 3 and the weight is 0.9 then this computation follows:

7 * (3 * 0.9) = 18.9

 

Words Table

The Words table does much of the “heavy lifting” when it comes to the database. Words keeps track of every word the bot encounters and the DatabaseManager Object will use it’s KnownWord()ScoreWord(), AddOrUpdateWord() methods to to populate, check and update this table.

The Words table has whopping 58 columns: ID, Word, Colleague-Sender, Employee-Sender, Manager-Sender, Employer-Sender, Spouse-Sender, Husband-Sender, Wife-Sender, Parent-Sender, Father-Sender, Mother-Sender, Child-Sender, Son-Sender, Daughter-Sender, Sibling-Sender, Brother-Sender, Sister-Sender, Grandparent-Sender, Grandfather-Sender, Grandmother-Sender, Grandchild-Sender, Grandson-Sender, Granddaughter-Sender, Uncle-Sender, Aunt-Sender, Cousin-Sender, Nephew-Sender, Niece-Sender, Friend-Sender, Colleague-Recipient, Employee-Recipient, Manager-Recipient, Employer-Recipient, Spouse-Recipient, Husband-Recipient, Wife-Recipient, Parent-Recipient, Father-Recipient, Mother-Recipient, Child-Recipient, Son-Recipient, Daughter-Recipient, Sibling-Recipient, Brother-Recipient, Sister-Recipient, Grandparent-Recipient, Grandfather-Recipient, Grandmother-Recipient, Grandchild-Recipient, Grandson-Recipient, Granddaughter-Recipient, Uncle-Recipient, Aunt-Recipient, Cousin-Recipient, Nephew-Recipient, Niece-Recipient, Friend-Recipient

The ID column is used as the Primary Key for the table.

The Word column stores the actual word that the Tokenizer Object extracted and processed.

The rest of the columns are named for the relationship classification it stores the counts for.

SQL

-- phpMyAdmin SQL Dump
-- version 4.6.6deb4
-- https://www.phpmyadmin.net/
--
-- Host: localhost:3306
-- Generation Time: Jun 27, 2018 at 11:01 AM
-- Server version: 10.1.23-MariaDB-9+deb9u1
-- PHP Version: 7.0.27-0+deb9u1

SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";


/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;

--
-- Database: `EmailRelationshipClassifier`
--
CREATE DATABASE IF NOT EXISTS `EmailRelationshipClassifier` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
USE `EmailRelationshipClassifier`;

-- --------------------------------------------------------

--
-- Table structure for table `Classifications`
--

CREATE TABLE `Classifications` (
  `ID` int(11) NOT NULL,
  `Classification` varchar(128) NOT NULL,
  `Weight` float NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

-- --------------------------------------------------------

--
-- Dumping data for table `Classifications`
--

INSERT INTO `Classifications` (`ID`, `Classification`, `Weight`) VALUES
(1, 'Colleague', 1),
(2, 'Employee', 1),
(3, 'Manager', 1),
(4, 'Employer', 1),
(5, 'Spouse', 1),
(6, 'Husband', 0.9),
(7, 'Wife', 0.9),
(8, 'Parent', 1),
(9, 'Father', 0.9),
(10, 'Mother', 0.9),
(11, 'Child', 1),
(12, 'Son', 0.9),
(13, 'Daughter', 0.9),
(14, 'Sibling', 1),
(15, 'Brother', 0.9),
(16, 'Sister', 0.9),
(17, 'Grandparent', 1),
(18, 'Grandfather', 0.9),
(19, 'Grandmother', 0.9),
(20, 'Grandchild', 1),
(21, 'Grandson', 0.9),
(22, 'Granddaughter', 0.9),
(23, 'Uncle', 1),
(24, 'Aunt', 1),
(25, 'Cousin', 1),
(26, 'Nephew', 1),
(27, 'Niece', 1),
(28, 'Friend', 1);

--
-- Table structure for table `Words`
--

CREATE TABLE `Words` (
  `ID` int(11) NOT NULL,
  `Word` text NOT NULL,
  `Colleague-Sender` int(11) NOT NULL,
  `Employee-Sender` int(11) NOT NULL,
  `Manager-Sender` int(11) NOT NULL,
  `Employer-Sender` int(11) NOT NULL,
  `Spouse-Sender` int(11) NOT NULL,
  `Husband-Sender` int(11) NOT NULL,
  `Wife-Sender` int(11) NOT NULL,
  `Parent-Sender` int(11) NOT NULL,
  `Father-Sender` int(11) NOT NULL,
  `Mother-Sender` int(11) NOT NULL,
  `Child-Sender` int(11) NOT NULL,
  `Son-Sender` int(11) NOT NULL,
  `Daughter-Sender` int(11) NOT NULL,
  `Sibling-Sender` int(11) NOT NULL,
  `Brother-Sender` int(11) NOT NULL,
  `Sister-Sender` int(11) NOT NULL,
  `Grandparent-Sender` int(11) NOT NULL,
  `Grandfather-Sender` int(11) NOT NULL,
  `Grandmother-Sender` int(11) NOT NULL,
  `Grandchild-Sender` int(11) NOT NULL,
  `Grandson-Sender` int(11) NOT NULL,
  `Granddaughter-Sender` int(11) NOT NULL,
  `Uncle-Sender` int(11) NOT NULL,
  `Aunt-Sender` int(11) NOT NULL,
  `Cousin-Sender` int(11) NOT NULL,
  `Nephew-Sender` int(11) NOT NULL,
  `Niece-Sender` int(11) NOT NULL,
  `Friend-Sender` int(11) NOT NULL,
  `Colleague-Recipient` int(11) NOT NULL,
  `Employee-Recipient` int(11) NOT NULL,
  `Manager-Recipient` int(11) NOT NULL,
  `Employer-Recipient` int(11) NOT NULL,
  `Spouse-Recipient` int(11) NOT NULL,
  `Husband-Recipient` int(11) NOT NULL,
  `Wife-Recipient` int(11) NOT NULL,
  `Parent-Recipient` int(11) NOT NULL,
  `Father-Recipient` int(11) NOT NULL,
  `Mother-Recipient` int(11) NOT NULL,
  `Child-Recipient` int(11) NOT NULL,
  `Son-Recipient` int(11) NOT NULL,
  `Daughter-Recipient` int(11) NOT NULL,
  `Sibling-Recipient` int(11) NOT NULL,
  `Brother-Recipient` int(11) NOT NULL,
  `Sister-Recipient` int(11) NOT NULL,
  `Grandparent-Recipient` int(11) NOT NULL,
  `Grandfather-Recipient` int(11) NOT NULL,
  `Grandmother-Recipient` int(11) NOT NULL,
  `Grandchild-Recipient` int(11) NOT NULL,
  `Grandson-Recipient` int(11) NOT NULL,
  `Granddaughter-Recipient` int(11) NOT NULL,
  `Uncle-Recipient` int(11) NOT NULL,
  `Aunt-Recipient` int(11) NOT NULL,
  `Cousin-Recipient` int(11) NOT NULL,
  `Nephew-Recipient` int(11) NOT NULL,
  `Niece-Recipient` int(11) NOT NULL,
  `Friend-Recipient` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

--
-- Indexes for dumped tables
--

--
-- Indexes for table `Classifications`
--
ALTER TABLE `Classifications`
  ADD PRIMARY KEY (`ID`);

--
-- Indexes for table `Words`
--
ALTER TABLE `Words`
  ADD PRIMARY KEY (`ID`);

--
-- AUTO_INCREMENT for dumped tables
--

--
-- AUTO_INCREMENT for table `Classifications`
--
ALTER TABLE `Classifications`
  MODIFY `ID` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=29;
--
-- AUTO_INCREMENT for table `Words`
--
ALTER TABLE `Words`
  MODIFY `ID` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1;
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;

Use this SQL to setup your MySQL database for your Email Relationship Classifier bot.

Now all that is left to do is to build a system to hand classify the emails and the actual bot to do the training and testing which we’ll cover in an upcoming post.

I’d like to thank all of my supporters who’s generous contributions make these posts possible!

I hope you enjoyed reading this post and consider supporting me on Patreon.

 

 

Much Love,

~Joy

Email Relationship Classifier Class Files

Recently I wrote a post titled A Bag of Words where I outlined an Email Relationship Classifier which is a bot designed to determine the relationships between the sender and all of the recipients based only on the text available in the subject line and the body of the email.

Yesterday I uploaded the “class” files to my GitHub profile that I will use to implement the Relationship Classifier bot and today we’re going go over them.

If you plan to use the class files in your implementation you will also require a database and the code for the bot as well however we’ll discuss those in an upcoming post.

Today however lets just get right to it and review the class files used by my bot.

FileManager.Class.php

This class file manages the files (txt, json etc) that the bot interacts with.

There are three properties: $path, $files & $current_file

As well as three Standard Methods: Scan(), NextFile(), NumberOfFiles()

In addition there are two Magic Methods: __construct() & __destruct()

Properties

The FileManager Object is designed to work inside a single folder at a time and uses the $path property to remember where it’s working.

The $files property keeps a list of all the files in the folder that the bot is working.

The $current_file property keeps track of the numeric index of the file name stored in the $files array.

Methods

The Scan() method will scan the path provided to it and store the names of the files it finds in the $files array.

The NextFile() method will return the name of the next file in the $files array after the $current_file integer value.

The NumberOfFiles() method simply counts how many files (keys) are in the $files array.

Code:

<?php
class FileManager {
  // Data
  private $path = '';
  private $files = array();
  private $current_file = 0;
  
  function __construct(){
  }
  
  function __destruct(){
    // Do NOT call this function ie $object->__destruct();
    
    // Use unset($object); to let garbage collection properly
    // destroy the FileManager object in memory
    
    // No More FileManager after this point
  }
  
  public function Scan($path = ''){
    if(!empty($path)){
      $this->path = $path;
      $this->files = array_values(array_diff(scandir($path), array('..', '.')));
    }
    else{
		die('ERROR! FileManager->Scan(string $path) requires a directory path string.' . PHP_EOL);
	}
  }
  
  public function NextFile(){
    if(count($this->files) > 0){
      
      // reset count so we dont overun the array
      if($this->current_file > count($this->files)){
        $this->current_file = 0;
      }

      $file = "{$this->path}/{$this->files[$this->current_file]}";
      $this->current_file++;
      return $file;
    }
    else{
		die('ERROR! FileManager->NextFile() requires you to run FileManager->Scan(string $path) first.' . PHP_EOL);
    }
  }
  
  public function NumberOfFiles(){
    return count($this->files);
  }
}

 

Tokenizer.Class.php

This class file facilitates Lexical Analysis through a method called tokenization. A token is a discrete piece of information or pattern like a word.

There is only a single property: $tokens

Further there are three Standard Methods: TokenizeFile(), ProcessTokens(), Tokenize()

In addition there are two Magic Methods: __construct() & __destruct()

Properties

The Tokenizer Object is designed to do the “slicing and dicing” of your data so it doesn’t require many properties

The $tokens property keeps a list of all the tokens that the bot is working with.

Methods

The TokenizeFile() method accepts a path to a file which it will then load, read into a string and  then pass to the Tokenize() method.

The ProcessTokens() method will receive the matches found by the Tokenize() method and process them so as to remove apostrophes and hyphens i.e.  pre-game becomes pregame and ain’t’not’gonna’ever-never becomes aintnotgonnaevernever.  After which it converts the token to UPPERCASE so that tokens that are otherwise the same can be merged into a single token and be counted properly.

The Tokenize() method uses RegEx pattern matching and capture groups to match the pattern /(\w+)(\’?)(?)/m‘ which it then passes to the ProcessTokens() method. Once ProcessTokens() is finished Tokenize()  counts the tokens and then uses the token as the key and the value is the count. i.e. “The blue sky is blue” would be represented like this (‘THE’=>1, ‘SKY’=1, ‘IS’=1, ‘BLUE’=>2).

Code:

<?php
class Tokenizer {

    // Data
    public $tokens = array();


    function __construct(){
    }

    function __destruct(){
        // Do NOT call this function ie $object->__destruct();

        // Use unset($object); to let garbage collection properly
        // destroy the Tokenizer object in memory

        // No More Tokenizer after this point
    }

    public function TokenizeFile($file_name){
		// Get file contents
		$file_handle = fopen(trim($file_name), 'r');
		$file_data = fread($file_handle, filesize($file_name));
		fclose($file_handle);

		// do any preprocessing to $file_data here

		// Pass file data to Tokenize() method
		$this->Tokenize($file_data);
    }


	private function ProcessTokens(&$matches){
		
		foreach($matches as $key=>&$tokenset){
			// $tokenset[2] == ' 
			// $tokenset[3] == - 
			// Handle apostrophe and hyphen word merges
			// i.e. pre-game = PREGAME
			// & don't = DONT
			if(!empty($tokenset[2]) || !empty($tokenset[3])){

				$n = 1;
				$tokenset[0] = str_replace(array('\'', '-'), '', $tokenset[0]); // remove apostrophe and hyphen
				$next = $matches[$key + $n][0];
				$tokenset[0] .= $next; // merge with next captured token
				unset($matches[$key + $n]); // unset next token
				
				// Handle nested hyphen & apostrophe word merges 
				// i.e. pre-game-celebration  = PREGAMECELEBRATION
				// & ain't'not'gonna'ever-never  = AINTNOTGONNAEVERNEVER
				while(strpos($next, '-') !== false || strpos($next, '\'') !== false){
					$n++;
					$next = $matches[$key + $n][0];
					$tokenset[0] = str_replace(array('\'', '-'), '',$tokenset[0]) . str_replace(array('\'', '-'), '', $next); // merge with next captured token
					unset($matches[$key + $n]); // unset next token
				}			
			}

			$tokenset = strtoupper(trim($tokenset[0])); // convert to uppercase and string
		}	
	}
	

    private function Tokenize($string){
		if(!empty($string)){
			// Get Word Tokens using RegEx
			preg_match_all('/(\w+)(\'?)(-?)/m', $string, $this->tokens, PREG_SET_ORDER, 0);
			
			$this->ProcessTokens($this->tokens);

			// use words as keys in array and values are the counts
			$this->tokens = array_count_values($this->tokens);
		}
    }
}

`

DatabaseManager.Class.php

This class file allows the bot to connect to it’s database and since the DatabaseManager Object is designed to handle the communication for the bot some of it’s functionality has been merged into this class directly rather than passing data between classes.

There are six properties: $server, $username$password, $dbname, $conn, $classifications

As well as seven Standard Methods: SetCredentials(), Connect(), Disconnect(), GetKnownClasses(), KnownWord(), ScoreWord(), AddOrUpdateWord()

In addition there are two Magic Methods: __construct() & __destruct()

Properties

Other than the $classifications property this is probably what you would expect to see on a database manager.

The $server property is the DNS name or IP address of the server hosting the database for the bot.

The $username property is the username the bot uses to access the database.

The $password property is the password the bot uses to access the database.

The $dbname property is the name of the database the bot is using.

The $conn property stores the connection object once initialized.

The $classifications property is a key & values array of the classifications and the ‘weight’ used by the bot (see A Bag of Words) to determine the “weighted” score for a relationship class rather than simply relying on a raw count.

Methods

The SetCredentials() method accepts and sets the $server, $username$password, $dbname properties.

The Connect() method establishes a connection with the server and retains it as the $conn property.

The Disconnect() method severs the connection with the database.

The GetKnownClasses() method queries the database for known “Relationship Classifications” and weights then retains the information as the $classifications array property.

The KnownWord() method returns true if the word is known and false otherwise.

The ScoreWord() method is used during testing to obtain the class scores for a word in the database.

The AddOrUpdateWord() method is used during training to add new words or update known word.

Code:

<?php
class DatabaseManager {
  // Data
  private $server = '';
  private $username = '';
  private $password = '';
  private $dbname = '';
  public $conn;     // The DB connection

  public $classifications = array();
      
  function __construct($server = NULL, $username = NULL, $password = NULL, $dbname = NULL){
    if(!empty($server) && !empty($username) && !empty($password) && !empty($dbname)){
      $this->SetCredentials($server, $username, $password, $dbname);
    }
  }
  
  
  function __destruct(){
    // Do NOT call this function ie $object->__destruct();
    
    // Use unset($object); to let garbage collection properly
    // destroy the DatabaseManager object in memory
    
    // No More DatabaseManager after this point
  }
  
  
  public function SetCredentials($server, $username, $password, $dbname){
     $this->server = $server;
     $this->username = $username;
     $this->password = $password;
     $this->dbname = $dbname;
  }
  
  
  public function Connect(){
    // Create connection
    $this->conn = new mysqli($this->server, $this->username, $this->password, $this->dbname);
    
    // Check connection
    if ($this->conn->connect_error) {
      die("MYSQL DB Connection failed: " . $this->conn->connect_error);
    }

    return true;
  }
    
    
  public function Disconnect(){
    $this->conn->close(); // Close connection
  }


  public function GetKnownClasses(){
  $this->Connect();
    $sql = "SELECT * FROM `Classifications`";
    $result = $this->conn->query($sql);

  if ($result->num_rows > 0) {
    $classifications = array();
    // Obtain the Classifications
    while($row = $result->fetch_assoc()) {
       $classifications[$row['Classification']] = $row['Weight'];
    }
    $this->classifications = $classifications;
  }
  else {
    die('ERROR! No Known Classifications in Database.' . PHP_EOL);
  }
  $this->Disconnect();
  }

  public function KnownWord(&$word){
    $this->Connect();
      $sql = "SELECT * FROM `Words` WHERE `Word`='$word' LIMIT 1;";
      $result = $this->conn->query($sql);
      //$this->Disconnect();

    if ($result->num_rows > 0) {
      return true;
    }
    return false;      
  }
  
  public function ScoreWord(&$word, &$count){
	
	if(count($this->classifications) == 0){
	    $this->GetKnownClasses();
	    $classifications = array();
	    foreach($this->classifications as $class=>$value){
			$classifications["$class-Sender"] =	$value;
		}
		foreach($this->classifications as $class=>$value){
			$classifications["$class-Recipient"] =	$value;
		}
		$this->classifications = $classifications;
	}
	

	if($this->KnownWord($word)){
        $this->Connect();
		$sql = "SELECT * FROM `Words` WHERE `Word` LIKE '$word'";
		$result = $this->conn->query($sql);

	  if ($result->num_rows > 0) {
		$word_data = $result->fetch_assoc();
		foreach($word_data as $key=>$value){
			 if($key == 'ID'){
				 unset($word_data["$key"]);
			 }
			 elseif($key == 'Word'){
				 unset($word_data["$key"]);
			 }
			 else{
				 $word_data[$key] *= ($count * $this->classifications[$key]);
			 }
	    }
		return $word_data;
	  }
    }else{
	    // unknown word... add it or ignore it
	}
  }
  

  public function AddOrUpdateWord(&$word, &$count, &$EmailClassifications){

    if(count($this->classifications) < 1){
      $this->GetKnownClasses();
    }
        
    $sql = "";
  
    if($this->KnownWord($word) == false){
      // Add Word
      // Build Insert SQL
      $sql .= "INSERT INTO `Words` ";
      $sql .= "(`ID`, `Word`, `" . implode('`, `', array_keys($EmailClassifications)) . '`) ';
      $sql .= "VALUES (NULL, '$word', '" . implode("', '", array_values($EmailClassifications)) . "')";
    }else{
      // Update Word
      // Build Update SQL
      $sql .= "UPDATE `Words` SET ";  
      $EmailClassifications = array_diff($EmailClassifications, array('0')); // remove any classes
      $classes = array_keys($EmailClassifications);
      for($i = 0; $i < count($classes); $i++){
        $sql .= "`{$classes[$i]}` = `{$classes[$i]}` + $count";
        
        if( $i < count($classes) - 1){
           $sql .= ', ';
        }
      }
      $sql .= " WHERE `Word`='$word'";      
    }

      // DO QUERY
      $this->Connect();
      $result = $this->conn->query($sql);
      $this->Disconnect();    

    if ($result > 0){
      echo substr($sql, 0, 7) . " $word" . PHP_EOL;
    }else{
      die("FAIL");
    }
  }
}

 

We will cover the database and building the actual bot in an upcoming post.

I’d like to thank all of my supporters who’s generous contributions make these posts possible!

I hope you enjoyed reading this post and consider supporting me on Patreon.

 

 

Much Love,

~Joy

FIZZ

Recently one of my Patreon Supporters sent me a pic of a can of FIZZ with the comment…

“Best Seltzer ever!”

And they would know too because they are something of a seltzer aficionado.

Best  Seltzer ever!

Anyway, the thing is they caught me during one of those mythical “down time” moments when you don’t have anything particularly pressing to do, so I had a little fun with the image and I thought I would share it with you and before I do I would like to stress that Kroger & FIZZ & Co are not sponsors of this content… but they could be and so can you! If you like this kind of content consider supporting me over on Patreon for a little as $1 a month. Your Patronage will help me to keep creating free of charge content like this and more.

 Now in Nuclear Neon flavor

 

After a long day of building the O’Neal Cylinder in orbit, relax with a can of FIZZ… It’s truly out of this world!

 

Basically… I couldn’t resist doing one with a “Retro” vibe.

 

I hope you enjoyed this post and consider supporting me on Patreon.

 

 

Much Love,

~Joy

A Bag of Words

Let’s say that you are an enterprising data scientist looking to build a bot to distinguish what “class” something belongs to. Like identifying which emails are spam and which emails are not for example…

A simple way you can accomplish this type of goal is by using a “Bag of Words Model” in conjunction with “classifications” (Supervised Learning) to teach the bot to recognize the patterns associated with our classifications.

However rather than build a simple “spam detector” I’m going to outline how to build an “Email Sender and Recipient Relationship Classifier” because well… let’s face it you don’t read this blog for simple now do you? Or rather, you seem to like when I simplify hard things but not to the point of too simple. That… and all the free working code! 😛

More directly, I’m going to walk you though my process of designing and building this bot, as usual from scratch (because that’s how we learn) to answer the question: Was this email a conversation between Colleagues? Family? Friends? Etc… What kinds of relationships do the senders and recipients of an email have with one another.

Further, it is possible to extend the capability of this type bot or even branch out into other problem domains like developing a bot to read legal documents, review aggregate medical data or even view images and then extract all kinds of hidden facts, patterns and correlations. 


Additionally, it should be noted that this is not a “bleeding edge” or even necessarily a “state of the art” technology like those fancy LSTM Networks but it is a tried and tested method that is still at the heart of many spam detection systems as well as the basis of many modern machine learning classifier processes and it can operate alone or in conjunction with other techniques to build more robust systems.

So let’s get started!

Step 1. Collect your “Data Corpus”

You will need a large set of emails to even make attempting this worth your efforts.

Between 10K – 100K emails would be an alright jumping off point (just a guess) and might be perfect if you have few classifications though more may be required if you have a lot of classifications and or the emails you use to train are particularly short.

From where you say? Well… there is The Enron Email Dataset. You could also just use your own emails and any donated by friends, colleagues and family.

Smaller less all encompassing systems (like a prototype) could probably get away with a much smaller Data Corpus, perhaps even a few hundred examples or less if it’s just a proof of concept.

 

Step 2. Create a List of Classifications

Classifications are basically labels that help us to group things that are alike.

You can think of a classification like a metaphorical box, red things go in the red box, blue things in the blue box, cats… go in the cat box! 😛

Grouping information with classifications is useful because it gives us the ability to teach the bot what distinguishes one group or thing from another by showing it examples of each so that it can learn the differences.

In this case we’re building an Email Sender and Recipient Relationship Classifier, so we might use a list of relationships as our classifications like this:

Examples: (Colleague, Employee, Manager, Employer, Spouse, Husband, Wife, Parent, Father, Mother, Child, Son, Daughter, Sibling, Brother, Sister, Grandparent, Grandfather, Grandmother, Grandson, Granddaughter, Uncle, Aunt, Cousin, Nephew, Father-in-law, Mother-in-law, Brother-in-law, Sister-in-law, …)

There may be other classifications such as Daughter-in-law, Friend, Landlord, Plumber, Lawyer, Government Entity, Second Cousin Thrice Removed, etc.. that you might want in your list of classifications so ultimately the starter list I provide above should be altered to meet your needs.

It should also be noted that the more classifications that you need the bot to identify the more data it will need to look at to learn to properly classify new data.

 

Step 3. Hand classify ALL emails in your Data Corpus

Sadly there is no way around this step (using this methodology) and it’s the worst part because it involves manually reading every email and hand documenting the appropriate tags in a file or database.

You can choose to represent this information in many ways (such as “key value pairs” and it may be more useful in some circumstances to do so) however for the sake of simplicity I will represent the values here as rows in a table where each column represents one of the classifications or “keys”.

Further there may be cases where it is desirable or simply more accurate to assign more than one classification to a given set of data; In the case of the Relationship Classifier an example of  which could be a child away at college emailing both their parents. You would want the bot to understand and generate a classification to reflect both parents as recipients not one or the other.

You also need to consider future growth and your methodology needs to accommodate the ability to add new classifications as the need may arise.

Ideally adding new classes should be as simple as adding additional representative columns (or keys) for the new classifications and retraining the bot on the emails but only update the new classification columns (or keys).

Now lets look at a few examples.

Example Email:

Subject line: Congratulations on Your Promotion

Dear Bob,

I heard through about your promotion to Vice President of ACME Widgets through LinkedIn. Congratulations, you deserve it!

Best wishes,

John Doe

Example Classification:

Sender (Colleague)
Recipient (Colleague)

Sender (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Recipient (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

 

 

Multiple Sender & Recipient Example Classification:

Subject line: You won’t believe this!

It’s un-be-li-e-va-b-l-e

During the post-game celebration Mr. Coach got a whole water cooler dumped on his head!

Everyone laughed as Mr. Coach chased the team off the field.

Love,
Bobby

Example Classification:

Sender (Child, Son)
Recipient (Parent, Father, Mother)

Sender (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Recipient (0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

Multiple Sender & Recipient Example Classification:

Subject line: Thanks Mom & Dad!

You are the best parents ever!

Love,
Jane Doe

Example Classification:

Sender (Child, Daughter)
Recipient (Parent, Father, Mother)

Sender (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Recipient (0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

 

This information needs to be logged for each email in your Data Corpus and it should be stored separately from the text of the email.

It’s also highly notable that the *-in-law classifications lineup sort of “tangentially” with their direct counterpart i.e.  Brother, Sister, Brother-in-law, Sister-in-law… as well as any other overlapping classification and it can be difficult for a person let alone a bot to classify this.

There are three ways to handle this:

Option 1. The simplest and most straight forward is to eliminate overlapping classifications.

Option 2. Leave the classifications and ignore it. You can always remove them later.

Option 3. Subclass by setting the main classification as a value of 1 and the sub-classification as a value less than 1 but greater than 0 (a float) for example 0.5 and should there be multiple sub-classifications, where logical you could rank the sub-classes by probability with the highest being the highest float less than 1 in the set and the lowest probability in the set being the lowest float above 0.

This will teach the bot that these classes are closely associated and when the bot tries to classify new emails it will automatically weight/group the sub-classes of a classification near the main classification.

It is hypothetically possible that a subclass could be shared between several classes and as such the subclass could still be selected as the most likely classification when it is most likely. This is however more complicated.

Thankfully though this IS NOT a “black box” bot and can be hand modified or updated and edited after training so there is no need to implement Option 3 when first starting out.

 

Step 4. Split the Data into Two Separate Groups

The First group is the “Training Data”.

The Second group is the “Test Data”.

When training the bot, you will show the “Training Data” to teach the bot how to classify.

When testing the bot, you will use the “Test Data”.

The reason why you hand classified ALL emails and then separate them like this is so that when you test the bot you know if it got it correct or not. You simply compare the answers the bot gives for the emails it never trained on with the classifications that were manually assigned by a human.

Which brings us to how to divide the emails among the training and testing groups.

A good rule of thumb is that the larger your “Data Corpus” set is the larger your “Test Data” set can be.

So if the ratio is: TrainingDataRatio:TestDataRatio then:

If you have a tiny Data Corpus of 100 emails and use a ratio of 95%:5% your Training Data is 95 emails and your Test Data is 5 emails.

If you have a Data Corpus of 10K emails and use a ratio of 90%:10% your Training Data is 9000 emails and your Test Data is 1000 emails.

If you have a large Data Corpus of 100K emails you might use a 70%:30% ratio and your Training Data is 70000 emails and your Test Data is 30000 emails.

Most of your data should be used for training and once you know how many emails you have to work with you can decide on an appropriate ratio.

Once you have that ratio you can determine your split using this formula:

Convert TrainingDataRatio:TestDataRatio to a decimal value I.E. 88%:12% becomes 0.88:0.12

Then compute:

(total_number_of_emails x TrainingDataRatio) = number_of_training_emails

(total_number_of_emails x TestDataRatio) = number_of_test_emails

Here’s some PHP code that uses this formula that you can use to determine how to split your data:
<?php

// This function is not a complete ToFloat
// in that it expects you to provide a number
// and not an array or a string of chars necessarily
// however PHP will convert the char string '1' to an
// int and '0.22' to a float automatically
function ToFloat($value){
	
	if(is_numeric($value)){
		if(is_int($value)){
			return "0.$value";
		}
		return $value; // should be a float already
	}else{
		die('ToFloat($value) only accepts NUMBERS.');
	}	
}

// Change these to fit your needs
$total_number_of_emails = 10278;
$training_data_percentage = 88; // set between 50% and 97%

// Compute Values - Don't change these
$training_data_ratio = ToFloat($training_data_percentage);
$test_data_ratio = (1.0 - $training_data_ratio);
$number_of_training_emails = $total_number_of_emails * $training_data_ratio;
$number_of_test_emails = $total_number_of_emails * $test_data_ratio;
$number_of_training_emails_round = round($number_of_training_emails, 0, PHP_ROUND_HALF_UP);
$number_of_test_emails_round = round($number_of_test_emails, 0, PHP_ROUND_HALF_DOWN);

// Build Report
$report = "You chose to have $training_data_percentage% of your Data Corpus used as Training Data." . PHP_EOL . PHP_EOL;
$report .= "You have $total_number_of_emails emails so using a ratio split of $training_data_ratio : $test_data_ratio" . PHP_EOL;
$report .= "You should split your emails like this:" . PHP_EOL . PHP_EOL;
$report .= "Training Emails: $number_of_training_emails_round" . PHP_EOL;
$report .= "Test Emails: $number_of_test_emails_round" . PHP_EOL . PHP_EOL;
$report .= 'Formula' . PHP_EOL;
$report .= "($total_number_of_emails x $training_data_ratio) = RoundUp($number_of_training_emails) = $number_of_training_emails_round" . PHP_EOL;
$report .= "($total_number_of_emails x $test_data_ratio) = RoundDown($number_of_test_emails) = $number_of_test_emails_round" . PHP_EOL;

// Report
echo $report . PHP_EOL;
Which will produce this output or something similar:

You chose to have 88% of your Data Corpus used as Training Data.

You have 10278 emails so using a ratio split of 0.88 : 0.12
You should split your emails like this:

Training Emails: 9045
Test Emails: 1233

Formula
(10278 x 0.88) = RoundUp(9044.64) = 9045
(10278 x 0.12) = RoundDown(1233.36) = 1233

It is possible for this tool to produce a floating-point number (as I demonstrate) hence the need to round the values. The rounding should always yield a whole as the test data ratio is determined by subtracting the training data ratio from 1. If there is a remainder the “extra” fraction of the split email is essentially mathematically “recombined” and just added to the training data.

 

Step 5. Train Bot to Classify Emails

Here is roughly the Pseudo Code for the “Train” process:

I encourage you to try to use this pseudo code to create your own implementation though I do plan to release my solution in an upcoming post.

<?php
// THIS IS PSEUDO CODE //////////////////////


// for all the emails in the TrainingData
foreach(TrainingData as Email){

    // lookup word in dictionary file json,csv,txt,xml other or database
    // open or load email subject + body + sender + recipient

    // Example Email would roughly be like this
    // Email[text] = [subject] + [body]
    // Email[sender] = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
    // Email[recipient] = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

    // get all the words in the email
    words = split(Email[text])

    // for all the words in this email subject + body
    foreach(words as word){

        // lookup word in dictionary file json,csv,txt,xml other or database

        // if the word doesn't exist already, add it
        if(!knownword(word)){
            add(word)
        }else{
            word[sender] = lookup(word[sender])
            word[recipient] = lookup(word[recipient])
        }

        // set or increment sender value
        word[sender] += Email[sender]

        // set or increment recipient value
        word[recipient] += Email[recipient]

        // Save new count
        save(word)
    }

}
?>

 

What happens during training is that each email is “read” by the bot. It extracts each word then checks if it already knows the word, if not, it adds it to it’s ‘dictionary’ or ‘lexicon’ then each of the classifications are added (in the case that the word was just added) or incremented (in the case of a known word).

If the word is new the value will be set to 1 for all the classes associated with the email and 0 for the others. Whereas with known words, the associated email class values are incremented by 1 each time the word is encountered.

After training the classifier on all the Training Data, certain words should favor some classes over others.

For example, having the word ‘LinkedIn’ in a subject or email might score the email as a higher probability of being a Colleague, Employee, Manager or Employer with a lower probability of the others classifications but not necessarily 0% chance either because other classifications may also use the word, the key with this bot is that some words will moderately or highly “favor”, “associate” or “prefer” some classifications over others.

Further, in the database or other storage file the word ‘LinkedIn‘ might have scores like this:

word (sender), (recipient)

LinkedIn
sender (2270, 3514, 894, 8712, 65, 22, 0, 7, 3, 8, 4, 45, 3, 1, 2, 12, 5, 3, 7, 15, 1, 6, 0, 2, 1, 0, 0, 0, 0)
recipient(3511, 2560, 9808, 4215, 227, 46, 10, 76, 13, 8, 4, 45, 3, 1, 2, 12, 5, 3, 7, 15, 1, 6, 0, 2, 1, 0, 0, 0, 0)

Which was hypothetically caused by the word LinkedIn getting classified more times as Colleague, Employee, Manager or Employer than the other classes in both the sender and recipient fields.

Here are some simulated charts based on the hypothetical LinkedIn word example above that should illustrate what we want though this is an ‘idealized’ scenario (I made the numbers up to illustrate my point).

LinkedIn Word Analysis –  Recipient

LinkedIn Word Analysis –  Sender

 

As you can see on the Sender() and Recipient() Radar Charts the bot is seemingly reaching out almost as if it wants to grab the Manager and Employer classes strongly indicating the bot associates this word with those classes and to a lesser extent the Colleague & Employee classes.

The goal of training is to teach the bot to identify the collective word patterns present in an email so that it correctly identifies (classifies) the relationships between the sender and recipients.

Step 6. Testing & Using the Classifier Bot

Basically we do the same thing we did while training (use a Bag of Words model) without updating/training the bot and instead log and increment  the classification sums until we’ve looked at all the words present in the email. Then we sum them and we have our prediction.

Example:

Let’s say the email simply consisted of the string “Thanks Mom & Dad\nLove,\nJane Doe”

the words the bot would see are ‘Thanks‘, ‘Mom‘, ‘Dad‘, ‘Love‘, ‘Jane‘, ‘Doe

So, the bot would simply add up the values for all the word columns (classifications) and then sort them from most likely to least likely.

Just for a moment ignore all the words but two Mom & Dad so that we can illustrate what we want to happen.

These words could look something like this :

Mom
sender(1156, 562, 2754, 96, 653, 886, 1052, 1941, 1791, 364, 7276, 6988, 8626, 2628, 674, 1013, 65, 2324, 1862,1322, 2923, 1011, 1262, 1902, 2518, 1734, 2574, 562, 575)
recipient(1119, 2711, 1286, 870, 662, 2844, 2078, 8289, 8410, 8534, 2310, 883, 1567, 2514, 350, 1459, 2268, 978, 918, 157, 1061, 1096, 1716, 588, 1459, 325, 1459, 2004, 2856)

Dad
sender(2262, 238, 1479, 546, 1188, 867, 644, 1870, 2728, 817, 8906, 9988, 8589, 437, 2438, 1759, 1879, 2079, 2580, 2664, 1345, 253, 1200, 1187, 273, 2346, 474, 2986, 514)
recipient(308, 2154, 287, 2617, 2703, 281, 720, 7033, 8718, 8159, 1730, 1596, 2748, 121, 1552, 1938, 1342, 1241, 955, 520, 742, 1026, 1676, 353, 2910, 1652, 1468, 2295, 2750)

 

 

Notice how columns  11, 12, 13 classifications(Child, Son, Daughter) in the Sender() groups highly correlate with Mom & Dad (larger numbers), this might be because if a sender says the word Mom or Dad they are speaking to their parents, but not always and the words reflect this by none of the classifications having a 0.

Further, two colleagues may discuss one of their parents and say something like “My Dad loves to BBQ“, therefore one single word cannot definitively classify a document however an entire email of sufficient length should have enough words of high correlation that an attempt at classification could be made.

This is done by adding the classification values for all words present.

Example:

sender = Thanks[sender]+Mom[sender]+Dad[sender]+Love[sender]+Jane[sender]+Doe[sender]
recipient Thanks[recipient]+Mom[recipient]+Dad[recipient]+Love[recipient]+Jane[recipient]+Doe[recipient]

If the same word appears multiple times in the email the classifier should add it to the sums as many times as it appears.

After adding all the columns up for all the words for both the sender group and the recipient group you will then have your predicted classification:

sender = (4031, 32198, 17505, 7206, 33504, 16725, 12538, 28448, 19706, 8371, 99753, 87594, 102595, 16916, 22418, 10090, 14046, 3599, 7075, 16216, 3718, 8627, 32545, 26574, 13259, 19471, 32123, 20820, 31418)
recipient = (24496, 12777, 25341, 19170, 14671, 8599, 10491, 94750, 81289, 81218, 2905, 14275, 32484, 4478, 25067, 5370, 23929, 10872, 34001, 9487, 11855, 24731, 17955, 10658, 3265, 12308, 20352, 15770, 5048)

 

If we then sort the classifications by highest to lowest we get the most likely to the least likely:

 

Sender: Recipient:
Daughter 102595 Parent 94750
Child 99753 Father 81289
Son 87594 Mother 81218

 

The Sender likely being the Daughter, Child or Son and the Recipient was likely to be Parent, Father, Mother in that order.

You can convert the score to a percentage by summing the group and then dividing each score by the group sum

Example:

Sender Sum = 749089

Sender:
Daughter       102595 / 749089 = 0.136959693707957
Child          99753 / 749089 = 0.133165752000096
Son            87594 / 749089 = 0.116934035875577
Spouse         33504 / 749089 = 0.044726327579233
Aunt           32545 / 749089 = 0.04344610586993
Employee       32198 / 749089 = 0.04298287653403
Mother-in-law  32123 / 749089 = 0.042882754919642
Sister-in-law  31418 / 749089 = 0.041941611744399
Parent         28448 / 749089 = 0.03797679581465
Cousin         26574 / 749089 = 0.035475090409818
Brother        22418 / 749089 = 0.029927018017886
Brother-in-law 20820 / 749089 = 0.027793760154
Father         19706 / 749089 = 0.02630662044163
Father-in-law  19471 / 749089 = 0.025992906049882
Manager        17505 / 749089 = 0.023368384798068
Sibling        16916 / 749089 = 0.022582096386411
Husband        16725 / 749089 = 0.022327120008437
Grandson       16216 / 749089 = 0.02164762798546
Grandparent    14046 / 749089 = 0.018750775942512
Nephew         13259 / 749089 = 0.017700166468871
Wife           12538 / 749089 = 0.016737664015891
Sister         10090 / 749089 = 0.01346969452228
Uncle          8627 / 749089 = 0.011516655564292
Mother         8371 / 749089 = 0.011516655564292
Employer       7206 / 749089 = 0.011174907120516
Grandmother    7075 / 749089 = 0.009619684710362
Colleague      4031 / 749089 = 0.009444805623898
Granddaughter  3718 / 749089 = 0.005381203034619
Grandfather    3599 / 749089 = 0.004963362163908

 

Simplifying turns the values into an easy to read, understand and explain percentage:

Sender:

Daughter       13.70%
Child          13.32%
Son            11.69%
Spouse          4.47%
Aunt            4.34%
Employee        4.30%
Mother-in-law   4.29%
Sister-in-law   4.19%
Parent          3.80%
Cousin          3.55%
Brother         2.99%
Brother-in-law  2.78%
Father          2.63%
Father-in-law   2.60%
Manager         2.34%
Sibling         2.26%
Husband         2.23%
Grandson        2.16%
Grandparent     1.88%
Nephew          1.77%
Wife            1.67%
Sister          1.35%
Uncle           1.15%
Mother          1.12%
Employer        0.96%
Grandmother     0.94%
Colleague       0.54%
Granddaughter   0.50%
Grandfather     0.48%

And the same can be done with the Recipient values as well.

At this point you compare all the predictions with the actual scores and work to minimize the difference between the predicted classifications and the actual classifications on the Test Data.

You will never get 100% predictive accuracy (especially if you have a long list of classifications) but you are really only interested in the trend with this sort of bot, if you get a large spike or peek classification or group of classifications then your bot is working.

I do plan to release to the code for this project in an upcoming post however I encourage you try building this system for yourself and of course please like comment and share!

If you’re bummed that you didn’t get the code for this project today I’ll tell you what… I’m Feeling Generous. 😉

I hope you enjoyed this post and consider supporting me on Patreon.

 

 

Much Love,

~Joy

 

Create a free website or blog at WordPress.com.

Up ↑