We’re in the “home stretch” and quickly approaching our goal of having a working Email Relationship Classifier Bot prototype.

Today we will cover building the training portion of the bot and of course this system implements “Supervised Learning” so you will need to have “hand classified” your “Data Corpus” as outlined in my post A Bag of Words as well as Classifying Emails so if you’ve read the other posts in this series then you are ready to proceed.

 

Train.php

What you are going to love about this code is it’s simplicity!

It’s intentionally short and “high level” which was achieved by using our Class files (DatabaseManager.Class.phpFileManager.Class.phpTokenizer.Class.php) which we covered in my post Class Files  to create Objects to “act” upon our data encapsulated inside them. This means we can just ask our Objects to do complex work in just a few lines of code.

Code

<?php

// This function will load the human scored JSON class files
function LoadClassFile($file_name){
  // Get file contents
  $file_handle = fopen($file_name, 'r');
  $file_data = fread($file_handle, filesize($file_name));
  fclose($file_handle);
  return $file_data;
}


// Include Classes
function ClassAutoloader($class) {
    include 'Classes/' . $class . '.Class.php';
}
spl_autoload_register('ClassAutoloader');


// Instantiate Objects
$myTokenizer = new Tokenizer();
$myEmailFileManager = new FileManager();
$myJSONFileManager = new FileManager();
$myDatabaseManager = new DatabaseManager();


// No Configuration needed for the Tokenizer Object

// Configure FileManager Objects
$myEmailFileManager->Scan('DataCorpus/TrainingData');
$myJSONFileManager->Scan('DataCorpus/TrainingDataClassifications');
$number_of_training_files = $myEmailFileManager->NumberOfFiles();
$number_of_JSON_files = $myJSONFileManager->NumberOfFiles();

// Configure DatabaseManager Object
$myDatabaseManager->SetCredentials(
  $server = 'localhost', 
  $username = 'root', 
  $password = 'password', 
  $dbname = 'EmailRelationshipClassifier'
);


// Make sure the files are there and the number of training files is
// the same as the number of JSON Class files.
if(($number_of_training_files != $number_of_JSON_files) 
   || ($number_of_training_files == 0 || $number_of_JSON_files == 0) ){
  die(PHP_EOL . 'ERROR! the number of training files and classification files are not the same or are zero! Run CreateClassificationFiles.php first.');
}
else{
  // Loop Through Files
  for($current_file = 0; $current_file < $number_of_training_files; $current_file++){
    $myTokenizer->TokenizeFile($myEmailFileManager->NextFile());		
    $EmailClassifications = json_decode(LoadClassFile($myJSONFileManager->NextFile()), true);
    // Loop Through Tokens
    foreach($myTokenizer->tokens as $word=>$count){
      $myDatabaseManager->AddOrUpdateWord($word, $count, $EmailClassifications);
    }
  }
}

echo PHP_EOL . 'Training complete! You can now run Test.php' . PHP_EOL;

 

Save Train.php in the root project folder:


[EmailRelationshipClassifier]
│
├── CreateClassificationFiles.php
├── DatasetSplitAdviser.php
├── database.sql
├── Train.php 
│
├── [Classes]
│   │
│   ├── DatabaseManager.Class.php
│   ├── FileManager.Class.php
│   └── Tokenizer.Class.php
│
└── [DataCorpus]
    │
    ├── [TestData]
    │
    ├── [TestDataClassifications]
    │
    ├── [TestResults]
    │
    ├── [TrainingData]
    │
    └── [TrainingDataClassifications]

 

Of course the complexity does exist inside the Objects, it’s just advantageous to obfuscate it here using the Object methods so that we can focus on the task of training rather than the details of moving the data around.

Once all the classes have been included and the objects instantiated & configured there is a check to confirm the .txt & JSON files exist and that the number is the same.

If none of the fail conditions trigger the die() function then for all the training files (.txt emails),   the $myTokenizer Object will ask the $myEmailFileManager Object for the next file in it’s list which it will load and tokenize, which means that it builds a “bag of words model” of the email, specifically “unigrams“.

Then the JSON relationship class file will be loaded and decoded into an array of “key & value pairs ” where the key is the relationship class name and the value is either a zero or one (0/1) where one denotes relationship class membership and zero denotes a lack of class membership.

Then for each unigram word token the $myDatabaseManager Object will perform it’s AddOrUpdateWord() method.

The AddOrUpdateWord()  method accepts the unigram word token as the  first argument, the number of times it appears in the training file as the second argument and the relationship class memberships array as the third argument. The word is then either added to the Words table in the database or updated.

You can review the details of the database in my post Email Relationship Classifier Database.

After all the words in all the training emails have been processed the training is complete and we’re ready to test our bot which I’ll cover in an upcoming post.

If you enjoyed this post please support me on Patreon for as little as $1 a month, thank you.

 

 

Much Love,

~Joy