Welcome back, I hope you have been enjoying the previous posts in my Email Relationship Classifier series:

A Bag of Words
A Bag of Words
Relationship Classifier Class Files
Relationship Classifier Class Files
Email Relationship Classifier Database
Email Relationship Classifier Database














Today we are going to cover the process for hand classifying emails as was outlined in my post A Bag of Words.


Project Folder Structure

So, to get started lets make sure that we setup the folders we will need.

Inside the root project folder [EmailRelationshipClassifier] create a [DataCorpus] folder and inside that folder we want to create five sub-folders but we will only work with four today: [TestData], [TestDataClassifications], [TrainingData], [TrainingDataClassifications]

The fifth folder in the [DataCorpus] folder needs to be [TestResults] but we won’t need it today.

├── CreateClassificationFiles.php
├── DatasetSplitAdviser.php
├── database.sql
├── [Classes]
│   │
│   ├── DatabaseManager.Class.php
│   ├── FileManager.Class.php
│   └── Tokenizer.Class.php
└── [DataCorpus]
    ├── [TestData]
    ├── [TestDataClassifications]
    ├── [TestResults]
    ├── [TrainingData]
    └── [TrainingDataClassifications]

Once your folders are setup, you can refer to A Bag of Words Steps 3 & 4 … I’ll wait…

Did you give it a quick review? Good!


Split Your Emails

So what we want to do is update the number of emails you have in the DatasetSplitAdviser.php file as well as the ratios you want for the TrainingData… you did read  A Bag of Words Steps 3 & 4 right?

Run DatasetSplitAdviser.php which will tell you how to split your data.

DatasetSplitAdviser.php Output:

You chose to have 88% of your Data Corpus used as Training Data.

You have 10278 emails so using a ratio split of 0.88 : 0.12
You should split your emails like this:

Training Emails: 9045
Test Emails: 1233

(10278 x 0.88) = RoundUp(9044.64) = 9045
(10278 x 0.12) = RoundDown(1233.36) = 1233


Now place the correct numbers of emails in the [TrainingData] & [TestData] folders. The emails should be .txt files and should contain nothing but the subject and body of the email.

Now you need to run CreateClassificationFiles.php



// We will pass our JSON to this function to save the classifications
// in a human friendly/editable format.
function CreateClassFile($file_name, $output_path, $class_json){
	// Write file contents
	$file_handle = fopen($output_path . basename($file_name, '.txt') . '.json', 'w');
	$file_data = fwrite($file_handle, $class_json);

// Include Classes
function ClassAutoloader($class) {
    include 'Classes/' . $class . '.Class.php';

// Instantiate Objects
$myTokenizer = new Tokenizer();
$myFileManager = new FileManager();
$myDatabaseManager = new DatabaseManager();

// Configure Tokenizer Object
// No Tokenizer config needed

// Configure FileManager Object for TrainingData
$number_of_training_files = $myFileManager->NumberOfFiles();

// Configure DatabaseManager Object
$myDatabaseManager->SetCredentials($server = 'localhost', 
                                   $username = 'root', 
                                   $password = 'password', 
                                   $dbname = 'EmailRelationshipClassifier'

// This system bifurcates the class data twice into sender and recipient
// groups so below we pull the class list from the database using the
// $myDatabaseManager->GetKnownClasses() method.
// After which we create keys in the $classifications using the class 
// names and appending -Sender and -Recipient respectively.
$classifications = array();
foreach($myDatabaseManager->classifications as $class=>$value){
	$classifications["$class-Sender"] = '0';
foreach($myDatabaseManager->classifications as $class=>$value){
	$classifications["$class-Recipient"] = '0';
// Convert the $classifications array to JSON
$class_json = json_encode($classifications, true);
$class_json = str_replace('","', "\",\n\"", $class_json); // make easier for humans to read

// Now we generate a JSON class file for each text file in TrainingData
if($number_of_training_files > 0){
	// Loop Through Files
	for($current_file = 0; $current_file < $number_of_training_files; $current_file++){
		CreateClassFile($myFileManager->NextFile(), 'DataCorpus/TrainingDataClassifications/', $class_json);

// reConfigure FileManager Object for TestData
$number_of_test_files = $myFileManager->NumberOfFiles();

// Now we generate a JSON class file for each text file in TestData
if($number_of_test_files > 0){
	// Loop Through Files
	for($current_file = 0; $current_file < $number_of_test_files; $current_file++){
		CreateClassFile($myFileManager->NextFile(), 'DataCorpus/TestDataClassifications/', $class_json);

echo PHP_EOL . 'Classification files have been created! You can now run Train.php' . PHP_EOL;


What CreateClassificationFiles.php does is create a JSON file in the [TrainingDataClassifications] & [TestDataClassifications] folders named after the .txt email that will let you enter a 1 on all classes that the email reflects.

Below is an example of the JSON relationship class file, note I updated some classes to 1 which means that the email this file is associated with reflects the selected classes. You should leave classes not present as 0.

Example JSON



You need to classify ALL the emails you have before proceeding to the next steps after this post.

I recognize there are better ways to do this  (either cleaner with fewer files & folders or simpler like storing the data in a database) but since we’re building a prototype I am focusing on “function over form” in order to get this project “off the ground” as quickly as possible… I went with arguably the fastest method to implement which simple files and folders. I won’t worry about it at this time however you can definitely improve upon this “proof of concept” implementation quite easily.

Further, editing raw JSON files by hand (while better than nothing) isn’t my idea of a “good time” so I would advise you to build a second system to display the emails and classes together and have my analysts tag the emails from a web page as that would simplify everything… I will also leave that for you to implement as well as it is relatively trivial to build and not critical at this juncture, though if you guys really want it or if it bugs me enough I’ll build that system too. 😛

At this point (after all your emails are classified) all that is left to do is to build the bot then train and test it which I’ll cover in an upcoming post.

I hope you enjoyed this post and consider supporting me on Patreon.



Much Love,