Let’s say that you are an enterprising data scientist looking to build a bot to distinguish what “class” something belongs to. Like identifying which emails are spam and which emails are not for example…

A simple way you can accomplish this type of goal is by using a “Bag of Words Model” in conjunction with “classifications” (Supervised Learning) to teach the bot to recognize the patterns associated with our classifications.

However rather than build a simple “spam detector” I’m going to outline how to build an “Email Sender and Recipient Relationship Classifier” because well… let’s face it you don’t read this blog for simple now do you? Or rather, you seem to like when I simplify hard things but not to the point of too simple. That… and all the free working code! 😛

More directly, I’m going to walk you though my process of designing and building this bot, as usual from scratch (because that’s how we learn) to answer the question: Was this email a conversation between Colleagues? Family? Friends? Etc… What kinds of relationships do the senders and recipients of an email have with one another.

Further, it is possible to extend the capability of this type bot or even branch out into other problem domains like developing a bot to read legal documents, review aggregate medical data or even view images and then extract all kinds of hidden facts, patterns and correlations. 


Additionally, it should be noted that this is not a “bleeding edge” or even necessarily a “state of the art” technology like those fancy LSTM Networks but it is a tried and tested method that is still at the heart of many spam detection systems as well as the basis of many modern machine learning classifier processes and it can operate alone or in conjunction with other techniques to build more robust systems.

So let’s get started!

Step 1. Collect your “Data Corpus”

You will need a large set of emails to even make attempting this worth your efforts.

Between 10K – 100K emails would be an alright jumping off point (just a guess) and might be perfect if you have few classifications though more may be required if you have a lot of classifications and or the emails you use to train are particularly short.

From where you say? Well… there is The Enron Email Dataset. You could also just use your own emails and any donated by friends, colleagues and family.

Smaller less all encompassing systems (like a prototype) could probably get away with a much smaller Data Corpus, perhaps even a few hundred examples or less if it’s just a proof of concept.

 

Step 2. Create a List of Classifications

Classifications are basically labels that help us to group things that are alike.

You can think of a classification like a metaphorical box, red things go in the red box, blue things in the blue box, cats… go in the cat box! 😛

Grouping information with classifications is useful because it gives us the ability to teach the bot what distinguishes one group or thing from another by showing it examples of each so that it can learn the differences.

In this case we’re building an Email Sender and Recipient Relationship Classifier, so we might use a list of relationships as our classifications like this:

Examples: (Colleague, Employee, Manager, Employer, Spouse, Husband, Wife, Parent, Father, Mother, Child, Son, Daughter, Sibling, Brother, Sister, Grandparent, Grandfather, Grandmother, Grandson, Granddaughter, Uncle, Aunt, Cousin, Nephew, Father-in-law, Mother-in-law, Brother-in-law, Sister-in-law, …)

There may be other classifications such as Daughter-in-law, Friend, Landlord, Plumber, Lawyer, Government Entity, Second Cousin Thrice Removed, etc.. that you might want in your list of classifications so ultimately the starter list I provide above should be altered to meet your needs.

It should also be noted that the more classifications that you need the bot to identify the more data it will need to look at to learn to properly classify new data.

 

Step 3. Hand classify ALL emails in your Data Corpus

Sadly there is no way around this step (using this methodology) and it’s the worst part because it involves manually reading every email and hand documenting the appropriate tags in a file or database.

You can choose to represent this information in many ways (such as “key value pairs” and it may be more useful in some circumstances to do so) however for the sake of simplicity I will represent the values here as rows in a table where each column represents one of the classifications or “keys”.

Further there may be cases where it is desirable or simply more accurate to assign more than one classification to a given set of data; In the case of the Relationship Classifier an example of  which could be a child away at college emailing both their parents. You would want the bot to understand and generate a classification to reflect both parents as recipients not one or the other.

You also need to consider future growth and your methodology needs to accommodate the ability to add new classifications as the need may arise.

Ideally adding new classes should be as simple as adding additional representative columns (or keys) for the new classifications and retraining the bot on the emails but only update the new classification columns (or keys).

Now lets look at a few examples.

Example Email:

Subject line: Congratulations on Your Promotion

Dear Bob,

I heard through about your promotion to Vice President of ACME Widgets through LinkedIn. Congratulations, you deserve it!

Best wishes,

John Doe

Example Classification:

Sender (Colleague)
Recipient (Colleague)

Sender (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Recipient (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

 

 

Multiple Sender & Recipient Example Classification:

Subject line: You won’t believe this!

It’s un-be-li-e-va-b-l-e

During the post-game celebration Mr. Coach got a whole water cooler dumped on his head!

Everyone laughed as Mr. Coach chased the team off the field.

Love,
Bobby

Example Classification:

Sender (Child, Son)
Recipient (Parent, Father, Mother)

Sender (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Recipient (0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

Multiple Sender & Recipient Example Classification:

Subject line: Thanks Mom & Dad!

You are the best parents ever!

Love,
Jane Doe

Example Classification:

Sender (Child, Daughter)
Recipient (Parent, Father, Mother)

Sender (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Recipient (0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

 

This information needs to be logged for each email in your Data Corpus and it should be stored separately from the text of the email.

It’s also highly notable that the *-in-law classifications lineup sort of “tangentially” with their direct counterpart i.e.  Brother, Sister, Brother-in-law, Sister-in-law… as well as any other overlapping classification and it can be difficult for a person let alone a bot to classify this.

There are three ways to handle this:

Option 1. The simplest and most straight forward is to eliminate overlapping classifications.

Option 2. Leave the classifications and ignore it. You can always remove them later.

Option 3. Subclass by setting the main classification as a value of 1 and the sub-classification as a value less than 1 but greater than 0 (a float) for example 0.5 and should there be multiple sub-classifications, where logical you could rank the sub-classes by probability with the highest being the highest float less than 1 in the set and the lowest probability in the set being the lowest float above 0.

This will teach the bot that these classes are closely associated and when the bot tries to classify new emails it will automatically weight/group the sub-classes of a classification near the main classification.

It is hypothetically possible that a subclass could be shared between several classes and as such the subclass could still be selected as the most likely classification when it is most likely. This is however more complicated.

Thankfully though this IS NOT a “black box” bot and can be hand modified or updated and edited after training so there is no need to implement Option 3 when first starting out.

 

Step 4. Split the Data into Two Separate Groups

The First group is the “Training Data”.

The Second group is the “Test Data”.

When training the bot, you will show the “Training Data” to teach the bot how to classify.

When testing the bot, you will use the “Test Data”.

The reason why you hand classified ALL emails and then separate them like this is so that when you test the bot you know if it got it correct or not. You simply compare the answers the bot gives for the emails it never trained on with the classifications that were manually assigned by a human.

Which brings us to how to divide the emails among the training and testing groups.

A good rule of thumb is that the larger your “Data Corpus” set is the larger your “Test Data” set can be.

So if the ratio is: TrainingDataRatio:TestDataRatio then:

If you have a tiny Data Corpus of 100 emails and use a ratio of 95%:5% your Training Data is 95 emails and your Test Data is 5 emails.

If you have a Data Corpus of 10K emails and use a ratio of 90%:10% your Training Data is 9000 emails and your Test Data is 1000 emails.

If you have a large Data Corpus of 100K emails you might use a 70%:30% ratio and your Training Data is 70000 emails and your Test Data is 30000 emails.

Most of your data should be used for training and once you know how many emails you have to work with you can decide on an appropriate ratio.

Once you have that ratio you can determine your split using this formula:

Convert TrainingDataRatio:TestDataRatio to a decimal value I.E. 88%:12% becomes 0.88:0.12

Then compute:

(total_number_of_emails x TrainingDataRatio) = number_of_training_emails

(total_number_of_emails x TestDataRatio) = number_of_test_emails

Here’s some PHP code that uses this formula that you can use to determine how to split your data:
<?php

// This function is not a complete ToFloat
// in that it expects you to provide a number
// and not an array or a string of chars necessarily
// however PHP will convert the char string '1' to an
// int and '0.22' to a float automatically
function ToFloat($value){
	
	if(is_numeric($value)){
		if(is_int($value)){
			return "0.$value";
		}
		return $value; // should be a float already
	}else{
		die('ToFloat($value) only accepts NUMBERS.');
	}	
}

// Change these to fit your needs
$total_number_of_emails = 10278;
$training_data_percentage = 88; // set between 50% and 97%

// Compute Values - Don't change these
$training_data_ratio = ToFloat($training_data_percentage);
$test_data_ratio = (1.0 - $training_data_ratio);
$number_of_training_emails = $total_number_of_emails * $training_data_ratio;
$number_of_test_emails = $total_number_of_emails * $test_data_ratio;
$number_of_training_emails_round = round($number_of_training_emails, 0, PHP_ROUND_HALF_UP);
$number_of_test_emails_round = round($number_of_test_emails, 0, PHP_ROUND_HALF_DOWN);

// Build Report
$report = "You chose to have $training_data_percentage% of your Data Corpus used as Training Data." . PHP_EOL . PHP_EOL;
$report .= "You have $total_number_of_emails emails so using a ratio split of $training_data_ratio : $test_data_ratio" . PHP_EOL;
$report .= "You should split your emails like this:" . PHP_EOL . PHP_EOL;
$report .= "Training Emails: $number_of_training_emails_round" . PHP_EOL;
$report .= "Test Emails: $number_of_test_emails_round" . PHP_EOL . PHP_EOL;
$report .= 'Formula' . PHP_EOL;
$report .= "($total_number_of_emails x $training_data_ratio) = RoundUp($number_of_training_emails) = $number_of_training_emails_round" . PHP_EOL;
$report .= "($total_number_of_emails x $test_data_ratio) = RoundDown($number_of_test_emails) = $number_of_test_emails_round" . PHP_EOL;

// Report
echo $report . PHP_EOL;
Which will produce this output or something similar:

You chose to have 88% of your Data Corpus used as Training Data.

You have 10278 emails so using a ratio split of 0.88 : 0.12
You should split your emails like this:

Training Emails: 9045
Test Emails: 1233

Formula
(10278 x 0.88) = RoundUp(9044.64) = 9045
(10278 x 0.12) = RoundDown(1233.36) = 1233

It is possible for this tool to produce a floating-point number (as I demonstrate) hence the need to round the values. The rounding should always yield a whole as the test data ratio is determined by subtracting the training data ratio from 1. If there is a remainder the “extra” fraction of the split email is essentially mathematically “recombined” and just added to the training data.

 

Step 5. Train Bot to Classify Emails

Here is roughly the Pseudo Code for the “Train” process:

I encourage you to try to use this pseudo code to create your own implementation though I do plan to release my solution in an upcoming post.

<?php
// THIS IS PSEUDO CODE //////////////////////


// for all the emails in the TrainingData
foreach(TrainingData as Email){

    // lookup word in dictionary file json,csv,txt,xml other or database
    // open or load email subject + body + sender + recipient

    // Example Email would roughly be like this
    // Email[text] = [subject] + [body]
    // Email[sender] = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
    // Email[recipient] = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

    // get all the words in the email
    words = split(Email[text])

    // for all the words in this email subject + body
    foreach(words as word){

        // lookup word in dictionary file json,csv,txt,xml other or database

        // if the word doesn't exist already, add it
        if(!knownword(word)){
            add(word)
        }else{
            word[sender] = lookup(word[sender])
            word[recipient] = lookup(word[recipient])
        }

        // set or increment sender value
        word[sender] += Email[sender]

        // set or increment recipient value
        word[recipient] += Email[recipient]

        // Save new count
        save(word)
    }

}
?>

 

What happens during training is that each email is “read” by the bot. It extracts each word then checks if it already knows the word, if not, it adds it to it’s ‘dictionary’ or ‘lexicon’ then each of the classifications are added (in the case that the word was just added) or incremented (in the case of a known word).

If the word is new the value will be set to 1 for all the classes associated with the email and 0 for the others. Whereas with known words, the associated email class values are incremented by 1 each time the word is encountered.

After training the classifier on all the Training Data, certain words should favor some classes over others.

For example, having the word ‘LinkedIn’ in a subject or email might score the email as a higher probability of being a Colleague, Employee, Manager or Employer with a lower probability of the others classifications but not necessarily 0% chance either because other classifications may also use the word, the key with this bot is that some words will moderately or highly “favor”, “associate” or “prefer” some classifications over others.

Further, in the database or other storage file the word ‘LinkedIn‘ might have scores like this:

word (sender), (recipient)

LinkedIn
sender (2270, 3514, 894, 8712, 65, 22, 0, 7, 3, 8, 4, 45, 3, 1, 2, 12, 5, 3, 7, 15, 1, 6, 0, 2, 1, 0, 0, 0, 0)
recipient(3511, 2560, 9808, 4215, 227, 46, 10, 76, 13, 8, 4, 45, 3, 1, 2, 12, 5, 3, 7, 15, 1, 6, 0, 2, 1, 0, 0, 0, 0)

Which was hypothetically caused by the word LinkedIn getting classified more times as Colleague, Employee, Manager or Employer than the other classes in both the sender and recipient fields.

Here are some simulated charts based on the hypothetical LinkedIn word example above that should illustrate what we want though this is an ‘idealized’ scenario (I made the numbers up to illustrate my point).

LinkedIn Word Analysis –  Recipient

LinkedIn Word Analysis –  Sender

 

As you can see on the Sender() and Recipient() Radar Charts the bot is seemingly reaching out almost as if it wants to grab the Manager and Employer classes strongly indicating the bot associates this word with those classes and to a lesser extent the Colleague & Employee classes.

The goal of training is to teach the bot to identify the collective word patterns present in an email so that it correctly identifies (classifies) the relationships between the sender and recipients.

Step 6. Testing & Using the Classifier Bot

Basically we do the same thing we did while training (use a Bag of Words model) without updating/training the bot and instead log and increment  the classification sums until we’ve looked at all the words present in the email. Then we sum them and we have our prediction.

Example:

Let’s say the email simply consisted of the string “Thanks Mom & Dad\nLove,\nJane Doe”

the words the bot would see are ‘Thanks‘, ‘Mom‘, ‘Dad‘, ‘Love‘, ‘Jane‘, ‘Doe

So, the bot would simply add up the values for all the word columns (classifications) and then sort them from most likely to least likely.

Just for a moment ignore all the words but two Mom & Dad so that we can illustrate what we want to happen.

These words could look something like this :

Mom
sender(1156, 562, 2754, 96, 653, 886, 1052, 1941, 1791, 364, 7276, 6988, 8626, 2628, 674, 1013, 65, 2324, 1862,1322, 2923, 1011, 1262, 1902, 2518, 1734, 2574, 562, 575)
recipient(1119, 2711, 1286, 870, 662, 2844, 2078, 8289, 8410, 8534, 2310, 883, 1567, 2514, 350, 1459, 2268, 978, 918, 157, 1061, 1096, 1716, 588, 1459, 325, 1459, 2004, 2856)

Dad
sender(2262, 238, 1479, 546, 1188, 867, 644, 1870, 2728, 817, 8906, 9988, 8589, 437, 2438, 1759, 1879, 2079, 2580, 2664, 1345, 253, 1200, 1187, 273, 2346, 474, 2986, 514)
recipient(308, 2154, 287, 2617, 2703, 281, 720, 7033, 8718, 8159, 1730, 1596, 2748, 121, 1552, 1938, 1342, 1241, 955, 520, 742, 1026, 1676, 353, 2910, 1652, 1468, 2295, 2750)

 

 

Notice how columns  11, 12, 13 classifications(Child, Son, Daughter) in the Sender() groups highly correlate with Mom & Dad (larger numbers), this might be because if a sender says the word Mom or Dad they are speaking to their parents, but not always and the words reflect this by none of the classifications having a 0.

Further, two colleagues may discuss one of their parents and say something like “My Dad loves to BBQ“, therefore one single word cannot definitively classify a document however an entire email of sufficient length should have enough words of high correlation that an attempt at classification could be made.

This is done by adding the classification values for all words present.

Example:

sender = Thanks[sender]+Mom[sender]+Dad[sender]+Love[sender]+Jane[sender]+Doe[sender]
recipient Thanks[recipient]+Mom[recipient]+Dad[recipient]+Love[recipient]+Jane[recipient]+Doe[recipient]

If the same word appears multiple times in the email the classifier should add it to the sums as many times as it appears.

After adding all the columns up for all the words for both the sender group and the recipient group you will then have your predicted classification:

sender = (4031, 32198, 17505, 7206, 33504, 16725, 12538, 28448, 19706, 8371, 99753, 87594, 102595, 16916, 22418, 10090, 14046, 3599, 7075, 16216, 3718, 8627, 32545, 26574, 13259, 19471, 32123, 20820, 31418)
recipient = (24496, 12777, 25341, 19170, 14671, 8599, 10491, 94750, 81289, 81218, 2905, 14275, 32484, 4478, 25067, 5370, 23929, 10872, 34001, 9487, 11855, 24731, 17955, 10658, 3265, 12308, 20352, 15770, 5048)

 

If we then sort the classifications by highest to lowest we get the most likely to the least likely:

 

Sender: Recipient:
Daughter 102595 Parent 94750
Child 99753 Father 81289
Son 87594 Mother 81218

 

The Sender likely being the Daughter, Child or Son and the Recipient was likely to be Parent, Father, Mother in that order.

You can convert the score to a percentage by summing the group and then dividing each score by the group sum

Example:

Sender Sum = 749089

Sender:
Daughter       102595 / 749089 = 0.136959693707957
Child          99753 / 749089 = 0.133165752000096
Son            87594 / 749089 = 0.116934035875577
Spouse         33504 / 749089 = 0.044726327579233
Aunt           32545 / 749089 = 0.04344610586993
Employee       32198 / 749089 = 0.04298287653403
Mother-in-law  32123 / 749089 = 0.042882754919642
Sister-in-law  31418 / 749089 = 0.041941611744399
Parent         28448 / 749089 = 0.03797679581465
Cousin         26574 / 749089 = 0.035475090409818
Brother        22418 / 749089 = 0.029927018017886
Brother-in-law 20820 / 749089 = 0.027793760154
Father         19706 / 749089 = 0.02630662044163
Father-in-law  19471 / 749089 = 0.025992906049882
Manager        17505 / 749089 = 0.023368384798068
Sibling        16916 / 749089 = 0.022582096386411
Husband        16725 / 749089 = 0.022327120008437
Grandson       16216 / 749089 = 0.02164762798546
Grandparent    14046 / 749089 = 0.018750775942512
Nephew         13259 / 749089 = 0.017700166468871
Wife           12538 / 749089 = 0.016737664015891
Sister         10090 / 749089 = 0.01346969452228
Uncle          8627 / 749089 = 0.011516655564292
Mother         8371 / 749089 = 0.011516655564292
Employer       7206 / 749089 = 0.011174907120516
Grandmother    7075 / 749089 = 0.009619684710362
Colleague      4031 / 749089 = 0.009444805623898
Granddaughter  3718 / 749089 = 0.005381203034619
Grandfather    3599 / 749089 = 0.004963362163908

 

Simplifying turns the values into an easy to read, understand and explain percentage:

Sender:

Daughter       13.70%
Child          13.32%
Son            11.69%
Spouse          4.47%
Aunt            4.34%
Employee        4.30%
Mother-in-law   4.29%
Sister-in-law   4.19%
Parent          3.80%
Cousin          3.55%
Brother         2.99%
Brother-in-law  2.78%
Father          2.63%
Father-in-law   2.60%
Manager         2.34%
Sibling         2.26%
Husband         2.23%
Grandson        2.16%
Grandparent     1.88%
Nephew          1.77%
Wife            1.67%
Sister          1.35%
Uncle           1.15%
Mother          1.12%
Employer        0.96%
Grandmother     0.94%
Colleague       0.54%
Granddaughter   0.50%
Grandfather     0.48%

And the same can be done with the Recipient values as well.

At this point you compare all the predictions with the actual scores and work to minimize the difference between the predicted classifications and the actual classifications on the Test Data.

You will never get 100% predictive accuracy (especially if you have a long list of classifications) but you are really only interested in the trend with this sort of bot, if you get a large spike or peek classification or group of classifications then your bot is working.

I do plan to release to the code for this project in an upcoming post however I encourage you try building this system for yourself and of course please like comment and share!

If you’re bummed that you didn’t get the code for this project today I’ll tell you what… I’m Feeling Generous. 😉

I hope you enjoyed this post and consider supporting me on Patreon.

 

 

Much Love,

~Joy

 

Advertisements