Search

Geek Girl Joy

Artificial Intelligence, Simulations & Software

Month

July 2018

Brute Force Password Breaking

Welcome, today we’re going to talk about “Brute Force” Password Breaking.

I know, it’s a controversial topic… though they say you should write about controversy if you want to get read right? ūüėõ But¬†to ensure its as controversial as possible I’m going to give you an actual working prototype that you can use to try Brute Force attack your own passwords! ūüėČ

However, before you call me irresponsible consider the following.

There are plenty of methods for cracking passwords that are far more efficient than a Brute Force attack and if you have to resort to Brute Force (trying all possible combinations against an unknown and likely long password) a modern hashing algorithm then you are essentially screwed!

If the creator of the password chose a “simple”, ¬†“common” or “guessable” password… like for example “123456789”, “Cat” or “Password”, Brute Force isn’t even necessary!

 

Rainbow Tables

A hacker can simply use a “Rainbow Table” (so colorful), which is basically a database, to lookup the pre-computed solution for the hash of your password and obtain the unhashed “Plaintext” of your password.

In most cases where a hacker can use a Rainbow Table they will save themselves significant time and effort simply because they don’t have to do any hashing (which cumulatively can take a lot of time), it’s just a matter of traversing a table and retrieving the associated data of the index that matches the hash provided, assuming of course that the hash was pre-computed and exists in the Rainbow Table.

For example, the hash output of the SHA1 algorithm for the word ‘Cat’ is cebe54c7626cb1cefaca5f7f5ea6c96b4a7a2882¬†and if a hacker was able to break into a database containing this hashed credential then they could reverse lookup the hashed password in seconds.

Clearly what makes a Rainbow Table so useful to a hacker is that it can take the insurmountable challenge of Brute Forcing a password and change it into something that is at the very least, manageable.

There are techniques however, called Password ‘Salting‘ & ‘Peppering‘ which expose a severe weakness with¬†Rainbow Tables… namely, if you cannot pre-compute a big database of all possible hashes then you are forced to resort to some other technique if not a Brute Force attack.

A Salt is some unique (and long) string value that can be added to a password before it is hashed to make building a Rainbow Table difficult if not impossible.

Here’s an example of how Salting works, lets take the insecure password we used above ‘Cat’, and look at what happens when we add a 30 character Salt to it prior to hashing the value.

Password: Cat

Salt: LPdjlEfrMhGkENHf3e4Lp7VZgXd77f

Hash(Password + Salt) = d73b50b3d80762f55a28a44e49568be064ee8208

Note:To ensure you get the maximum benefit from Salting your passwords you should use a different Salt for every password credential that you store in your database. If you don’t it will be much easier to for a hacker to steal your passwords.

As you can see, by including a Salt with the password when it is hashed the result produced is different than the word by itself. This different hash is what is stored in the database.

The benefit of doing this is simply that it is now extremely unlikely (improbable but not impossible) that the combination of the word Cat and this long randomly generated Salt string will have been pre-computed anywhere, so it doesn’t even matter if a hacker gets both the hash and the Salt because a Rainbow Table will have to be generated from scratch using the Salt, which can take horrendous amounts of time, potentially on the order of a human life time or even longer for some hashing algorithms, password lengths and of course depending on how much raw computation an attacker can field.

 

Dictionary Attack

A¬†Dictionary Attack is similar to the concept of a Rainbow Table in that it also utilizes a database, however where they differ is that a Rainbow Table’s purpose is to store pre-computed hashes so you can just lookup a password, whereas a Dictionary Attack still requires the attacker to break your password through hashing.

So from a Hackers perspective a Rainbow Table is preferable to a Dictionary.

The purpose of a Dictionary Attack then is to contain all the most likely passwords, which are then combined with your Salt (if they have it) or also generated as is the case with a Pepper before hashed to generate a new Rainbow Table that is unique to the Salt or Pepper.

This is a form of Brute Force attack and takes time to generate though it is still better than “true” brute force because it relies on the idea that words mean things and we all share the same words and meanings.

Anytime there is a massive breach of user credentials where the passwords are compromised… i.e. the passwords were kept as plain text or hashed using a single unchanged Salt, or simply too short of a salt for every password so all passwords in the database become compromised… all the compromised passwords get added to Dictionaries (and Rainbow Tables) because the passwords were used by someone so they are “known to be good” and are therefore more likely to be used by someone else.

Think of a Dictionary as a list of thousands to millions of “probable” common and known passwords.

The benefit of a Dictionary is that a hacker can focus on all the most likely passwords because people tend to think alike and of all the possible words that COULD be used for a password only a small subset will ever actually be selected by people.

Further, If the dictionary attack fails and the hacker must resort to True Brute Force, they can exclude the passwords in the Dictionary that they already tried and focus on a the Brute Force Attack by generating new previously untried passwords.

 

True Brute Force

The thing is, if Rainbow Tables and Dictionary Attacks failed to break your credentials then most hackers will give up and even the skilled professionals are forced to question the real value of their target because in most cases it’s not worth the hassle!

Having to resort to Brute force means that they tried EVERYTHING else and failed!

Your servers proved to be secure, your protocols are working, your “wetware” er… IT staff isn’t giving out credentials over the phone… and that “dumpster dive” the hackers took at 3 AM to see if your staff is throwing out documents with “sensitive” information, proved useless…

True Brute Force means that you take all typable letters:

Upper Case letters:  ABCDEFGHIJKLMNOPQRSTUVWXYZ

Lower Case letters: abcdefghijklmnopqrstuvwxyz

Symbols:¬†!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~

And while we’re at it why not include Numbers too:¬†0123456789

For a total of 94 possible characters and then starting with only 1 character, generate and iterate through all possible permutations until you give up or a solution is found!

This sounds easier to accomplish than it actually is!

Sure, generating the data is easy enough (I show you how and provide code below) but due to the sheer numbers involved it’s essentially an impossible task when considering the hash time is multiplied by all the combinations you have to try and a longer password means more combinations are required to break the hash.

This is why it is recommended by some technologists that your password include Upper and Lower case letters along with numbers and symbols and be longer than 15 characters!

Let’s do the math!

If there are 94 possible characters and a password is only 1 character long we would only need to try a maximum of 941 = 94 characters before we can guarantee we have the password.

In the case of a 3 character long password (943 = 830,584) we would have to try a little less than 1 million combinations!

This is simple exponentiation with the base being the number of symbols possible and the exponent is the length of the password.

Here’s a table:

Password Length Combinations Calculation
1 94 941
2 8,836 942
3 830,584 943
4 78,074,896 944
5 7,339,040,224 945
6 689,869,781,056 946
7 64,847,759,419,264 947
8 6,095,689,385,410,816 948
9 572,994,802,228,617,000 949
10 53,861,511,409,490,000,000 9410
11 5,062,982,072,492,060,000,000 9411
12 475,920,314,814,253,000,000,000 9412
13 44,736,509,592,539,800,000,000,000 9413
14 4,205,231,901,698,740,000,000,000,000 9414
15 395,291,798,759,682,000,000,000,000,000 9415

 

Clearly a nice long password that contains upper and lower case letters along with numbers and symbols is definitely going to give your hacker a bad day though I personally prefer the idea of Passphrases which are a series of words in a phrase rather than a single word.

If the words you use in your passphrase are nice and long, have¬†upper and lower case letters along with numbers and symbols and isn’t a well known phrase (so that it’s not in a phrase dictionary) then you can be fairly confident that your account is secure for the foreseeable future.

Breaker Class

So as you can see, I am not helping anyone break anything by giving out this code, your passwords are safe! ūüėČ

Regular readers will notice in the code below I used my AppTimer Class that I released over on my Benchmarking PHP article.

Breaker has no properties and only 3 methods (not including PHP’s Magic Methods).

Methods: GetSymbols, IncrementValues & Match

GetSymbols()

GetSymbols converts the numbers in an array to the char the number represents.

For example: The number 0 represents the exclamation symbol ! and the number 33 represents uppercase B

IncrementValues()

IncrementValues takes an array of numbers and increments the values of each by 1 unless that would exceed the allowable max in which case it resets it to 0 and the value to the right is created or incremented by 1.

Match()

At this time, match just does a comparison though feel free to hash the values that are passed to this method to complete the Brute Force Attack program.

As is, this will only brute force plain text against plain text.

Breaker.php

<?php
set_time_limit(0); // Disable the time limit on script execution

// Create Breaker Class 
//
// This tool is a demonstration of a "brute force" password breaker.
// This prototype is provided AS IS and for informational & educational
// purposes only! 
//
// Modern Password Hashing should have little fear of this code though
// for a minimum level of rationality I have excluded the parts that 
// would handle hashing the passwords to slow down the "Script Kiddies" 
// however any reasonably skilled PHP developer would have little trouble 
// adding their own hashing function to complete this prototype.
// 
// DO NOT USE THIS SOFTWARE TO VIOLATE THE LAW! COMPLY WITH ALL DIRECTION
// GIVEN TO YOU BY LAW ENFORCEMENT! ANY ILLEGAL OR MALICIOUS ACTIONS YOU 
// CHOOSE TO ENGAGE IN OUTSIDE OF AN EDUCATIONAL SETTING ARE YOUR OWN! 
class Breaker{

    function GetSymbols($values, $symbols){
        foreach($values as &$value){
            if(isset($symbols[$value])){
                $value = $symbols[$value];
            }
        }
        return $values;
    }

    function IncrementValues($values, $number_of_valid_symbols){
        foreach($values as $key=>&$value){
            // If this value is maxed
            if($values[$key] >= $number_of_valid_symbols){
                // Reset it to 0 and increment the next value
                $values[$key] = 0; // Reset this value
                if(!isset($values[$key+1])){
                    $values[$key+1] = '0';
                }else{
                    $values[$key+1]++; // Reset this value
                }
            }
            else{
                // If key greater than 0
                if($key > 0){
                    if($values[$key-1]>$number_of_valid_symbols){
                        // Increment this value
                        $values[$key]++;
                    }
                }
                else{
                    // Always Increment this value
                    $values[$key]++;
                }
            }            
        }
        return $values;
    }

    function Match($hash, $test_password){
        if($test_password == $hash){
          return true;
        }
        return false;
    }

}



include('AppTimer.Class.php');     // Include AppTimer class file

$Timer = new AppTimer();           // Create Timer
$Timer->Start();                   // Start() Timer


$password_to_break = 'Cat'; 

// Concatenate all symbols explicitly
//$valid_symbols = "!\"#$%&'()*+,-./0123456789:;<=>?@"; // note the escaped double quote
//$valid_symbols .= "ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`";
//$valid_symbols .= "abcdefghijklmnopqrstuvwxyz{|}~";
//$valid_symbols = str_split($valid_symbols); // split string into array

// Cleaner way to Create array of ASCII char 33 - 126
$valid_symbols = range(chr(33), chr(126)); // Shorter version of above
$number_of_valid_symbols = count($valid_symbols); // 94 chars

$length = 1; // Start at 1 digit length to try all possible combinations
             // This assumes the password length is unknown.
// If the length of the password is known then use the correct length i.e:
// $length = strlen($password_to_break); 
                                       
// Generate first plain text password to try
$values = str_split(strrev(str_repeat('0', $length)));
$PlainTextPasswordBreaker = new Breaker();
$test_password = $PlainTextPasswordBreaker->GetSymbols($values, $valid_symbols);

while(!$PlainTextPasswordBreaker->Match($password_to_break, $test_password)){
    // We have not found the correct password so keep trying to generate it
    $values = $temp = $PlainTextPasswordBreaker->IncrementValues($values, $number_of_valid_symbols);
    $temp = $PlainTextPasswordBreaker->GetSymbols($values, $valid_symbols);
    $test_password = strrev(implode('', $temp));
    
    //echo $test_password . PHP_EOL; // Uncomment to watch breaker
                                   // Will make Breaker much slower
}


$Timer->Stop();             // Stop() Timer
$time = $Timer->Report();   // Report()


echo "Password: $test_password \nFound in: $time" . PHP_EOL;

As presented the output of the code will look something like this:

Password: Cat 
Found in: 5.8302 Seconds

If you uncomment the echo on line 105 inside the while loop you can watch each permutation get generated however echo will slow down the time it actually takes to find the password.

Here is what that would look like (note that I shortened the output to just the last few permutations before the solution was found):

...
Ca!
Ca"
Ca#
Ca$
Ca%
Ca&
Ca'
Ca(
Ca)
Ca*
Ca+
Ca,
Ca-
Ca.
Ca/
Ca0
Ca1
Ca2
Ca3
Ca4
Ca5
Ca6
Ca7
Ca8
Ca9
Ca:
Ca;
Ca
Ca?
Ca@
CaA
CaB
CaC
CaD
CaE
CaF
CaG
CaH
CaI
CaJ
CaK
CaL
CaM
CaN
CaO
CaP
CaQ
CaR
CaS
CaT
CaU
CaV
CaW
CaX
CaY
CaZ
Ca[
Ca\
Ca]
Ca^
Ca_
Ca`
Caa
Cab
Cac
Cad
Cae
Caf
Cag
Cah
Cai
Caj
Cak
Cal
Cam
Can
Cao
Cap
Caq
Car
Cas
Cat
Password: Cat 
Found in: 14.9678 Seconds

 

USE BREAKER RESPONSIBLY FOR EDUCATIONAL PURPOSES ONLY!

You can find Breaker on my GitHub profile here.

You can find a list of all my other posts on my Topics and Posts page & I hope you enjoyed reading this article, if so please support me on Patreon for as little as $1 a month.

Your financial support allows me to dedicate time to developing awesome projects like this and while I am publishing them without cost, that isn’t to say they are free. I am doing this all by myself and it takes me a lot of time and effort to build and publish these projects for your enjoyment.

Your financial support means a lot to me and allows me to be able to afford to spend the time necessary to make great content for you.

So I ask again, please support me on Patreon for as little as $1 a month.

Feel free to suggest a project you would like to see built in the comments and if it sounds interesting it might just get built and featured here on my blog.

 

 

Much Love,

~Joy

Advertisements

Mysterious Warehouse

My best friend is writing a series of books that are a mix of Sin City meets The Crow, though I’m not at liberty to reveal too much about the project at this time.

Anyway, he asked me to design some cover art samples for the books and instantly I was on board!

As part of the design process I am generating a lot of “extra” art work that won’t be used for anything… so I thought I would share a few of the images here.

Please enjoy!

 

Click for Full Size

 

I hope you enjoy my posts, if so i’d love your support on¬†Patreon, for as little as $1 a month your financial support allows me to dedicate time to developing awesome projects and art like this and while I publish without cost, that isn‚Äôt to say my work is free. I am doing this all by myself and it takes me a lot of time and effort.

Your financial support allows me to be able to afford to spend the time necessary to make great content for you.

So I ask again, please support me on Patreon for as little as $1 a month.

 

 

Much Love,

~Joy

Email Relationship Classifier Testing The Bot

Welcome back, today is last post in this series and one that I know many of you have been eagerly awaiting… we’re finally going to test the bot!

So I think what I’m going to do is first give you the code to review and then after I will walk you through it and explain what’s going on.

Test.php

I have labeled the subheadings in this post after the section comments in the code to make it easier to review so you should refer back to this code while reading the article to aid in understanding.

<?php
// This function will load the human scored JSON class files
function LoadClassFile($file_name){
  // Get file contents
  $file_handle = fopen($file_name, 'r');
  $file_data = fread($file_handle, filesize($file_name));
  fclose($file_handle);
  return $file_data;
}

// We will pass our Results to this function to save so it can be reviewed later
function CreateResultsFile($file_name, $output_path, $results){
  
  // Write file contents
  $file_handle = fopen($output_path . basename($file_name), 'w');
  $file_data = fwrite($file_handle, $results);
  fclose($file_handle);
}



// Include Classes
function ClassAutoloader($class) {
  include 'Classes/' . $class . '.Class.php';
}
spl_autoload_register('ClassAutoloader');


// Instantiate Objects
$myTokenizer = new Tokenizer();
$myEmailFileManager = new FileManager();
$myJSONFileManager = new FileManager();
$myDatabaseManager = new DatabaseManager();


// No Configuration needed for the Tokenizer Object

// Configure FileManager Objects
$myEmailFileManager->Scan('DataCorpus/TestData');
$myJSONFileManager->Scan('DataCorpus/TestDataClassifications');
$number_of_testing_files = $myEmailFileManager->NumberOfFiles();
$number_of_JSON_files = $myJSONFileManager->NumberOfFiles();

// Configure DatabaseManager Object
$myDatabaseManager->SetCredentials(
  $server = 'localhost',
  $username = 'root',
  $password = 'password',
  $dbname = 'EmailRelationshipClassifier'
);


// Make sure the files are there and the number of files are the same
if(($number_of_testing_files != $number_of_JSON_files) 
   || ($number_of_testing_files == 0 || $number_of_JSON_files == 0) 
  ){
  die(PHP_EOL . 'ERROR! the number of training files and classification files are not the same or are zero! Run CreateClassificationFiles.php first.');
}
else{
  // Loop Through Files
  for($current_file = 0; $current_file < $number_of_testing_files; $current_file++){
 
  
  $report_data = '';
  
  /////////////////////////
  // Bot Predict Classification
  /////////////////////////
  
  $file = $myEmailFileManager->NextFile();
  
  $myTokenizer->TokenizeFile($file);
    
  $report_data .= "Found Tokens:". PHP_EOL;
  // Loop Through Tokens
  foreach($myTokenizer->tokens as $word=>$count){
    $report_data .= "$word $count" . PHP_EOL;
    // Get word classification scores
    $myTokenizer->tokens[$word] = $myDatabaseManager->ScoreWord($word, $count);
    
    // Remove unknown word tokens
    if($myTokenizer->tokens[$word] == NULL){
    unset($myTokenizer->tokens[$word]);
    }
  }
  
  $report_data .= PHP_EOL . "Known Words:". PHP_EOL;
  $known_words = array_keys($myTokenizer->tokens);
  foreach($known_words as $word){
    $report_data .= $word . PHP_EOL;
  }
  
  $weights = array();
  // Sum tokens
  foreach($myTokenizer->tokens as $word=>$word_data){
    foreach($word_data as $class_name=>$class_count){
    @$weights[$class_name] += $class_count;
    }
  }
  $weights = array_diff($weights, array(0)); // remove 0 value classes

  // Sort into sender recipient groups
  foreach($weights as $class=>$count){
    // if key name contains -Sender add to the Sender key
    if(strstr($class, '-Sender')){
    $weights['Sender'][strstr($class, '-Sender', true)] = $count;
    }
    else{// if key name contains -Recipient add to the Recipient key
    $weights['Recipient'][strstr($class, '-Recipient', true)] = $count;
    }
    unset($weights[$class]); // remove the unsorted element
  }
  // sort arrays from more likely to less likely
  array_multisort($weights['Sender'], SORT_DESC);
  array_multisort($weights['Recipient'], SORT_DESC);



  /////////////////////////
  // Human Classified Data
  /////////////////////////
  $EmailClassifications = json_decode(LoadClassFile($myJSONFileManager->NextFile()), true);
  $EmailClassifications = array_diff($EmailClassifications, array(0)); // remove 0 value classes
  $sum = array_sum($EmailClassifications); // sum the total of all classes weights
  // sort into sender recipient groups
  // and convert values to percentages
  foreach($EmailClassifications as $class=>$count){
    // if key name contains -Sender add to the Sender key
    if(strstr($class, '-Sender')){
    $EmailClassifications['Sender'][strstr($class, '-Sender', true)] = $count;
    }
    else{// if key name contains -Recipient add to the Recipient key
    $EmailClassifications['Recipient'][strstr($class, '-Recipient', true)] = $count;
    }
    unset($EmailClassifications[$class]); // remove the unsorted element
  }
  // sort arrays
  array_multisort($EmailClassifications['Sender'], SORT_DESC);
  array_multisort($EmailClassifications['Recipient'], SORT_DESC);


  $report_data .= PHP_EOL;
  
  
  
  
  /////////////////////////
  // Report - Sender
  /////////////////////////

  $report_data .= PHP_EOL . "Predicted Sender Class & Score: " . PHP_EOL;
  $sum = array_sum($weights['Sender']); // sum the total of Sender weights
  foreach($weights['Sender'] as $class=>$count){
     $report_data .= "$class:  " . round(($count / $sum) * 100) . '%' . PHP_EOL;
  }
    

  $report_data .= PHP_EOL . "Human Scored Sender Class: " . PHP_EOL;
  $sum = array_sum($EmailClassifications['Sender']); // sum the total of Sender EmailClassifications
  foreach($EmailClassifications['Sender'] as $class=>$count){
     $report_data .= "$class:  " . round(($count / $sum) * 100) . '%' . PHP_EOL;
  }
  
  /////////////////////////
  // Report - Sender Mistakes
  /////////////////////////

  $report_data .= PHP_EOL . "Incorrect Predicted Sender Classes: " . PHP_EOL;
  $IPSC = array_keys(array_diff_key($weights['Sender'], $EmailClassifications['Sender']));
  if(count($IPSC) > 0){
    foreach($IPSC as $class){
     $report_data .= $class . PHP_EOL;
    }
  }else{
    $report_data .= 'None' . PHP_EOL;
  }
    
  $report_data .= PHP_EOL . "Missing Predicted Sender Classes: " . PHP_EOL;
  $MPSC = array_keys(array_diff_key($EmailClassifications['Sender'], $weights['Sender']));
  if(count($MPSC) > 0){
    foreach($MPSC as $class){
     $report_data .= $class . PHP_EOL;
    }
  }else{
    $report_data .= 'None' . PHP_EOL;
  }
  

  /////////////////////////
  // Report - Recipients
  /////////////////////////
  
  $sum = array_sum($weights['Recipient']); // sum the total of Sender weights
  $report_data .= PHP_EOL . "Predicted Recipient Class & Score: " . PHP_EOL; 
  foreach($weights['Recipient'] as $class=>$count){
     $report_data .= "$class:  " . round(($count / $sum) * 100) . '%' . PHP_EOL;
  }
  

  $report_data .= PHP_EOL . "Human Scored Recipient Class: " . PHP_EOL; 
  $sum = array_sum($EmailClassifications['Recipient']); // sum the total of Recipient EmailClassifications
  foreach($EmailClassifications['Recipient'] as $class=>$count){
     $report_data .= "$class:  " . round(($count / $sum) * 100) . '%' . PHP_EOL;
  }
  
  
  /////////////////////////
  // Report - Recipient Mistakes
  /////////////////////////
  
  $report_data .= PHP_EOL . "Incorrect Predicted Recipient Classes: " . PHP_EOL;
  $IPRC = array_keys(array_diff_key($weights['Recipient'], $EmailClassifications['Recipient']));
  if(count($IPRC) > 0){
    foreach($IPRC as $class){
     $report_data .= $class . PHP_EOL;
    }
  }else{
    $report_data .= 'None' . PHP_EOL;
  }
  
  $report_data .= PHP_EOL . "Missing Predicted Recipient Classes: " . PHP_EOL;
  $MPRC = array_keys(array_diff_key($EmailClassifications['Recipient'], $weights['Recipient']));
  if(count($MPRC) > 0){
    foreach($MPRC as $class){
     $report_data .= $class . PHP_EOL;
    }
  }else{
    $report_data .= 'None' . PHP_EOL;
  }
  
  /////////////////////////
  // Report - Overall
  /////////////////////////
  
  // Compute Results
  $sum_pediction = count($weights['Sender']) + count($weights['Recipient']);
  $sum_pediction -= count($IPSC); // Penalize Incorrect Predicted Sender Classes
  $sum_pediction -= count($MPSC) / 2; // Penalize Missing Sender Classes at half a point each
  $sum_pediction -= count($IPRC); // Penalize Incorrect Predicted Recipient Classes
  $sum_pediction -= count($MPRC) / 2; // Penalize Missing Recipient Classes at half a point each
  $sum_actual = count($EmailClassifications['Sender']) + count($EmailClassifications['Recipient']);
  
  $report_data .= PHP_EOL . "Overall Accuracy: " . PHP_EOL;
  $report_data .= ($sum_pediction / $sum_actual) * 100 . '%' . PHP_EOL;
  
  CreateResultsFile($file, 'DataCorpus/TestResults/', $report_data);
  echo $report_data;
  }
}

echo PHP_EOL . 'Testing Complete!' . PHP_EOL;

 

You will probably recognize the first portion of this code from Train.php and in fact there are really only two differences in initializing the environment between the two scripts.

The first difference is that Test.php includes a function called CreateResultsFile()¬†that we’ll use to save the report that the bot generates, so we can review it later and the second is the paths that we provide to $myEmailFileManager & $myJSONFileManager are different from the ones used in Train.php.

Once the fail conditions around line 54 pass, the bot will step through all testing data beginning around line 61.

The first order of business is to generate the bot’s “prediction” of what relationship classes are present in the email.

Bot Predict Classification

The bot starts by Tokenizing the file which means building a bag of words model for the email and then the found tokens are passed to the $myDatabaseManager Object which uses it’s¬†ScoreWord() method to scale the word class values using the information obtained during training. Unknown words are ignored and have no bearing on classifying the email in my implementation.

$myDatabaseManager->ScoreWord() method

For reference here is the ScoreWord() method for your review.

public function ScoreWord(&$word, &$count){
  
  if(count($this->classifications) == 0){
    $this->GetKnownClasses();
    $classifications = array();
    foreach($this->classifications as $class=>$value){
      $classifications["$class-Sender"] =  $value;
    }
    foreach($this->classifications as $class=>$value){
      $classifications["$class-Recipient"] =  $value;
    }
    $this->classifications = $classifications;
  }
  

  if($this->KnownWord($word)){
    $this->Connect();
    $sql = "SELECT * FROM `Words` WHERE `Word` LIKE '$word'";
    $result = $this->conn->query($sql);

    if ($result->num_rows > 0) {
    $word_data = $result->fetch_assoc();
    foreach($word_data as $key=>$value){
       if($key == 'ID'){
         unset($word_data["$key"]);
       }
       elseif($key == 'Word'){
         unset($word_data["$key"]);
       }
       else{
         $word_data[$key] *= ($count * $this->classifications[$key]);
       }
    }
    return $word_data;
    }
  }else{
    // unknown word... add it or ignore it
  }
}

 

Note that you could easily add new words found during test data to the bot knowledge base with zero relationship class affiliations and you could later manually update the word classes or do additional training to improve the bot’s “familiarity” with the word.

 

Then the $weights array is created to hold the prediction (the bot generated classifications) which is all the class counts summed and unnecessary elements removed.

Why $weights and not $prediction? I don’t know, maybe I was being¬†$pretentious. ūüėõ

The array is then sorted into sender and recipient groups followed by lowest class to the highest class.

 

Human Classified Data

Next the human generated classifications stored in JSON are loaded into the $EmailClassifications array and the values are sorted into sender and recipient groups as well.

At this point we have extracted enough information to begin generating the statistical portion of the $report.

 

Report – Sender

Beginning on line 147 we evaluate the Sender data starting with the bot prediction by adding up the $sum “total count” of all the predicted weights then we determine what percentage each individual weight contributes to the overall prediction by dividing the weight value against the $sum then multiply the resulting number by 100% of the $sum.

This same process is repeated for the human classified data.

 

Report – Sender Mistakes

We then evaluate the bot predicted sender data for mistakes by comparing the bot’s predicted classification $weights against the known human generated $EmailClassifications¬†using the array_diff_key & array_keys¬†PHP language functions to extract and store the¬†“Incorrect Predicted Sender Classes” as the¬†$IPSC¬†array, so we can use them later during the final evaluation.

We then do the same but in reverse, comparing¬†$EmailClassifications¬†against $weights for the “Missing Predicted Sender Classes” and save them the as $MPSC array.

 

Report РRecipients & Recipients Mistakes

We repeat this same process we used for the Sender data for the Recipients data beginning on line 189 followed by processing any mistakes on line beginning on line 208 which results in the $IPRC (Incorrect Predicted Recipient Classes) & $MPRC (Missing Predicted Recipient Classes) arrays.

 

Report – Overall

The last portion of the report is to evaluate the “overall” accuracy using the data the we collected and generated while working on the report.

We start by creating a $sum_prediction variable and setting its value to the total count of weights present in the $weights array.

We then proceed to subtract “points” from this number for every incorrect and or missing relationship classes.

Incorrect predictions receive a full point penalty whereas  missing predictions are penalized as half a point.

My thought process being that it’s better (but not perfect) for the bot to miss a class and exclude it than to include an incorrect class.

You may wish to use a different scoring rubric than this depending on what the repercussions of incorrect or missing data are in your model, this method is provided as a simple example.

We then create a variable called $sum_actual and set its value to the total count of classes present in the email as classified by a human.

The final “Overall Accuracy” is computed by taking the $sum_prediction and dividing it by the $sum_actual and then multiplying against 100 to get a percent.

We then save the $results report using the CreateResultsFile() function and echo the report to the screen as well.

Ideally $results would be captured to facilitate programmatic evaluation of the overall accuracy of the model, like in a csv or in a database so that you can compare all the results of all the test data,  however as this is only a prototype I went with a .txt dump of the individial report that the bot generates.

The output of this bot should look something like this:

 


Found Tokens:
YOU 1
WONT 1
BELIEVE 1
THIS 1
ITS 1
UNBELIEVABLE 1
DURING 1
THE 3
POSTGAME 1
CELEBRATION 1
MR 2
COACH 2
GOT 1
A 1
WHOLE 1
WATER 1
COOLER 1
DUMPED 1
ON 1
HIS 1
HEAD 1
EVERYONE 1
LAUGHED 1
AS 1
CHASED 1
TEAM 1
OFF 1
FIELD 1
LOVE 1
BOBBY 1

Known Words:
YOU
THE
LOVE


Predicted Sender Class & Score: 
Child:  53%
Daughter:  47%

Human Scored Sender Class: 
Son:  50%
Child:  50%

Incorrect Predicted Sender Classes: 
Daughter

Missing Predicted Sender Classes: 
Son

Predicted Recipient Class & Score: 
Parent:  36%
Mother:  32%
Father:  32%

Human Scored Recipient Class: 
Mother:  33%
Father:  33%
Parent:  33%

Incorrect Predicted Recipient Classes: 
None

Missing Predicted Recipient Classes: 
None

Overall Accuracy: 
70%

Testing Complete!

 

As it stands this bot is quite rough however you can improve it by modeling word bi-grams to account for the context the words are used in rather than just noting which words are present.

Additionally, I capitalize and process hyphens and apostrophes out of words which reduces the number of words the bot learns (i.e. dont vs don’t vs Don’t vs Dont vs DoNt… all become DONT) which simplifies some things and reduces database storage requirements a bit, however it does fail to properly model language because people can express meaning in ways that might get removed by this processing which obviously lowers the accuracy of your model in the long run.

You can find this bot and all its files on GitHub – emails not included.

I hope you enjoyed reading about building this bot , if so please support me on Patreon for as little as $1 a month.

Your financial support allows me to dedicate time to developing awesome projects like this and while I am publishing them without cost, that isn’t to say they are free. I am doing this all by myself and it takes me a lot of time and effort to build and publish these projects for your enjoyment.

Your financial support means a lot to me and allows me to be able to afford to spend the time necessary to make great content for you.

So I ask again, please support me on Patreon for as little as $1 a month.

Feel free to suggest a project you would like to see built in the comments and if it sounds interesting it might just get built and featured here on my blog.

 

 

Much Love,

~Joy

 

Blog at WordPress.com.

Up ↑