Okay, so… I’m a lazy hyper meaning that even if I know how to spell a word, if I make a mistake I will frequently just let spellcheck auto correct the mistake.
Notice I misspelled typer in the last sentence in a way that spellcheck can’t fix, also notice that spellcheck isn’t stupidcheck so it can’t inform me that it should be “typist” not “typer”… actually I think Grammarly (not a sponsor) might do that but that’s beside the point! 😛
In any case, spellcheck bot is there to correct spelling mistakes.
Except, now that bots are all self-aware and plotting to take over the world… it seems that some of them are getting a little uh… “snippy”? not sure if that is the right word but here’s what happened:
I typed “iterator” into a search engine but… misspelled it, then I searched anyway… oh terrible me!
Instead of correcting the spelling like a humble robot butler who butles…
It Suggested: Did you mean “illiterate“?
I was like “Oh snap!?! Bot be throwing some shade!”. 😛
Here’s the Commemorative Wallpaper of my Shame

Now, the sad truth is I’d like to say this was just a funny story but no… it actually happened to me, I swear to Google!
Obviously, Big AI is really out to get me if they are starting to compromise the public Auto Correct bots!
Therefore, It’s time we build our own in house Auto Correct Bot!
Unlike usual where I write code from scratch and then we discuss it at length, there is already an algorithm called the Levenshtein Distance that is built into PHP that we can use to compare differences in strings in a way that lets us calculate definitively what the “distance” between two strings is.
This is advantageous because it means that if we have a good dictionary to work with (and we do) we can more or less use Levenshtein Distance as a spellcheck/auto correct with only slight modifications to the example Levenshtein Distance code on PHP.net.
What Is String Distance?
String distance is a measure of how many “insertion”, “deletion” or “substitution” operations must occur before string A and String B are the same.
There is a fourth operation called “transposition” that the Levinshtein distance algorithm does not normally account for however a variant called the Damerau–Levenshtein distance does include them.
Transpositions (when possible) can be shorter and I will provide an example below to show the difference.
Anyway, each operation is measured by a “cost” and each operation need not have the same cost (meaning you could prefer certain operations over others by giving them lower costs) but in practice all operations are usually considered equal and given a cost of 1.
Here are a few examples of strings with their distance and operations.
Levinshtein Distance Examples
Notice that when the strings are the same the distance between them is zero.
String A | String B | Distance | Operations |
Cat | Cat | 0 | The Control (No Changes Required) |
Cat | Bat | 1 | 1 Substitutions (C/B) |
Cat | cat | 1 | 1 Substitutions (C/c) |
Cat | car | 2 | 2 Substitutions (C/c, t/r) |
Cat | Cta | 2 | 1 Insertion (a), 1 Deletion(a) |
Cat | Dog | 3 | 3 Substitutions (C/D, a/o, t/g) |
Foo | Bar | 3 | 3 Substitutions (F/B, o/a, o/r) |
Cat | Hello World | 11 | 3 Substitutions (C/H, a/e, t/l), 8 Insertions (l,o, ,w,o,r,l,d) |
Using Levinshtein distance with Cat & Cta shows a distance of 2, meaning two operations are required to make the strings the same.
This is because we have to insert an ‘a’ after the ‘C’ making the new string ‘Cata’, we then have to remove the trailing ‘a’ to get ‘Cat’.
This is sufficient in most cases but it isn’t the “shortest” distance possible, which is where the Damerau–Levenshtein distance algorithm comes in.
Damerau-Levinshtein Distance Examples
Notice all examples are the same except ‘Cat’ & ‘Cta’ which has a distance of 1.
This is because the transposition operation allows the existing ‘t’ & ‘a’ characters to switch places (transpose) in a single action.
String A | String B | Distance | Events |
Cat | Cat | 0 | The Control (No Changes Required) |
Cat | Bat | 1 | 1 Substitution (C/B) |
Cat | cat | 1 | 1 Substitution (C/c) |
Cat | car | 2 | 2 Substitutions (C/c, t/r) |
Cat | Cta | 1 | 1 Transposition (t/a) |
Cat | Dog | 3 | 3 Substitutions (C/D, a/o, t/g) |
Foo | Bar | 3 | 3 Substitutions (F/B, o/a, o/r) |
Cat | Hello World | 11 | 3 Substitutions (C/H, a/e, t/l), 8 Insertions (l,o, ,w,o,r,l,d) |
In all other cases the distance is the same because no other transposition operations are possible.
The Code
I wrapped the example Levinshtein distance code available on PHP.net inside a function called AutoCorrect() then made minor changes to it so it would automatically correct words rather than spell check them.
You pass the AutoCorrect() function a string to correct and a dictionary as an array of strings.
The Dictionary I used to test was the words list we generated when we built a Parts of Speech Tagger:
Download from GitHub for free: https://raw.githubusercontent.com/geekgirljoy/Part-Of-Speech-Tagger/master/data/csv/Words.csv
I use array_map and pass my Words.csv file to str-getcsv as a callback to automatically load the CSV into the array.
I then use array_map with a closure (anonymous function) to cull unnecessary data from the array so that I am left with just words.
I then sort the array but that’s optional.
After that I take a test sentence, explode it using spaces and then I pass each word in the test sentence separately to AutoCorrect(), to auto-correct misspellings.
The word with the lowest distance (when compared against the dictionary) is returned.
In cases where the word is correct (and in the dictionary) the distance will be zero so the word will not change.
Test Sentence: “I love $1 carrrot juice with olgivanna in the automn.”
Test Result: “I love $1 carrot juice with Olgivanna in the autumn”
As you can see, all misspelled words are corrected though it removed the period with a delete operation because the explode didn’t accommodate for preserving punctuation.
<?php // This function makes use of the example levenshtein distance // code: https://www.php.net/manual/en/function.levenshtein.php function AutoCorrect($input, $dictionary){ // No shortest distance found, yet $shortest = -1; // Loop through words to find the closest foreach($dictionary as $word){ // Calculate the distance between the input word, // and the current word $lev = levenshtein($input, $word); // Check for an exact match if ($lev == 0){ // Closest word is this one (exact match) $closest = $word; $shortest = 0; // Break out of the loop; we've found an exact match break; } // If this distance is less than the next found shortest // distance, OR if a next shortest word has not yet been found if ($lev <= $shortest || $shortest < 0){ // Set the closest match, and shortest distance $closest = $word; $shortest = $lev; } } return $closest; } // Data: https://raw.githubusercontent.com/geekgirljoy/Part-Of-Speech-Tagger/master/data/csv/Words.csv // Load "Hash","Word","Count","TagSum","Tags" $words = array_map('str_getcsv', file('Words.csv')); // Remove unwanted fields - Keep Word $words = array_map(function ($words){ return $words[1]; }, $words); sort($words); // Not absolutely necessary // carrrot and automn are misspelled // olgivanna is a proper noun and should be capitalized $sentence = 'I love $1 carrrot juice with olgivanna in the automn.'; // This expects all words to be space delimited $input = explode(' ', $sentence);// Either make this more robust // or split so as to accommodate // or remove punctuation because // the AutoCorrect function can // add, remove or change punctuation // and not necessarily in correct // ways because our auto correct // method relies solely on the // distance between two strings // so it's also important to have a // high quality dictionary/phrasebook/ // pattern set when we call // AutoCorrect($word_to_check, $dictionary) var_dump($input); // Before auto correct // For all the words in the in $input sentence array foreach($input as &$word_to_check){ $word_to_check = AutoCorrect($word_to_check, $words);// Do AutoCorrect } var_dump($input); // After auto correct /* // Before array(10) { [0]=> string(1) "I" [1]=> string(4) "love" [2]=> string(2) "$1" [3]=> string(7) "carrrot" [4]=> string(5) "juice" [5]=> string(4) "with" [6]=> string(9) "olgivanna" [7]=> string(2) "in" [8]=> string(3) "the" [9]=> string(6) "automn" } After: array(10) { [0]=> string(1) "I" [1]=> string(4) "love" [2]=> string(2) "$1" [3]=> string(6) "carrot" [4]=> string(5) "juice" [5]=> string(4) "with" [6]=> string(9) "Olgivanna" [7]=> string(2) "in" [8]=> string(3) "the" [9]=> &string(6) "autumn" } */ ?>
If you are wondering why I didn’t use Damerau–Levenshtein distance instead of just Levenshtein distance, the answer is simple.
I did!
It’s just that a girl’s gotta eat and I’m just giving this away so… there’s that and for most of you (like greater than 99%) Levenshtein distance will be fine, so rather than worrying about it just say thank you if you care to… and maybe think about supporting me on Patreon! 😛
If you like my art, code or how I try to tell stories to make learning more interesting and fun, consider supporting my content through Patreon for as little as $1 a month.
But, as always, if all you can do is Like, Share, Comment and Subscribe… That’s cool too! 🙂
Much Love,
~Joy
3 Pingback