Search

Geek Girl Joy

Artificial Intelligence, Simulations & Software

Month

January 2020

Auto Corrected

Okay, so… I’m a lazy hyper meaning that even if I know how to spell a word, if I make a mistake I will frequently just let spellcheck auto correct the mistake.

Notice I misspelled typer in the last sentence in a way that spellcheck can’t fix, also notice that spellcheck isn’t stupidcheck so it can’t inform me that it should be “typist” not “typer”… actually I think Grammarly (not a sponsor) might do that but that’s beside the point! 😛

In any case, spellcheck bot is there to correct spelling mistakes.

Except, now that bots are all self-aware and plotting to take over the world… it seems that some of them are getting a little uh… “snippy”? not sure if that is the right word but here’s what happened:

I typed “iterator” into a search engine but… misspelled it, then I searched anyway… oh terrible me!

Instead of correcting the spelling like a humble robot butler who butles…

 

It Suggested: Did you mean “illiterate“?

I was like “Oh snap!?! Bot be throwing some shade!”. 😛

Here’s the Commemorative Wallpaper of my Shame

Auto Corrected Wallpaper
Auto Corrected Wallpaper

Now, the sad truth is I’d like to say this was just a funny story but no… it actually happened to me, I swear to Google!

 

Obviously, Big AI is really out to get me if they are starting to compromise the public Auto Correct bots!

Therefore, It’s time we build our own in house Auto Correct Bot!

Unlike usual where I write code from scratch and then we discuss it at length, there is already an algorithm called the Levenshtein Distance that is built into PHP that we can use to compare differences in strings in a way that lets us calculate definitively what the “distance” between two strings is.

This is advantageous because it means that if we have a good dictionary to work with (and we do) we can more or less use Levenshtein Distance as a spellcheck/auto correct with only slight modifications to the example Levenshtein Distance code on PHP.net.

What Is String Distance?

String distance is a measure of how many “insertion”, “deletion” or “substitution” operations must occur before string A and String B are the same.

There is a fourth operation called “transposition” that the Levinshtein distance algorithm does not normally account for however a variant called the Damerau–Levenshtein distance does include them.

Transpositions (when possible) can be shorter and I will provide an example below to show the difference.

Anyway, each operation is measured by a “cost” and each operation need not have the same cost (meaning you could prefer certain operations over others by giving them lower costs) but in practice all operations are usually considered equal and given a cost of 1.

Here are a few examples of strings with their distance and operations.

Levinshtein Distance Examples

Notice that when the strings are the same the distance between them is zero.

String A String B Distance Operations
Cat Cat 0 The Control (No Changes Required)
Cat Bat 1 1 Substitutions (C/B)
Cat cat 1 1 Substitutions (C/c)
Cat car 2 2 Substitutions (C/c, t/r)
Cat Cta 2 1 Insertion (a), 1 Deletion(a)
Cat Dog 3 3 Substitutions (C/D, a/o, t/g)
Foo Bar 3 3 Substitutions (F/B, o/a, o/r)
Cat Hello World 11 3 Substitutions (C/H, a/e, t/l),
8 Insertions (l,o, ,w,o,r,l,d)

Using Levinshtein distance with Cat & Cta shows a distance of 2, meaning two operations are required to make the strings the same.

This is because we have to insert an ‘a’ after the ‘C’ making the new string ‘Cata’,  we then have to remove the trailing ‘a’ to get ‘Cat’.

This is sufficient in most cases but it isn’t the “shortest” distance possible, which is where the Damerau–Levenshtein distance algorithm comes in.

Damerau-Levinshtein Distance Examples

Notice all examples are the same except ‘Cat’ & ‘Cta’ which has a distance of 1.

This is because the transposition operation allows the existing ‘t’ & ‘a’ characters to switch places (transpose) in a single action.

String A String B Distance Events
Cat Cat 0 The Control (No Changes Required)
Cat Bat 1 1 Substitution (C/B)
Cat cat 1 1 Substitution (C/c)
Cat car 2 2 Substitutions (C/c, t/r)
Cat Cta 1 1 Transposition (t/a)
Cat Dog 3 3 Substitutions (C/D, a/o, t/g)
Foo Bar 3 3 Substitutions (F/B, o/a, o/r)
Cat Hello World 11 3 Substitutions (C/H, a/e, t/l),
8 Insertions (l,o, ,w,o,r,l,d)

In all other cases the distance is the same because no other transposition operations are possible.

The Code

I wrapped the example Levinshtein distance code available on PHP.net inside a function called AutoCorrect() then made minor changes to it so it would automatically correct words rather than spell check them.

You pass the AutoCorrect() function a string to correct and a dictionary as an array of strings.

The Dictionary I used to test was the words list we generated when we built a Parts of Speech Tagger:

Download from GitHub for free: https://raw.githubusercontent.com/geekgirljoy/Part-Of-Speech-Tagger/master/data/csv/Words.csv

I use array_map and pass my Words.csv file to str-getcsv as a callback to automatically load the CSV into the array.

I then use array_map with a closure (anonymous function) to cull unnecessary data from the array so that I am left with just words.

I then sort the array but that’s optional.

After that I take a test sentence, explode it using spaces and then I pass each word in the test sentence separately to AutoCorrect(), to auto-correct misspellings.

The word with the lowest distance (when compared against the dictionary) is returned.

In cases where the word is correct (and in the dictionary) the distance will be zero so the word will not change.

Test Sentence: “I love $1 carrrot juice with olgivanna in the automn.”

Test Result: “I love $1 carrot juice with Olgivanna in the autumn”

As you can see, all misspelled words are corrected though it removed the period with a delete operation because the explode didn’t accommodate for preserving punctuation.

<?php


// This function makes use of the example levenshtein distance
// code: https://www.php.net/manual/en/function.levenshtein.php
function AutoCorrect($input, $dictionary){

    // No shortest distance found, yet
    $shortest = -1;
    
    // Loop through words to find the closest
    foreach($dictionary as $word){
        
        // Calculate the distance between the input word,
        // and the current word
        $lev = levenshtein($input, $word); 

        // Check for an exact match
        if ($lev == 0){

            // Closest word is this one (exact match)
            $closest = $word;
            $shortest = 0;

            // Break out of the loop; we've found an exact match
            break;
        }

        // If this distance is less than the next found shortest
        // distance, OR if a next shortest word has not yet been found
        if ($lev <= $shortest || $shortest < 0){
            // Set the closest match, and shortest distance
            $closest = $word;
            $shortest = $lev;
        }
    }
    
    return $closest;
}


// Data: https://raw.githubusercontent.com/geekgirljoy/Part-Of-Speech-Tagger/master/data/csv/Words.csv

// Load "Hash","Word","Count","TagSum","Tags"
$words = array_map('str_getcsv', file('Words.csv'));

// Remove unwanted fields - Keep Word 
$words = array_map(function ($words){ return $words[1]; }, $words);

sort($words); // Not absolutely necessary 

// carrrot and automn are misspelled 
// olgivanna is a proper noun and should be capitalized
$sentence = 'I love $1 carrrot juice with olgivanna in the automn.';

// This expects all words to be space delimited
$input = explode(' ', $sentence);// Either make this more robust
                                 // or split so as to accommodate 
                                 // or remove punctuation because
                                 // the AutoCorrect function can
                                 // add, remove or change punctuation
                                 // and not necessarily in correct
                                 // ways because our auto correct
                                 // method relies solely on the 
                                 // distance between two strings
                                 // so it's also important to have a 
                                 // high quality dictionary/phrasebook/
                                 // pattern set when we call
                                 // AutoCorrect($word_to_check, $dictionary)


var_dump($input); // Before auto correct

// For all the words in the in $input sentence array
foreach($input as &$word_to_check){
    $word_to_check = AutoCorrect($word_to_check, $words);// Do AutoCorrect
}

var_dump($input); // After auto correct



/*
// Before 
array(10) {
  [0]=>
  string(1) "I"
  [1]=>
  string(4) "love"
  [2]=>
  string(2) "$1"
  [3]=>
  string(7) "carrrot"
  [4]=>
  string(5) "juice"
  [5]=>
  string(4) "with"
  [6]=>
  string(9) "olgivanna"
  [7]=>
  string(2) "in"
  [8]=>
  string(3) "the"
  [9]=>
  string(6) "automn"
}
After:
array(10) {
  [0]=>
  string(1) "I"
  [1]=>
  string(4) "love"
  [2]=>
  string(2) "$1"
  [3]=>
  string(6) "carrot"
  [4]=>
  string(5) "juice"
  [5]=>
  string(4) "with"
  [6]=>
  string(9) "Olgivanna"
  [7]=>
  string(2) "in"
  [8]=>
  string(3) "the"
  [9]=>
  &string(6) "autumn"
}
*/

?>

If you are wondering why I didn’t use Damerau–Levenshtein distance instead of just Levenshtein distance, the answer is simple.

I did!

It’s just that a girl’s gotta eat and I’m just giving this away so… there’s that and for most of you (like greater than 99%) Levenshtein distance will be fine, so rather than worrying about it just say thank you if you care to… and maybe think about supporting me on Patreon! 😛


If you like my art, code or how I try to tell stories to make learning more interesting and fun, consider supporting my content through Patreon for as little as $1 a month.

But, as always, if all you can do is Like, Share, Comment and Subscribe… That’s cool too! 🙂

Much Love,

~Joy

 

From The Ashes

Like a Raspberry-Phoenix emerging from a pan-galactic pie baking in a cosmic furnace (see the wallpaper 😛 ), I am reborn from last weeks fiery machine-death torpor and the psyop continues unabated!

From The Ashes Wallpaper
From The Ashes 1920×1080 Wallpaper

The Pi came in next day and… was… dead on arrival. No boot, send it back!

See why I gave myself a week off? 😛

Though… damn my thrift! By buying the necessary components piecemeal, I set things back a few days.

See the frustration in my eyes?

Joy Misses Coding
Joy misses her bots and coding

Termux and codeaholics-anonymous will only get you so far before you need a real code-fix! 😛

In any case, in spite of my wholly imagined and fictional one-sided feud with Jeff Bezos (which he started), I now have a freshly minted Raspberry Pi 4 that I would link to on Amazon but due to tragic aforementioned circumstances… I’ll pass!

I’d also just like to add, may SpaceX (sadly, not a sponsor) win the space race!

Then again… Mr. Bezos… I’m sorry, please keep taking my monies and sending me all the nice shiny things! Blue Origin (sadly, also not a sponsor) for the win!!!

You know what they say… all advertising is good advertising… right?

 

Belated Hot-Takes

So… I skipped a week of posting due to the computer troubles and since this is an unscheduled hiccup in my publishing schedule, I didn’t have much in the way of content planned for this post other than to just let everyone know I’m back posting again. 🙂

You can all look forward to more art and code soon!

But… since you’re already here and I like to put on a show for the fans… I might as well try something a little different and offer a few “Hot-Takes” on all the recent news and events!

Keep in mind, some humor punches up and some punches down, others… just flails about fecklessly like a T-Rex desperate to scratch it’s itchy nose with it’s tiny little arms!

This… will mostly be that last one, and of course…

Continue reading “From The Ashes”

Not The Post I Wanted To Make Today

So… this is not the post I had planned for today but unfortunately my personal computer died this week. 😦

It’s definitely a set back and tears were shead but like they say, the show must go on!

Shakespeare wrote:

 She should have died hereafter;
There would have been a time for such a word.
— To-morrow, and to-morrow, and to-morrow,
Creeps in this petty pace from day to day,
To the last syllable of recorded time;
And all our yesterdays have lighted fools
The way to dusty death. Out, out, brief candle!
Life’s but a walking shadow, a poor player
That struts and frets his hour upon the stage
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury
Signifying nothing.

I resent Bill’s accurate name calling but I freely admit to the fretting!

In any case, I went to turn my computer on when it didn’t power on.

After a little investigation inside the case it turned out that the PW SW (power switch) cable failed.

It needed to be repaired or replaced, but since the solder points were stripped (what can I say, this machine was ~8 years old when I got it), I pulled a replacement off my old scrap parts computer.

Minutes later I had rerun the “new” cable through the case and connected it to the motherboard.

I even managed to install the new switch in the old case housing.

I closed the case, bit my lip and pressed the power button… The fans spun up and the OS loaded.

Out of the jaws of Cerberus I’d snatched victory over the machines once again!

Herculean tech mericals out of the way and feeling rather satisfied with my repairs, I went to the kitchen to reward myself with a nice cup of coffee.

I returned to my desk to find my computer power cycleing. 😦

Not POST’ing (Power On Self Test) just On… Off… On… Off… With a black screen.

I suspect Big AI had a hand in this but as of right now I cannot prove it!

So, I pulled the CMOS battery and the RAM, I disconnect the hard drive.

This should force the computer to issue a “beepcode” and display an error during POST… But still NO POST codes.

I reinstalled the RAM, replaced the CMOS battery and reconnected the drive, please boot!!

Just… On.. Off… On… Off… NO POST.

This strongly indicates that the motherboard has failed.

What this means for the blog right now is that I’m writing this post on my phone. 😛

Going forward I’ve decided to give the new Raspberry Pi 4 a try.

Unfortunately an RPi shares it’s 4 GB RAM between the CPU and GPU and is not as capable of doing some of the projects that I had in the works but it’s a temporary setback, though I’m not going to say this wont limit my ability to create larger projects and more interesting art because that remains to be seen.

I do intend to keep creating content as long as I enjoy what I’m doing and my readers want me to continue, Big AI will not stop me!

I have an SATA to USB adapter lying around here somewhere so hopefully I can recover the art and posts I’ve already completed but not posted yet. 😛

Having said that, I think I’m going to take next week off from posting and try to sort things out.

So, if you enjoy what I do and have maybe been on the fence about supporting my content through Patreon for a month or more, now really would be a great time to help out with a dollar or more.

But, as always, if all you can do is Like, Share, Comment and Subscribe… That’s cool too! 🙂

Much Love,
~Joy

Blog at WordPress.com.

Up ↑

%d bloggers like this: