Welcome, today we’re going to wrap up our Parts of Speech tagging bot prototype.

That doesn’t mean that we’re done with this code (or database), it’s just that this prototype is functioning well enough that it fulfills what we set out to accomplish (Tokenizing & Lexing Natural Language) and further development at this point is unnecessary for our purposes but there are tons of other improvements we could add in the future if we want to turn this prototype into a full product and I encourage you to experiment! 🙂

There were some changes to the Database (last week and this week) but I have uploaded the most recent data to the GitHub Project Repo: Part-Of-Speech-Tagger

We last left off with 3 unknown lexemes and the word ‘jumps’ was miss tagged.

Additionally, our tagging process is more or less effective but it’s not quick and that simply wont do if we want to use our  tagger to do fun things in the future, so we’ll cover how we solve that today too.

There is much to discuss so let’s start with the miss tagged words.

Miss Tagged Words

In our test sentence:

“The quick brown fox jumps over the lazy dog. A long-term contract with “zero-liability” protection! Let’s think it over.”

The word ‘jumps’ is a verb (vbz – 3rd. singular present to be exact) yet our bot says it’s a noun, why?

Well, as it stands our bot uses a “back-off” or “fall back” method in Test.php where if it finds the Tri-gram it is considered better than the Bi-grams or Skip-grams and so too, such is the relationship between Bi-grams & Skip-grams with themselves and Uni-grams as well.

So, if it finds a gram that meets its search criteria it accepts it and stops trying to “back-off” (use progressively simpler gram types) to determine the correct parts of speech tag.

This works well most of the time because as we discussed in my last post (Unigrams) more “complex” word grams are more highly correlated to the correct tag and should generally be correct.

Of course this can lead to odd results when there is a 50/50 probability split between 2 tags as is the case when we are trying to assign a tag to the word ‘jumps’.

Given these possible ‘jumps’ Tri-grams present in our sentence, we can use these SQL statements to search:

Tri-grams:

(brownfoxjumps)

SELECT * FROM `Trigrams` WHERE `Hash` = '682b80d207a16c1313996208e895c08b' LIMIT 1;

(foxjumpsover)

SELECT * FROM `Trigrams` WHERE `Hash` = '2f5c3296f447ce849bf3a681eac58aab' LIMIT 1;

(jumpsoverthe)

SELECT * FROM `Trigrams` WHERE `Hash` = 'fa8b014923df32935641ca80b624a169' LIMIT 1;

But all of these queries fail because these Tri-grams do not exist in the Brown Corpus.

Our bot then “back’s off” to Bi-grams and Skip-grams:

Bi-grams:
(foxjumps)

SELECT * FROM `Trigrams` WHERE `Hash_AB` = '62279802100563e49bc6e7e0713288a2' OR `Hash_BC` = '62279802100563e49bc6e7e0713288a2';

(jumpsover)

SELECT * FROM `Trigrams` WHERE `Hash_AB` = '4cc154f9d12cb151afc33a92e11f9394' OR `Hash_BC` = '4cc154f9d12cb151afc33a92e11f9394';

 

Skip-grams:
(brownjumps)

SELECT * FROM `Trigrams` WHERE `Hash_AC` = '18999a039be3b0e47927f6ea6ae7526c';

(jumpsthe)

SELECT * FROM `Trigrams` WHERE `Hash_AC` = '488de4a30befe23c46fd5bf9bf8a59f4';

Finally, when it tries the Skip-gram (jumpsthe) it gets a match, but that  tag match is nns (plural noun), not the vbz we would expect.

The original source for the match is cg09.txt and if we search that document we find this Tagged sentence:

“A/at pervading/vbg quality/nn of/in free/jj lyricism/nn and/cc a/at building/nn from/in turns/nns close/rb to/in the/at ground/nn towards/in jumps/nns into/in the/at air/nn gives/vbz the/at work/nn its/pp$ central/jj focus/nn ./.”

If we remove the tags we can more easily read the text:

“A pervading quality of free lyricism and a building from turns close to the ground towards jumps into the air gives the work its central focus .”

Frankly… I’m uncertain as to what the hell the original author meant and it is not immediately clear to me why jumps was tagged as a plural noun, but I’m going to say this isn’t correct…

Having said that… what can we do about this? You might be tempted to simply test all gram types (even if it already found matches with a better gram type) and then add and average the results.

The thing is, that’s time-consuming to do on the fly and generally speaking, on average it should only be slightly more accurate than the back-off approach we’ve already implemented, further it doesn’t actually solve the 50/50 probability problem either.

If we search all Tri-grams for the word ‘jumps’ we can see why:

SELECT * FROM `Trigrams` WHERE `Word_A` = 'jumps' OR `Word_B` = 'jumps' OR `Word_C` = 'jumps';

Not only is this query extremely slow with just one word alone taking 38 seconds to return 6 rows of data, but when we get that data we see that the corpus only saw the word jumps 2 times:

times the word appears in the corpus * n_gram = number of rows

‘jumps’ appears 2 times in the brown corpus and we are checking Tri-grams so we get:

(2 * 3) = 6 rows

So there are 6 Tri-grams in the database that contain the word jumps but only 2 tag instances.

One set of three as vbz and one set of three as nns.

Because the bot is using probability to assign tags to words, when there is a 50/50 split the bot cannot determine which is correct so it will output whichever ends up at the top of the list when it builds its list.

This means if you sort the tags list ascending you will get one tag and descending will result in the other.

We could manually fix this case but I’m reticent to start manually modifying the Brown Corpus.

It would fix this word but there could be other cases like this… without testing all cases we cannot know and while possible, I’m not prepared to do that because it’s simply not necessary.

We can leave the records that contain jumps tagged as a noun in the off-chance that it was properly tagged.

I mean, I’m a programmer so I do study a subset of linguistics but I’m no English major, my assessment can be wrong and so can yours! 😛

So, the correct choice in this case is to feed the bot additional training material because the more this bot learns the more accurate it becomes!

The thing is I want to keep the base Parts of Speech tagger project more or less using the “vanilla” Brown Corpus data set, so…

Basically, if/when you feed this bot more data it will automatically resolve issues like this so we can dismiss this issue as a result of the bot needing additional training from well tagged sources.

 

The Unknown Lexemes

The unknown lexemes were the quotes and the compound word zero-liability.

The Quotes
The Brown Corpus makes a distinction between open and closing quote/citation symbols using  to mean open quote and  to mean close quote.

So searching for a double quote mark returns nothing because it never learned that word/symbol/tag.

Thankfully this is easy to handle and doesn’t require more training so we can implement a solution.

If we pre-process our lexemes to determine if they are a quote and keep track of when we are in a citation then we can easily turn the unknown quotes into the correct open and closed lexems and tags that the Brown Corpus uses.

// check if lexeme is a quote and convert to open or closed unigram/tag
$quotes = array('"', "''", "``");
if(in_array($lexeme, $quotes) ){// is this a quote/citation?
  if($in_citation == true){
    // this is a close quote
    $lexeme = "''";
    $in_citation = false;
  }else{
    // this is an open quote
    $lexeme = "``";
    $in_citation = true;
  }
}

 

Zero-Liability

We can perform the same queries for zero-liability that we did for jumps above (including the uni-gram lookup) and they will all fail to return results because the Brown Corpus does not contain this word.

Almost always the solution to an unknown word is simply more training but what’s interesting in this case is the word is a compound word and this is a special case because compound words are words that are made up of other words. In this case ‘zero‘ and ‘liability‘.

We could be clever and try splitting them to see if we can tag them as Uni-grams and then combine the tags into a single tag that applies to the compound word… but I’ll leave that for you to implement for now. 😉

This brings us to one final optimization to speed up tagging.

 

Speeding up Tagging

Probably the best middle ground in terms of (speed and accuracy) would be to rely solely on the Uni-grams that have their probability computed from the Tri-grams, good thing we did in the Unigrams post. 😉

This results in a fast and simple query that amounts to looking up the exact word we want and getting its pre-computed tags sorted by probability.

Here’s the results:

Using Run-time “Back off” Tagging (Test.php) – 7 Minutes 58.6194 Seconds
Tagged Sentence:

The/at quick/jj brown/jj fox/np jumps/nns over/in the/at lazy/jj dog/nn ./. A/at long-term/nn contract/vb with/in “/“ zero-liability/unk ”/” protection/nn-hl !/. Let’s/vb+ppo think/vb it/ppo over/in ./.

Using Uni-grams with Pre-computed Probability (FastTest.php) – 3.7372 Seconds
Tagged Sentence:

The/at quick/jj brown/jj fox/nn jumps/vbz over/in the/at lazy/jj dog/nn ./. A/at long-term/nn contract/nn with/in “/“ zero-liability/unk \’\’/” protection/nn !/. Let\’s/vb+ppo think/vb it/pps over/in ./.

Both queries were on a “cold” database (no pre-cached data), i.e a worst case scenario.

As you can see the Uni-gram lookup was ~126x faster and yet is functionally identical though the Uni-gram lookup method also correctly tagged jumps as vbz but that’s just the luck of 50/50 lookup order. 😉

Also, I do think the “it” in “think it over” is probably more accurately tagged as ppo rather than pps… but again, what do I know?

Once the database is fully warmed up (a healthy cache in memory) both queries perform faster but the Uni-gram ALWAYS wins.

So while we lost little to no accuracy with the pre-computed uni-gram probabilities, we gained a truly MASSIVE boost to our tagging throughput capacity to the point that it could now hypothetically be used to do web-based parts of speech tagging, but…

Realistically it needs more data and a few more rules to make it deploy-able in that fashion, though it makes an excellent prototype! 🙂

 

Tag Definitions and Examples

As a little bonus I added the descriptions to the Tags table as well as an examples field that provides example words for each tag.

Here’s a few:

nnsingular or mass noun [city, investment, liquid, percentage, question, ship, showroom, spite, teaching, work]
nnsplural noun [baths, blasts, cracks, districts, fathers, girls, logs, poems, seeds, sights]
vbverb, base form [accommodate, assist, attach, expunge, get, know, locate, outgrow, possess, provide]
vbzverb, 3rd. singular present [codetermines, gives, keeps, lingers, looks, originates, seems, swears, tends, tries]
inpreposition [at, between, in, of]
jjadjective [congenial, glorious, latin, major, old, pale, scalar, special, strategic, wooden]
jjrcomparative adjective [better, greater, heavier, higher, lesser, stronger, uglier, worse, younger]

You can find these along with all the 472 Parts of Speech Tags along with their descriptions and examples in the Tags database.

If you would like to download a copy of this project you can find it on my GitHub: Part_Of_Speech_Tagger

And if you followed along from the beginning  go ahead and pat yourself on the back for being an awesome student!

Here’s the other posts in this series:

If you do make or do something cool with this I’d love the hear about it!

Going forward, we can use this bot as the basis of other projects (which we will) but I think in my next post maybe we’ll build a Neural Network so hit that follow button to make sure you get all my new posts!

Also, don’t forget to share this post with someone you think would find it interesting and with that, please leave a like and share your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your direct monetary support finances this work and allows me to dedicate the time & effort required to develop all the projects & posts I create and publish.

Your support goes toward helping me obtain access to better tools and equipment so I can improve the quality of my content.  It also helps me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you feel inclined to give me money and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month and you will be helping me grow.

Thank you!

And as always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy