Welcome back, today we’re going to peek inside the database for the Parts of Speech tagger.

Unfortunately my Raspberry Pi that I am using to train with is slow (cough… and my code is super un-optimized 😛 ) so… it’s still working on learning the complete brown corpus though we’re almost there, less than 190 training files remaining!

Before proceeding here’s my disclaimer on the GitHub repo. It basically says that I don’t own the Brown Corpus and I am not selling it to you!

The Database

Here’s a recap of the database, It consists of three tables. Words, Tags & Trigrams. You can find the complete MySQL Database Setup script here:  Create.PartsOfSpeech.DB.sql.

CREATE TABLE `Tags` (
  `ID` int(11) NOT NULL,
  `Tag` varchar(8) NOT NULL,
  `Definition` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;


CREATE TABLE `Trigrams` (
  `ID` int(11) NOT NULL,
  `Count` int(11) NOT NULL,
  `Word_A` int(11) NOT NULL,
  `Word_B` int(11) NOT NULL,
  `Word_C` int(11) NOT NULL,
  `Tag_A` int(11) NOT NULL,
  `Tag_B` int(11) NOT NULL,
  `Tag_C` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;


CREATE TABLE `Words` (
  `ID` int(11) NOT NULL,
  `Word` varchar(100) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;


 

Words Table

The Words table keeps track of all the words the tagger knows.

The bot uses the ID’s in place of the words so given this sentence:

the quick brown fox jumps over the lazy dog a long-term contract with zero-liability protection lets think it over

We would expect the system to be able to lookup each word (provided it knows it) and replace it with the ID of the word in the Words table, like this:

1 43524 70488 515610 1149954 7158 1 266303 56280. 309 43578 53868 1212 zero-liability 238658 482081 32358 423

Notice that the bot was unable to lookup the ID for the word “zero-liability”, this is because it never saw that word during training and it would need to be “learned” by the system by assigning it a new ID and adding it to the database.

Here’s an info graphic that might help you understand the Words table:

Words Table - An inforgraphic reviewing the words table.
Words Table

 

Tags Table

The Tags table keeps track of all the tags the tagger knows.

The bot uses the ID’s in place of the tags so given these words:

fox, jump, jumps, jumped

We would assign these tags:

fox/nn = singular or mass noun

jump/vb = verb, base form

jumps/vbz = verb, 3rd. singular present

jumped/vbd = verb, past tense

And the ID’s for the tags would be represented as such:

21, 246, 138, 12

Here’s an info graphic that might help you understand the Tags table:

Tags Table - An inforgraphic reviewing the tags table.
Tags Table

 

Trigrams Table

The Trigrams table is the heart of the system and it’s job is to keep track of the associations between word trigrams (groups of 3 words) and tag trigrams (groups of 3 tags).

The Brown Corpus training data is split up into trigrams of words and tags so that when the bot learns it isn’t just learning individual words and tags but chains of words and tags.

This helps the bot learn that some words can have more than one meaning or role in a sentence. It also keeps a count of each time it sees a trigram so it can calculate the probability of each trigram and tag set.

Given this sentence:

the quick brown fox jumps over the lazy dog a long-term contract with zero-liability protection lets think it over

We would expect the system to be able to extract the following trigrams represented here as JSON:

[
	["The","quick","brown"],
	["quick","brown","fox"],
	["brown","fox","jumps"],
	["fox","jumps","over"],
	["jumps","over","the"],
	["over","the","lazy"],
	["the","lazy","dog"],
	["lazy","dog","A"],
	["dog","A","long-term"],
	["A","long-term","contract"], 
	["long-term","contract","with"],
	["contract","with","zero-liability"],
	["with","zero-liability","protection"],
	["zero-liability","protection","Let's"],
	["protection","Let's","think"],
	["Let's","think","it"],
	["think","it","over"]
]

 

And of course since we’re actually using word ID’s and not the words themselves we could change the words to their ID’s in the JSON:

 [
	["1","43524","70488"],
	["43524","70488","515610"],
	["70488","515610","1149954"],
	["515610","1149954","7158"],
	["1149954","7158","1"],
	["7158","1","266303"],
	["1","266303","56280"],
	["266303","56280","309"],
	["56280","309","43578"],
	["309","43578","53868"],
	["43578","53868","1212"],
	["53868","1212","1161931"],
	["1212","1161931","238658"],
	["1161931","238658","482081"],
	["238658","482081","32358"],
	["482081","32358","423"],
	["32358","423","7158"]
]

The same can be done with the Tags.

 

Here’s an info graphic that might help you understand the Trigrams table:

Trigrams Table - An inforgraphic reviewing the trigrams table.
Trigrams Table

 

 

This is as far as we’ll get this week so remember to like, and follow!

Also, don’t forget to share this post with someone you think would find it interesting and leave your thoughts in the comments.

And before you go, consider helping me grow…


Help Me Grow

Your financial support allows me to dedicate the time & effort required to develop all the projects & posts I create and publish here on my blog.

It goes toward helping me eat, pay rent and of course we can’t forget to buy diapers for Xavier now can we? 😛

My little Xavier Logich

 

If you would also like to financially support my work and add your name on my Sponsors page then visit my Patreon page and pledge $1 or more a month.

As always, feel free to suggest a project you would like to see built or a topic you would like to hear me discuss in the comments and if it sounds interesting it might just get featured here on my blog for everyone to enjoy.

 

 

Much Love,

~Joy

Advertisements