Google… your Father’s and I are very disappointed in you!
If you had simply taken my advise we could have avoided this whole embarrassing situation…
Damn it you should have listened to me!
Google Google’s Improprieties
Okay look, for a long time now Google has shown a clear and uniquely singular ability to slap the name Google on things!
Sometimes those things even turn into instant successes… and then there’s everything else Google does, like Google+ and Google Glass!
Which is why we’re all so surprised to hear that Google bought a tainted data-set, misspelled infinity minus a plex or two and… is now allegedly accused of misappropriating song lyrics… allegedly!
That last part is no joke, just google “Google’s improprieties”… no… wait, that didn’t bring up anything relevant… how odd?
The Cliffhanger Notes
Okay, figured it out!
If you google “Google music lyrics” you will allegedly find articles related to what I’m referring to.
Basically google has been supplying lyrics directly as a result of song lookup’s… and it’s been allegedly alleged that Google acknowledly acknowledges the data came from a third party but that they were unaware that the data was illegally scraped.
Let’s get something straight: It’s generally not illegal to scrape publicly accessible data though the purveyor of said data may not want you using a bot on their site and might restrict your IP as a result of your misbehaving.
But even if you can crawl a site and even if data is freely and publicly available for all to enjoy… Fair Use aside (whatever that means 😛 ), copyright law still applies!
As I’ve said before:
“Don’t steal intellectual property!”
Yep, that’s an actual quote from me!
Oh, where can I be quoted as having said that you ask?
A little post I wrote called How to Build a Spider Bot.
I’d like to quote another section from that same article here for your enjoyment and edification:
Beware Dirty Data and Canary Traps
When you scrape data from the wild (the internet) you will encounter “Dirty Data” which simply means it’s incomplete, misspelled or in some way contains inaccuracies from what it should actually be.
Further, I mentioned above that some sites will go to great lengths to confound your efforts to crawl their content, closely related to Dirty Data but intentional is the concept of a “Canary Trap” which means the data purveyor deliberately ‘tainted’ their information in an effort to either confirm you obtained your information from them or to simply make it more difficult to use the data without “Cleansing” it first.
The simplest solution to both problems is to obtain the same content from as many different sources as possible (if possible) and compare the differences.
Any significant variance definitely indicates the presence of Dirty Data and may indicate the presence of a Canary Trap embedded in one or more of the data sources.
Again let me reiterate, DON’T STEAL!
But… if you won’t follow that sound advise at least do yourself a favor and cleanse your god damn data!
With that… Google, I hate to say it but… I told you so!
One of the many benefits of supporting my work is I tend to release my work under the MIT License which basically says that as long as you don’t take my name off it, you can use my code in your commercial projects for free!
What more reason do you need to say thank you to me over on Patreon for a little as $1 a month than the fact that I don’t embed canary traps in my work!?
But, if all you can do is like, share, comment and subscribe… well, that’s cool too! 😉