Search

Geek Girl Joy

May 2018

It’s raining in Los Angeles today and I’m feeling generous! So generous in fact that I have decided to give you a free gift.

How would you like 256 pre-trained neural networks and the associated training files?

Last year while working on my Ancestor Simulations Series I wrote a 4 part mini-series on Elementary Cellular Automata and here are the links:

Elementary Cellular Automata

Elementary Cellular Automata 2

Elementary Cellular Automata 3

Elementary Cellular Automata 4

In these articles I provide a fairly thorough introduction to what Elementary Cellular Automata are and how they function.

I also demonstrate a program I wrote in PHP to algorithmically compute the Wolfram 1D CA Rules and then render them as PNG images. You can find my algorithmic Wolfram 1D CA implementation on GitHub here.

After publishing those articles I announced that I was releasing the 256 Pre-Trained Wolfram CA Neural Networks to my followers on Patreon.

Now, I have decided to make those Neural Networks available to you for free!

• 256 training sets (.data files) to review, modify  and train neural networks to perform all 256 (0 – 255) rules  conveniently organized into separate files and named after the rule that  it trains the network on.
• 256 PRE-TRAINED neural networks (.net files) capable of performing the computation as outlined in my Elementary Cellular Automata series except via neural networks.
• 5 separate programs (.php files):
1. generate_rule_set.php: This program will recreate the entire set of 256 .data training files.
2. train_all.php:  This program will train all 256 FANN .net neural networks.
3. train_specific.php: This program lets you enter the  rule numbers into an array prior to running to train the specific rules  as FANN .net file neural networks.
4. ann_output_all_rule_set_images.php: This program will render all the rules as images using the neural networks.
5. ann_output_specific_rule_images.php: This program will render the specified rules as images using the specified neural networks.
• The entire set of rules (0 – 255) per-rendered as 1025x1025px PNG images.

You will need PHP and FANN installed for this to work. I have a public tutorial on how to setup your test environment for free here:

Getting Started with Neural Networks Series

You can get your copy in one of two ways:

If you like this project please “star” it on GitHub.

Since I’ve been generous to you I’d like you to also be generous with other people today and if you are so inclined to be generous to me please consider supporting me on Patreon for as little as \$1 a month.

This post was made possible by the generous contributions of my sponsors on Patreon

Much Love,

~Joy

We live in the age of information and there is a seemingly endless abundance of new websites being launched all the time creating entrepreneurial opportunities for innovative developers to ‘data mine’ all that aggregate info.

You might be building a specialized knowledge base for a unique Machine Learning problem or you may be working on a new algorithm for analyzing and ranking website content. In fact the uses for mined web content extend far beyond the scope of this article, however if you find yourself in need of the ability to acquire a large portion of a website (or sites) or simply review the content on them (like a search engine) then you need a ‘crawler’ or ‘spider’.

There are plenty of spiders available already (free and paid) that you can run on your equipment or there are services available as well but since this isn’t a review I won’t be mentioning any names.

If you offer such a product and would like to sponsor a post feel free to contact me.

Also I think it’s fun to see how things work and you don’t really get to do that when you can’t look at the code and change how it works.

So today we’re going to look at building a basic crawler, in fact I would go so far as to say this is more of a ‘page scraper’ than a fully realized spider, the main difference being features and higher levels of autonomous operation.

I am going to demonstrate basic scraping/crawling but I will leave more advanced processing and parsing for you to implement as they are use case specific and I want to keep this post as simple as possible.

I may do another post on how to improve this basic implementation if it seems like you guys like this topic so let me know in the comments if I should do more posts on this subject.

A Word of Warning

Before we get started I want to warn you that site owners take steps to prevent data miners from scraping all their content.

Even free sites that act as “knowledge hubs” usually do not want their content copied by a robot because the person operating the bot may be one of their competitors.

I am not providing this code to you so that you can be irresponsible or malicious.

I say this from personal experience, it sucks having a website you need block your external IP Address! 😛

It sucks even worse if they get your ISP involved and it sucks the most if government entities get involved!

What you choose to do with this spider is entirely up to you, play nicely or YOU WILL get blacklisted or maybe even worse! DO NOT use this code to violate any laws and don’t steal intellectual property!

When in doubt, find out if the site has a “robots.txt” file which will offer some insight into how the site views spiders and what they will and wont allow, you can find it at siteroot/robots.txt

The robots.txt file for my site can be found here: https://geekgirljoy.wordpress.com/robots.txt

Having a robots.txt file is not a requirement for having a website and just because a site doesn’t have one doesn’t mean they are OK with you crawling their entire site.

However if you crawl links or sections they disallow then your spider will likely be discovered and blocked.

Our Strategy

Now let’s talk about the strategy we will use for the spider bot.

Target Specific Content

First, we will not be implementing any auto URL extraction and crawling based on found hyperlinks because it’s not a good idea to just auto crawl any URL without first knowing what’s there.

Web developers are smart and they have developed counter measures to unwanted spiders, see “Honeypot”, false links and never ending cascading auto generated URLs that do nothing but link to deeper auto generated URLS… a dumb bot cannot understand this and will happily continue to crawl these links sending up HUGE red flags and tripping alarms all over the place that a bot is operating at your IP Address.

If you’re smart you might be thinking that you could set a maximum URL “depth” which is not a bad idea, however this isn’t a guarantee that your bot won’t be found, it only guarantee that if your bot gets stuck in a honeypot it will eventually stop crawling it, though perhaps after it has already been discovered.

Think about it, one of the methods to confuse your bot is to place a hidden link on a page that no human would ever see and with no visible link to click on, then the only way someone could reasonably get there is if they either reviewed the site code or an automated bot found the link, so repeated attempts to access bad links would only occur outside of the normal operation of the site.

As such, in this example URLs to be crawled will be hand coded into an array, this ensures that the spider will only crawl pages you have approved.

You could extend this spider to collect the URLs and simply review & approve them before the bot crawls them, but we won’t cover that today.

Record Keeping

As for what we will do with the data we collect, it makes sense to simply store a local copy for our later use offline rather than processing the data while the spider is crawling.

Mainly because you can always use your local copy without any penalty, just save all extracted data separately from the original.

This prevents the need for repeated crawling of the same content which reduces the server resources your spider will use enabling it to operate slightly more covertly. 🙂

As to how you store that data, it’s entirely up to you.

I have used raw text, json and even MySQL, however today we will simply clone the page and keep it’s extension, which is probably the most straight forward method.

Be Patient

Websites are for people not spiders and even if a site is totally cool with you crawling their entire site, you should exercise caution and not crawl too frequently or quickly so that you do not utilize an inordinate amount of their server resources, if you do, YOU WILL BE BLOCKED!

As such we will implement a wait period that should help reduce our load on the web server we are crawling.

Additionally, humans don’t read all webpages in the same amount of time so we will have the wait period change after each crawl rather than the same each time so that the bot leaves less of a repeating crawl pattern.

Simple Spider Code

```<?php
// Instructions:
// Update the \$url_array with the pages you want to crawl then run from command line.
// References:
// mkdir() - http://php.net/manual/en/function.mkdir.php
// count() - http://php.net/manual/en/function.count.php
// file_get_contents() - http://php.net/manual/en/function.file-get-contents.php
// fopen() - http://php.net/manual/en/function.fopen.php
// fwrite() - http://php.net/manual/en/function.fwrite.php
// fclose() - http://php.net/manual/en/function.fclose.php
// basename() - http://php.net/manual/en/function.basename.php
// mt_rand() - http://php.net/manual/en/function.mt-rand.php
// sleep() - http://php.net/manual/en/function.sleep.php
// List of URLs to Crawl
\$url_array = array('http://www.sitename.com/page1.html',
'http://www.sitename.com/page2.html',
'http://www.sitename.com/page3.html'
);
@mkdir('crawled/', 0777, true); // quietly make a subfolder called 'crawled'
// Loop Through \$url_array
foreach(\$url_array  as \$key=>\$page){
// Do Crawl
echo  'Crawling (' . \$key . ' of ' . count(\$url_array) . ')' . \$page . PHP_EOL;
\$data = file_get_contents(\$page);

// Save a clone of the crawled page in the crawled subfolder
\$file = fopen('crawled/' . basename(\$page), 'w');
fwrite(\$file, \$data);
fclose(\$file);

// Wait	- DO NOT REMOVE
// This keeps your spider from looking like a bot
// It makes it look like you are spending a few minutes reading each
// page like a person would. It keeps the spider from using excessive
// resources which will get you blacklisted.
\$sleep = mt_rand(150, 300); // Between 2.5 & 5 Minutes
echo 'Sleeping for ' . \$sleep . PHP_EOL;
while(\$sleep > 0){
sleep(1);
\$sleep-=1; // take one second away
echo \$sleep . ' seconds until next crawl.' . PHP_EOL;
}
}
echo 'Program Complete'  . PHP_EOL;
```

Beware Dirty Data and Canary Traps

When you scrape data from the wild (the internet) you will encounter “Dirty Data” which simply means it’s incomplete, misspelled or in some way contains inaccuracies from what it should actually be.

Further, I mentioned above that some sites will go to great lengths to confound your efforts to crawl their content, closely related to Dirty Data but intentional is the concept of a “Canary Trap” which means the data purveyor deliberately ‘tainted’ their information in an effort to either confirm you obtained your information from them or to simply make it more difficult to use the data without “Cleansing” it first.

The simplest solution to both problems is to obtain the same content from as many different sources as possible (if possible) and compare the differences.

Any significant variance definitely indicates the presence of Dirty Data and may indicate the presence of a Canary Trap embedded in one or more of the data sources.

If you would like to obtain a copy of this project on GitHub you can find it here.

I hope you enjoyed this post, if so please consider supporting me on Patreon.

Much Love,

~Joy