Well i was recently presented with a problem of mitigating the ironic 404 pages for a web domain .
What best then to use a web crawler to crawl all the domain pages and observing the response !!
We shall summon scrapy spiders to do our biddings .
You can install it by
$ sudo apt-get install python-pip python-dev libffi-dev libxslt1-dev libxslt1.1 libxml2-dev libxml2 libssl-dev
$sudo pip install Scrapy
$sudo pip install service_identity
Then start the project by ,
$ scrapy startproject Project_name
This will create the directory structure :
Project_name/
scrapy.cfg
Project_name/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
Under spiders , create a python script with any name , and compose the soul of the spider
For the 404 you can use this
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import *
from scrapy.item import Item, Field
class Item(scrapy.Item):
title = Field()
link = Field()
response = Field()
refer = Field()
class MySpider(CrawlSpider):
name = "AcrazySpiderofDoom"
allowed_domains = ["www.domain.com"]
start_urls = ["http://www.domain.com/"]
rules = (Rule (SgmlLinkExtractor(allow=(),unique=True)
, callback="parse_items", follow= True),
)
def parse_items(self,response):
item = Item()
item ["title"] = response.xpath('//title').extract()[0]
item ["link"] = response.url
item ["response"] = response.status
item ["refer"] = response.request.headers['referer']
return item
Thats it, Give it life by
$
scrapy crawl AcrazySpiderofDoom
You can even make it enter the data to a csv by $ scrapy crawl AcrazySpiderofDoom -o items.csv
Example screen grab of running the crawler for partypoker.com
Once completed (I killed it here) , it will also provide valuable statistics