A webcrawler to detect 404 , using scrapy in python

Well i was recently presented with a problem of mitigating the ironic 404 pages for a web domain .

What best then to use a web crawler to crawl all the domain pages and observing the response !!

We shall summon scrapy spiders to do our biddings .

You can install it by

$ sudo apt-get install python-pip python-dev libffi-dev libxslt1-dev libxslt1.1 libxml2-dev libxml2 libssl-dev

$sudo pip install Scrapy

$sudo pip install service_identity

Then start the project by ,
$ scrapy startproject Project_name

This will create the directory structure :

Under spiders , create a python script with any name , and compose the soul of the spider

For the 404 you can use this

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import *
from scrapy.item import Item, Field

class Item(scrapy.Item):
    title = Field()
    link = Field()
    response = Field()
    refer = Field()

class MySpider(CrawlSpider):
    name = "AcrazySpiderofDoom"
    allowed_domains = [""]
    start_urls = [""]

    rules = (Rule (SgmlLinkExtractor(allow=(),unique=True)
    , callback="parse_items", follow= True),

    def parse_items(self,response):
        item = Item()
            item ["title"] = response.xpath('//title').extract()[0]
            item ["link"] = response.url
        item ["response"] = response.status
        item ["refer"] = response.request.headers['referer']
        return item

Thats it, Give it life by

$ scrapy crawl AcrazySpiderofDoom

You can even make it enter the data to a csv by
$ scrapy crawl AcrazySpiderofDoom -o items.csv

Example screen grab of running the crawler for

Once completed (I killed it here) , it will also provide valuable statistics