Wednesday, 9 July 2014

A webcrawler to detect 404 , using scrapy in python

Well i was recently presented with a problem of mitigating the ironic 404 pages for a web domain .

What best then to use a web crawler to crawl all the domain pages and observing the response !!

We shall summon scrapy spiders to do our biddings .

You can install it by

$ sudo apt-get install python-pip python-dev libffi-dev libxslt1-dev libxslt1.1 libxml2-dev libxml2 libssl-dev

$sudo pip install Scrapy

$sudo pip install service_identity

Then start the project by ,
$ scrapy startproject Project_name

This will create the directory structure :
Project_name/
    scrapy.cfg
    Project_name/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
           

Under spiders , create a python script with any name , and compose the soul of the spider

For the 404 you can use this





import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import *
from scrapy.item import Item, Field

class Item(scrapy.Item):
    title = Field()
    link = Field()
    response = Field()
    refer = Field()

class MySpider(CrawlSpider):
    name = "AcrazySpiderofDoom"
    allowed_domains = ["www.domain.com"]
    start_urls = ["http://www.domain.com/"]

    rules = (Rule (SgmlLinkExtractor(allow=(),unique=True)
    , callback="parse_items", follow= True),
    )



    def parse_items(self,response):
        item = Item()
            item ["title"] = response.xpath('//title').extract()[0]
            item ["link"] = response.url
        item ["response"] = response.status
        item ["refer"] = response.request.headers['referer']
        return item



Thats it, Give it life by

$ scrapy crawl AcrazySpiderofDoom


You can even make it enter the data to a csv by
$ scrapy crawl AcrazySpiderofDoom -o items.csv









Example screen grab of running the crawler for partypoker.com




Once completed (I killed it here) , it will also provide valuable statistics








No comments:

Post a Comment