Spidey: Python Web Crawler

I created a web crawler using python and its modules. It follows certain conditions like it reads robots.txt before crawling a page. If the robots.txt allows the page to be crawled the spidey crawls it. It dives in recursively. But there are certain limitations I have set. It do not go beyond 20 pages, as it is just a prototype. It cannot be detect traps, where it will go infinitely.

Spidey is a very basic crawler which works just fine with websites at least on the websites I have tested. I have tested it with http://python.org/ and http://stackoverflow.com/ and it did pretty well.

The modules I have used for the purpose are urllib2, re, BeautifulSoup, robotparser.

Here is the code if you want to test/use it.

1 comment

Leave a comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: