Nimit's ramblings: Scraping Data: Getting only what you need

Saturday, February 18, 2012

Scraping Data: Getting only what you need

It has happened several times that I have visited Wikipedia home page just to look into today's history. But, then there is so much of information, that you simply get distracted, don't you? So I decided how about, getting just what I needed? I had once stumbled upon a CS 50 video on Scraping data from internet, hence I had a basic idea as to what to do.
But I googled a bit, coz I wanted to get something which was more recent and something that made my work very easy. Thankfully I stumbled upon, simplehtmldom parser . It made scraping of data so much easier. You don't even need to worry about completeness of the code! I just had to go through the HTML of the page and identify where the information was placed. Well, here is the code that I used to get my information.

<?php
include_once('simple_html_dom.php');
$html = file_get_html("http://en.wikipedia.org/wiki/Main_Page");
echo $html->find('p', 4)->innertext; // Event happening today e.g Independence day
$ret['All'] = $html->find('ul', 5)->innertext; // Event happened in the past
echo $ret['All'];
$html->clear();
unset($html);
?>

I have used $ret['All'] just to show you that the output can be stored in another array.
There is one fault in this code. If there is no event happening today, then the fourth 'p' tag would contain data of some other section of the page. Hence, in order to scrape the data in this manner, you need to hope that the page's html does not change. I have used only some of the features of the simplehtmldom parser, this manual will show you more applications of the library. Happy Hacking!
If I get a chance I will make a twitter bot, that gets this info and tweets it everyday. Will need to setup some cron jobs for that and will need to learn several other things, but that will be done later.