Saturday, February 18, 2012

Scraping Data: Getting only what you need

     It has happened several times that I have visited Wikipedia home page just to look into today's history. But, then there is so much of information, that you simply get distracted, don't you? So I decided how about, getting just what I needed? I had once stumbled upon a CS 50 video on Scraping data from internet, hence I had a basic idea as to what to do.
    But I googled a bit, coz I wanted to get something which was more recent and something that made my work very easy. Thankfully I stumbled upon, simplehtmldom parser . It made scraping of data so much easier. You don't even need to worry about completeness of the code! I just had to go through the HTML of the page and identify where the information was placed. Well, here is the code that I used to get my information.

  $html = file_get_html("");
 echo $html->find('p', 4)->innertext; // Event happening today e.g Independence day
$ret['All'] = $html->find('ul', 5)->innertext; // Event happened in the past
echo $ret['All'];

I have used $ret['All'] just to show you that the output can be stored in another array.
      There is one fault in this code. If there is no event happening today, then the fourth 'p' tag would contain data of some other section of the page. Hence, in order to scrape the data in this manner, you need to hope that the page's html does not change. I have used only some of the features of the simplehtmldom parser, this manual will show you more applications of the library. Happy Hacking!
      If I get a chance I will make a twitter bot, that gets this info and tweets it everyday. Will need to setup some cron jobs for that and will need to learn several other things, but that will be done later.


Tuesday, February 7, 2012

TT (newer version of timetable)

  Last year I had created a shell script which showed me my academic timetable for the day. I was a newbie in shell scripting at that time and hence the script was not so good. Hence, here is tt or I should say timetable2.0 ?
  This time I have stored my timetable in a timetable.txt file so that whenever I need to modify the timetable, I don't need to modify the script. Also, the code is much shorter this time with not so many if statements. Here is the new code:

days=( Monday Tuesday Wednesday Thursday Friday )

if [ "$#" == 0 ]
  echo -e "\nToday's Timetable:"
  set `date +%u`
  echo -e "\n"${days[$1-1]}"'s timetable:"
  sed -n $1p < timetable.txt
  echo -e "\n\n"

  Still to be added: Feature to change the timetable. But that's all for now. The new modification will come when I feel the mood of it :)