Garbage in, Xmltv out

December 31, 2008 at 04:48 AM | categories: boleh | View Comments

As a follow up to my previous post, I am going to rant about my poor experience with screenscraping. Although the xmltv grabber, in its current incarnation, works with listings from The Star and Astro, the script was initially written to target the official websites for TV3/NTV7/8TV/TV9 (Media Prima) and RTM1/RTM2 (RTM).

To understand why the idea was ditched, here's a sample line of html from TV3 (reformatted for sanity):

<td>
    <a
        id="plcRoot_Layout_zoneCenter_ContentPlaceHolder_partPlaceholder_Layout_zoneScheduleContent_TV3ScheduleContent_ScheduleMain1_dlScheduleToday_ctl04_lnkShow" 
        title="Date: Aug 19, 2008&lt;br>Time: 10:00 AM - 10:02 AM"
        class="ScheduleLink"
        onmouseover="this.T_STICKY=false;this.T_WIDTH=300;this.T_FONTCOLOR='#000000';this.T_FONTFACE='Verdana';this.T_PADDING=5;this.T_BGCOLOR='#FFFFFF';this.T_TITLE='BERITA TERKINI';this.T_STATIC=true;return escape('Date: Aug 19, 2008&lt;br>Time: 10:00 AM - 10:02 AM');" 
        href="/Shows/MainNormal.aspx?MasterID=258&amp;ShowID=322&amp;MenuID=1&amp;TemplateID=3"
    >
            BERITA TERKINI
    </a>
</td>

It contains an id that is 150 characters long, multiple unescaped closing angle brackets, and some funky onmouseover code. Truely thedailywtf.com material. Oh ya, the html file with little content approaches 100K in size.

TRWTF about Media Prima websites, however, is the lack of consistency. All 4 sites appear to be running the same ASP.NET app, but subtly, each one is different:

  • It is Schedules.aspx on TV3 and NTV7, ScheduleToday.aspx on 8TV, and Schedule.aspx on TV9.

  • To get today schedules, you need to pass in query string parameter view=today to TV3, NTV7, and TV9. 8TV, of course, doesn't need it.

  • NTV7 only contains partial listing and will truncate shows that have been aired from the list. TV9 contains partial listing but doesn't truncate. TV3 and 8TV contain full listing and doesn't truncate. IMHO, Media Prima should change one of them to contain full listing with truncation. Then we will have a permutation.

  • If you feel adventurous, you can probably get around the truncation by simulating ASP.NET's postback and using the lovely calendar widget that has number-of-days-since-2000-01-01 as its parameter. If you feel adventurous, and have too much time in your hands.

Bashing aside, one good thing about Media Prima is that they are not afraid to show you what's under the hood. I just checked the 8tv schedules and was presented with this error message, embedded in the page:

[Error loading the WebPart '8TVScheduleSubNavi']
C:\Inetpub\wwwroot\mediaprima\8tv\CMSWebParts\8TV\Schedule\8TVScheduleSubNavi.ascx(17): error BC30451: Name 'LinkHelperClass' is not declared.

Awesome.

In comparison, RTM website is surprisingly good.

  • Both RTM1 and RTM2 pages are consistent to each other. This is a small feat, but I have to mention it.

  • The date parameter follows ISO8601, i.e. YYYY-MM-DD, unlike Media Prima websites that expect 3 parameters for day, month, and year. Kudos to the developers.

  • The page size is 6 times smaller compared to Media Prima.

  • The listing follows the newspaper day (i.e. from morning until the next morning), rather than the actual day (i.e. from midnight to midnight). This is good usability.

  • It has reliability issue at times -- RTM1 listing is blank since 2008-12-28.

As for Astro website, there is nothing much to talk about. Overall, it is just OK.

  • Pages are consistent.

  • The date parameter uses the format of DD-MON-YYYY.

  • A day of schedules is splitted into 2 pages, one for AM, one for PM. This is cumbersome not only for the script to scrape, but also for an actual person to read.

  • Things like No Transmission and Transmission Ends are included as shows with start time and duration. This isn't really necessary.

  • The size of the page is 3 times bigger compared to RTM.

The Star website has its goods and bads, but still, it is the best among the bunch.

  • Pages are consistent.

  • The date parameter uses the format of MM/DD/YYYY. Ugh.

  • The listing contains columns for description and episode. This is a major plus. However, the episode column contains a mix of English words and Arabic numerals. It has to be more consistent.

  • The listing follows newspaper day (duh).

  • It spells SpongeBob SquarePants correctly. Shame on you, Astro.

Lastly, the web designers for RTM/Media Prima/Astro/The Star really need to start learning how to use CSS to properly separate content from presentation. Seriously. Let's just start by giving a freaking id (that is less than 150 characters) to the freaking schedules tables, so that I don't have to rely on some bizzare bgcolor attributes to identify them. Amen.

blog comments powered by Disqus