Echo Dev Blog

ECHO (Event Calendar Houston) development blog

Wednesday, April 28, 2004

Working on getting a regular expression for the JazzHouston calendar:

<!-- START post loop -->\s*(?:<tr>\s*<td[^>]+>\s*<p[^>]+>\s+(.+?)\n.+?</p>\s*
</td>\s*<td[^>]+>\s*<p[^>]+>\s+<a href="(.+?)">(.+?)</a>(?:.+?)</p>\s+</td>
\s+<td[^>]+>\s*<p[^>]+>\s+<a href=(.+?)>(.+?)</a>\s+</p>\s+</td>\s+</tr>\s+)+
<!-- END post loop -->



 

Quite a mouthful ... it works, though.

posted by Ian  # 10:57 AM

Sunday, April 25, 2004

OK, well, 4 hours later I have more or less succeeded in grabbing the data out of the KPFT calendar. Here's what I pulled:



Here's how I did it:

  • First, I used an HTML screen scrape routine courtesy of ASP Alliance. Had the HTML pulling in under 5 minutes. Gotta love it.

  • Next, I set about writing a regular expression to parse out the individual day listings. That proved to be a bit trickier, in part because I had to figure out how to use regular expressions in .NET (hadn't done so yet). I did find a regex dictionary for the .NET flavor, which was helpful, but some confusion still remained (especially since I couldn't get the various flags to work correctly). After much struggling, I finally found the genius Expresso, a visual regular expression editor. It makes doing regular expressions a breeze, and even showed me some syntax I didn't know about. It's only flaw is that it doesn't do a good job of threading the UI separately from the expression, so if you write a real monster (i.e. one that would take forever to actually run) it just locks up the UI. It'd be better if it behaved nicely and told you something like "It looks like your expression is taking a long time to run; click here to cancel".

    The expression I eventually ended up with for the day parsing is this:

    (?:<a name=""(sunday|monday|tuesday|wednesday|thursday|friday|saturday)"">(?:.*?)</a>(.*?)(?=<a name))
    Good looking, eh? I love regular expressions. They're so concise! This one breaks up the input page into chunks for each day. Cool. Next, I wrote another expression (much faster this time, due to Expresso) to parse out the data within each day, for all the individual listings:

    (?:<li>\s(.+?)\s*?(?=(<li>|</?ul>)))
    That one worked pretty well. I was working on having the regex itself parse out more of the information (venue, performer, etc.) but the KPFT listings aren't consistent enough for that, so I decided to rely more on procedural logic for that part.

    After a few more hurdles (incluing realizing that the m-dash, which is used on the KPFT calendar as a separator, isn't coming through somewhere), I got the data all pulled in.

    Major next steps with this data import:
  • It didn't deal well at all with the festival. I don't know if it'll be possible, but it's worth trying to parse that correctly.
  • Push the data into a database format (or object model, that's cool too) including not only the ones that did parse correctly, but also the ones that did not

    Then I'm going to repeat the process for a couple of the other sources, and then see if I can generalize some of the common parsing logic (maybe I can, maybe I can't ...)

    Phew.

  • Finally getting a chance today to put some more time into Echo. I'm going to work on actually creating a parser program today that can actually do the parsing I was talking about before.

    Starting up the project as a C# class library - I may port it to another language, like PHP or Python, but this is easiest for me to develop in for now (besides, porting it to one of those would be a great learning tool). Using NUnit to build unit tests and debug.
    posted by Ian  # 12:01 PM

    Archives

    03/01/2004 - 03/31/2004   04/01/2004 - 04/30/2004   11/01/2004 - 11/30/2004  

    This page is powered by Blogger. Isn't yours?