Echo Dev Blog

ECHO (Event Calendar Houston) development blog

Sunday, May 22, 2005

netpivotal.com
posted by Ian  # 6:26 AM

Sunday, November 21, 2004

Wow, been a while since I've posted here. Sorry 'bout that (not that anyone is reading). I'm still thinking a lot about this idea, and have big plans. Those plans may swing into action soon.

In the mean time, here's a link to a discussion on the hostbaby forums about calendar scraping & RSS. Very useful, I will return to that when I get to that point.

Also, the other day as I was idly thinking about event calendars and the idea of recurrence, I was pondering the question of how you integrate recurring events into a calendar as it goes into the future. JazzHouston has one approach - the recurring events (aka "weekly" gigs) are never shown integrated with the single events. I think that's a shame - if you're looking for what to do tonight, it's unlikely you'll think to check both regular and weekly gigs (at least, it's unlikely for me). The other option, of course, being to integrate the recurring events with the regular ones. The problem? You get lots of events in 2005, 2006, 2007, etc., and they're all recurring, and probably most of them won't actually be going on then.

So for recurring events, how far into the future do you go? My revelation about this was that you can guage the probable reliability of an event based on its history. So if the event has been going for a month, well, it will probably go another month. If it's been going for 2 years, it'll probably go another 2 years. Or maybe the more appropriate equation is predicting half of the history - an event with 2 yrs of history gets 1 year prediction. (Of course, when you're creating recurring events, the ideal is to get the artist to say how long it'll go; but they often don't know.)

You also need a way for those recurring events to be canceled on particular dates, or have other "instance information" like special guests on a particular date, etc. So there needs to be some kind of division of labor, where the recurring event acts like a template that the specific events are generated for. And like Outlook recurring events, you probably need UI controls to ask whether someone intended to change the whole series or just the instance. Hmm.
posted by Ian  # 3:19 PM

Wednesday, April 28, 2004

Working on getting a regular expression for the JazzHouston calendar:

<!-- START post loop -->\s*(?:<tr>\s*<td[^>]+>\s*<p[^>]+>\s+(.+?)\n.+?</p>\s*
</td>\s*<td[^>]+>\s*<p[^>]+>\s+<a href="(.+?)">(.+?)</a>(?:.+?)</p>\s+</td>
\s+<td[^>]+>\s*<p[^>]+>\s+<a href=(.+?)>(.+?)</a>\s+</p>\s+</td>\s+</tr>\s+)+
<!-- END post loop -->



 

Quite a mouthful ... it works, though.

posted by Ian  # 10:57 AM

Sunday, April 25, 2004

OK, well, 4 hours later I have more or less succeeded in grabbing the data out of the KPFT calendar. Here's what I pulled:



Here's how I did it:

  • First, I used an HTML screen scrape routine courtesy of ASP Alliance. Had the HTML pulling in under 5 minutes. Gotta love it.

  • Next, I set about writing a regular expression to parse out the individual day listings. That proved to be a bit trickier, in part because I had to figure out how to use regular expressions in .NET (hadn't done so yet). I did find a regex dictionary for the .NET flavor, which was helpful, but some confusion still remained (especially since I couldn't get the various flags to work correctly). After much struggling, I finally found the genius Expresso, a visual regular expression editor. It makes doing regular expressions a breeze, and even showed me some syntax I didn't know about. It's only flaw is that it doesn't do a good job of threading the UI separately from the expression, so if you write a real monster (i.e. one that would take forever to actually run) it just locks up the UI. It'd be better if it behaved nicely and told you something like "It looks like your expression is taking a long time to run; click here to cancel".

    The expression I eventually ended up with for the day parsing is this:

    (?:<a name=""(sunday|monday|tuesday|wednesday|thursday|friday|saturday)"">(?:.*?)</a>(.*?)(?=<a name))
    Good looking, eh? I love regular expressions. They're so concise! This one breaks up the input page into chunks for each day. Cool. Next, I wrote another expression (much faster this time, due to Expresso) to parse out the data within each day, for all the individual listings:

    (?:<li>\s(.+?)\s*?(?=(<li>|</?ul>)))
    That one worked pretty well. I was working on having the regex itself parse out more of the information (venue, performer, etc.) but the KPFT listings aren't consistent enough for that, so I decided to rely more on procedural logic for that part.

    After a few more hurdles (incluing realizing that the m-dash, which is used on the KPFT calendar as a separator, isn't coming through somewhere), I got the data all pulled in.

    Major next steps with this data import:
  • It didn't deal well at all with the festival. I don't know if it'll be possible, but it's worth trying to parse that correctly.
  • Push the data into a database format (or object model, that's cool too) including not only the ones that did parse correctly, but also the ones that did not

    Then I'm going to repeat the process for a couple of the other sources, and then see if I can generalize some of the common parsing logic (maybe I can, maybe I can't ...)

    Phew.

  • Finally getting a chance today to put some more time into Echo. I'm going to work on actually creating a parser program today that can actually do the parsing I was talking about before.

    Starting up the project as a C# class library - I may port it to another language, like PHP or Python, but this is easiest for me to develop in for now (besides, porting it to one of those would be a great learning tool). Using NUnit to build unit tests and debug.
    posted by Ian  # 12:01 PM

    Sunday, March 28, 2004

    So I'm working on the import from the various HTML pages, and I'm trying to distill a basic screen-scraping approach I can use. I know others have done it successfully, in programs like Watson especially. They use something called a Parsing Dictionary to HTML scrape. Pretty elegant little parser, from what I can see. I downloaded a copy of their test tool, fired it up, and got a first basic parse against the KPFT concert calendar running. Nice. So at least now I have a model I can use to create my own little parsers. I don't know just how detailed I can get with the Watson parsing, but I bet it's pretty deep. As with most things, knowing what to do is often more useful than knowing how it's done; if I can see all the options on their parser, I should be able to implement those same options myself without even seeing their actual source code.

    An aside: One downside to using HTML-scraping as part of my data input is that you're dependant on the owners of those web sites you "scrape" to not go changing things around. When they do, you need to update your program to match theirs. That's kind of a bummer, as it means you're never "done" - you never know when out of the blue, one of your sources may just stop working until you fix it. I guess that's why it's a good thing to have multiple sources, even though some of them might actually be the exact same information - if one of them has a site redesign or upgrade of some sort, you won't be totally in the lurch while you're putting in the new parser.

    Ah, OK. Good night.
    posted by Ian  # 10:38 PM

    Friday, March 26, 2004

    Since making that last night, I remembered a couple more sources, so here's one where I have added feeds from HoustonBands.net, Pollstar, and SpaceCity Rock. (Pollstar and HoustonBands.net only allow you to pull future dates, so it's just Friday onward.) I also added in the "Steady Gigs" from JazzHouston, and cleaned up the parsing on the KPFT calendar stuff a bit.

    Here it is:

    One really cool thing I'll be able to do, from a data integrity point of view, is to actually show people what source something came from. So if there's wrong information, rather than yelling at me, I can direct them to yell at the one who posted it wrong (and it can get correct there). So this thing has the potential to shore up the other calendars in town, too.

    There are a few obvious next steps:
    1 - Automate the import so it can get at least this far on its own without me spending any time on it
    2 - Perform more basic formatting fixes on this - remove "(The)" etc.
    3 - Match this data to the central normalized data - resolving duplicates or discrepancies, misspellings, alternate spellings, etc
    4 - Augment the central data with Genre, venue information, etc.
    5 - Create a filterable display page for the normalized data
    6 - Consider an ongoing process - logins for self service, conflict resolution, etc.

    posted by Ian  # 9:16 AM

    Thursday, March 25, 2004

    So I went ahead and made a spreadsheet that represents the "input" data - i.e. what might be coming in from other sources. That gives me a clean division of labor; I can separately focus on how the input data gets associated to entities with deeper meanings - bands, individual performers, genres, etc.

    I manually copied information (using only methods I could duplicate using regular expressions and other string maniupulation, i.e. no human intelligence) from several sources: JazzHouston, Jambase, Houston Chronicle, KPFT Calendar (aka the Blueshound's calendar). I tried to include the Houston Press event database, but it spits this information out in such a bad way, you can't really get at it (i.e. there's no specific date listed, just a "through march 31" type of thing.) Also, when you search on a certain date, it doesn't give you items that happen on that date, it seems to give you items that are happening in general proximity to that date. Pretty lame.

    Right away I notice one main thing: discrepancy!! For Monday, as a example, there's maybe two cases where even 2 of the different calendars agree exactly on the data. For example, Chronicle has the El Orbits at Continental, whereas KPFT has Glover Gil. (I happen to know they're both true; Glover Gil opens for the El Orbits; but the point is, neither calendar had the whole story). Even when they're close, they're still off: Rudyard's is listed with "The Ponies / The Fuse" by the Chronicle, and with "The Ponys / The Fuse" by the KPFT calendar (note the different spellings). Aargh!

    Actually, now that I think about it, that's a GOOD thing. It means (as I suspected) that there's a real need for a cohesive, trustworthy calendar in this town.

    Here's the actual import data:

    As an aside, I have run into a few quirks that my data model will have to account for.

  • sometimes bands are listed together with commas, but sometimes they have commas in their name ("Godspeed you, black emperor"). Doh. Similarly, the blueshound calender uses M-dashes as separators ... mostly. But sometimes not. I figured it was database driven, but I guess not.

  • Also, sometimes there's missing info. Like, the list just gives the time the doors open but not the start time of each band.

  • With classical: one listing was just "Chamber music recital". So the name given is actually more like the description, and there is no name. Interesting. That could be with jazz often too - a "nameless" act performing.

  • I could also pull additional stuff, like the band URL, genre, etc to facilitate making a better match. I'm not going to worry about that yet; the principle is the same.

    Other thoughts so far:

    I'm leaning towards not emphasizing regular events (i.e. weekly gigs) in this system, because they're prone to change. Maybe as far as it can go would be a button in the self-serve tools that says "extend this as confirmed for X weeks".

    For "Jam sessions", they're kind of their own beast, and the system might have to recognize certain semantic elements in order to correctly parse things. Like "Jazz Jam with David Marcellin" would be David Marcelling as the performer, Jazz Jam as the Performance name (?).


  • So, OK, while I'm on a roll and all ... let's talk data modeleing. Here's what I'm thinking so far for entities.

    Event - One place, one time, one or more presenter.

    Performer - a band, solo artist, theater troupe, organization, etc.

    Genres are basically just labels - they only mean something to those who read them, they're just a bit of text slapped on an event or band. And, which is it, does an event have a genre or a band?

    Can things like band membership and genre change over time, and does it make sense to track them as doing so?

    It gets hard to try to imagine any possible set of circumstances, but I'd like to keep a running list of events it should be able to handle. Like:

    - Free Radicals show - who's involved today? What's the line between a regular member and a sit-in?
    - SxDE festival - tons of bands under one organizer at one place
    - Magic Bullets playing the music for a cabaret theater performance; both are events?
    - Kiss Kiss Kill Kill used to be The Singles; can their fans still find them, and would they know why this random new band shows up in their favorites?
    - A jazz group underr one person's name but with a rhythm section that changes every few months

    Dizzying, isn't it?
    posted by Ian  # 10:02 PM

    Here's a master list of all the different areas I need to think about:

    - Data model: the relationships between entities in the "static" part of the model (i.e. not how it gets there, but what it is)
    - Aggregation: correctly pulling data from other sources, providing a staging setup, and creating a process whereby as much of the move from staging to confirmed is as automatic as possible.
    - Presentation: how the data we do have is sliced, diced, searched, pushed, pulled, etc.

    posted by Ian  # 9:43 PM

    I'm starting a new blog here to track the development of the ideas for my new event calendar software. I think this will be pretty innovative and popular stuff; it's just common sense to use, but the design of it is actually fairly tricky (which is probably why nobody has done it yet).

    So what is it? Think Jambase.com, but
  • Covering not just jamband shows, but other kinds of music and events
  • With the ability to drill down to the individual performer level (i.e. track when one person's in two bands, connect the solo career to the band, etc)
  • Allowing multiple overlapping genres per group - so you can be Rock and Blues and Country, or Punk and Bluegrass, or whatever other combinations you want (e.g. funk, jazz, prog and jam)
  • Allowing the audience to personalize what they want to hear about, both in terms of genre and by individual favorite groups. Also allowing diverse options on how it gets to you - a web page you go to, a daily (or weekly) email you get, etc.
  • Aggregating multiple other sources: screen scraping event listings on other sites, editorial input, and self-service by performers (if they want)

    To the audience, this should all be seamless; you just hit the web page, and BOOM, there's a list of w hat's happening, when and where, biased towards your own tastes and favorites. To the bands, it should be invisible - your event was probably aggregated from somewhere else, but if not you can put it in.

    So making it work is the first hard part; the second is making it efficient to maintain; and the third is figuring out if and how it's worth my time (i.e. can make me money). No sweat.


  • Archives

    03/01/2004 - 03/31/2004   04/01/2004 - 04/30/2004   11/01/2004 - 11/30/2004   05/01/2005 - 05/31/2005  

    This page is powered by Blogger. Isn't yours?