Echo Dev Blog

ECHO (Event Calendar Houston) development blog

Sunday, March 28, 2004

So I'm working on the import from the various HTML pages, and I'm trying to distill a basic screen-scraping approach I can use. I know others have done it successfully, in programs like Watson especially. They use something called a Parsing Dictionary to HTML scrape. Pretty elegant little parser, from what I can see. I downloaded a copy of their test tool, fired it up, and got a first basic parse against the KPFT concert calendar running. Nice. So at least now I have a model I can use to create my own little parsers. I don't know just how detailed I can get with the Watson parsing, but I bet it's pretty deep. As with most things, knowing what to do is often more useful than knowing how it's done; if I can see all the options on their parser, I should be able to implement those same options myself without even seeing their actual source code.

An aside: One downside to using HTML-scraping as part of my data input is that you're dependant on the owners of those web sites you "scrape" to not go changing things around. When they do, you need to update your program to match theirs. That's kind of a bummer, as it means you're never "done" - you never know when out of the blue, one of your sources may just stop working until you fix it. I guess that's why it's a good thing to have multiple sources, even though some of them might actually be the exact same information - if one of them has a site redesign or upgrade of some sort, you won't be totally in the lurch while you're putting in the new parser.

Ah, OK. Good night.
posted by Ian  # 10:38 PM

Friday, March 26, 2004

Since making that last night, I remembered a couple more sources, so here's one where I have added feeds from HoustonBands.net, Pollstar, and SpaceCity Rock. (Pollstar and HoustonBands.net only allow you to pull future dates, so it's just Friday onward.) I also added in the "Steady Gigs" from JazzHouston, and cleaned up the parsing on the KPFT calendar stuff a bit.

Here it is:

One really cool thing I'll be able to do, from a data integrity point of view, is to actually show people what source something came from. So if there's wrong information, rather than yelling at me, I can direct them to yell at the one who posted it wrong (and it can get correct there). So this thing has the potential to shore up the other calendars in town, too.

There are a few obvious next steps:
1 - Automate the import so it can get at least this far on its own without me spending any time on it
2 - Perform more basic formatting fixes on this - remove "(The)" etc.
3 - Match this data to the central normalized data - resolving duplicates or discrepancies, misspellings, alternate spellings, etc
4 - Augment the central data with Genre, venue information, etc.
5 - Create a filterable display page for the normalized data
6 - Consider an ongoing process - logins for self service, conflict resolution, etc.

posted by Ian  # 9:16 AM

Thursday, March 25, 2004

So I went ahead and made a spreadsheet that represents the "input" data - i.e. what might be coming in from other sources. That gives me a clean division of labor; I can separately focus on how the input data gets associated to entities with deeper meanings - bands, individual performers, genres, etc.

I manually copied information (using only methods I could duplicate using regular expressions and other string maniupulation, i.e. no human intelligence) from several sources: JazzHouston, Jambase, Houston Chronicle, KPFT Calendar (aka the Blueshound's calendar). I tried to include the Houston Press event database, but it spits this information out in such a bad way, you can't really get at it (i.e. there's no specific date listed, just a "through march 31" type of thing.) Also, when you search on a certain date, it doesn't give you items that happen on that date, it seems to give you items that are happening in general proximity to that date. Pretty lame.

Right away I notice one main thing: discrepancy!! For Monday, as a example, there's maybe two cases where even 2 of the different calendars agree exactly on the data. For example, Chronicle has the El Orbits at Continental, whereas KPFT has Glover Gil. (I happen to know they're both true; Glover Gil opens for the El Orbits; but the point is, neither calendar had the whole story). Even when they're close, they're still off: Rudyard's is listed with "The Ponies / The Fuse" by the Chronicle, and with "The Ponys / The Fuse" by the KPFT calendar (note the different spellings). Aargh!

Actually, now that I think about it, that's a GOOD thing. It means (as I suspected) that there's a real need for a cohesive, trustworthy calendar in this town.

Here's the actual import data:

As an aside, I have run into a few quirks that my data model will have to account for.

  • sometimes bands are listed together with commas, but sometimes they have commas in their name ("Godspeed you, black emperor"). Doh. Similarly, the blueshound calender uses M-dashes as separators ... mostly. But sometimes not. I figured it was database driven, but I guess not.

  • Also, sometimes there's missing info. Like, the list just gives the time the doors open but not the start time of each band.

  • With classical: one listing was just "Chamber music recital". So the name given is actually more like the description, and there is no name. Interesting. That could be with jazz often too - a "nameless" act performing.

  • I could also pull additional stuff, like the band URL, genre, etc to facilitate making a better match. I'm not going to worry about that yet; the principle is the same.

    Other thoughts so far:

    I'm leaning towards not emphasizing regular events (i.e. weekly gigs) in this system, because they're prone to change. Maybe as far as it can go would be a button in the self-serve tools that says "extend this as confirmed for X weeks".

    For "Jam sessions", they're kind of their own beast, and the system might have to recognize certain semantic elements in order to correctly parse things. Like "Jazz Jam with David Marcellin" would be David Marcelling as the performer, Jazz Jam as the Performance name (?).


  • So, OK, while I'm on a roll and all ... let's talk data modeleing. Here's what I'm thinking so far for entities.

    Event - One place, one time, one or more presenter.

    Performer - a band, solo artist, theater troupe, organization, etc.

    Genres are basically just labels - they only mean something to those who read them, they're just a bit of text slapped on an event or band. And, which is it, does an event have a genre or a band?

    Can things like band membership and genre change over time, and does it make sense to track them as doing so?

    It gets hard to try to imagine any possible set of circumstances, but I'd like to keep a running list of events it should be able to handle. Like:

    - Free Radicals show - who's involved today? What's the line between a regular member and a sit-in?
    - SxDE festival - tons of bands under one organizer at one place
    - Magic Bullets playing the music for a cabaret theater performance; both are events?
    - Kiss Kiss Kill Kill used to be The Singles; can their fans still find them, and would they know why this random new band shows up in their favorites?
    - A jazz group underr one person's name but with a rhythm section that changes every few months

    Dizzying, isn't it?
    posted by Ian  # 10:02 PM

    Here's a master list of all the different areas I need to think about:

    - Data model: the relationships between entities in the "static" part of the model (i.e. not how it gets there, but what it is)
    - Aggregation: correctly pulling data from other sources, providing a staging setup, and creating a process whereby as much of the move from staging to confirmed is as automatic as possible.
    - Presentation: how the data we do have is sliced, diced, searched, pushed, pulled, etc.

    posted by Ian  # 9:43 PM

    I'm starting a new blog here to track the development of the ideas for my new event calendar software. I think this will be pretty innovative and popular stuff; it's just common sense to use, but the design of it is actually fairly tricky (which is probably why nobody has done it yet).

    So what is it? Think Jambase.com, but
  • Covering not just jamband shows, but other kinds of music and events
  • With the ability to drill down to the individual performer level (i.e. track when one person's in two bands, connect the solo career to the band, etc)
  • Allowing multiple overlapping genres per group - so you can be Rock and Blues and Country, or Punk and Bluegrass, or whatever other combinations you want (e.g. funk, jazz, prog and jam)
  • Allowing the audience to personalize what they want to hear about, both in terms of genre and by individual favorite groups. Also allowing diverse options on how it gets to you - a web page you go to, a daily (or weekly) email you get, etc.
  • Aggregating multiple other sources: screen scraping event listings on other sites, editorial input, and self-service by performers (if they want)

    To the audience, this should all be seamless; you just hit the web page, and BOOM, there's a list of w hat's happening, when and where, biased towards your own tastes and favorites. To the bands, it should be invisible - your event was probably aggregated from somewhere else, but if not you can put it in.

    So making it work is the first hard part; the second is making it efficient to maintain; and the third is figuring out if and how it's worth my time (i.e. can make me money). No sweat.


  • Archives

    03/01/2004 - 03/31/2004   04/01/2004 - 04/30/2004  

    This page is powered by Blogger. Isn't yours?