Not Quite Ready For Prime Time
On the surface of it, it seems like a totally cool idea. Function blocks that you drag, drop and “pipe” together on a whiteboard to grab RSS or Atom feeds, web pages, or whatever; grab the pieces you want, transform them, combine them, and sort them into something new.
This is usually accompanied by the “frozen block” problem, where you drag a block onto the whiteboard but then can’t move it or connect it. The problem for the first is usually to just wait; it’s probably actually saved, but the response isn’t coming back. In time it unclogs. The latter is more serious, and often involves backing out of the project altogether, then going back in.
The project I’ve been working in is pretty ambitious. An all-purpose “what’s the bad news” feed which combines the crime blotters of two local police departments’ web sites, two feeds from community boards, the National Weather Service’s feed for our area, a traffic report, and a transit report.
Now, grabbing RSS feeds and combining them is what Pipes does best. But the two local police department’s web pages were another matter. They’re done up in non-standard HTML, and a lot of trial-and-error was involved to try and capture the items that I want when the people who are coding the pages don’t do it the same way every day. It’s probably broken right now, come to think of it. And that blocked pipe problem can be pretty infuriating when you’re in a rapid-fire “try this then let’s try that” mode.
Speaking of nonstandard, the biggest problems I ran into were when Pipes didn’t see the Microsoft Office tags embedded in the text; that is, the parser saw them, but the debugger didn’t.
In this example, I had a space in between the quotation mark and the date. This made it impossible to sort on date, because the system wants to interpret it as a specific date/time format, and the space wasn’t part of it.
Running a regexp to strip out the space was getting me nowhere. I kept seeing ” 2010-04-etc.” and couldn’t get rid of it. Then I ran a regexp to take out any non-word spaces.
There’s that awful MS Office-style HTML. It was there, but the debugger refused to show it to me. Taking out anything that looked like a tag — despite the character-entity for the > — worked.
Still other bits of weirdness that the debugger wasn’t much help with. I was trying to filter so that only posts with words like “crime,” “cop,” “arrest” and so on would appear. But I kept getting this one:
Aha! There it is. So, a three-letter word is probably not the best filter you can use.
But again, these are easy. The interface continually hanging up on me is what made it hard. In time, I started to get close, but it took days.