Knucklehead Up and Running Again
For about a week there my MMA fighter app for Android, Knucklehead, has been out of commission. The reason is that the site that I scrape for data, Sherdog was completely redesigned. I needed to rework a lot of my PHP-based screen scraping code to make it work. My PHP is rusty, but fortunately, my regular expression chops are always sharp. In any case, it’s back to working order now.
Maybe with a little luck I can get people to re-rate those 1-star ratings I got while it was down. If not, perhaps I can guilt people into giving me good ratings by saying that it’s my newborn son’s fault I didn’t get to it sooner. I mean, really, are you going to give the father of this child a hard time? Don’t be so heartless.
Screen Scraping Tips
As we all know, screen scraping is pretty much the most volatile way to supply your apps with data. While no amount of planning can permanently future-proof your screen scraper, there’s some things you can do to make it easier.
- Define a data model for what you are scraping. What made my project relatively easy to get working again was that I had a clearly defined data model shared by both the backend and frontend. In my case, this is a LAMP stack and distributed Android app respectively. Now my users don’t need to download a new version of my app in order to get the fix. It just works now.
- Code the scraper on a backend proxy, not in the app itself. That way, when your scraper inevitably gets broken at some point (after a year in my case), you only need to change the code in one place. This goes hand-in-hand with defining a data model.
- Keep it simple stupid. Don’t over-architect a screen scraper. Most data processing code will change over time, but a screen scraper may need a total rewrite at some point. Put all your software engineering skills into controlling what leaves the scraper, but play it pretty loose with what the scraper is going to be processing.
- Using JSON? You better be using jsonlint.com too. jsonlint.com can save you a lot of time and aggravation when it comes to creating and processing your scraped data. Not only does it validate your JSON, but it will also pretty print it so it’s easy to read through yourself.
That’s just a couple things that made fixing my scraper a lot less painful. You guys out there have any tips? I’d love to hear them. All things being equal, I hope I never have to screen scrape for data ever again. But since that’s pretty unlikely, it’s good to be prepared for the inevitable obstacles that scraping will present.
Next time try using a html parser like phpquery. It allows you select dom elements using css syntax (or xpath). It’s much cleaner than using regex and much easier to maintain.