Adam Cameron's Dev Blog: JSoup

Sunday, 15 September 2013

Scraps, scrapes and soup

G'day:
Yesterday I noted - fairly "emphatically" - that changes to Adobe's ColdFusion docs site (not the new flash-harry CF10 stuff, the old stuff that they never update) have allegedly broken some community sites like cfquickdocs and CFGloss (the latter is currently redirected to the site's homepage). This seemed to be because the HTML structure (or the URLs? I don't actually know for sure) had changed and these sites relied on scraping the site to get the content for their revised UI for the CF docs. These sites exist because the CF docs site is not particularly user friendly as far as searching and navigation goes. And the URLs are completely unhackable (in that good sense of hackability, eg: "cfabort.html" would be good because one could infer other tags' pages from it; "WSc3ff6d0ea77859461172e0811cbec22c24-7fde.html" is bloody stupid), so whilst the content is OK, finding it can be a challenge sometimes.

Stop Press:

CFQuickDocs is at least partially back up again. The CF8 and CFMX7 docs work: CF9 still doesn't.

Ray pointed out that it's entirely Adobe's prerogative to change their URLs/HTML if they want, and went on to say anyone relying on page scraping for their content need to be aware of the instability of their chosen approach. And, accordingly, if Adobe changed their docs, it's not their fault if other sites break. I would say that given they performed the action that caused the breakage that it is their fault. However - all things being equal - it's perhaps not something they should intrinsically lose sleep over. That's the thing, though: I don't think all things are equal here. These third-party documentation sites are providing a service to the community that works around shortcomings in Adobe's own approach, and accordingly I think they warrant some respect from Adobe. My own case in point is that I need to scrape bug content from the bugbase for my CFBugNotifier process, because there's no other way to get the info. And given Adobe don't see fit to provide notifications when bug status changes, I think I offer a community-useful service here. And if they changed the bugbase mark-up and CFBugNotifier broke, that would be a detriment to their community. And their fault

Another consideration is that the Adobe docs - except the CF9 ones (and formerly the CF10 ones, I found out today: "ColdFusion copyright and trademarks and third-party notices") - are copyrighted. So people aren't really supposed to be copying them for their own use. Which is what page-scraping is doing. This is why my "copy" of the bug-base entries doesn't maintain the actual content, it maintains a hash of the content. That way I can quickly check if anything has changed (if not exactly what has changed). This doesn't infringe Adobe's copyright.

Anyway, whatever. I think I was a little more belligerent about it than I needed to be, but I think also that Ray was a little more dismissive of the broader picture than he needed to be (whilst being technically correct). And this blog article is not about that anyhow (he says, being almost 500 words into it!).

About

I've been a web developer since 2001: 13yrs as a CFML developer; 6yrs as a PHP dev; and now back adjacent to the CFML community but focusing on code quality, design and testing. As of 2023, I am migrating my dev focus back to PHP again.

The code I write and discuss here is pretty much just looking at random conundrums I encounter in my day job.

I tend to be a bit "forthright" in my opinions, I am indelicate, and I tend to swear too much. This will come out occasionally here: I make no apology for it.

Everything said here is my own opinion. Feel free to disagree with me :-)