Showing posts with label JSoup. Show all posts
Showing posts with label JSoup. Show all posts

Sunday 15 September 2013

Scraps, scrapes and soup

Yesterday I noted - fairly "emphatically" - that changes to Adobe's ColdFusion docs site (not the new flash-harry CF10 stuff, the old stuff that they never update) have allegedly broken some community sites like cfquickdocs and CFGloss (the latter is currently redirected to the site's homepage). This seemed to be because the HTML structure (or the URLs? I don't actually know for sure) had changed and these sites relied on scraping the site to get the content for their revised UI for the CF docs. These sites exist because the CF docs site is not particularly user friendly as far as searching and navigation goes. And the URLs are completely unhackable (in that good sense of hackability, eg: "cfabort.html" would be good because one could infer other tags' pages from it; "WSc3ff6d0ea77859461172e0811cbec22c24-7fde.html" is bloody stupid), so whilst the content is OK, finding it can be a challenge sometimes.

Stop Press:

CFQuickDocs is at least partially back up again. The CF8 and CFMX7 docs work: CF9 still doesn't.

Ray pointed out that it's entirely Adobe's prerogative to change their URLs/HTML if they want, and went on to say anyone relying on page scraping for their content need to be aware of the instability of their chosen approach. And, accordingly, if Adobe changed their docs, it's not their fault if other sites break. I would say that given they performed the action that caused the breakage that it is their fault. However - all things being equal - it's perhaps not something they should intrinsically lose sleep over. That's the thing, though: I don't think all things are equal here. These third-party documentation sites are providing a service to the community that works around shortcomings in Adobe's own approach, and accordingly I think they warrant some respect from Adobe. My own case in point is that I need to scrape bug content from the bugbase for my CFBugNotifier process, because there's no other way to get the info. And given Adobe don't see fit to provide notifications when bug status changes, I think I offer a community-useful service here. And if they changed the bugbase mark-up and CFBugNotifier broke, that would be a detriment to their community. And their fault

Another consideration is that the Adobe docs - except the CF9 ones (and formerly the CF10 ones, I found out today: "ColdFusion copyright and trademarks and third-party notices") - are copyrighted. So people aren't really supposed to be copying them for their own use. Which is what page-scraping is doing. This is why my "copy" of the bug-base entries doesn't maintain the actual content, it maintains a hash of the content. That way I can quickly check if anything has changed (if not exactly what has changed). This doesn't infringe Adobe's copyright.

Anyway, whatever. I think I was a little more belligerent about it than I needed to be, but I think also that Ray was a little more dismissive of the broader picture than he needed to be (whilst being technically correct). And this blog article is not about that anyhow (he says, being almost 500 words into it!).