Adam Cameron's Dev Blog: Scraps, scrapes and soup

G'day:
Yesterday I noted - fairly "emphatically" - that changes to Adobe's ColdFusion docs site (not the new flash-harry CF10 stuff, the old stuff that they never update) have allegedly broken some community sites like cfquickdocs and CFGloss (the latter is currently redirected to the site's homepage). This seemed to be because the HTML structure (or the URLs? I don't actually know for sure) had changed and these sites relied on scraping the site to get the content for their revised UI for the CF docs. These sites exist because the CF docs site is not particularly user friendly as far as searching and navigation goes. And the URLs are completely unhackable (in that good sense of hackability, eg: "cfabort.html" would be good because one could infer other tags' pages from it; "WSc3ff6d0ea77859461172e0811cbec22c24-7fde.html" is bloody stupid), so whilst the content is OK, finding it can be a challenge sometimes.

Stop Press:

CFQuickDocs is at least partially back up again. The CF8 and CFMX7 docs work: CF9 still doesn't.

Ray pointed out that it's entirely Adobe's prerogative to change their URLs/HTML if they want, and went on to say anyone relying on page scraping for their content need to be aware of the instability of their chosen approach. And, accordingly, if Adobe changed their docs, it's not their fault if other sites break. I would say that given they performed the action that caused the breakage that it is their fault. However - all things being equal - it's perhaps not something they should intrinsically lose sleep over. That's the thing, though: I don't think all things are equal here. These third-party documentation sites are providing a service to the community that works around shortcomings in Adobe's own approach, and accordingly I think they warrant some respect from Adobe. My own case in point is that I need to scrape bug content from the bugbase for my CFBugNotifier process, because there's no other way to get the info. And given Adobe don't see fit to provide notifications when bug status changes, I think I offer a community-useful service here. And if they changed the bugbase mark-up and CFBugNotifier broke, that would be a detriment to their community. And their fault

Another consideration is that the Adobe docs - except the CF9 ones (and formerly the CF10 ones, I found out today: "ColdFusion copyright and trademarks and third-party notices") - are copyrighted. So people aren't really supposed to be copying them for their own use. Which is what page-scraping is doing. This is why my "copy" of the bug-base entries doesn't maintain the actual content, it maintains a hash of the content. That way I can quickly check if anything has changed (if not exactly what has changed). This doesn't infringe Adobe's copyright.

Anyway, whatever. I think I was a little more belligerent about it than I needed to be, but I think also that Ray was a little more dismissive of the broader picture than he needed to be (whilst being technically correct). And this blog article is not about that anyhow (he says, being almost 500 words into it!).

My undertaking from all this was that I was going to scrape all the doc content and save each page as JSON, and put it somewhere safe (from Adobe), and accessible to whoever wanted it. I have started to do this, and have scraped & JSONified all the pages for ColdFusion 9's tags and functions. I know I'm allowed to copy those, so have done so. I am still checking what the story is with other versions of ColdFusion.

I've put all the code and the JSON docs up on github, here: https://github.com/adamcameroncoldfusion/cfmldocs. I will continue to randomly tinker with it as time goes on. I already need to shift some of the files to a better location (the CFCs are CF-9-docs specific, but have fairly general names at the moment). The JSON files are in the cfmldocs subdir. I've structured that so other CF engine's docs can go in there too, as well as other versions of the docs within those engine spaces.

That's the scraps and the scrapes, but what's all this about soup?

In the past when I've needed to extract content from a scraped HTML doc, I've just used regexes. This is evident from the "bug updates" code, in particular here: BugbaseProxy.cfc. I'm reasonably good with regexes, and those patterns in BugbaseProxy.cfc reliably get the results I want, but I am aware that using regexes to extract anything from a DOM document is frowned upon (somewhat too rigorously IMO... a DOM document is after all just a string, whether the dogmatists like it or not).

So, anyway, I have been aware of JSoup for quite some time, and yesterday decided to use that instead of regexes to extract the various bits and pieces from the CFML documentation pages.

I'm not going to go into too much depth with this, partly because it's so bloody easy to use, and partly because it's documented up the wazoo already, so I would not be adding much by documenting my first-attempt-of-using-it fumblings (not that this has stopped me writing-up how little I know about things like Ruby, but hey). I will show you how I'm using JSoup via CFML though.

Here's the basic code I was using to extract bits and pieces from the HTML docs.

// JSoup.cfc
component {

    public JSoup function init(){

        var javaLoader = createObject("javaloader.JavaLoader").init([expandPath("/jsoup/jsoup-1.7.2.jar")]);
        variables.jsoup = javaLoader.create("org.jsoup.Jsoup");

        return this;
    }

    public any function getJSoup(){
        return variables.jsoup;
    }

}

Firstly I've wrapped JSoup itself in a CFC. This is mostly to hide the fact I'm loading it with JavaLoader, and also if I decide to not use JavaLoader, I just need to change it in one place: here. I'm using JavaLoader because I thought I might be putting this up on my hosted CF instance, and I'm restricted as to how much access I have to Java resources, but JavaLoader works around those (Russ knows I do this, and doesn't mind).

Next I have an HtmlPage.cfc class which uses JSoup to dart off and read in a page, and "objectify" it as a Document:

// HTMLPage.cfc
component {

    variables.baseUrl    = "";
    variables.pageName    = "";

    public HtmlPage function init(){
        variables.jSoup        = new JSoup().getJSoup();
        variables.docObject    = variables.jSoup.connect(variables.baseUrl & variables.pageName).get();

        return this;
    }

}

And then I have a specialisation of an HtmlPage which is representation of the CFML reference pages. All this means is that it knows the base URL of where they are:

// CfmlReferencePage.cfc
component extends="HtmlPage" {

    variables.adobeReferences = new AdobeCfmlReference();

    variables.baseUrl = variables.adobeReferences.baseUrl;

}

BTW, AdobeCfmlReference.cfc is just this:

// AdobeCfmlReference.cfc
component {
    this.baseUrl            = "http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef/";
    this.functionIndexPage    = "WSc3ff6d0ea77859461172e0811cbec22c24-7ff8.html";
    this.tagListPage        = "WSc3ff6d0ea77859461172e0811cbec17576-7ffd.html";
    this.licence            = "http://help.adobe.com/en_US/ColdFusion/10.0/LegalNotices/index.html";
}

I've abstracted some of the Adobe-controlled values into here, so I have a single place to change them. Other stuff needs to go in here too, but that's something for another day.

So an instance of a CfmlReferencePage object knows where the CF9 CFML Reference is. Aside: I need to change the name of the CfmlReferencePage.cfc to Cf9CfmlReferencePage.cfc (or something), but I have more reorganising to do than just that, so I've not done it yet.

Next I have a DocumentationPage.cfc which knows about a single page in the docs... eg the page for <cfabort> or listFind(), and how to extract information from it. We're getting closer to some JSoup stuff now.

// DocumentationPage.cfc
component extends="CfmlReferencePage" {    // this is basically an abstract class

    variables.optionType = "";

    public DocumentationPage function init(required string pageName){
        variables.pageName = arguments.pageName;
        super.init();
        return this;
    }

    public struct function getDocumentation(){
        return {
            pageName                = getPageName(),
            description                = getDescription(),
            category                = getCategory(),
            syntax                    = getSyntax(),
            seeAlso                    = getSeeAlso(),
            history                    = getHistory(),
            "#variables.optionType#"= getOptions(),
            usage                    = getUsage(),
            example                    = getExample(),
            licence                    = variables.adobeReferences.licence
        };    
    }

    public string function getPageName(){
        return  variables.docObject.select("h1").text();
    }

    public string function getDescription(){
        return getSectionText("Description");
    }

    public string function getCategory(){
        return getSectionText("Category");
    }

    public string function getSyntax(){
        return getSectionSiblingText("Syntax");
    }

    public string function getSeeAlso(){
        return getSectionSiblingText("See Also");
    }

    public string function getHistory(){
        return getSectionSiblingText("History");
    }

    public string function getUsage(){
        return getSectionSiblingText("Usage");
    }

    public string function getExample(){
        return getSectionSiblingText("Example");
    }

    public array function getOptions(){
        var optionsSection = variables.docObject.select('h4.sectiontitle:containsOwn(#variables.optionType#s)');
        if (!arrayLen(optionsSection)){
            return [];
        }
        var optionsDocumentation = [];
        for (var tagOption in optionsSection[1].parent().select("tbody tr")){
            arrayAppend(optionsDocumentation, getOptionDetails(tagOption));
        }
        return optionsDocumentation;
    }

    private struct function getOptionDetails(required tagOption){ 
        return {};    // needs to be implemented by subclass
    }

    private string function getSectionText(required string sectionTitle){
        var sectionText = variables.docObject.select('h4.sectiontitle:contains(#sectionTitle#)+p');
        if (arrayLen(sectionText)){
            return sectionText[1].text();
        }else{
            return "";
        }
    }

    private string function getSectionSiblingText(required string sectionTitle){
        var text = "";
        var section = variables.docObject.select('h4.sectiontitle:containsOwn(#sectionTitle#)');

        if (arrayLen(section)){
            for (elem in section[1].siblingElements()){
                text &= (elem.html() & "<br>");
            }
        }
        return text;
    }

}

Note how this is basically an abstract class? This is because tag pages and function pages differ in a coupla areas. So I specialise again to cover those. Here's TagPage.cfc:

// TagPage.cfc
component extends="DocumentationPage" {

    variables.optionType = "Attribute";

    private struct function getOptionDetails(required tagOption){ 
        var optionParts = tagOption.select("td");
        var optionDetails = {
            "#variables.optionType#"= "",
            reqOrOpt                = "",
            "default"                = "",
            description                = ""
        };
        switch (min(arrayLen(optionParts), 4)) {
            case 4 : optionDetails.description                = optionParts[4].text();
            case 3 : optionDetails["default"]                = optionParts[3].text();
            case 2 : optionDetails.reqOrOpt                    = optionParts[2].text();
            case 1 : optionDetails["#variables.optionType#"]= optionParts[1].text();
        }
        return optionDetails;
    }

}

The differences are that tags have attributes, and functions have parameters. And tag attributes have four documented elements: name, whether they're required or optional, what the default - if any - is, and a description. A function's parameters simply have a name and a description (the defaults and optionality are rolled into the description). FunctionPage.cfc, for comparison:

// FunctionPage.cfc
component extends="DocumentationPage" {

    variables.optionType = "Parameter";

    private struct function getOptionDetails(required tagOption){ 
        var optionParts = tagOption.select("td");
        var optionDetails = {
            "#variables.optionType#"        = "",
            description    = ""
        };
        switch (min(arrayLen(optionParts), 2)) {
            case 2 : optionDetails.description                = optionParts[2].text();
            case 1 : optionDetails["#variables.optionType#"]= optionParts[1].text();
        }
        return optionDetails;
    }

}

So we scrape a page like this:

// scrapetagPage.cfm
import me.adamcameron.docs.*;

tagPage = new TagPage("WSc3ff6d0ea77859461172e0811cbec22c24-7fde.html");
documentation = tagPage.getDocumentation();
writeDump([
    {pageName=documentation.pageName},
    {description=documentation.description},
    {category=documentation.category},
    {syntax=documentation.syntax},
    {seeAlso=documentation.seeAlso},
    {history=documentation.history},
    {usage=documentation.usage},
    {example=documentation.example}
]);

And that gives us this:

array

struct
PAGENAME	cfabort

struct
DESCRIPTION	Stops the processing of a ColdFusion page at the tag location. ColdFusion returns everything that was processed before the tag. The tag is often used with conditional logic to stop processing a page when a condition occurs.

struct
CATEGORY	Flow-control tags

struct

SYNTAX

<br><cfabort showError = "<i xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:xs="http://www.w3.org/2001/XMLSchema">error message</i>"><br><span class="notetitle">Note: </span>You can specify this tag’s attributes in an <samp class="codeph">attributeCollection</samp> whose value is a structure. Specify the structure name in the <samp class="codeph">attributeCollection</samp> and use the tag’s attribute names as structure keys.<br>

struct

SEEALSO

<br><samp class="codeph"><a href="WSc3ff6d0ea77859461172e0811cbec22c24-7fe1.html">cfbreak</a></samp>, <samp class="codeph"><a href="WSc3ff6d0ea77859461172e0811cbec22c24-7d56.html">cfexecute</a></samp>, <samp class="codeph"><a href="WSc3ff6d0ea77859461172e0811cbec22c24-7fdd.html">cfexit</a></samp>, <samp class="codeph"><a href="WSc3ff6d0ea77859461172e0811cbec22c24-7fe8.html">cfif</a></samp>, <samp class="codeph"><a href="WSc3ff6d0ea77859461172e0811cbec22c24-7cac.html">cflocation</a></samp>, <samp class="codeph"><a href="WSc3ff6d0ea77859461172e0811cbec22c24-7fe2.html">cfloop</a></samp>, <samp class="codeph"><a href="WSc3ff6d0ea77859461172e0811cbec22c24-7fe5.html">cfswitch</a></samp>, <samp class="codeph"><a href="WSc3ff6d0ea77859461172e0811cbec22c24-7e25.html">cfthrow</a></samp>, <samp class="codeph"><a href="WSc3ff6d0ea77859461172e0811cbec22c24-7ec6.html">cftry</a></samp>; <a href="http://help.adobe.com/en_US/ColdFusion/9.0/Developing/WSc3ff6d0ea77859461172e0811cbec22c24-74fc.html" target="_self">cfabort and cfexit</a> in the <i xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:xs="http://www.w3.org/2001/XMLSchema">Developing ColdFusion Applications</i><br>

struct
HISTORY	[empty string]

struct

USAGE

<br>When you use the <samp class="codeph">cfabort</samp> and <samp class="codeph">cferror</samp> tags together, the <samp class="codeph">cfabort</samp> tag halts processing immediately; the <samp class="codeph">cferror</samp> tag redirects output to a specified page.<br>If this tag does not contain a <samp class="codeph">showError</samp> attribute value, processing stops when the tag is reached and ColdFusion returns the page contents up to the line that contains the <samp class="codeph">cfabort</samp> tag.<br>When you use this tag with the <samp class="codeph">showError</samp> attribute, but do not define an error page using <samp class="codeph">cferror</samp>, page processing stops when the <samp class="codeph">cfabort</samp> tag is reached. The message in <samp class="codeph">showError</samp> displays to the client.<br>When you use this tag with the <samp class="codeph">showError</samp> attribute and an error page using <samp class="codeph">cferror</samp>, ColdFusion redirects output to the error page specified in the <samp class="codeph">cferror</samp> tag.<br>

struct

EXAMPLE

<br>This example shows the use of <samp class="codeph">cfabort </samp>to stop processing. In the second example, where cfabort is used, the result never displays.<br><h3>Example A: Let the instruction complete itself</h3>  <cfset myVariable = 3>  <cfloop from = "1" to = "4" index = "Counter"> <cfset myVariable = myVariable + 1> </cfloop> <cfoutput> <p>The value of myVariable after incrementing through the loop #Counter# times is: #myVariable#</p> </cfoutput> <h3>Example B: Use cfabort to halt the instructions with showmessage attribute and cferror</h3>  <cfset myVariable = 3>  <cfloop from = "1" to = "4" index = "Counter">  <cfif Counter is 2>  <cferror type="request" template="request_err.cfm"> <cfabort showerror="CFABORT has been called for no good reason">   <cfelse> <cfset myVariable = myVariable + 1> </cfif> </cfloop> <cfoutput> <p> The value of myVariable after incrementing through the loop#counter# times is: #myVariable#</p> </cfoutput><br>

So where's the JSoup stuff?

Well here we get the contents of the H1, which I use for the pageName value:

variables.docObject.select("h1").text();

Easy. I don't have to use a regex to look for the closing </h1> tag or anything, nor worry about what other crap is between the tags: JSoup looks after that for me.

How about getting the subsection headings?

var sectionText = variables.docObject.select('h4.sectiontitle:contains(#sectionTitle#)+p');

Here sectionTitle would be something like "Description", and this gets an H4 which has a sectionTitle class on it, and that H4 contains "Description". Having found that I get the <p> tags that immediately follow it. This returns an array (possibly zero length), and I want the content of the first one, which is the actual description:

if (arrayLen(sectionText)){
    return sectionText[1].text();
}else{
    return "";
}

To contextualise this, an abbreviated version of the mark-up here might be:

<div>
<h4 class="sectiontitle">Description</h4>
<p>Stops the processing of a ColdFusion page at the tag location. [...]</p>
</div>

And this soupery pulls out the stuff in the <p> tag, which is the description for <cfabort>.

A more complicated example is putting out the details of a tag attribute / function parameter (which I refer to generally as "options"):

public array function getOptions(){
    var optionsSection = variables.docObject.select('h4.sectiontitle:containsOwn(#variables.optionType#s)');
    if (!arrayLen(optionsSection)){
        return [];
    }
    var optionsDocumentation = [];
    for (var tagOption in optionsSection[1].parent().select("tbody tr")){
        arrayAppend(optionsDocumentation, getOptionDetails(tagOption));
    }
    return optionsDocumentation;
}

Here I'm getting the <h4> with the sectionTitle class which has either "Attributes" or "Parameters" (depending on whether it's a tag or a function page) as its inner text. The difference between contains() - which I used earlier - and containsOwn() is that contains() will look in sub tags within the current one, whereas containsOwn() must the be text of the current tag itself.

Having found the heading, I actually need to look back up to its parent, and then look for a table within that parent. In this case it's the <tbody> I'm after, so I look directly for that, grabbing each row from it, and passing that to a handler. The handler for a TagPage is as follows:

private struct function getOptionDetails(required tagOption){ 
    var optionParts = tagOption.select("td");
    var optionDetails = {
        "#variables.optionType#"= "",
        reqOrOpt                = "",
        "default"                = "",
        description                = ""
    };
    switch (min(arrayLen(optionParts), 4)) {
        case 4 : optionDetails.description                = optionParts[4].text();
        case 3 : optionDetails["default"]                = optionParts[3].text();
        case 2 : optionDetails.reqOrOpt                    = optionParts[2].text();
        case 1 : optionDetails["#variables.optionType#"]= optionParts[1].text();
    }
    return optionDetails;
}

Here I pull the <td> elements out of the passed-in row, and record the values of each of the first four TDs as name, reqOrOpt, default and description respectively. I grab them last to first as it's easier to write the array-len-handling code with the switch that way (note there's no breaks, so if the array has four or more entries, it'll run all four of those cases, and so forth.

The logic here is a but curious (I am not entirely convinced by this approach, I have to say), but the JSoup bit to observe is that one can chain select() calls together on the results from a previous select() result's elements. Much like JQuery.

Selectors can get very complex very quickly. Take this one:

var listingElements = variables.docObject.select("##inner_content_table ul.navlinklist>li>a:matchesOwn(^Functions\s[a-z](?:-[a-z])?$)");

This works on the function index page in the docs, which has this content:

ColdFusion Functions

The following tables list and categorize ColdFusion Markup Language (CFML) functions.

New Functions in ColdFusion 9 and ColdFusion 9.0.1

Functions by category

Function changes since ColdFusion 5

Functions a-b

Functions c-d

Functions e-g

Functions h-im

Functions in-k

Functions l

Functions m-r

Functions s

Functions t-z

Here I only want to pull out the A-Z links, so my selector is doing this:

Selection fragment	Explanation
#inner_content_table	The element with the ID inner_content_table
ul.navlinklist>li>	A <ul> with class navlinklist, immediately followed by <li> tag(s), immediately follwed by...
a:matchesOwn()	An <a> tag which has the text matched by the regex
^Functions\s[a-z](?:-[a-z])?$	A string that starts with "Functions", then has a space, a letter, and zero or one additional letter that follows a -. (eg: "m-r", or just "s", as per the example above).

So here we're mixing selectors and regular expressions for text values. This is pretty bloody cool.

There's mountains of other selector / traversal methods to use... my requirements here were pretty straight forward so the above example was as complex as I needed to get.

The full docs are here, including:

That's about it. It's Sunday afternoon, so I think I shall watch a black-and-white movie. Then maybe get onto the work I was supposed to do this weekend!

--
Adam

Sunday, 15 September 2013

Scraps, scrapes and soup

Stop Press:

ColdFusion Functions