Wednesday 25 December 2013

File looping strategies: it doesn't seem to make a difference

G'day:
This article is more "for the sake of completeness" rather than being interesting. You'll perhaps recall I posted another boring article the other day: "So what did I do today? (Nothing interesting. Here are the details)", and I made a Peter Jackson allusion as to the merits of my writing volume (ie: sometimes more is not better). Sticking with that theme, here's the culmination of that line of investigation, which pretty much turned up nothing interesting.

A week or so ago, someone (I can't recall who... if it was you, lemme know and I'll credit you appropriately) on Twitter made the interesting observation that using <cfloop> over a file is about - I think - 10% more performant than using fileOpen() / fileReadLine() in a while loop. I find that slightly interesting as I like it when tags out perform script. Not because I prefer tags (very very far from it), but because stuff like this keeps us on our toes when it comes to exactly what's performant in CFML and what isn't. Another example of this was my investigation in to exactly how much slower is evaluate() compared to [some approach not using evaluate()] that I did a few months back: '"evalulate() is really slow". Is it now?'.

I also wondered out loud if the <cfloop> version might read the entire file into memory first, then loop over it. Which would be bad. I had no reason to think that ColdFusion might do this, other than that the ColdFusion team don't always seem to think things through before they implement them.

So, anyway, I decided to look into it. I wanted to time some stuff, which is why I made that makeStopwatch() function which the previous article was all about.

Firstly I made a large file, larger than the amount of RAM I have allocated to either ColdFusion or Railo (both have a max heap size of 512MB):

<!--- make.cfm --->
<cfflush interval="16">
<cfscript>
    CRLF = chr(13) & chr(10);
    try {
        filePath = expandPath("./data.txt");
        fileHandle = fileOpen(filePath, "append");
        for (row=1; row <= 1024 * 2048; row++){ // 2 "million" rows
            kb = "";
            for (i=1; i <= 32; i++){
                kb &= replace(createUuid(), "-", "", "ALL");
            }
            kb = left(kb, 1022) & CRLF; // 1kB rows
            fileWrite(fileHandle, kb);
            if (!row MOD 500){
                writeOutput("."); // a progress indicator
            }
        }
    } finally {
        fileClose(fileHandle);
    }
</cfscript>

This generates a 2GB file of 1kB lines of "unique" data. I probably could have just make it all "AAAAA", but oh well. This file was far to big to read into memory in one hit as this code demonstrates:

<!--- readWholeThing.cfm --->
<cffile action="read" file="#expandPath('./data.txt')#" variable="data">
<cfoutput>#len(data)#</cfoutput>

Which results in:

Railo 4.1.2.005 Error (java.lang.OutOfMemoryError)
MessageJava heap space
Causejava.lang.OutOfMemoryError
StacktraceThe Error Occurred in
C:\webroots\railo-express-4.1.x-jre-win64\webapps\www\shared\git\blogExamples\misc\filesystem\loopingOverFile\readWholeThing.cfm: line 2 
1: <!--- readWholeThing.cfm --->
2: <cffile action="read" file="#expandPath('./data.txt')#" variable="data">
3: <cfoutput>#len(data)#</cfoutput>

Well at least on Railo it does. On ColdFusion, the error is not very helpful:

The following information is meant for the website developer for debugging purposes.
Error Occurred While Processing Request


The error occurred inC:/apps/adobe/ColdFusion/10/cfusion/wwwroot/shared/git/blogExamples/misc/filesystem/loopingOverFile/readWholeThing.cfm: line 2
1 : <!--- readWholeThing.cfm --->
2 : <cffile action="read" file="#expandPath('./data.txt')#" variable="data">
3 : <cfoutput>#len(data)#</cfoutput>

Resources:
Browser  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36
Remote Address  0:0:0:0:0:0:0:1
Referrer  http://localhost:8500/shared/git/blogExamples/misc/filesystem/loopingOverFile/
Date/Time  25-Dec-13 01:47 PM
Stack Trace
at cfreadWholeThing2ecfm1912066453.runPage (
C:/apps/adobe/ColdFusion/10/cfusion/wwwroot
/shared/git/blogExamples/misc/filesystem/loopingOverFile/readWholeThing.cfm:2) 

java.lang.NegativeArraySizeException
 at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:64)
 at java.lang.StringBuffer.<init>(StringBuffer.java:108)
 at coldfusion.tagext.io.FileUtils.readFile(FileUtils.java:967)
 at coldfusion.tagext.io.FileTag.read(FileTag.java:484)
 at coldfusion.tagext.io.FileTag.doStartTag(FileTag.java:302)
 at coldfusion.runtime.CfJspPage._emptyTcfTag(CfJspPage.java:2799)
 at cfreadWholeThing2ecfm1912066453.runPage (
C:\apps\adobe\ColdFusion\10\cfusion\wwwroot
\shared\git\blogExamples\misc\filesystem\loopingOverFile\readWholeThing.cfm:2)
 

It's only in the logs that the out of memory issue is identified (in coldfusion-out.log and exception.log it's clear; in application.log and coldfusion-error.log it has the same ambiguous errors as appeared on the screen).

Anyway the point being is that one needs to watch how big the files one is reading in lock-stock, as files get big quickly, and JVM memory allocations are finite resources. If possible, do not read entire files into memory if you don't know how big they are and that they'll not co-opt an unexpected amount of your JVM RAM.

Having built the file, I read it in and "processed" it via fileReadLine():

<!--- fileReadLine.cfm --->
<cfinclude template="./makeStopwatch.cfm">
<cfset stopwatch = makeStopwatch()>
<cfset stopwatch.start()>
<cfset fileHandle = fileOpen(expandPath("./data.txt"), "read")>
<cfloop condition="NOT fileIsEOF(fileHandle)">
    <cfset line = fileReadLine(fileHandle)>
    <cfset oneLine = line>
</cfloop>
<cfset stopwatch.stop()>
<cfdump var="#stopwatch.getTimeline()#">

I'm using tags even in this example to remove tags vs script as a contributing factor. Reading the file via this mechanism gave this sort of result:

Array
1
Struct
CURRENTCLOCK
number1387977565845
LAPDURATION
number0
MESSAGE
stringstart
TOTALDURATION
number0
2
Struct
CURRENTCLOCK
number1387977669711
LAPDURATION
number103866
MESSAGE
stringstop
TOTALDURATION
number103866

100-odd seconds to loop over the file. On another run, it might take 90sec. Or 110sec. Or closer to 200sec. That was on Railo: the results on ColdFusion were similar, but perhaps marginally faster (odd: ColdFusion is usually well slower than Railo).

Next I used <cfloop> instead:

<!--- cfloopFile.cfm --->
<cfinclude template="./makeStopwatch.cfm">
<cfset stopwatch = makeStopwatch()>
<cfset stopwatch.start()>
<cfloop index="line" file="#expandPath('./data.txt')#">
    <cfset oneLine = line>
</cfloop>
<cfset stopwatch.stop()>
<cfdump var="#stopwatch.getTimeline()#">

Results:

Array
1
Struct
CURRENTCLOCK
number1387978391409
LAPDURATION
number0
MESSAGE
stringstart
TOTALDURATION
number0
2
Struct
CURRENTCLOCK
number1387978498446
LAPDURATION
number107037
MESSAGE
stringstop
TOTALDURATION
number107037

Sometimes it was faster, sometimes slower, about the same as reading the thing line by line. Again, ColdFusion's results were across a similar range, but tended to err towards being slightly faster.

The important thing to take from this is that a <cfloop> over a file does not read the entire file first, it reads it iteratively. This is good.

And the other thing is that I put down most of the performance variation to my hard drive and other environmental considerations, more than the looping logic being more or less performant: the differences between test runs on the same code was broader than any difference between the two logic variations.

Similarly, whilst I indicated CF might have seemed to be faster than Railo, the spread in results on either had a greater variation than comparing the two. Again, I think it's more down to the disk performance than the  Railo/ColdFusion comparison.

My conclusion is that either is as good as the other.

Which is not very earthshattering, and we probably could have assumed that without 1300 words of blog article. Still: I am in a bit of a Xmas lunch postprandial state, so an easy article is definitely where my mood is at.  Well my mood is at "snooze", so that's what I'm gonna... zzzzzzz...