Sunday, 12 August 2012

Complex data-types in CF, and how they're not copied by reference

G'day
I've got part 2 of the array discussion underway, but I'm just going to digress slightly to write some notes up about how CFML deals with assigning and passing-around complex objects.

Again there's probably nothing new here for most bods out there, but hopefully it'll be useful to some people.

OK, so first up, there are a number of data types in CFML:
  • string (including lists)
  • date
  • boolean
  • numeric (including both integer and floating point)
  • array
  • struct
  • query
  • XML
  • objects (specifically native CFML ones, ie: component instances)
  • [others]
Included in "[others]" are stuff like file handles, images, spreadsheets, FTP connections, and stuff like that... there's a whole bunch of them, and I won't try to list all of them because I'll forget something, which'll mean someone will need to add a comment to point this out. So I'm just saving us all some time.



The first four are all "simple values", and the rest are "complex values". Basically a simple value is one that has only one part to its value; the complex ones are multi-part. For example an array as a number of elements, a struct has a bunch of key/value pairs, etc.

I'm not going to say much about simple values in this, other than that they're all - clearly - copied by value. Given this code:

s1 = "Tahi, TWO, toru, wha";
s2 = s1;

s1 = replace(s1, "TWO", "rua");

writeDump(variables);


We predictably get this output:

struct
S1Tahi, rua, toru, wha
S2Tahi, TWO, toru, wha

IE: after assigning s2 the value of s1, s1 and s2 are completely different values. So changing s1 does not impact s2 (even though in this case it'd've been convenient if it had).

Things get a bit more complicated with complex objects. To demonstrate, let's run some code that's much the same as the code above, except using a struct:

st1 = {
    one     = "Tahi",
    two     = "TWO",
    three   = "Toru",
    four    = "Wha"
};
st2 = st1;

st1.two = "Rua";

writeDump(variables);


This outputs:

struct
ST1
struct
FOURWha
ONETahi
THREEToru
TWORua
ST2
struct
FOURWha
ONETahi
THREEToru
TWORua

So that's different. Changing the value of a key in st1 also changes it in st2. Some clever people will nod knowingly and go "yeah, it's because structs are copied by reference in CFML". Sorry, but no they're not.

Structs are still copied by value, but unlike simple objects wherein the "value" is the contents of the string (or the date, etc), for structs (and most other complex objects), the value being copied is a reference.

So how's that different from copying by reference? It sounds the same.

To be honest, I understand what's going on, but I find it difficult to articulate. And I'm not 100% sure of the absolute vagaries, but I reckon it's something like this:

Copy by value:
VariableReference IDMemory address
st1@123456780x00001234
st2@9ABCDEF00x00001234


Copy by reference:
VariableReference IDMemory address
st1@123456780x00001234
st2@123456780x00001234

(I've made the reference ID and memory addresses look all hexadecimal and stuff for illustrative purposes).

You can see in both examples they all point to the same location of memory, but in the copy-by-value example the references are actually different. st2 is a new reference, it's just pointing to the same memory location as st1. Whereas in the latter example, st1 and st2 both point to exactly the same reference (and, accordingly, the same location in memory).

For almost all situations you'll encounter in CF the end results are the same, but I've got an example later on that demonstrate the difference.

Now... having said that, for the rest of this article I am going to use the term "copy by reference" for the sake of brevity. I actually mean "copy by reference value".

Now the gist of this article is to demonstrate how the various CFML complex object types behave when being assigned or passed.

I'm going to start with the odd-one-out. Arrays.

Arrays

Arrays are complex data types, but for historical reasons (and - IMO - a bad decision on the part of Macromedia's CF team) arrays are actually passed by value in ColdFusion.

Here's some illustrative code:
a1 = ["Tahi", "TWO", "Toru", "Wha"];
a2 = a1;

a1[2] = "Rua"; // oops, got one of the values wrong: fix it

writeDump(variables);

And the output:

struct
A1
array
1Tahi
2Rua
3Toru
4Wha
A2
array
1Tahi
2TWO
3Toru
4Wha

If arrays were copied by reference, then one would expect a2 to have been "fixed" too. But because when a1's value is assigned to a2, it's a value-copy, then the values of a1 and a2 are thereafter completely distinct from each other, so the "fix" is not propagated to a2.

What's completely weird is that ColdFusion's array functions all seem to actually pass the input array by reference, as demonstrated here:

a1 = ["Tahi", "Rua", "Toru"];
arrayAppend(a1, "Wha");

writeDump(variables);

Resulting in:

struct
A1
array
1Tahi
2Rua
3Toru
4Wha

So there's no return value, the passed-in array is simply modified "inline", implying it's passed by reference. Seriously? I do not know what Allaire / Macromedia were thinking by implementing arrays the way they did. That's partly pithy derision, partly an admission I actually don't know... there might've been a good reason for it. I doubt there actually was though.

Noticed I italicised "ColdFusion" before in the first para, when describing that arrays are passed by value. If I was to run that first example on Railo or Lucee, I get slightly different results.

WARNING:
Those that have any sort of sense of the aesthetic might want to look away now, because I'm about to show you a <cfdump> from Railo (and they look frickin' ghastly).

Here goes...


Scope
A1
Array
1
stringTahi
2
stringRua
3
stringToru
4
stringWha
A2
Array
1
stringTahi
2
stringRua
3
stringToru
4
stringWha

Once your eyes recover from that visual assault, you'll be able to note how a2 has also been updated with the "fix" to the element at index 2. This is because Railo decided that ColdFusion's behaviour here was daft, so they "fixed" it in Railo (discussion here).

On OpenBD, the code runs the same as on ColdFusion.

For the rest of this doc, one can assume the results are the same on all three of ColdFusion, Railo/Lucee and OpenBD, other than where I indicate otherwise.

If one really wanted to pass arrays by reference value in CFML, there is a way to kind of shoe-horn it in. Ben Nadel wrote an article a while back which covered how an ArrayList is passed by reference, so one could just use an ArrayList instead. I've slightly finetuned his approach here: one can use a CF array until the last moment, then turn it into an ArrayList when the pass-by-reference-value is needed:

a1 = ["Tahi", "TWO", "Toru", "Wha"];
a1 = createObject("java", "java.util.ArrayList").init(a1);
a2 = a1;

a1[2] = "Rua";

writeDump(variables);

And this outputs:
struct
A1
array
1Tahi
2Rua
3Toru
4Wha
A2
array
1Tahi
2Rua
3Toru
4Wha

Which demonstrates that the "array" is indeed being assigned by its reference value, instead of its actual value. I am not sure of the merits of doing this, but it's there to do should the need arise.

Structs

Structs do the whole "copying by reference value" thing. Here's some sample code:
st1 = {
    one     = "Tahi",
    two     = "TWO",
    three   = "Toru",
    four    = "Wha"
};
st2 = st1;

st1.two = "Rua";

writeDump(variables);

And the results:

struct
ST1
struct
FOURWha
ONETahi
THREEToru
TWORua
ST2
struct
FOURWha
ONETahi
THREEToru
TWORua

This demonstrates that the refs for both st1 and st2 are pointing to the same struct in memory.

I'm going sideline slightly into some code that demonstrates exactly the same thing, but it threw me when I first encountered it, despite being obvious what's going on.

st1 = {
    inner1 = {
        one = "Tahi"
    }
};
st1.inner2 = st1.inner1;    // so those two references are pointing to the same struct in memory

st2 = st1;

st1.inner1.two = "Rua";

writeDump(variables);        // note that TWO has been set into both inner1 and inner2


st3 = duplicate(st1);        // make a proper value-based copy; so completely different references pointing to different bits of memory

st3.inner1.three = "Toru";

writeDump(variables);        // inner2 also had a copy of "THREE"

The first dump shows:

struct
ST1
struct
INNER1
struct
ONETahi
TWORua
INNER2
struct
ONETahi
TWORua
ST2
struct
INNER1
struct
ONETahi
TWORua
INNER2
struct
ONETahi
TWORua

That's all predicable. After the duplicate(), though, I expected all the substructs to be discrete entities. So adding a THREE to st3.inner1 oughtn't also add it to st3.inner2. but it does:

struct
ST1
struct
INNER1
struct
ONETahi
TWORua
INNER2
struct
ONETahi
TWORua
ST2
struct
INNER1
struct
ONETahi
TWORua
INNER2
struct
ONETahi
TWORua
ST3
struct
INNER1
struct
ONETahi
THREEToru
TWORua
INNER2
struct
ONETahi
THREEToru
TWORua

After a while of doubting my sanity, it occurred to me that the duplicate() had made everything discrete from a memory point of view, but the two references that st3.inner1 and st3.inner2 - whilst being different from their counterparts in st1 & st2 - still pointed to the same piece of memory (a different piece of memory from the other two structs, but the same as each other). So I squinted and went "oh yeah... makes sense I guess". And moved on.

Note: there appears to be a bug in Railo here, in that the output of the st3 is:

Struct
inner1
Struct
one
stringTahi
THREE
stringToru
TWO
stringRua
INNER2
Struct
one
stringTahi
TWO
stringRua

That's not right (OpenBD is also wrong here, in the same way).

Here's an example showing that it's not just assignments and passings that point to the same bit of memory. The rather useful structFindKey() returns an array of structs which all reference the original struct, so manipulating the result from the function call also manipulates the source struct:

st1 = {
    one     = "Tahi",
    two     = "TWO",
    three   = "Toru",
    four    = "Wha"
};

a = structFindKey(st1, "two");
a[1].owner.two = "Rua";

writeDump(variables);

struct
A
array
1
struct
owner
struct
FOURWha
ONETahi
THREEToru
TWORua
path.two
valueTWO
ST1
struct
FOURWha
ONETahi
THREEToru
TWORua

Note how I change a[1].owner.two, but that change is reflected in st1 as well. That's quite handy.

Less handy... the pretty much useless structCopy() just messes stuff up. Have a look at this code & output:

st1 = {
    one      = "Tahi",
    two      = "TWO",
    inner    = {
        three   = "THREE",
        four    = "Wha"
    }
};
st2 = structCopy(st1);

st1.two         = "Rua";
st1.inner.three = "Toru";

writeDump(variables);

struct
ST1
struct
INNER
struct
FOURWha
THREEToru
ONETahi
TWORua
ST2
struct
INNER
struct
FOURWha
THREEToru
ONETahi
TWOTWO

structCopy() does some sort of weird / hybrid / neither-use-nor-ornament kind of copy wherein the top level items are discrete, but the rest of the thing is not. So the code "fixes" the THREE/Toru, but not the TWO/Rua. I have no idea why one would want to copy a struct like this.

Lastly (as far as structs go), here's a demo of structs being passed rather than just being assigned:

st1 = {
    one     = "Tahi",
    two     = "TWO",
    three   = "Toru",
    four    = "Wha"
};
st2 = f(st1);

function f(st){
    st.two = "Rua";
    return st;
};

writeDump(variables);

I'll dispense with the output this time, because you get the idea, I think.

Queries

Queries behave the same was as structs do, so I'll be quick with this one:

q1 = queryNew("digit,maori", "Integer,Varchar", [
    [1, "tahi"],
    [2, "TWO"],
    [3, "Toru"],
    [4, "Wha"]
]);
q2 = q1;

q1.maori[2] = "Rua";

writeDump(variables);

struct
Q1
query
DIGITMAORI
11tahi
22Rua
33Toru
44Wha
Q2
query
DIGITMAORI
11tahi
22Rua
33Toru
44Wha

One of the new syntactical features of ColdFusion 10 is that one can now load data into a query straight in the queryNew() expression (as per above), as well as in queryAddRow(). That's a cool addition to the language. An example of using queryAddRow() like this is below:

q1 = queryNew("digit,maori", "Integer,Varchar", [
    [1, "tahi"],
    [2, "Rua"],
    [3, "Toru"],
    [4, "Wha"]
]);

st = {
    digit = 5,
    maori    = "FIVE"
}; 
a = [st];

queryAddRow(q1, a);

st.maori = "Rima";

writeDump(variables);

Note what I'm testing here: I'm doing the old "update the struct" trick, to see whether it propagates into the query as well, wondering if the query data might somehow might still use the struct's reference. No. And now that I think about it, it was a pretty daft thing to try. Oh well. here's the result anyways ('cos, like, we're not sick of dumps saying "tahi, rua, toru, wha" yet, eh?):

struct
A
array
1
struct
DIGIT5
MAORIRima
Q1
query
DIGITMAORI
11tahi
22Rua
33Toru
44Wha
55FIVE
ST
struct
DIGIT5
MAORIRima

There's nothing more to say about queries here. Pretty dull.

Oh... Railo supports this new syntax (and works the same as CF does), but it seems OpenBD does not: it just errors with that code.

XML

XML also works via reference, as demonstrated here:

<cfxml variable="x1">
    <numbers>
        <one>Tahi</one>
        <two>TWO</two>
        <three>Toru</three>
        <four>Wha</four>
    </numbers>
</cfxml>
<cfset x2 = x1>

<cfset x1.numbers.two.xmlText = "Rua">

<cfdump var="#variables#">

struct
X1
xml document [short version]
numbers
XmlText
one
XmlTextTahi
two
XmlTextRua
three
XmlTextToru
four
XmlTextWha
X2
xml document [short version]
numbers
XmlText
one
XmlTextTahi
two
XmlTextRua
three
XmlTextToru
four
XmlTextWha

Handily - and similar to what we saw with structFindKey() before - xmlSearch() also returns an array of XML nodes which actually reference the original XML doc, so we can do this:

<cfxml variable="x1">
    <numbers>
        <one>Tahi</one>
        <two>TWO</two>
        <three>Toru</three>
        <four>Wha</four>
    </numbers>
</cfxml>
<cfset a = xmlSearch(x1, "/numbers/two/")>

<cfset a[1].xmlText = "Rua">

<cfdump var="#variables#">

<cfoutput>#a[1].getClass().getName()#</cfoutput>

struct
A
array
1
xml element
XmlNametwo
XmlNsPrefix
XmlNsURI
XmlTextRua
XmlComment
XmlAttributes
struct [empty]
XmlChildren
X1
xml document [short version]
numbers
XmlText
one
XmlTextTahi
two
XmlTextRua
three
XmlTextToru
four
XmlTextWha
org.apache.xerces.dom.DeferredElementNSImpl

Note that last line just outputs the underlying datatype of the objects returned by xmlSearch().

Here's another trap for young players (I'm not young now, but I was when this tricked me... OK, well I wasn't even young then. So it's a trap for... err... daft people, I guess).

Have a look at this code, which is much the same as above:

<cfxml variable="x1">
    <numbers>
        <one>Tahi</one>
        <TWO>TWO</TWO>
        <three>Toru</three>
        <four>Wha</four>
    </numbers>
</cfxml>
<cfset a = xmlSearch(lcase(x1), "/numbers/two/")>

<cfset a[1].xmlText = "Rua">

<cfdump var="#variables#">

The difference here is that I'm lower-casing the XML, because I want to use a case-insensitive XPATH look-up. One can do this with XPath 1.0, but it was a bit of a hack, so often when I want to just find stuff, and don't need to use the values, I just lcase() the XML. I seem to recall that CF10 uses XPath 2.0 now, so I could just use a case-insensitive look-up without messing about. I still have to look at XPath 2.0, and will probably write something about that at a later date.

However this was outputting:

struct
A
array
1
xml element
XmlNametwo
XmlNsPrefix
XmlNsURI
XmlTextRua
XmlComment
XmlAttributes
struct [empty]
XmlChildren
X1
xml document [short version]
numbers
XmlText
one
XmlTextTahi
TWO
XmlTextTWO
three
XmlTextToru
four
XmlTextWha

Note how it's changed the xmlText in the search result, but hadn't updated the XML doc. It took me ages to twig why I was being a muppet here.

When I do the lcase() call, I'm no longer using the original XML document: as lcase() is a string function, it passes copies of the value around, so the XML I'm doing the search on is not x1. It's a copy of it (well: it's a copy of it that's been cast to a string, then copied again, then turned back into an XML doc).

OK. Getting close to the end now. Cheers for sticking with me thusfar.

Objects

Here's some code to demonstrate objects are passed by object reference value too:

// C.cfc
component {

    structAppend(
        THIS,
        {
            one     = "Tahi",
            two     = "TWO",
            three   = "Toru",
            four    = "Wha"
        }
    );
}

// component.cfm
o1 = new C();
o2 = o1;

o1.two = "Rua";

writeDump(o1);
writeDump(o2);    // doing them separately to prevent CF "helpfully" suppressing o2's dump because it's the same object as o1


Running component.cfm yields this:

component shared.CF.data_types.assignment.C
ONETahi
THREEToru
FOURWha
TWORua
component shared.CF.data_types.assignment.C
ONETahi
THREEToru
FOURWha
TWORua

Perfect.

(NB: I did not test this on OpenBD because it does not support CFScript-only CFCs, so the code didn't work. I could not be bothered refactoring it to demonstrate what I strongly suspect to be the case: OpenBD performs the same here..?)

One last thing


Here's a demonstration of how CFML (any flavour) doesn't actually pass things by reference. Consider this code:

st1 = {
    one     = "Tahi",
    two     = "Rua",
    three   = "Toru",
    four    = "Wha"
};
st2 = f(st1);

function f(st){
    st = {
        one     = "Ichi",
        two     = "Ni",
        three   = "San",
        four    = "Shi"
    };
    return st;
};

writeDump(st1);
writeDump(st2);

This yields:
struct
FOURWha
ONETahi
THREEToru
TWORua
struct
FOURShi
ONEIchi
THREESan
TWONi

If st1 was actually being passed by reference, then st (arguments.st, inside the function) would be exactly the same reference. So if we then assign it a different value, then that would be assigning st1 the same value as well. Which is not the case. The reference passed into f() is a new reference which happens to point at the same memory location as st1. however when we reassign st, we're pointing it to a new memory address. But st1 still points at the old one.

Hence in CFML, complex objects are not passed by reference.

Make sense?

This ended up being way longer than I expected, but there's a lot of code and dumpery going on amongst the narrative. I hope it was a bit useful / interesting for some people. I will admit it actually formalised a few things in my head as I typed this stuff in, so it was good for me if nothing else.

Dinner time.

--
Adam