Tuesday 4 November 2014

PHP: array_udiff()

I'm not too sure how interesting this article is gonna be, but for the last coupla days I've been trying to solve a problem using array_udiff() to filter elements from one array based on their sameyness (I made that word up, sorry) compared to elements in another array. Now I know this perhaps isn't quite the intended purpose of a diffing function - and it turned out to not be a viable solution for this reason - but the investigation turned up some "interesting" behaviour that isn't clearly documented as far as I can tell.

First things first, look at this shambles:

Those are all legit functions in PHP. One can get the general gist of what they're for, but can you infer - by their names - what the f*** all the variations are about? Of course not. Seriously: whoever decided on these function names should be shot.

Anyway, it was in the process of wading through that morass that I started to look at array_udiff(). What does array_udiff() do? It returns the difference between one base array and a number of other arrays: the result being all the elements in the first array that are not in any of the other arrays. Cool, but what's the "u" for in array_udiff(). Oh silly you, can't you tell? That means "use a callback to do the differentiation". This is as opposed to just array_diff() (note: no "u" in there) which uses PHP's internal comparator to compare elements. Of course it does, PHP.

Anyway, the callback version of this function was definitely what I needed as the arrays I had to diff were arrays of arrays, with each element representing a time frame, eg:

$timeslots = [
    ["from" => "10:00:00", "to" => "10:30:00"],
    ["from" => "11:00:00", "to" => "11:45:00"],
    ["from" => "14:00:00", "to" => "15:00:00"],
    ["from" => "16:30:00", "to" => "17:00:00"]

So that'll require code form me to compare elements.

The callback for this is along these lines:

int function comparator(element1, element2) {
    // returns -1, 0, 1 depending on whether element1 is less than, equal to or greater than element2

My initial reaction is... huh? we're filtering here (a diffing process is basically a filtering one). The elements either match or they don't: it's a boolean test. It's neither here nor there whether a non-matching element is "less than" or "greater than" to that which it's being compared. Indeed there might not be any sense of "greater than" or "less than" in the context of determining the sameiness of objects. PHP is carrying on here as if this is a sorting operation, not a filtering operation.

Still: I decided to play PHP's silly game, and in my case one could reflect the sense of greater than or less than with my comparison items, so I ran with it.

However the code didn't work. I'll spare you the detail of what I was trying to do as the code isn't great for demonstrating the issue I discovered, and I boiled it down to a less complex example which shows what PHP is doing:

// arrays.php
$first = [
    ["source"=>"first", "text"=>"I", "value"=>1],
    ["source"=>"first", "text"=>"V", "value"=>5],
    ["source"=>"first", "text"=>"II", "value"=>2],
    ["source"=>"first", "text"=>"IV", "value"=>4],
    ["source"=>"first", "text"=>"III", "value"=>3]

$second = [
    ["source"=>"second", "text"=>"fifth", "value"=>5],
    ["source"=>"second", "text"=>"third", "value"=>3],
    ["source"=>"second", "text"=>"second", "value"=>2],
    ["source"=>"second", "text"=>"seventh", "value"=>7]

// array_udiff.php

require "arrays.php";

$tally = 0;
$diff = array_udiff($first, $second, function ($e1,$e2) use (&$tally) {
    $comparison = gmp_sign($e1["value"] - $e2["value"]);
    printf('(%d) $e1: %s<br>$e2: %s; comparison: %d<hr>', ++$tally, json_encode($e1), json_encode($e2), $comparison);
    return $comparison;

echo "<pre>";
echo "</pre>";

Here I have two arrays, with each element of each itself being an array. The arrays are not directly analogous, but there can be a sense of "difference" based on the value key's value. IE: the difference I want here would contain elements 1,2 and 4 (by value) from the first array.

My comparator here simply uses the sign of the difference between the two elements. Interestingly PHP has no inbuilt sign function (like sgn() in CFML or Math.sign() in JS), one needs to use the GMP library to get that functionality. That's an odd omission from the core language.

I've also put some telemetry in there, as I had noticed some odd behaviour, and this was to help me troubleshoot stuff. I'm keeping a tally of how many iterations of the comparator are being called, and also exactly what is being compared and to what.

The output of this is:

(1) $e1: {"source":"first","text":"II","value":2}
$e2: {"source":"first","text":"V","value":5}; comparison: -1

(2) $e1: {"source":"first","text":"III","value":3}
$e2: {"source":"first","text":"II","value":2}; comparison: 1

(3) $e1: {"source":"first","text":"IV","value":4}
$e2: {"source":"first","text":"II","value":2}; comparison: 1

(4) $e1: {"source":"first","text":"I","value":1}
$e2: {"source":"first","text":"II","value":2}; comparison: -1

(5) $e1: {"source":"first","text":"IV","value":4}
$e2: {"source":"first","text":"V","value":5}; comparison: -1

(6) $e1: {"source":"first","text":"III","value":3}
$e2: {"source":"first","text":"IV","value":4}; comparison: -1

(7) $e1: {"source":"second","text":"third","value":3}
$e2: {"source":"second","text":"fifth","value":5}; comparison: -1

(8) $e1: {"source":"second","text":"seventh","value":7}
$e2: {"source":"second","text":"third","value":3}; comparison: 1

(9) $e1: {"source":"second","text":"second","value":2}
$e2: {"source":"second","text":"third","value":3}; comparison: -1

(10) $e1: {"source":"second","text":"seventh","value":7}
$e2: {"source":"second","text":"fifth","value":5}; comparison: 1

(11) $e1: {"source":"first","text":"I","value":1}
$e2: {"source":"second","text":"second","value":2}; comparison: -1

(12) $e1: {"source":"first","text":"I","value":1}
$e2: {"source":"first","text":"II","value":2}; comparison: -1

(13) $e1: {"source":"first","text":"II","value":2}
$e2: {"source":"second","text":"second","value":2}; comparison: 0

(14) $e1: {"source":"first","text":"II","value":2}
$e2: {"source":"first","text":"III","value":3}; comparison: -1

(15) $e1: {"source":"first","text":"III","value":3}
$e2: {"source":"second","text":"third","value":3}; comparison: 0

(16) $e1: {"source":"first","text":"III","value":3}
$e2: {"source":"first","text":"IV","value":4}; comparison: -1

(17) $e1: {"source":"first","text":"IV","value":4}
$e2: {"source":"second","text":"fifth","value":5}; comparison: -1

(18) $e1: {"source":"first","text":"IV","value":4}
$e2: {"source":"first","text":"V","value":5}; comparison: -1

(19) $e1: {"source":"first","text":"V","value":5}
$e2: {"source":"second","text":"fifth","value":5}; comparison: 0

array(2) {
  array(3) {
    string(5) "first"
    string(1) "I"
  array(3) {
    string(5) "first"
    string(2) "IV"

OK, the end results are fine, but WTF is going on during processing? I've highlighted three different categories of iteration there:

  • comparing an element of the first array with another element of the first array;
  • comparing two elements of the second array;
  • and doing what I only expected to see: comparing an element from the first array with one from the second array.
And this was causing errors for me because - for my actual test case - the two arrays were actually structurally different, ans my callback was coded to compare an element from the first array to an element from the second array. Not to compare the two arrays internally, as I didn't care about that and it's irrelevant to my requirements.

What the hell was going on?

After some googling, I found the answer, and it also answers why the callback can't just return a boolean, and it needs to act like a sorting comparator. It's because not only is it a filtering (boolean should be fine, as I said) comparator, it is also used a a sorting comparator. Because the two arrays are sorted before the filtering is done. I cannot find where it says this in the docs. All I could fine is this answer on Stack Overflow: "About array_udiff,I want to ask again", which I'll reproduce here:

The result you're getting is because you're using callback which returns 1 and 0. While logically it's justified (i.e. 1 means "equal" and 0 means "not equal") - PHP expects callback that return full comparison - thus, not only on equality - but also greater and less comparison.

To understand why it is so you need to understand how PHP processes calculation of arrays difference. It's not just simple walking through both arrays. PHP will sort arrays first, then will do checks. That is because simple walk (nested loop) will result in O(n*m) complexity, while sorting arrays first will result in approximately O(k log(k)), k=max(n, m) complexity for two arrays of size n and m. And to sort elements via user-defined callback it's not enough to check elements equality. You'll need to know their order, that's why less/greater comparison is needed.

So, as a conclusion: you can only use callback with full -1/0/1 values in it's result. If your callback returns something different, results will be unpredictable - that's why they may be "correct" and "wrong" on different input data.
Hmmm. Well superficially that's lovely. And it should be in the PHP docs front and centre.

Here's the problem though: this will work fine when the arrays are analogous, but depending on the complexity of the data, a callback that handles sorting can't necessarily be co-opted to also provide filtering. Also, the way comparisons (for either sorting or filtering) need to work won't intrinsically be the same, if, the arrays aren't structured the same. In my example above, the keys match so the comparison works for first-first (sorting), second-second (sorting), first-second (filtering) comparisons:

$comparison = gmp_sign($e1["value"] - $e2["value"]);

Only because it's the value key in both arrays does this work! If in one array it was "score" and in the other it was "passMark" then the code would break on the sorting. If PHP was even more jerry-built that it is, it could perhaps pass something to the callback to identify whether it's in a sorting phase or a filtering phase, and which side it's sorting if it is, and the callback could have three code blocks accordingly, but that would be a bit shit. Basically it's a huge (and wrong) assumption that the same logic can be used for all three of those phases.

Another issue is that PHP is co-opting "less than" (-1) and "greater than" (1) to mean "doesn't match", and only "equal to" (0) equates to "match" (so for a diff operation, this means "discard") isn't necessarily even correct! This sounds counter-intuitive, but the arrays I was differing where timespans, eg:

$potentialAppointments = [
    ["from" => "09:30", "to" => "10:00"], // good
    ["from" => "10:15", "to" => "10:45"], // overlaps embargo
    ["from" => "11:00", "to" => "11:30"], // overlaps embargo
    ["from" => "11:45", "to" => "12:15"], // good
    ["from" => "12:30", "to" => "13:00"], // good
    ["from" => "13:15", "to" => "13:45"], // good
    ["from" => "14:00", "to" => "14:30"], // overlaps embargo
    ["from" => "14:45", "to" => "15:15"], // overlaps embargo
    ["from" => "15:30", "to" => "16:00"], // good
    ["from" => "16:15", "to" => "16:45"], // overlaps embargo
    ["from" => "17:00", "to" => "17:30"], // good
    ["from" => "17:45", "to" => "18:15"], // good
    ["from" => "18:30", "to" => "19:00"], // good
    ["from" => "19:15", "to" => "19:45"]  // good

$embargoPeriods = [
    ["from" => "10:00:00", "to" => "10:30:00"],
    ["from" => "11:00:00", "to" => "11:45:00"],
    ["from" => "14:00:00", "to" => "15:00:00"],
    ["from" => "16:30:00", "to" => "17:00:00"]

Here the "diff" is the potential appointments which aren't covered by the embargo periods.  So an appointment might be less than an embargo period, but as it partially overlaps the embargo period, the it is considered a "match" for the purposes of diffing, so should be discarded. This situation is actually a sitter for this sort of thing, and it's very easy to determine a diff here, but the pre-sort operation makes no sense here. Granted this is an edge case, but the first case I cited where the two arrays are structured differently would be very common, I'd think? And these callback-enabled functions are what these situations are designed for.

Still: now I know, and I wish they'd bloody documented this coherently as it would have saved me about half a day's effort. But... I got a blog article out of it, I s'pose. And I now have a pretty decent understanding of how these diffing / intersecting functions work.

As for what all those other incomprehensibly-named variations of the various functions are for? I'll cover that later. I still need to come up with decent examples for each.