Sunday 16 November 2014

Weekend code puzzle: my "answer" (PHP version... which doesn't quite work)

G'day:
Again, apologies for the delay here. I had a busy week and then yesterday was sidelined due to needing to go to the pub to watch the rugby. Still: better late than never... here's the beginning of my follow-up to "Something for the weekend? A wee code puzzle (in CFML, PHP, anything really...)".

First things first: thanks to everyone who entered:

I have not looked at any of the code yet, so as to not influence my own design. But I will be getting to all the separate entries in a subsequent article.

I was moderately pleased with my PHP progress for my first round of dev on this. I was on an hour-and-a-bit flight from London to Shannon, and managed to bang out the first four versions of my solution whilst on the plane. I have to admit I did not use TDD on this first round, because I was offline and I did not have PHPUnit installed on the machine I was using. I think it would have sped things up overall had I done so though, as it'd've been nice to have some regression testing as I implemented the various permutations of the solution.

My first attempt was to just take the brute force approach, and simply to get something working. This is the one aspect of TDD I did follow: get it working, then polish it up and refactor it once you know the logic is sound, and you can deliver something.

// getSubseries.php

$series = [100,300,100,50,50,50,50,50,500,200,100];
$threshold = 500;

$subseries = getSubseries1($series, $threshold);

print_r($subseries);

function getSubseries($series, $threshold)
{
    $subseries = [];
    $working = [];
    array_walk($series, function ($value) use ($threshold, &$subseries, &$working) {
        if ($value > $threshold){
            $working = [];
        } elseif (array_sum($working) + $value <= $threshold){
            $working[] = $value;
        }elseif(array_sum($working) + $value - $working[0] <= $threshold){
            array_shift($working);
            $working[] = $value;
        }else{
            $working = [$value];
        }
        if (count($working) > count($subseries)){
            $subseries = $working;
        }
    });
    return $subseries;
}

This simply tests the baseline for requirement. This is a bit leaden as it has too many conditionals, which I was able to pare back in subsequent attempts. But here my algorithm for finding the best subsequence is:
  • loop over the array;
  • if the current element is higher than the threshold, start a new "working" array;
  • or if the current element when appended to the working array would yield a sum within the threshold, append it;
  • or if we got rid of the first item of the array that'd be the case, do that;
  • otherwise we start a new working array with that one element.
  • Now if the length of the working array is longer than the previously best one, use that as the best one
Oh, note that at this stage I was using the original rules for the code puzzle, which did not have the modification to look for the highest summed sub-sequence if there are more than one of the same length. This version is entirely length-based, so will match only the first longest sequence. This was what the original challenge on Stack Overflow called for.

This works, but as I said it's a bit leaden. And also has a bug in it I didn't notice until I did do some reverse-engineered TDD.

Next I fine-tuned the logic a bit to reduce the need for some of the conditions due to rearranging some of the operations:

function getSubseries($series, $threshold)
{
    $subseries = [];
    $working = [];
    array_walk($series, function ($value) use ($threshold, &$subseries, &$working) {
        $working[] = $value;
        if (array_sum($working) <= $threshold) {
            return $subseries = count($working) > count($subseries) ? $working : $subseries;
        }
        array_shift($working);
    });
    return $subseries;
}


So now I'm just:

  • appending to the working subseries irrespective of whether it's sum will be within the threshold.
  • If we are within the threshold I return either the working array or the current best one depending on which of the two is the better.
  • If we're still going, the working subseries is over the threshold, so lop the first element off. The theory being on subsequent iterations we'll end up with the appropriate best subseries.

This passes my tests, but I don't think it actually "works" as there will be a real likelihood that we'll end up with the best subseries being lost in the "middle" of the working subseries, whilst the working subseries as a whole is over-threshold.

The next iteration is just the same logic, except using foreach() instead of array_walk().

function getSubseries($series, $threshold)
{
    $subseries = [];
    $working = [];
    foreach ($series as $value){
        $working[] = $value;
        if (array_sum($working) <= $threshold) {
            $subseries = count($working) > count($subseries) ? $working : $subseries;
            continue;
        }
        array_shift($working);
    }
    return $subseries;
}

Most PHPers love their foreach() loops, so this will likely be the approach we'd see most often (my observation about the logic error in this notwithstanding). I still prefer the array_walk() version as it slightly more elegant to me: returning from the callback instead of continue-ing the loop seems nicer. There's not much in it. And I'm sure if performance was really important then the foreach() version would be marginally faster. It'd be much nicer if PHP had a more thoughtful closure implementation too, and one didn't need to enclose variables "by hand".

Lastly I decided that array_walk() was not the best way to do this, and the whole thing is a reduction operation:

function getSubseries4($series, $threshold)
{
    $working = [];
    return array_reduce($series, function ($reduction, $current) use ($threshold, &$working) {
        $working[] = $current;
        if (array_sum($working) <= $threshold) {
            return count($working) > count($reduction) ? $working : $reduction;
        }
        array_shift($working);
        return $reduction;
    }, []);
}

Here the idea is that I am reducing the main array to the best subseries. Each iteration I append a value to the "$working" subseries, and if it's better than the previous one ($reduction), I replace $reduction with the $working subseries. And I still have the same logic error with how I am shifting-out the first element of the array.

This reduction approach is the best approach of the lot, and it's also shortest!

However - as I said - this function is no good as it has that logic error in it. I did not actually discover this until I was doing the CFML version, which also included the rule-variation for returning the subseries with the highest sum, as well as the longest length.

BTW, I ended up doing the TDD side of things by hand, as I was still on a machine with no PHPUnit. Fortunately PHP has an assert() function built in, so I just used that.

If you want to scan through the reimplementation of the solution using TDD, I have a coupla files for you to look at (I'll not include them here). GetSubseriesTest.php details the unit tests I created in sequence whilst doing TDD on my revised answer. And getSubseries.php includes each iteration of the test as I TDDed it. Only the last/final iteration is uncommented, but all the previous iterations are there, commented out, showing how the function progressed during TDD.

I will have a quick look at how the assert() function in PHP works. First one sets some options:

assert_options(ASSERT_CALLBACK, function ($script, $line, $expression, $message) {
    printf("(%d) %s\n", $line, $message);
    exit;
});

Here I'm telling PHP that assert() will run my callback when the assertion fails (the expression is false). I simply print out the line and the message when the assertion fails. Using exit here prevents PHP from also issuing a warning.

And using it looks like this:

assert(is_array($result), "Should return an array");

The first argument is a boolean expression, the second is the message to use when the expression is false.

Note that processing halts if the assertion is false.

I was quite pleased with the way my code shrank down from first attempt to last attempt, but I am slightly let down I never noticed that logic error. Still... this is what I came up with, so I am posting it here anyhow.

I'll write up another article covering the CFML approach either later today or tomorrow. I've got the code all ready to go, I've just gotta wrap some words around it.

--
Adam