Thursday 16 October 2014

CFML: Gert points out a schoolboy error in my prime numbers generator logic

G'day:
I was messing around with using closure to mimic a generator the other day: "CFML: prime number generator UDF, and overhead of using inline function expressions". Gert had a look at it, and spotted a shortfall in my implementation. He's provided some code as well but I've not really looked at that yet, instead wanting to nut it out myself (part as a penance, part as an exercise).

Here's the original implementation:

function createPrimeNumberSequence(){
    var primes = [];
    var potential = 1;
    var upperThresholdToCheck = false;

    var everyHandler = function(prime){
        if (potential mod prime != 0) return true;
        if (prime > upperThresholdToCheck) return true;
        return false;
    }

    return function(){
        while(true) {
            upperThresholdToCheck = sqr(++potential);
            var potentialIsPrime = primes.every(everyHandler);
            if (potentialIsPrime) {
                primes.append(potential);
                return potential;
            }
        }
    };
}

The key part is highlighted. Before reading on, see if you spot what I'm doing wrong.

Here's my logic (not - as it happens - the code's logic; my logic):
  • we already have zero or more primes from potential previous calls to the function;
  • to find the next prime, we incrementally check a potential value;
  • we need to check the potential for prime factors. If it has any: it itself is not a prime;
  • we need to check each prime up to and including its square root. If we have not found a factor by the time we get to its square root, it can't possibly have any prime factors.
  • So we loop through every() prime. This is a slight misuse of every(), but we need an iteration method we can exit from it midway through, which one cannot do from the other array iteration methods.
  • If the number isn't a factor of that prime, we need to check the next, so exit.
  • If we're here and the current prime is beyond the square root threshold, then the potential is a prime, so return true;
  • otherwise it's not a prime, so exit, returning false.
Written like that, the logic is fine. However I've ballsed up how every() works. There are two significant values one can return from the every() callback function:
  • false: exit the whole loop (so: like break)
  • true: exit the current iteration (so: like continue)

I stopped paying attention to what I'm doing, and am treating the return value from the callback as if it's the result of my question: is it a prime or not? This is not right. When I do the return true when processing reaches the square root threshold, I am returning true; which means "keep looping". Grumble. What a dick. You know what's worse? This isn't down to me previously not knowing how every() works: I knew all this full-well. I just wasn't thinking. I hate that.

I can't possibly conflate "exit the loop now" and "is it a prime?" in the one value from that callback. What I need to do is to set a variable to represent whether the potential is a prime or not, then leave the return value to do its job.

I was still getting the right results because the rest of the logic was OK, I was just doing an awful lot more work than necessary: always checking right to the end of the primes array.

Here's the fix for this:

function createPrimeNumberSequence(){
    var primes = []
    var potential = 1
    var upperThresholdToCheck = false
    var potentialIsPrime = false

    var someHandler = function(prime){
        if (potential mod prime == 0){
            potentialIsPrime = false
            return true
        }
        if (prime > upperThresholdToCheck) return true
        return false
    }

    return function(){
        while (true) {
            upperThresholdToCheck = sqr(++potential)
            potentialIsPrime = true
            primes.some(someHandler)
            if (potentialIsPrime) {
                primes.append(potential)
                return potential
            }
        }
    }
}

So in here I've done just a couple of things differently:

  • I'm using some() not every(). Either would have worked, but as I feel that some() is more semantically correct here. I already know that I'll not even be checking every element of the collection; it'll just be some of them, so I'm using the best fit for the job.
  • I'm using potentialIsPrime differently. I'm setting it within the callback once I know the answer. The answer can be determined in one of two ways:
    1. the potential is found to have a prime factor: it's not a prime.
    2. We get to the end of the primes we need to check: it is - therefore - a prime.
    Note I have have to hoist its declaration into the main function, so the callback can enclose it for use.
If I now slide some execution metrics into each variation of the functions, I get this output:


Original Using for() loop

Loop iterated: 34048 times
Execution time: 27ms



Original using every()

Handler called: 516383 times
Execution time: 616ms



Updated using some()

Handler called: 34048 times
Execution time: 425ms




Look how much extra bloody work my version was doing previous! No wonder it was so much slower! But still: it's way way slower than just using a loop.

Gert addresses this in his comment, but I'll get to that.

Oh, and checking against Gert's code: he's reached the same conclusion and uses bExitedWithThreshold much the same way as I am using potentialIsPrime in my version.

There is a further optimisation I can make to both versions though. Once that all prime number generator / calculator functions do but I initially refused to here. After the second iteration (having identified 2 and 3 as primes), I can thenceforth increment potential by two, instead of one. Why? Because no even numbers are primes, so people generally just assess odd numbers.

Why don't I like this? Because it's incomplete thinking. "Even numbers" are not special here. What we really mean is "multiples of 2". And why is 2 significant? Because it's a prime, and any multiple of it after the first one will definitely not be a prime. But 2 is not special in this regard. multiples of 3, likewise, intrinsically aren't prime, so after we find three we should increment by [a value which will not land us on either a multiple of 2 or 3]. Likewise after 5, we should increment by [a value which will not land us on either a multiple of 2, 3 or 5]. And so on. So to me, solutions which increment by 2 are using it as a magic number. Which is poor practice.

However. I also firmly believe that one should not hang on to dogma at the expense of being practical. Having reminded myself of that, I've pulled my head out of my arse and concluded that a better but not perfect solution is still a better one. And let's be realistic: this whole approach to identifying primes is not the optimal way of doing it anyhow.

OK, so I've modified both the loop and some() versions of this, in the same way:

function createPrimeNumberSequenceLoopImproved(){
    var primes = [2]
    var potential = 1

    return function(){
        while (true) {
            potential += 2
            var upperThresholdToCheck = sqr(potential)
            var potentialIsPrime = true
            for (var prime in primes){
                if (potential mod prime == 0) {
                    potentialIsPrime = false
                    break;
                }
                if (prime > upperThresholdToCheck) break;
            }
            if (potentialIsPrime) {
                primes.append(potential)
                return primes[primes.len()-1]
            }
        }
    }
}

Updates:

  • If I'm gonna use magic numbers, then I'm gonna leverage that fact. I stipulate the first prime is 2.
  • This means I don't have to have a conditional to determine the increment for potential: it can always be 2.
  • A quirk of stipulating 2 as already being a prime, my calculations now actually work out the next prime. So I need to return the previous one in the array. I figure this is less overhead than having to work out how much to increment potential inside that while loop: after the first two turns, it's always gonna be 2, after all.

And the same updates have been made to the some() version:

function createPrimeNumberSequenceSomeImproved(){
    var primes = [2]
    var potential = 1
    var upperThresholdToCheck = false
    var potentialIsPrime = false

    var someHandler = function(prime){
        if (potential mod prime == 0){
            potentialIsPrime = false
            return true
        }
        if (prime > upperThresholdToCheck) return true
        return false
    }

    return function(){
        while (true) {
            potential += 2
            upperThresholdToCheck = sqr(potential)
            potentialIsPrime = true
            primes.some(someHandler)
            if (potentialIsPrime) {
                primes.append(potential)
                return primes[primes.len()-1]
            }
        }
    }
}

Checking the metrics on this, I now get:


Original Using for() loop

Loop iterated: 34048 times
Execution time: 27ms



Improved using for() loop

Handler called: 30144 times
Execution time: 12ms


Original using every()

Handler called: 516383 times
Execution time: 616ms

Updated using some()

Handler called: 34048 times
Execution time: 425ms

Improved using some()

Handler called: 30144 times
Execution time: 224ms


We've knocked 10% of the iterations off here. I kinda expected it to be closer to a 50% improvement, and am not sure why it isn't. I'm sure it's obvious if I engage my brain.

The last thing that Gert said which is commentworthy:

In total with 1000 iteration there are a total of 34000 function calls. So IMHO comparing 34000 function calls with a simple for in statement is really like apples and pears. The calls of the functons are actually taking 90% of the total every time. The time inside the closure is only 10%.
Well: yes. This was entirely my point. I wasn't claiming the code inside the loop or the callback was faster or slower: I'd expect that to be the same. It's that the overhead of calling the callback is significant. As he says... 90% of the overhead of the work. This is a significant consideration when choosing a strategy.

If you're doing something that's under heavy load - like this sort of thing - I recommend not using the iteration functions (some(), every(), map(), reduce() etc), just use a vanilla loop. However in situations where there is not so much load, do use the iteration functions as they make the intent of the code much clearer. I don't actually think my code here is an example of that, TBH. In that the value I'm using from the some() code is not actually the function result (I am not actually checking if "some" of the array fulfils the criteria), I'm just leveraging its early-exit capabilities, I think this is actually a misuse of some(). The loop is better here. Even if it breaks my coding rule of a loop should generally act on the entire collection it's looping over. I guess there's some other iteration functionality I'm after here which more clearly describes how I'm processing the array. Hmmm.

Anyway, I need to update that thing on CFLib, I guess. Better crack on with that.

Cheers, Gert, for calling me out on this!

--
Adam