Tuesday, 15 January 2013

Floating point accuracy

G'day:
This is not going to be particularly earth-shattering, sorry.

Yesterday someone asked on my favourite source of militant pedantry (Stack Overflow) how to pass a BigDecimal to POI for a spreadsheet cell value. You'll need to take my word for it, as the original question has been deleted (not sure what to make of that). There's a Google-cached version of it here. I dunno how long google caches stuff for though. The gist of the question text is (this is an abridged version of the original:


How to use BigDecimal when I have no implementation that will take it

Here is my code:

[snipped]

The part I changed for it to error is:

thisCell.setCellValue(javacast("bigdecimal", invoicesDue.amount) );
The setCellValue() method has no overload which will take a BigDecimal.

The field is going to be currency which requires a BigDecimal, however XSSFCell's setCellValue() doesn't have an implementation that will take it. The closest approximation only accepts a Double, which is no good to me because in the documentation it says double shouldn't be used for currency. That being the case, what should I do?
[...]
I had a hunt around and came up with this:

To answer the specific question you have asked: the setCellValue() method has no overload which will take a BigDecimal. Which is what the error is telling you.

A follow-up question you might want to ask is "I need to do [x] which requires a BigDecimal, however XSSFCell's setCellValue() doesn't have an implementation that will take it. The closest approximation only accepts a Double, which is no good to me [because of this reason]. That being the case, what should I do?"

I guess you could update this question, or you could create a new question.

And that's about as far as the Q&A got before the question was deleted.

The problem the questioner was having is that they were - rightly - cautious about using a float for the currency amounts, and floats are not very accurate, so if one needs accuracy, using something like a BigDecimal is better as it (seems to ~... I am no expert) preserve accuracy to an arbitrary degree.

POI will only accept a Double for a numeric, and I also read Excel only works in Doubles too, so there are going to be inherent accuracy constraints here.

It then occurred to me that I had no idea how much accuracy a Double actually has, so set out to find out.  I knocked this code together:

What this does is to build three strings that represent numbers:
  • one that's an integer only (ie: all digits are to the left of the decimal point)
  • one that's a decimal only (ie: all digits are to the right of the decimal point)
  • one that's both (ie: digits are on either side of the decimal point)
I then convert each of these to a numeric (val() it), and then check to see if the string and the numeric value (treated as a string again) compare favourable.

After that I just tabulate the results, as well as writing them to an XLS file so I can see what happens there as well.

The results are as follows:


DigitsIntegerDecimalBoth
StringNumberEqualStringNumberEqualStringNumberEqual
111YES0.10.1YES1.01NO
21212YES0.120.12YES1.21.2YES
3123123YES0.1230.123YES13.213.2YES
412341234YES0.12340.1234YES13.2413.24YES
51234512345YES0.123450.12345YES135.24135.24YES
6123456123456YES0.1234560.123456YES135.246135.246YES
712345671234567YES0.12345670.1234567YES1357.2461357.246YES
81234567812345678YES0.123456780.12345678YES1357.24681357.2468YES
9123456789123456789YES0.1234567890.123456789YES13579.246813579.2468YES
1112345678911234567891YES0.12345678910.1234567891YES135791.2468135791.2468YES
121234567891212345678912YES0.123456789120.12345678912YES135791.24682135791.24682YES
13123456789123123456789123YES0.1234567891230.123456789123YES1357913.246821357913.24682YES
1412345678912341.23456789123E+012NO0.12345678912340.123456789123NO1357913.2468241357913.24682NO
15123456789123451.23456789123E+013NO0.123456789123450.123456789123NO13579135.24682413579135.2468NO
161234567891234561.23456789123E+014NO0.1234567891234560.123456789123NO13579135.246824613579135.2468NO
1712345678912345671.23456789123E+015NO0.12345678912345670.123456789123NO135791357.2468246135791357.247NO
18123456789123456781.23456789123E+016NO0.123456789123456780.123456789123NO135791357.24682468135791357.247NO
191234567891234567891.23456789123E+017NO0.1234567891234567890.123456789123NO1357913579.246824681357913579.25NO

(there's an error in this in that I'm implying the "digits" column represents how many digits are in the number, but that's wrong after 9, as I skip 10)

What we see is that a Double has 12 digits of precision, be they on the left, right, or across the decimal point. I "knew" (mostly) the number of digits accuracy it had would be fixed irrespective of where the decimal point was, but wasn't 100% certain, and I certainly didn't know how many digits it was.

The resultant spreadsheet was as follows (converted to a vanilla-looking HTML table):


1 1 1 TRUE 0.1 0.1 TRUE 1 1 TRUE
2 12 12 TRUE 0.12 0.12 TRUE 1.2 1.2 TRUE
3 123 123 TRUE 0.123 0.123 TRUE 13.2 13.2 TRUE
4 1234 1234 TRUE 0.1234 0.1234 TRUE 13.24 13.24 TRUE
5 12345 12345 TRUE 0.12345 0.12345 TRUE 135.24 135.24 TRUE
6 123456 123456 TRUE 0.123456 0.123456 TRUE 135.246 135.246 TRUE
7 1234567 1234567 TRUE 0.1234567 0.1234567 TRUE 1357.246 1357.246 TRUE
8 12345678 12345678 TRUE 0.12345678 0.12345678 TRUE 1357.2468 1357.2468 TRUE
9 123456789 123456789 TRUE 0.123456789 0.123456789 TRUE 13579.2468 13579.2468 TRUE










11 1234567891 1234567891 TRUE 0.123456789 0.123456789 TRUE 135791.2468 135791.2468 TRUE
12 12345678912 12345678912 TRUE 0.123456789 0.123456789 TRUE 135791.2468 135791.2468 TRUE
13 1.23457E+11 1.23457E+11 TRUE 0.123456789 0.123456789 TRUE 1357913.247 1357913.247 TRUE
14 1.23457E+12 1.23457E+12 FALSE 0.123456789 0.123456789 FALSE 1357913.247 1357913.247 FALSE
15 1.23457E+13 1.23457E+13 FALSE 0.123456789 0.123456789 FALSE 13579135.25 13579135.25 FALSE
16 1.23457E+14 1.23457E+14 FALSE 0.123456789 0.123456789 FALSE 13579135.25 13579135.25 FALSE
17 1.23457E+15 1.23457E+15 FALSE 0.123456789 0.123456789 FALSE 135791357.2 135791357.2 FALSE
18 1.23457E+16 1.23457E+16 FALSE 0.123456789 0.123456789 FALSE 135791357.2 135791357.2 FALSE
19 1.23457E+17 1.23457E+17 FALSE 0.123456789 0.123456789 FALSE 1357913579 1357913579 FALSE

This is much the same, except note Excel displays that "13th" row in scientific notation, whilst still seemingly preserving the accuracy (judging by the fact that row says "true" and the next one says "false".

From this, I could not be sure if the data "truncation" was coming from CF, or the limit in Excel is the same as in CF. So I just pasted in the "string" column into a new spreadsheet, and the results were as follows:


1 0.1000000000000000000000 1.0000000000000000000000
12 0.1200000000000000000000 1.2000000000000000000000
123 0.1230000000000000000000 13.2000000000000000000000
1234 0.1234000000000000000000 13.2400000000000000000000
12345 0.1234500000000000000000 135.2400000000000000000000
123456 0.1234560000000000000000 135.2460000000000000000000
1234567 0.1234567000000000000000 1357.2460000000000000000000
12345678 0.1234567800000000000000 1357.2468000000000000000000
123456789 0.1234567890000000000000 13579.2468000000000000000000
1234567891 0.1234567891000000000000 135791.2468000000000000000000
12345678912 0.1234567891200000000000 135791.2468200000000000000000
123456789123 0.1234567891230000000000 1357913.2468200000000000000000
1234567891234 0.1234567891234000000000 1357913.2468240000000000000000
12345678912345 0.1234567891234500000000 13579135.2468240000000000000000
123456789123456 0.1234567891234560000000 13579135.2468246000000000000000
1234567891234560 0.1234567891234560000000 135791357.2468240000000000000000
12345678912345600 0.1234567891234560000000 135791357.2468240000000000000000
123456789123456000 0.1234567891234560000000 1357913579.2468200000000000000000

So Excel runs to 15 digits of accuracy, which is better than CF will manage.

Despite all that, if we go back to the original situation in that the person asking the question was concerned about the accuracy levels for currency amounts: 12 digits will represent a currency value in the tens of billions, accurate to a penny. This will probably be fine for most people's requirements.

There are still potential issues with the inherent flaws with floating point numbers accurately representing decimals, but I think for the purposes of financial calculations, once the data is in Excel, it will be fine.

So my bottom line that I would have given the person had the question lasted long enough for me to update my answer would be "don't worry about it: just use a double".

And at least now I know how accurate CF floating point numbers are. This is one of the good things about helping people with their questions on forums... sometimes one needs to work out what the answer is before answering, so it's a learning exercise for the answerer too!

Time for work...

--
Adam