/red-formatting

Red formatting functions and dialect design experiments

Primary LanguageRed

Red format (WIP)

Old funcs are here and in @greggirwin's repo.

See Discussions for design talk and this commit (older)

Play with mask-based number formatting

There's a precompiled binary: Windows Linux

Goal: catch 'em bugs!

Rounding showcase

Try to figure out a pattern using videos ;)

mantissa is rounded to the nearest, in this case - nearest specified significant digit here, to nearest absolute value (halves round to even numbers) exponent is always rounded down, to absolute value in the mask

Number mask format spec

Somewhat based on ICU's number patterns, Excel format and Gregg's previous work as well as mask design discussion.

My quick research hints that ICU patterns are likely the most powerful out there, but only implemented in special packages like ICU4C/ICU4J and a few others. The masks available out of the box in most languages are rather simplistic and limited, giving Red an advantage here. Spreadsheet programs take the 2nd place in flexibility, lacking rounding, significant digit format and engineering notation, but adding fractions.

Goal is not to be 100% compatible with any, but only where it makes most sense. I want masks simple and predictable by their reader and I hope that the presented design is more powerful and simpler than it's predecessors. It has far fewer rules and no special cases.

Incompatibilities with ICU should be resolved by the script that imports it's data into Red. Reason for breaking compatibility is not to inherit complexity of their masks in Red.

Summary of the mask syntax

There are 4 modes of number formatting:

Parameter Criterion Example Masks examples
Decimal vs Exponential mode Presence of E or x in the mask 1234.5 or 1.2345E+3 0.0#### vs 0.0####E+0
Significant vs Fractional digits Presence of . in the mask 12.345 can be formatted as 12.3 either by requiring 3 significant digits or 1 fractional digit 000 vs 0.0

This is the most important clue to have in mind.

Same thing said by example:

  • 12.345 formatted as 000 results in 12.3 (significant digits, decimal)
  • 12.345 formatted as 000. results in 012 (fractional digits, decimal)
  • 12.345 formatted as 00E0 results in 1.2E1 (significant digits, exponential)
  • 12.345 formatted as 00.E0 results in 01E1 (fractional digits, exponential)

All the supported symbols are listed below:

Scope Whole part Fraction part Exponent part Note
Scope starts with initial state, always present with . with E or x
Symbol Meaning
. (period) turns on fractional digits mode and starts fraction part of the mask
forbidden if absent, significant digits mode is used
$, $$, $$$, $$$$ gets replaced by localized currency symbol; between digits or #s, also starts fraction part of the mask; count of $s affect the width (short, long currency names) forbidden e.g. 0$00 may produce 12€30
E (uppercase) starts exponent part of the mask N/A
x (lowercase) starts exponent part of the mask N/A formatted as ×10 and subsequent digits become superscript
0 digit that is always present, even if it's a leading zero digit that is always present, even if it's a trailing zero same as whole part
1 - 9 same as 0, but rounds the number, e.g. 01.2E3 rounds exponent to a multiple of 3 (engineering notation), then rounds mantissa to a multiple of 1.2 default rounding is to last figure, e.g. 0.00 and 0.01 are equivalent
# used for grouping only (together with space); has to precede digits (if any) digit that is removed if it and all #s after are zero; has to come after 0s (if any) same as whole part
(space) sets grouping size, e.g. # ##0 groups digits by 3 gets replaced by group separator, e.g. 0.0# ## ## groups up to 6 fraction digits in pairs has to be between #s or digits to take effect
+ always present sign symbol same as for the mantissa, but only applies if comes after E or x e.g. +0.0++E0+++ may format -123 as -1.2--E2+++
- sign symbol that is omitted if number is nonnegative automatically added before the first digit or # if neither of the sign marks is specified
( and ) accounting sign denotation that is omitted if number is nonnegative never part of the exponent e.g. ($0.00) may produce $15.00 or ($15.00)
% multiplies mantissa by 100; gets replaced by localized percent sign
%o multiplies mantissa by 1000; gets replaced by localized permille sign
' (apostrophe) quoting char used to insert literal text; double to produce apostrophe itself
e.g. '#'000, 0 o''clock
? reserved for padding variant of #, if we decide to support it
* (asterisk) reserved for padding, if we decide to support it

Examples & comparison

ICU mask Excel mask Red mask 12345 formatted 12.345 formatted 0.012345 formatted -12.345 formatted
0 not allowed not allowed 5 2 0 2?
# 0.###### 0.###### 12345 12.345 0.012345 -12.345
not allowed? 0 0. 12345 12 0 -12
#. 0. needs literal dot: 0.'.' 12345. 12. 0. -12.
#.## 0.## not allowed 12345. 12.35 0.01 -12.35
not allowed? not allowed 0.## 12345 12.35 0.01 -12.35
0.0# 0.0# 0.0# 12345.0 12.35 0.01 -12.35
#.0# #.0# #.0# 12345.0 12.35 .01 -12.35
#,#0 not allowed # #0. 1'23'45 12 0 -12
#,##0 #,##0 # ##0. 12'345 12 0 -12
#,## not allowed # #0.###### 1'23'45 12.345 0.012345 -12.345
0.0#,# not allowed 0.0# # 12345.0 12.34'5 0.01'2 -12.34'5
@@ not allowed 00 12000 12 0.012 -12
0E0 0E0 0E0 1E4 1E1 1E-2 -1E1
##0.#E0 not allowed 0.#E3 12.3E3 12.3E0 12.3E-3 -12.3E0
@@@E0 not allowed 000E3 12.3E3 12.3E0 12.3E-3 -12.3E0
@@5E0 not allowed 005E3 12.5E3 12.5E0 12.5E-3 -12.5E0
not allowed? not allowed 0x0 12×10³ 12×10⁰ 12×10⁻³ -12×10⁰
#0¤00 not allowed 0$00 12345€00 12€34 0€01 -12€34
#,##0¤00 not allowed # ##0$00 12'345€00 12€34 0€01 -12€34
@,@@ not allowed 0 00 1'23'00 12.3 0.01'23 -12.3
#,@@ not allowed # 00 1'20'00 12 0.01'2 -12
#,#@@ not allowed # #00 12'000 12 0.012 -12
#0.00;(#0.00) 0.00;(0.00) (0.00) 12345 12.34 0.01 (12.34)

TODO: write a GUI converter between all three mask formats, also feed all masks from ICU into it for testing.

Become a grouping expert

Example: #### ## #00 0.# ## ### has 7 groups:

  • 2nd group ## is the primary group: it is used when number is bigger than the mask
    Usually defining two groups is enough, e.g. # ##0 will add separators after each 3 digits
    If absence of 2nd group of whole digits (e.g. ##0 mask) no separators are inserted between whole digits at all
  • 1st group #### gets grown up to the primary group if it's shorter
    This allows us to write # ##0 instead of ### ##0
    But if it's longer, then it's used as is, not shortened
  • Groups after the 2nd are used as they appear (including groups in fractional part)

If we format number 1e12 using this mask we get 1 00 0000 00 000 0
If we format number 1e-12 using this mask we get 00 0
If we format number 12345e-7 using this mask we get 00 0.0 01 234

In significant digit mode grouping of fractional digits mirrors that of whole digits, so:

  • 12345678 formatted as 000 0 00 results in 1 2 345 6 00
  • 0.0012345678 formatted as 000 0 00 results in 0.00 1 234 5 7 (last digit rounded)

Major deviations of this design from ICU (or how I imagine it anyway)
  • ASCII mask symbols only (for ease of writing)
  • Space (internationally recommended thousands separator) instead of comma (used in ICU and Excel) as group separator (for better readability, esp. for low sighted)
    Note: it only affects masks spelling. Resulting separator is defined by locale data, which allows masks to stay as culturally neutral as possible (no hardcoded currency symbol, no hardcoded separators, no hardcoded digits, etc.)
  • Whole digit truncation is disabled: mask 0 formats 123 as 123, not as 3 (rarely needed use case, likely to induce bugs)
  • Period . is required for fixed-precision masks, e.g. #0., not #0 (allows to get rid of complexity of @ symbol and simplifies engineering notation significantly)
  • Period . is not output if no digits follow it,e.g. 0. mask formats 11 as 11 not 11. (unlikely ever needed use case, absence of which allows to simplify the design)
  • No infinite precision output (e.g. # and #E0 in ICU). If we want max precision, why are we even calling format? I find it awkward that in ICU # is infinite precision, #. is any number without fraction but with a dot (which cannot be removed?), and 0 is a single digit only. This is messy. In the case we later need this, we can just define a 16-digit mask and assign a name to it.
Design notes
  • While exponent case E/e can be resolved by the substitution character solely, x (formatted as ×) affects subsequent chars, making them superscript, so it has to become a separate mask char.
  • ? char (used for alignment) is not supported, because this hack only works for monospaced fonts. When using variable-width fonts, Excel kerns each digit separately in this mode to make digits and spaces equally wide, but fails on many fonts. It simply doesn't make sense to complicate masks if workaround still has to be implemented in GUI widgets. Proper way to align digits would be to create a sort of number-field widget based on field that would position each char individually and aligned under the dot. Gregg though thinks ? is still useful for .txt report files printed in monospace fonts and that those are still relevant. We agreed to wait until someone actually requests that for a real world task.
  • ICU recommends: "Programmers are used to the fallback exponent style 1.23E4, but that should not be shown to end-users [...] to show a format like 1.23 × 10⁴", but I don't see any support for that in their docs.
  • Multiple semicolon-delimited patterns (pos/neg in ICU, pos/neg/zero/text in spreadsheets) are not supported, because we have smart () for accounting.
  • Unicode variants of the ASCII sigils might be supported too: ¤ for $, for %o, × for x. Former two symbols are not even in the font I'm using in the editor, permille doesn't appear even in my browser font, so I imagine they are not going to have a lot of fans.