tulip-control/dd

use memory size prefixes consistently

johnyf opened this issue · 2 comments

The prefixes for multiples are not consistent in the code: both binary multiples and decimal multiples are used.

The prefix "M" in "MB" is used to mean a decimal multiple in the methods dd.cudd.BDD.statistics and dd.cudd.BDD.__str__ (and also in the methods dd.cudd_zdd.ZDD.statistics and dd.cudd_zdd.ZDD.__str__), i.e., 10**6.

The prefix "G" in "GB" is used to mean a binary multiple in the same extension modules (and also in the module dd.sylvan), i.e., 2**30.

So "MB" and "GB" are interpreted according to different conventions, which is inconsistent.

I choose binary multiples, so changes are needed to conform to this convention. This choice also ensures that the memory_estimate given to CUDD when initializing a BDD or ZDD manager will remain the same:

dd/dd/cudd.pyx

Lines 250 to 251 in 269e4fc

GB = 2**30
DEFAULT_MEMORY = 1 * GB

dd/dd/cudd.pyx

Lines 278 to 280 in 269e4fc

default_memory = DEFAULT_MEMORY
if memory_estimate is None:
memory_estimate = default_memory

dd/dd/cudd.pyx

Lines 300 to 305 in 269e4fc

mgr = Cudd_Init(
initial_n_vars_bdd,
initial_n_vars_zdd,
initial_subtable_size,
initial_cache_size,
memory_estimate)

I was aware that there exist two conventions, as documented at (by including the argument gnu=True in the call humanize.naturalsize(1073741824, gnu=True)):

https://github.com/tulip-control/dd/blob/269e4fc493faeec4f8da46162fb78efa6080b9b7/doc.md#cudd-interface-ddcudd

But this mismatch between "MB" and "GB" was unintended.

On prefixes for multiples: ISO, IEC, JEDEC

What is really going on is that:

The distinction becomes clear and the ambiguity avoided by using the prefixes for binary multiples of the ISO/IEC 80000 standard:

Changes needed

In most cases the changes will avoid this issue altogether, by changing from multiples of bytes to bytes. This requires changing:

  • the messages in method dd.cudd.BDD.__str__ (and the method dd.cudd_zdd.ZDD.__str__)
  • the message in one of the exceptions in the method dd.cudd.BDD.apply (and the method dd.cudd_zdd.ZDD.apply)
  • the value of key 'mem' in the dict returned by the method dd.cudd.BDD.statistics (and the method dd.cudd_zdd.ZDD.statistics): this is the most impactful change to users

The only other change needed is to rename the constants dd.cudd.GB, dd.cudd_zdd.GB, and dd.sylvan.GB to dd.cudd.GiB, dd.cudd_zdd.GiB, and dd.sylvan.GiB, respectively (the constant dd.sylvan.GB is unused, so it could as well be removed). This is mostly an internal change: it does not change what dd is doing. The only effect to users is due to the renaming, in case they happened to access the attribute GB. This change is discussed more in the following sections.

Note that in the documentation I should change the call humanize.naturalsize(1073741824, gnu=True) to humanize.naturalsize(1073741824, binary=True) to ensure that "GiB" appears in the resulting output (which would agree with the choice of IEC prefixes), instead of the GNU-style "M".

Relevant Wikipedia references are:
https://en.wikipedia.org/wiki/Byte#Multiple-byte_units (also some Wikipedia entries from above redirect to this entry)
https://en.wikipedia.org/wiki/Binary_prefix#IEC_prefixes (also some Wikipedia entries from above redirect to this entry)
https://en.wikipedia.org/wiki/File_size

Also, links to relevant Wikipedia articles could be included in the docstrings and documentation to clarify the situation.

Avoiding an implicit change of constant value

Renaming the constants GB to GiB will cause user code that uses GB to fail. The failure could be avoided by:

  • keeping the constants dd.cudd.GB and dd.cudd_zdd.GB and
    changing their values to SI, i.e., 10**9,
  • adding the new constants dd.cudd.GiB and dd.cudd_zdd.GiB,
    and defining them equal to IEC values, i.e., 2**30
    (which equals the previous value of dd.cudd.GB and dd.cudd_zdd.GB), and
  • replacing all use of dd.cudd.GB and dd.cudd_zdd.GB in the dd code
    with the newly introduced dd.cudd.GiB and dd.cudd_zdd.GiB,
    which would ensure that the code is doing the same thing as before.

Doing so avoids the failure, but would:

  • implicitly change the meaning of the constants dd.cudd.GB and dd.cudd_zdd.GB in user code, for anyone that may already be using them, whereas a failure is explicit
    (explicit is better than implicit PEP 20)
  • define the unused constants GB: this is undesired (linting is used to detect and remove unused code; introducing it on purpose is undesired)
    (simple is better than complex [PEP 20])
  • treat GB as a special case (unused but defined, why not then define an unused kB too?)
    (special cases aren't special enough to break the rules [PEP 20])
  • make existing code that accessed the attribute GB silently continue to work, though strictly speaking this is an error, because the GB has changed meaning (given the nature of change the code is likely to still be doing what was intended, e.g., initialize a BDD manager with roughly N gigabytes memory, but if the code relies on the exact number, e.g., for what units it uses to annotate axes in plotting memory measurements, then the results will be in error)
    (errors should never pass silently [PEP 20]).

These observations imply that GB should be renamed to GiB, without defining any new GB.

Avoiding mysterious absence of constant GB

Renaming constant GB to GiB will make it nonobvious to a user what to do about the error to access the constat GB that existed previously. Python introspection will probably be the easiest way to find out about the renaming. The CHANGELOG of dd is another way. In any case, the renaming cannot go unnoticed, which is safe.

(Also, I estimate that the constants dd.cudd.GB, dd.cudd_zdd.GB, and dd.sylvan.GB are rarely used outside of dd.)

Failing with a message that the constant GB has been renamed

It is possible in Python 3.7 and later to define a module's __getattr__ [PEP 562]. This mechanism enables raising an exception with a helpful message when the attribute of a module is accessed. This is similar to defining a class property (e.g., with the decorator @property) and raising an exception when the class attribute is accessed, for example because it has been deprecated.

However, this approach requires Python 3.7, and dd tries to remain compatible with earlier Python versions.

Besides, this approach makes __getattr__ forward to other module attributes via the globals(), which is complex (simple is better than complex PEP 20).

Taking these observations into account, it is not possible to raise an informative exception when user code attempts to access the module attribute GB.

Warning about the changed constant value

If the constant GB was retained, and its value changed to 10**9 (and another constant named GiB added), then a DeprecatioWarning could be generated (on Python 3.7 and later) whenever the user accesses the attribute GB. The mechanism to do this is a module's __getattr__, which is discussed in the previous section.

However, as already noted, this approach is not compatible with earlier Python versions. Even if it was, the warning could be suppressed or go unnoticed in the program's output, in which case the value change of GB would in effect be implicit.

Also, the constant GB will be renamed to GiB and no new constant GB defined, so raising a warning is not needed.

Conclusion

So the explicit and safest approach is to remove GB and add GiB.
In any case, the renaming of GB will probably have a small effect to users (I estimate that the constant GB is rarely used outside of dd code).

The change to 'mem' returned by the methods dd.cudd.BDD.statistics and dd.cudd_zdd.ZDD.statistics is less noticeable (a UserWarning will be printed whenever the method statistics is called), and also has a bigger effect on users, because it does change the meaning of the values that the method statistics returns.

In addition to the warning, the change of what 'mem' means would probably be noticed by the change of values of memory by 6 orders of magnitude. Observing this change will suggest consulting the method's docstring, or the CHANGELOG.

The changes described above have been made on branch dev.

Above (#77 (comment)), the conclusion was to, among other changes, rename the constants dd.cudd.GB and dd.cudd_zdd.GB to dd.cudd.GiB and dd.cudd_zdd.GiB, respectively (and to remove the unused constant dd.sylvan.GB).

Previously, in each of these two modules, the constant GB was used at two places, and only internally to dd (the constant could have been hidden as _GB). However, the changes described above would result in the constant GiB being used at only one place (and still only internally to dd): just to define another constant, on the line after the line that would define GiB. So the constant GiB would be redundant ("Simple is better than complex." PEP 20).

Therefore, the constant GB will be removed from these modules, and no replacement constant (like GiB was intended to be) added. Instead, the number 2**30 will replace the use of GB on the line this occurs. A comment will mention "GiB" and the relevant standard, in order to emphasize that the chosen number (2**30) is a binary multiple. This removes entirely from dd the presence of any convention on naming multiples: the numbers themselves are used directly.

Also, several of the concerns and alternatives discussed regarding the renaming are not relevant or applicable after this decision, so it becomes obvious what the new design will be.

Aside: A module variable that is intended to be used as a constant is usually named with all uppercase letters and underscores, according to PEP 8. So the choice of identifier GiB above would not have been entirely incompatible with PEP 8, though it is not recommended (nevertheless, this naming case concerns a standardized unit name, so preserving the letter case seems the best choice). In any case, the decision to remove the constant as unnecessary also avoids this naming conundrum.

Addressed in commits cc8a609, e248d95, a1dbbd6, and 4f488a3.