use memory size prefixes consistently
johnyf opened this issue · 2 comments
The prefixes for multiples are not consistent in the code: both binary multiples and decimal multiples are used.
The prefix "M" in "MB" is used to mean a decimal multiple in the methods dd.cudd.BDD.statistics
and dd.cudd.BDD.__str__
(and also in the methods dd.cudd_zdd.ZDD.statistics
and dd.cudd_zdd.ZDD.__str__
), i.e., 10**6
.
The prefix "G" in "GB" is used to mean a binary multiple in the same extension modules (and also in the module dd.sylvan
), i.e., 2**30
.
So "MB" and "GB" are interpreted according to different conventions, which is inconsistent.
I choose binary multiples, so changes are needed to conform to this convention. This choice also ensures that the memory_estimate
given to CUDD when initializing a BDD or ZDD manager will remain the same:
Lines 250 to 251 in 269e4fc
Lines 278 to 280 in 269e4fc
Lines 300 to 305 in 269e4fc
I was aware that there exist two conventions, as documented at (by including the argument gnu=True
in the call humanize.naturalsize(1073741824, gnu=True)
):
But this mismatch between "MB" and "GB" was unintended.
On prefixes for multiples: ISO, IEC, JEDEC
What is really going on is that:
-
in the ISO standard:
MB = 10**6 bytes
:- "megabyte"
- prefix "M" for "mega"
GB = 10**9 bytes
:- "gigabyte"
- prefix "G" for "giga"
-
in the JEDEC standard:
MB = 2**20 bytes
GB = 2**30 bytes
The distinction becomes clear and the ambiguity avoided by using the prefixes for binary multiples of the ISO/IEC 80000 standard:
MiB = 2**20 bytes
:- "mebibyte"
- prefix "Mi" for "mebi"
GiB = 2**30 bytes
:- "gibibyte"
- prefix "Gi" for "gibi"
Changes needed
In most cases the changes will avoid this issue altogether, by changing from multiples of bytes to bytes. This requires changing:
- the messages in method
dd.cudd.BDD.__str__
(and the methoddd.cudd_zdd.ZDD.__str__
) - the message in one of the exceptions in the method
dd.cudd.BDD.apply
(and the methoddd.cudd_zdd.ZDD.apply
) - the value of key
'mem'
in thedict
returned by the methoddd.cudd.BDD.statistics
(and the methoddd.cudd_zdd.ZDD.statistics
): this is the most impactful change to users
The only other change needed is to rename the constants dd.cudd.GB
, dd.cudd_zdd.GB
, and dd.sylvan.GB
to dd.cudd.GiB
, dd.cudd_zdd.GiB
, and dd.sylvan.GiB
, respectively (the constant dd.sylvan.GB
is unused, so it could as well be removed). This is mostly an internal change: it does not change what dd
is doing. The only effect to users is due to the renaming, in case they happened to access the attribute GB
. This change is discussed more in the following sections.
Note that in the documentation I should change the call humanize.naturalsize(1073741824, gnu=True)
to humanize.naturalsize(1073741824, binary=True)
to ensure that "GiB" appears in the resulting output (which would agree with the choice of IEC prefixes), instead of the GNU-style "M".
Relevant Wikipedia references are:
https://en.wikipedia.org/wiki/Byte#Multiple-byte_units (also some Wikipedia entries from above redirect to this entry)
https://en.wikipedia.org/wiki/Binary_prefix#IEC_prefixes (also some Wikipedia entries from above redirect to this entry)
https://en.wikipedia.org/wiki/File_size
Also, links to relevant Wikipedia articles could be included in the docstrings and documentation to clarify the situation.
Avoiding an implicit change of constant value
Renaming the constants GB
to GiB
will cause user code that uses GB
to fail. The failure could be avoided by:
- keeping the constants
dd.cudd.GB
anddd.cudd_zdd.GB
and
changing their values to SI, i.e.,10**9
, - adding the new constants
dd.cudd.GiB
anddd.cudd_zdd.GiB
,
and defining them equal to IEC values, i.e.,2**30
(which equals the previous value ofdd.cudd.GB
anddd.cudd_zdd.GB
), and - replacing all use of
dd.cudd.GB
anddd.cudd_zdd.GB
in thedd
code
with the newly introduceddd.cudd.GiB
anddd.cudd_zdd.GiB
,
which would ensure that the code is doing the same thing as before.
Doing so avoids the failure, but would:
- implicitly change the meaning of the constants
dd.cudd.GB
anddd.cudd_zdd.GB
in user code, for anyone that may already be using them, whereas a failure is explicit
(explicit is better than implicit PEP 20) - define the unused constants
GB
: this is undesired (linting is used to detect and remove unused code; introducing it on purpose is undesired)
(simple is better than complex [PEP 20]) - treat
GB
as a special case (unused but defined, why not then define an unusedkB
too?)
(special cases aren't special enough to break the rules [PEP 20]) - make existing code that accessed the attribute
GB
silently continue to work, though strictly speaking this is an error, because theGB
has changed meaning (given the nature of change the code is likely to still be doing what was intended, e.g., initialize a BDD manager with roughly N gigabytes memory, but if the code relies on the exact number, e.g., for what units it uses to annotate axes in plotting memory measurements, then the results will be in error)
(errors should never pass silently [PEP 20]).
These observations imply that GB
should be renamed to GiB
, without defining any new GB
.
Avoiding mysterious absence of constant GB
Renaming constant GB
to GiB
will make it nonobvious to a user what to do about the error to access the constat GB
that existed previously. Python introspection will probably be the easiest way to find out about the renaming. The CHANGELOG of dd
is another way. In any case, the renaming cannot go unnoticed, which is safe.
(Also, I estimate that the constants dd.cudd.GB
, dd.cudd_zdd.GB
, and dd.sylvan.GB
are rarely used outside of dd
.)
Failing with a message that the constant GB
has been renamed
It is possible in Python 3.7 and later to define a module's __getattr__
[PEP 562]. This mechanism enables raising an exception with a helpful message when the attribute of a module is accessed. This is similar to defining a class property (e.g., with the decorator @property
) and raising an exception when the class attribute is accessed, for example because it has been deprecated.
However, this approach requires Python 3.7, and dd
tries to remain compatible with earlier Python versions.
Besides, this approach makes __getattr__
forward to other module attributes via the globals()
, which is complex (simple is better than complex PEP 20).
Taking these observations into account, it is not possible to raise an informative exception when user code attempts to access the module attribute GB
.
Warning about the changed constant value
If the constant GB
was retained, and its value changed to 10**9
(and another constant named GiB
added), then a DeprecatioWarning
could be generated (on Python 3.7 and later) whenever the user accesses the attribute GB
. The mechanism to do this is a module's __getattr__
, which is discussed in the previous section.
However, as already noted, this approach is not compatible with earlier Python versions. Even if it was, the warning could be suppressed or go unnoticed in the program's output, in which case the value change of GB
would in effect be implicit.
Also, the constant GB
will be renamed to GiB
and no new constant GB
defined, so raising a warning is not needed.
Conclusion
So the explicit and safest approach is to remove GB
and add GiB
.
In any case, the renaming of GB
will probably have a small effect to users (I estimate that the constant GB
is rarely used outside of dd
code).
The change to 'mem'
returned by the methods dd.cudd.BDD.statistics
and dd.cudd_zdd.ZDD.statistics
is less noticeable (a UserWarning
will be printed whenever the method statistics
is called), and also has a bigger effect on users, because it does change the meaning of the values that the method statistics
returns.
In addition to the warning, the change of what 'mem'
means would probably be noticed by the change of values of memory by 6 orders of magnitude. Observing this change will suggest consulting the method's docstring, or the CHANGELOG.
The changes described above have been made on branch dev
.
Above (#77 (comment)), the conclusion was to, among other changes, rename the constants dd.cudd.GB
and dd.cudd_zdd.GB
to dd.cudd.GiB
and dd.cudd_zdd.GiB
, respectively (and to remove the unused constant dd.sylvan.GB
).
Previously, in each of these two modules, the constant GB
was used at two places, and only internally to dd
(the constant could have been hidden as _GB
). However, the changes described above would result in the constant GiB
being used at only one place (and still only internally to dd
): just to define another constant, on the line after the line that would define GiB
. So the constant GiB
would be redundant ("Simple is better than complex." PEP 20).
Therefore, the constant GB
will be removed from these modules, and no replacement constant (like GiB
was intended to be) added. Instead, the number 2**30
will replace the use of GB
on the line this occurs. A comment will mention "GiB" and the relevant standard, in order to emphasize that the chosen number (2**30
) is a binary multiple. This removes entirely from dd
the presence of any convention on naming multiples: the numbers themselves are used directly.
Also, several of the concerns and alternatives discussed regarding the renaming are not relevant or applicable after this decision, so it becomes obvious what the new design will be.
Aside: A module variable that is intended to be used as a constant is usually named with all uppercase letters and underscores, according to PEP 8. So the choice of identifier GiB
above would not have been entirely incompatible with PEP 8, though it is not recommended (nevertheless, this naming case concerns a standardized unit name, so preserving the letter case seems the best choice). In any case, the decision to remove the constant as unnecessary also avoids this naming conundrum.