Addition of a mask to stdlib_experimental_stats function `mean`
Closed this issue ยท 33 comments
Current
(Related to #113)
The current spec for stdlib mean is:
mean - mean of array elements
Description
Returns the mean of all the elements of array, or of the elements of array along dimension dim if provided.
Syntax
result = mean(array)
result = mean(array, dim)
Arguments
array: Shall be an array of type integer, or real.
dim: Shall be a scalar of type integer with a value in the range from 1 to n, where n is the rank of array.
Return value
If array is of type real, the result is of the same type as array.
If array is of type integer, the result is of type double precision.
If dim is absent, a scalar with the mean of all elements in array is returned. Otherwise, an array of rank n-1, where n equals the rank of array, and a shape similar to that of array with dimension dim dropped is returned.
Proposal
I would like to propose to add mask into the API of mean as follows (similar to the intrinsic sum):
mean - mean of array elements
Description
Returns the mean of all the elements of array, or of the elements of array along dimension dim if provided, and if the corresponding element in mask is true.
Syntax
result = mean(array [, mask])
result = mean(array, dim [, mask])
Arguments
array: Shall be an array of type integer, or real.
dim: Shall be a scalar of type integer with a value in the range from 1 to n, where n is the rank of array.
mask (optional): Shall be of type logical and either be a scalar or an array of the same shape as array.
Return value
If array is of type real, the result is of the same type as array.
If array is of type integer, the result is of type double precision.
If dim is absent, a scalar with the mean of all elements in array is returned. Otherwise, an array of rank n-1, where n equals the rank of array, and a shape similar to that of array with dimension dim dropped is returned.
If mask is specified, the result is the mean of all elements of array corresponding to true elements of mask. If every element of mask is false , the result is zero.
This definition of mask for mean agrees with the definition of mask of the intrinsic sum in the Fortran Standard draft:
p. 429:
MASK (optional) shall be of type logical and shall be conformable with ARRAY.
and
p. 9:
3.36
conformable
โจof two data entitiesโฉ having the same shape, or one being an array and the other being scalar
Implementation
Each function for mean, e.g.,
module function mean_1_sp_sp(x) result(res)
real(sp), intent(in) :: x(:)
real(sp) :: res
res = sum(x) / real(size(x, kind = int64), dp)
end function mean_1_sp_spshould then become:
module function mean_1_sp_sp(x, mask) result(res)
real(sp), intent(in) :: x(:)
real(sp) :: res
logical, intent(in), optional :: mask
if(.not.optval(mask, .true.))then
res = 0._dp
return
end if
res = sum(x) / real(size(x, kind = int64), dp)
end function mean_1_sp_sp
module function mean_1_mask_sp_sp(x, mask) result(res)
real(sp), intent(in) :: x(:)
real(sp) :: res
logical, intent(in) :: mask(:)
res = sum(x, mask) / real(count(mask, kind = int64), dp)
end function mean_1_mask_sp_spI kown there was some discussions about dim and if it should be considered as optional or not in the spec. I guess the same discussion could appear here since mask is not optional in one of the 2 functions. However, it will behave like optional for the users through the interface mean.
That looks good to me. It would appear as "optional" due to the interface to the users, so I think that's what matters, and the above is just an implementation detail how to actually implement it.
A larger philosophical question is whether we have to implement mask for every array function that operates on arrays, or only on some?
A larger philosophical question is whether we have to implement
maskfor every array function that operates on arrays, or only on some?
Why would you not implement it on everey function? Also, from rank 3 to rank 7 or 15, it would auto-generated.
The only downside that I can see is that it will be a lot of functions. But I think we want to support the "mask" argument. The other approach would be that somehow the user would run the mask on the array before passing it to sum or mean (NumPy implements masked arrays). But I don't know how to do that efficiently in Fortran, so the mask argument seems appropriate.
I think it would be good to have the same API as sum and other intrinsic functions.
The users will not see all the functions behind the interface, and I think it can be manageable with auto-generation. It could be only (maybe) tricky for the CI, but we can limit the number of dimensions (currently 4). Also, I don't think the CI should stop us to implement a correct/desired code/API.
I agree.
Excellent proposal. I only have a problem with the following statement:
If every element of
maskisfalse, the result is zero.
Is this not dangerous? The user might think that zero is a legitimate mean value. Can't we return (a IEEE?) NaN or something along these lines so the user can catch the exception? All this hopefully without hurting performance.
Is this not dangerous? The user might think that zero is a legitimate mean value. Can't we return (an IEEE?)
NANor something along these lines so the user can catch the exception? All this hopefully without hurting performance.
Good point. The intrinsic sum returns 0, following the Standard:
Case (ii): The result of SUM (ARRAY, MASK = MASK) has a value equal to a processor-dependent approximation to the sum of the elements of ARRAY corresponding to the true elements of MASK or has the value zero if there are no true elements.
So, for the mean, if mask = .false., it ends up with a division of 0. by 0.. So mean should most likely return NaN. A simple way to return NaN could be:
real(dp)::res
if (.not.optval(mask, .true.) then
res = res /0._dp
return
end ifNote that GFortran does not allow expressions like res = 0._dp / 0._dp or res = 1._dp / 0._dp.
I would be careful with the nans. I usually turn them off in my production codes. Is there any current intrinsic function in Fortran that can return nan? I think returning zero would work. Stopping the program with an error would also work, but the issue is that then mean could not be a pure function...
I would be careful with the nans. I usually turn them off in my production codes.
I think this resonates in some way with the discussion had about assert. I.e. in debug mode you catch the cases that return NaN and in release mode you can ignore them (at your own risk, trading "safety" with performance).
Still, I think mathematically stands that although the sum of no values is zero, the mean of no values is undefined.
Having had to deal with Nan in a lot of programming environments, the most reliable way I found may seem surprising, but is
function nan64(value)
use,intrinsic :: iso_fortran_env, only: real64
implicit none
character(len=*),parameter::ident="@(#)M_units:: nan64(3fp): Returns a NAN (Not a number) of type real64"
character(len=3),save :: STRING='NaN'
real(kind=real64) :: nan64,value
read(STRING,*)nan64
end function nan64which is one part of a generic nan(value) where the returned value is the type of value that could be auto-generated. Note the string cannot be a parameter, which seems non-intuitive; but internal reads cannot read from a constant. The rest of the generic definition is quite repetitive but (I think) obvious.
It's a long story, but a lot of "standard" ways of defining a NaN are subject to whether the compiler supports the IEEE standard, whether optimizations change the definition or test (I have an is_nan function too), and so on. So far, reading the string "NaN" has proved portable and is supported by the Fortran standard as far back as I could find (if anyone knows when the earliest support by the standard was, feel free to add it here).
@urbanjost interesting. A few questions: what's the purpose of the ident parameter in your nan64 function? Also, why does it accept value as an argument? It does not seem to be used.
Still, I think mathematically stands that although the sum of no values is zero, the mean of no values is undefined.
I totally agree with @epagone statement, and I should have thought about it sooner: the mean of no values is undefined, not 0.
Regarding our implementation of mean along a dimension and with a mask being of the same shape as the input x, we have:
module function mean_2_mask_sp_sp(x, dim, mask) result(res)
real(sp), intent(in) :: x(:,:)
logical, intent(in) :: mask(:,:)
integer, intent(in) :: dim
real(sp) :: res(merge(size(x, 1), size(x, 2), mask = 1 < dim ))
select case(dim)
case(1)
res = sum(x, 1, mask) / real(count(mask, 1), sp)
case(2)
res = sum(x, 2, mask) / real(count(mask, 2), sp)
case default
call error_stop("ERROR (mean): wrong dimension")
end select
end function mean_2_mask_sp_spThis implementation may return an array res with numbers and NaN as means.
If we aim to return 0.0 (or any other value different thanNaN (even for the NaN proposed by @urbanjost) for the following case:
module function mean_2_all_sp_sp(x, mask) result(res)
real(sp), intent(in) :: x(:,:)
logical, intent(in), optional :: mask
real(sp) :: res
if (.not.optval(mask, .true.)) then
res = ??? 0._sp / value / @urbanjost NaN????
return
end if
res = sum(x) / real(size(x, kind = int64), sp)
end function mean_2_all_sp_spthen, for consistency across all mean functions, we will have to catch the possible compiler NaN in mean_2_mask_sp_sp(x, dim, mask) and replace them by ??? value / @urbanjost NaN????, leading to complicate and less efficient code.
Probably, IMHO, the best way to be consistent with mean_2_mask_sp_sp(x, dim, mask) (and to keep this latter function simple and efficient) would be something like:
module function mean_2_all_sp_sp(x, mask) result(res)
real(sp), intent(in) :: x(:,:)
logical, intent(in), optional :: mask
real(sp) :: res
if (.not.optval(mask, .true.)) then
res = sum(x, .false.) / count(.false.)
return
end if
res = sum(x) / real(size(x, kind = int64), sp)
end function mean_2_all_sp_spWith such an implementation, NaN could be still ignore in release mode (while not the case with @urbanjost approach, right?).
I would be careful with the nans. I usually turn them off in my production codes. Is there any current intrinsic function in Fortran that can return nan?
The intrinsic function fraction(x) may return NaN if x is an IEEE NaN or Infinity. So it is quite a special case.
I think returning zero would work. Stopping the program with an error would also work, but the issue is that then
meancould not be a pure function...
I suggest to avoid to return 0 (or any other value, or @urbanjost NaN), at least for consistenty with other functions of mean; see previous comment).
I would argue the same thing about stopping the program with an error, because it would mean that when mask is an array and the mean is computed along a dimension, we would need to check that there is no NaN in the array result of the function mean, or that there is no true element in the mask along a dimension. Such checks would penalize the efficiency of the function (in addition to not be able to be pure).
FYI: Regarding questions on nan64(3f) function: I extracted the function to generate a NaN from a generic function, where VALUE was used merely to select the kind of the NaN. The bit pattern of NaN is dependent on the KIND, the unused value was just used to determine kind.
The IDENT variable is a whole other discussion, not really pertinent to the Nan function per-se,
but really a whole other topic.
The IDENT is used like a C #ident directive. I (still) use metadata like that to automatically build indexes of .code. This particular syntax is often optimized away as an unused variable but is still useful in the source. I use a preprocessor that handles some of the issues with such data to keep it in the binaries.
Didn't mean to cause confusion; the basic idea is the most reliable way to generate a NaN that I have found is to do an internal read of the string 'NaN'.
The IDENT stuff is very useful for QA and configuration control and automated documentation but is not well supported by Fortran. Many systems even include a place for metadata in executables but it does not seem to be used a lot anymore; but I need it for a variety of reasons. I would ask for support of metadata strings as an extension to Fortran but (surprisingly to me) even #ident is not well supported in C.
Lets me automatically build an index like
source index and
get information from executable files like
what slice.exe
slice.exe:
PRODUCT: GPF library utilities and examples
PROGRAM: slice(1)
DESCRIPTION: display a set of curves as slices with a 3D view
VERSION: 1.1, 20190326
AUTHOR: John S. Urban
HOME PAGE: http://www.urbanjost.altervista.org/index.html
COMPILED: Fri, Jan 24th, 2020 3:57:19 PM
I have my own what(1) command. It used to be nearly ubiquitous on Unix systems but came with SCCS, which has largely been superseded by mg, git, ...
Anyway, ignore the IDENT string, with the exception that maybe there should be some kind of discussion of metadata and Fortran.
The nan(3f) and is_nan(3f) functions are in M_units.f90 in my GPF github collection
@urbanjost I see, thanks for the background.
@jvdp1 I think it's probably fine to return NaN, as I make them raise a runtime exception if a NaN is used in my production codes.
Is there any current intrinsic function in Fortran that can return nan?
sqrt and log of a negative number both return a nan.
In my view, the only reasonable value to return in this case is a nan. It's also behavior consistent with numpy.
If intrinsic modules like ieee_arithmetic or ieee_features don't provide a named nan and Inf constants, stdlib should. This is a perfect use case for stdlib_experimental_constants.
@milancurcic With ieee_arithmetic, we could get IEEE NaN as follows (ieee_nan in ieee_features is private:
use ieee_arithmetic, only: ieee_value, ieee_quiet_nan
implicit none
real::res
res = ieee_value(res, ieee_quiet_nan)This implementation is also not kind-dependent.
The only issue with IEEE is that it may not supported by all compilers since it is Fortran 2003. However, this is not a problem with preprocessing (there is already a flag for Fortran 90 in the fypp files).
So, would be a solution like the following one acceptable for stdlib::mean, while thinking how to implement constants like NaN, Inf,...:
#:if VERSION90
res = 0._sp
res = res / res
#:else
res = ieee_value(res, ieee_quiet_nan)
#:endifThe IEEE NaN would be returned if the compiler support ieee_arithmetic; otherwise it would return a NaN compatible with what other functions with an array mask would return when all elements are false.
I've used the transfer method for defining NaN for while now. Is there any reason why this is a bad idea?
I'm not sure how portable this approach is.
Both gfortran 9 and ifort 19 print "NaN'. I'm not sure what happens with older version of the compilers.
real(sp) :: NaN = transfer(Z'7FF80000', 1.0)
real(dp) :: NaN64 = transfer(Z'7FF0000000000001', 1._dp)I've used the transfer method for defining NaN for while now.
It is probably a way to implement that in a module constant #99 . We will probably need to differentiate quiet and signaling NaN.
From Intel ieee_arithmetic.f90:
REAL(4), PARAMETER :: FOR_S_SNAN = TRANSFER((/ Z'7FA00000' /),1.0_4)
REAL(4), PARAMETER :: FOR_S_QNAN = TRANSFER((/ Z'7FC00000' /),1.0_4) I see nothing wrong with using transfer. You can also assign binary literals, for example for 32-bit values:
real, parameter :: nan = b'11111111110000000000000000000000'
real, parameter :: neginf = b'11111111100000000000000000000000'
real, parameter :: posinf = b'01111111100000000000000000000000'However, if ieee_arithmetic is implemented on most compilers, we should just do that for simplicity, as it should work for any kind and provides quiet/signaling NaN for us, as Jeremie pointed out.
So, it seems that everybody agrees that NaN should be returned when mean is equal to 0./0..
Currently, and until (if) NaN is implemented inside stdlib, I proposed to implement NaN as:
res = ieee_value(res, ieee_quiet_nan)Both ieee_value and ieee_quiet_nan are provided by ieee_arithmetic. ieee_arithmetic is supported by GFortran 5 and higher and Intel 14 and higer. Do we aim to support older compilers? Do other compilers support IEEE modules (I believe so, at least for the most recent versions)?
If it is needed, preprocessing can be implemented to avoid IEEE (see comment), but I think it would be beter to avoid that solution for this NaN case.
I think it's reasonable to expect F2003 features to work. IEEE arithmetic features seem to be supported by most compilers, their recent versions that is, see table on page 12: https://www.fortranplus.co.uk/app/download/30202489/fortran_2003_2008_2018_compiler_support.pdf
So are we just going to assume all real kinds on all target compilers and platforms are IEEE compliant? Or at least have a concept of "nan" that resembles IEEE's?
This is relevant to #118 as well.
I would simply test it out (see #15). And if it works everywhere, then use it. Note that "inf" are disabled if "-ffast-math" is used, which is the default in my production codes. But otherwise stdlib can use them I think, just like sqrt(-1.) returns a NaN. I tested this simple code:
real :: a
a = -3
print *, sqrt(a)
endwith:
gfortran -O3 -march=native -ffast-math a.f90
and it prints:
NaN
So I think the NaN might work even with -ffast-math. Note that in Debug mode, I use the following options:
gfortran -Wall -Wextra -Wimplicit-interface -g -fcheck=all -fbacktrace -ffpe-trap=invalid,zero,overflow -finit-real=snan -finit-integer=-99999999 a.f90
and then the code above gives an exception:
$ ./a.out
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7f8837f5f2da in ???
#1 0x7f8837f5e503 in ???
#2 0x7f8837b91f1f in ???
#3 0x55e126135939 in MAIN__
at /tmp/a.f90:3
#4 0x55e1261359b8 in main
at /tmp/a.f90:4
Floating point exception (core dumped)
That way any such invalid operation such as sqrt of negative numbers is caught right when it happens with a nice stacktrace. So I suggest we use NaN in the same way as a way to trigger an exception, assuming it could be done.
So this code:
program test_none
use ieee_arithmetic, only: ieee_quiet_nan, ieee_value
implicit none
real :: nan
real :: a
nan = ieee_value(nan, ieee_quiet_nan)
a = -3
print *, a, nan
a = nan
print *, a, nan
endtriggers the exception at the nan = ieee_value(nan, ieee_quiet_nan) line:
$ ./a.out
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7f4b625ca2da in ???
#1 0x7f4b625c9503 in ???
#2 0x7f4b621fcf1f in ???
#3 0x7f4b6275ced8 in ???
#4 0x5558914f6a44 in test_none
at /tmp/a.f90:6
#5 0x5558914f6bba in main
at /tmp/a.f90:2
Floating point exception (core dumped)
So we would not be able to actually assign it to a nan, but if we use it as ieee_value(x, ieee_quiet_nan), then we can assign it when an invalid input is presented to mean and then gfortran should give a nice exception above.
I then tried the following code:
program test_none
implicit none
real, parameter :: NaN = transfer(Z'7FF80000', 1.0)
real :: a
a = -3
print *, a, NaN
a = NaN
print *, a, NaN
a = a + 5
print *, a, NaN
endand that unfortunately will not trigger any exception:
$ gfortran -Wall -Wextra -Wimplicit-interface -g -fcheck=all -fbacktrace -ffpe-trap=invalid,zero,overflow -finit-real=snan -finit-integer=-99999999 a.f90 && ./a.out
-3.00000000 NaN
NaN NaN
NaN NaN
I don't know how that works, but somehow transfer(Z'7FF80000', 1.0) is different to ieee_value(1.0, ieee_quiet_nan).
Conclusion: from the above analysis, I would recommend the ieee_value(1.0, ieee_quiet_nan) approach. We can perhaps wrap it into a function like get_nan(1.0), or even just nan(1.0). We can provide an alternative implementation of nan(1.0) on compilers that would not support ieee_value(1.0, ieee_quiet_nan).
I had a look in GCC ieee_arithmetic.F90 ieee_value function. Interestingly, they generate the NaN result as:
res = -1
res = srqt(res)So I guess it could be also a way to generate NaN for compilers that do not support ieee_arithmetic. However, It would be pretty old compilers (e.g., older than GFortran 5, ifort 14, ...)
....
elemental real(kind=4) function IEEE_VALUE_4(X, CLASS) result(res)
real(kind=4), intent(in) :: X
type(IEEE_CLASS_TYPE), intent(in) :: CLASS
logical flag
select case (CLASS%hidden)
case (1) ! IEEE_SIGNALING_NAN
if (ieee_support_halting(ieee_invalid)) then
call ieee_get_halting_mode(ieee_invalid, flag)
call ieee_set_halting_mode(ieee_invalid, .false.)
end if
res = -1
res = sqrt(res)
if (ieee_support_halting(ieee_invalid)) then
call ieee_set_halting_mode(ieee_invalid, flag)
end if
case (2) ! IEEE_QUIET_NAN
if (ieee_support_halting(ieee_invalid)) then
call ieee_get_halting_mode(ieee_invalid, flag)
call ieee_set_halting_mode(ieee_invalid, .false.)
end if
res = -1
res = sqrt(res)
if (ieee_support_halting(ieee_invalid)) then
call ieee_set_halting_mode(ieee_invalid, flag)
end if
...Regarding the discussion on the return value for zero-size arrays or masked arrays with no true entries: the behavior for maxval and minval is to return -huge(x). I think this is preferable to NaN because it can be tested for with non-IEEE floating point types. For instance, I know IBM's real(16) is non-IEEE.
Regarding the discussion on the return value for zero-size arrays or masked arrays with no true entries: the behavior for maxval and minval is to return -huge(x). I think this is preferable to NaN because it can be tested for with non-IEEE floating point types. For instance, I know IBM's real(16) is non-IEEE.
When I decide to go down this route, in my codes I usually introduce a parameter (e.g. r_undef = -huge(x)) in a module with all similar control parameters, used across the program. Just to keep the code more clean and avoid "magic numbers".
Regarding the discussion on the return value for zero-size arrays or masked arrays with no true entries: the behavior for maxval and minval is to return -huge(x). I think this is preferable to NaN because it can be tested for with non-IEEE floating point types. For instance, I know IBM's real(16) is non-IEEE.
I don't agree it is preferable to return -huge(x) (or something different than NaN) for zero-size arrays or masked arrrays with no true entries, because these cases are equal to 0./0., which is NaN.
NaN are also returned (by the division of 0./0.) when mask is an array, but for these cases they are dependent on the compiler.
To avoid the dependency on IEEE NaN, would you agree to replace:
res = ieee_value(res, ieee_quiet_nan)by
res = -1
res = sqrt(res)(as implemented in ieee_aritmethic.f90 of gcc)
or
res = 0._dp
res = sum(x, .false.) / res?
Using sqrt could allow the compilation of the code for non-IEEE types.
I would suggest we provide a function get_nan and internally implement it by any of the approaches that were discussed (ieee_value, sqrt or huge). That way we have just one simple place to modify, and the rest of stdlib uses get_nan and thus does not have to be modified. We should discuss a good name for such a function.