Addition of a mask to stdlib_experimental_stats function `mean`

Question

Addition of a mask to stdlib_experimental_stats function `mean`

Closed this issue 6 years ago · 33 comments

jvdp1 commented 6 years ago

Current

(Related to #113)

The current spec for stdlib mean is:

`mean` - mean of array elements

Description

Returns the mean of all the elements of array, or of the elements of array along dimension dim if provided.

Syntax

result = mean(array)

result = mean(array, dim)

Arguments

array: Shall be an array of type integer, or real.

dim: Shall be a scalar of type integer with a value in the range from 1 to n, where n is the rank of array.

Return value

If array is of type real, the result is of the same type as array.
If array is of type integer, the result is of type double precision.

If dim is absent, a scalar with the mean of all elements in array is returned. Otherwise, an array of rank n-1, where n equals the rank of array, and a shape similar to that of array with dimension dim dropped is returned.

Proposal

I would like to propose to add mask into the API of mean as follows (similar to the intrinsic sum):

`mean` - mean of array elements

Description

Returns the mean of all the elements of array, or of the elements of array along dimension dim if provided, and if the corresponding element in mask is true.

Syntax

result = mean(array [, mask])

result = mean(array, dim [, mask])

Arguments

array: Shall be an array of type integer, or real.

dim: Shall be a scalar of type integer with a value in the range from 1 to n, where n is the rank of array.

mask (optional): Shall be of type logical and either be a scalar or an array of the same shape as array.

Return value

If array is of type real, the result is of the same type as array.
If array is of type integer, the result is of type double precision.

If dim is absent, a scalar with the mean of all elements in array is returned. Otherwise, an array of rank n-1, where n equals the rank of array, and a shape similar to that of array with dimension dim dropped is returned.

If mask is specified, the result is the mean of all elements of array corresponding to true elements of mask. If every element of mask is false , the result is zero.

This definition of mask for mean agrees with the definition of mask of the intrinsic sum in the Fortran Standard draft:

p. 429:

MASK (optional) shall be of type logical and shall be conformable with ARRAY.

and

p. 9:

3.36
conformable
⟨of two data entities⟩ having the same shape, or one being an array and the other being scalar

Implementation

Each function for mean, e.g.,

module function mean_1_sp_sp(x) result(res)
    real(sp), intent(in) :: x(:)
    real(sp) :: res
    res = sum(x) / real(size(x, kind = int64), dp)
end function mean_1_sp_sp

should then become:

module function mean_1_sp_sp(x, mask) result(res)
    real(sp), intent(in) :: x(:)
    real(sp) :: res
    logical, intent(in), optional :: mask

    if(.not.optval(mask, .true.))then
        res = 0._dp
	return
    end if

    res = sum(x) / real(size(x, kind = int64), dp)

end function mean_1_sp_sp

module function mean_1_mask_sp_sp(x, mask) result(res)
    real(sp), intent(in) :: x(:)
    real(sp) :: res
    logical, intent(in) :: mask(:)
    res = sum(x, mask) / real(count(mask, kind = int64), dp)
end function mean_1_mask_sp_sp

I kown there was some discussions about dim and if it should be considered as optional or not in the spec. I guess the same discussion could appear here since mask is not optional in one of the 2 functions. However, it will behave like optional for the users through the interface mean.

certik commented 6 years ago

I agree.

Answer 1 · 2020-01-27T18:41:47.000Z

That looks good to me. It would appear as "optional" due to the interface to the users, so I think that's what matters, and the above is just an implementation detail how to actually implement it.

Answer 2 · 2020-01-27T18:42:51.000Z

A larger philosophical question is whether we have to implement mask for every array function that operates on arrays, or only on some?

Answer 3 · 2020-01-27T19:00:20.000Z

A larger philosophical question is whether we have to implement mask for every array function that operates on arrays, or only on some?

Why would you not implement it on everey function? Also, from rank 3 to rank 7 or 15, it would auto-generated.

Answer 4 · 2020-01-27T19:13:05.000Z

The only downside that I can see is that it will be a lot of functions. But I think we want to support the "mask" argument. The other approach would be that somehow the user would run the mask on the array before passing it to sum or mean (NumPy implements masked arrays). But I don't know how to do that efficiently in Fortran, so the mask argument seems appropriate.

Answer 5 · 2020-01-27T19:26:40.000Z

I think it would be good to have the same API as sum and other intrinsic functions.
The users will not see all the functions behind the interface, and I think it can be manageable with auto-generation. It could be only (maybe) tricky for the CI, but we can limit the number of dimensions (currently 4). Also, I don't think the CI should stop us to implement a correct/desired code/API.

Answer 6 · 2020-01-27T22:16:58.000Z

Excellent proposal. I only have a problem with the following statement:

If every element of mask is false , the result is zero.

Is this not dangerous? The user might think that zero is a legitimate mean value. Can't we return (a IEEE?) NaN or something along these lines so the user can catch the exception? All this hopefully without hurting performance.

Answer 7 · 2020-01-27T22:57:27.000Z

Is this not dangerous? The user might think that zero is a legitimate mean value. Can't we return (an IEEE?) NAN or something along these lines so the user can catch the exception? All this hopefully without hurting performance.

Good point. The intrinsic sum returns 0, following the Standard:

Case (ii): The result of SUM (ARRAY, MASK = MASK) has a value equal to a processor-dependent approximation to the sum of the elements of ARRAY corresponding to the true elements of MASK or has the value zero if there are no true elements.

So, for the mean, if mask = .false., it ends up with a division of 0. by 0.. So mean should most likely return NaN. A simple way to return NaN could be:

real(dp)::res
if (.not.optval(mask, .true.) then
    res = res /0._dp
    return
end if

Note that GFortran does not allow expressions like res = 0._dp / 0._dp or res = 1._dp / 0._dp.

Answer 8 · 2020-01-27T23:48:01.000Z

I would be careful with the nans. I usually turn them off in my production codes. Is there any current intrinsic function in Fortran that can return nan? I think returning zero would work. Stopping the program with an error would also work, but the issue is that then mean could not be a pure function...

Answer 9 · 2020-01-28T00:12:27.000Z

I would be careful with the nans. I usually turn them off in my production codes.

I think this resonates in some way with the discussion had about assert. I.e. in debug mode you catch the cases that return NaN and in release mode you can ignore them (at your own risk, trading "safety" with performance).

Still, I think mathematically stands that although the sum of no values is zero, the mean of no values is undefined.

Answer 10 · 2020-01-28T02:25:05.000Z

Having had to deal with Nan in a lot of programming environments, the most reliable way I found may seem surprising, but is

function nan64(value)
use,intrinsic :: iso_fortran_env, only: real64
implicit none
character(len=*),parameter::ident="@(#)M_units:: nan64(3fp): Returns a NAN (Not a number) of type real64"
character(len=3),save :: STRING='NaN'
real(kind=real64) :: nan64,value
   read(STRING,*)nan64
end function nan64

which is one part of a generic nan(value) where the returned value is the type of value that could be auto-generated. Note the string cannot be a parameter, which seems non-intuitive; but internal reads cannot read from a constant. The rest of the generic definition is quite repetitive but (I think) obvious.

It's a long story, but a lot of "standard" ways of defining a NaN are subject to whether the compiler supports the IEEE standard, whether optimizations change the definition or test (I have an is_nan function too), and so on. So far, reading the string "NaN" has proved portable and is supported by the Fortran standard as far back as I could find (if anyone knows when the earliest support by the standard was, feel free to add it here).

Answer 11 · 2020-01-28T05:19:11.000Z

@urbanjost interesting. A few questions: what's the purpose of the ident parameter in your nan64 function? Also, why does it accept value as an argument? It does not seem to be used.

Answer 12 · 2020-01-28T07:37:56.000Z

Still, I think mathematically stands that although the sum of no values is zero, the mean of no values is undefined.

I totally agree with @epagone statement, and I should have thought about it sooner: the mean of no values is undefined, not 0.

Regarding our implementation of mean along a dimension and with a mask being of the same shape as the input x, we have:

module function mean_2_mask_sp_sp(x, dim, mask) result(res)
    real(sp), intent(in) :: x(:,:)
    logical, intent(in) :: mask(:,:)
    integer, intent(in) :: dim
    real(sp) :: res(merge(size(x, 1), size(x, 2), mask = 1 < dim ))

    select case(dim)
     case(1)
       res = sum(x, 1, mask) / real(count(mask, 1), sp)
     case(2)
       res = sum(x, 2, mask) / real(count(mask, 2), sp)
     case default
       call error_stop("ERROR (mean): wrong dimension")
    end select

end function mean_2_mask_sp_sp

This implementation may return an array res with numbers and NaN as means.

If we aim to return 0.0 (or any other value different thanNaN (even for the NaN proposed by @urbanjost) for the following case:

module function mean_2_all_sp_sp(x, mask) result(res)
    real(sp), intent(in) :: x(:,:)
    logical, intent(in), optional :: mask
    real(sp) :: res

    if (.not.optval(mask, .true.)) then
        res = ??? 0._sp / value / @urbanjost NaN????
        return
    end if

    res = sum(x) / real(size(x, kind = int64), sp)

end function mean_2_all_sp_sp

then, for consistency across all mean functions, we will have to catch the possible compiler NaN in mean_2_mask_sp_sp(x, dim, mask) and replace them by ??? value / @urbanjost NaN????, leading to complicate and less efficient code.

Probably, IMHO, the best way to be consistent with mean_2_mask_sp_sp(x, dim, mask) (and to keep this latter function simple and efficient) would be something like:

module function mean_2_all_sp_sp(x, mask) result(res)
    real(sp), intent(in) :: x(:,:)
    logical, intent(in), optional :: mask
    real(sp) :: res

    if (.not.optval(mask, .true.)) then
        res = sum(x, .false.) / count(.false.)
        return
    end if

    res = sum(x) / real(size(x, kind = int64), sp)

end function mean_2_all_sp_sp

With such an implementation, NaN could be still ignore in release mode (while not the case with @urbanjost approach, right?).

Answer 13 · 2020-01-28T07:51:48.000Z

I would be careful with the nans. I usually turn them off in my production codes. Is there any current intrinsic function in Fortran that can return nan?

The intrinsic function fraction(x) may return NaN if x is an IEEE NaN or Infinity. So it is quite a special case.

I think returning zero would work. Stopping the program with an error would also work, but the issue is that then mean could not be a pure function...

I suggest to avoid to return 0 (or any other value, or @urbanjost NaN), at least for consistenty with other functions of mean; see previous comment).
I would argue the same thing about stopping the program with an error, because it would mean that when mask is an array and the mean is computed along a dimension, we would need to check that there is no NaN in the array result of the function mean, or that there is no true element in the mask along a dimension. Such checks would penalize the efficiency of the function (in addition to not be able to be pure).

Answer 14 · 2020-01-28T14:33:55.000Z

FYI: Regarding questions on nan64(3f) function: I extracted the function to generate a NaN from a generic function, where VALUE was used merely to select the kind of the NaN. The bit pattern of NaN is dependent on the KIND, the unused value was just used to determine kind.

The IDENT variable is a whole other discussion, not really pertinent to the Nan function per-se,
but really a whole other topic.

The IDENT is used like a C #ident directive. I (still) use metadata like that to automatically build indexes of .code. This particular syntax is often optimized away as an unused variable but is still useful in the source. I use a preprocessor that handles some of the issues with such data to keep it in the binaries.

Didn't mean to cause confusion; the basic idea is the most reliable way to generate a NaN that I have found is to do an internal read of the string 'NaN'.

The IDENT stuff is very useful for QA and configuration control and automated documentation but is not well supported by Fortran. Many systems even include a place for metadata in executables but it does not seem to be used a lot anymore; but I need it for a variety of reasons. I would ask for support of metadata strings as an extension to Fortran but (surprisingly to me) even #ident is not well supported in C.

Lets me automatically build an index like
source index and
get information from executable files like

   what slice.exe
   slice.exe:

    PRODUCT:        GPF library utilities and examples
    PROGRAM:        slice(1)
    DESCRIPTION:    display a set of curves as slices with a 3D view
    VERSION:        1.1, 20190326
    AUTHOR:         John S. Urban
    HOME PAGE:      http://www.urbanjost.altervista.org/index.html
    COMPILED:       Fri, Jan 24th, 2020 3:57:19 PM

I have my own what(1) command. It used to be nearly ubiquitous on Unix systems but came with SCCS, which has largely been superseded by mg, git, ...

Anyway, ignore the IDENT string, with the exception that maybe there should be some kind of discussion of metadata and Fortran.

The nan(3f) and is_nan(3f) functions are in M_units.f90 in my GPF github collection

Answer 15 · 2020-01-28T16:18:52.000Z

@urbanjost I see, thanks for the background.

@jvdp1 I think it's probably fine to return NaN, as I make them raise a runtime exception if a NaN is used in my production codes.

Answer 16 · 2020-01-28T16:22:12.000Z

Is there any current intrinsic function in Fortran that can return nan?

sqrt and log of a negative number both return a nan.

In my view, the only reasonable value to return in this case is a nan. It's also behavior consistent with numpy.

If intrinsic modules like ieee_arithmetic or ieee_features don't provide a named nan and Inf constants, stdlib should. This is a perfect use case for stdlib_experimental_constants.

Answer 17 · 2020-01-28T18:13:23.000Z

@milancurcic With ieee_arithmetic, we could get IEEE NaN as follows (ieee_nan in ieee_features is private:

use ieee_arithmetic, only: ieee_value, ieee_quiet_nan
implicit none
real::res

res = ieee_value(res, ieee_quiet_nan)

This implementation is also not kind-dependent.

The only issue with IEEE is that it may not supported by all compilers since it is Fortran 2003. However, this is not a problem with preprocessing (there is already a flag for Fortran 90 in the fypp files).

So, would be a solution like the following one acceptable for stdlib::mean, while thinking how to implement constants like NaN, Inf,...:

#:if VERSION90
res = 0._sp
res = res / res
#:else
res = ieee_value(res, ieee_quiet_nan)
#:endif

The IEEE NaN would be returned if the compiler support ieee_arithmetic; otherwise it would return a NaN compatible with what other functions with an array mask would return when all elements are false.

Answer 18 · 2020-01-28T20:13:16.000Z

I've used the transfer method for defining NaN for while now. Is there any reason why this is a bad idea?
I'm not sure how portable this approach is.
Both gfortran 9 and ifort 19 print "NaN'. I'm not sure what happens with older version of the compilers.

    real(sp) :: NaN = transfer(Z'7FF80000', 1.0)
    real(dp) :: NaN64 = transfer(Z'7FF0000000000001', 1._dp)

Answer 19 · 2020-01-28T20:27:44.000Z

I've used the transfer method for defining NaN for while now.

It is probably a way to implement that in a module constant #99 . We will probably need to differentiate quiet and signaling NaN.

From Intel ieee_arithmetic.f90:

      REAL(4), PARAMETER :: FOR_S_SNAN         = TRANSFER((/ Z'7FA00000' /),1.0_4)
      REAL(4), PARAMETER :: FOR_S_QNAN         = TRANSFER((/ Z'7FC00000' /),1.0_4)

Answer 20 · 2020-01-28T21:10:50.000Z

I see nothing wrong with using transfer. You can also assign binary literals, for example for 32-bit values:

real, parameter :: nan = b'11111111110000000000000000000000'
real, parameter :: neginf = b'11111111100000000000000000000000'
real, parameter :: posinf = b'01111111100000000000000000000000'

However, if ieee_arithmetic is implemented on most compilers, we should just do that for simplicity, as it should work for any kind and provides quiet/signaling NaN for us, as Jeremie pointed out.

Answer 21 · 2020-01-28T23:58:17.000Z

So, it seems that everybody agrees that NaN should be returned when mean is equal to 0./0..

Currently, and until (if) NaN is implemented inside stdlib, I proposed to implement NaN as:

res = ieee_value(res, ieee_quiet_nan)

Both ieee_value and ieee_quiet_nan are provided by ieee_arithmetic. ieee_arithmetic is supported by GFortran 5 and higher and Intel 14 and higer. Do we aim to support older compilers? Do other compilers support IEEE modules (I believe so, at least for the most recent versions)?
If it is needed, preprocessing can be implemented to avoid IEEE (see comment), but I think it would be beter to avoid that solution for this NaN case.

Answer 22 · 2020-01-29T00:09:59.000Z

I think it's reasonable to expect F2003 features to work. IEEE arithmetic features seem to be supported by most compilers, their recent versions that is, see table on page 12: https://www.fortranplus.co.uk/app/download/30202489/fortran_2003_2008_2018_compiler_support.pdf

Answer 23 · 2020-01-30T00:15:06.000Z

So are we just going to assume all real kinds on all target compilers and platforms are IEEE compliant? Or at least have a concept of "nan" that resembles IEEE's?

This is relevant to #118 as well.

Answer 24 · 2020-01-30T01:11:11.000Z

I would simply test it out (see #15). And if it works everywhere, then use it. Note that "inf" are disabled if "-ffast-math" is used, which is the default in my production codes. But otherwise stdlib can use them I think, just like sqrt(-1.) returns a NaN. I tested this simple code:

real :: a
a = -3
print *, sqrt(a)
end

with:

gfortran -O3 -march=native -ffast-math a.f90

and it prints:

NaN

So I think the NaN might work even with -ffast-math. Note that in Debug mode, I use the following options:

gfortran -Wall -Wextra -Wimplicit-interface -g -fcheck=all -fbacktrace -ffpe-trap=invalid,zero,overflow -finit-real=snan -finit-integer=-99999999 a.f90

and then the code above gives an exception:

$ ./a.out 

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f8837f5f2da in ???
#1  0x7f8837f5e503 in ???
#2  0x7f8837b91f1f in ???
#3  0x55e126135939 in MAIN__
	at /tmp/a.f90:3
#4  0x55e1261359b8 in main
	at /tmp/a.f90:4
Floating point exception (core dumped)

That way any such invalid operation such as sqrt of negative numbers is caught right when it happens with a nice stacktrace. So I suggest we use NaN in the same way as a way to trigger an exception, assuming it could be done.

Answer 25 · 2020-01-30T01:15:34.000Z

So this code:

program test_none
use ieee_arithmetic, only: ieee_quiet_nan, ieee_value
implicit none
real :: nan
real :: a
nan = ieee_value(nan, ieee_quiet_nan)
a = -3
print *, a, nan
a = nan
print *, a, nan
end

triggers the exception at the nan = ieee_value(nan, ieee_quiet_nan) line:

$ ./a.out 

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f4b625ca2da in ???
#1  0x7f4b625c9503 in ???
#2  0x7f4b621fcf1f in ???
#3  0x7f4b6275ced8 in ???
#4  0x5558914f6a44 in test_none
	at /tmp/a.f90:6
#5  0x5558914f6bba in main
	at /tmp/a.f90:2
Floating point exception (core dumped)

So we would not be able to actually assign it to a nan, but if we use it as ieee_value(x, ieee_quiet_nan), then we can assign it when an invalid input is presented to mean and then gfortran should give a nice exception above.

Answer 26 · 2020-01-30T01:20:05.000Z

I then tried the following code:

program test_none
implicit none
real, parameter :: NaN = transfer(Z'7FF80000', 1.0)
real :: a
a = -3
print *, a, NaN
a = NaN
print *, a, NaN
a = a + 5
print *, a, NaN
end

and that unfortunately will not trigger any exception:

$ gfortran -Wall -Wextra -Wimplicit-interface -g -fcheck=all -fbacktrace -ffpe-trap=invalid,zero,overflow -finit-real=snan -finit-integer=-99999999 a.f90 && ./a.out 
  -3.00000000                  NaN
              NaN              NaN
              NaN              NaN

I don't know how that works, but somehow transfer(Z'7FF80000', 1.0) is different to ieee_value(1.0, ieee_quiet_nan).

Answer 27 · 2020-01-30T01:23:11.000Z

Conclusion: from the above analysis, I would recommend the ieee_value(1.0, ieee_quiet_nan) approach. We can perhaps wrap it into a function like get_nan(1.0), or even just nan(1.0). We can provide an alternative implementation of nan(1.0) on compilers that would not support ieee_value(1.0, ieee_quiet_nan).

Answer 28 · 2020-01-30T09:21:47.000Z

I had a look in GCC ieee_arithmetic.F90 ieee_value function. Interestingly, they generate the NaN result as:

res = -1
res = srqt(res)

So I guess it could be also a way to generate NaN for compilers that do not support ieee_arithmetic. However, It would be pretty old compilers (e.g., older than GFortran 5, ifort 14, ...)

ieee_arithmetic.F90:

....
 elemental real(kind=4) function IEEE_VALUE_4(X, CLASS) result(res)

    real(kind=4), intent(in) :: X
    type(IEEE_CLASS_TYPE), intent(in) :: CLASS
    logical flag

    select case (CLASS%hidden)
      case (1)     ! IEEE_SIGNALING_NAN
        if (ieee_support_halting(ieee_invalid)) then
           call ieee_get_halting_mode(ieee_invalid, flag)
           call ieee_set_halting_mode(ieee_invalid, .false.)
        end if
        res = -1
        res = sqrt(res)
        if (ieee_support_halting(ieee_invalid)) then
           call ieee_set_halting_mode(ieee_invalid, flag)
        end if
      case (2)     ! IEEE_QUIET_NAN
        if (ieee_support_halting(ieee_invalid)) then
           call ieee_get_halting_mode(ieee_invalid, flag)
           call ieee_set_halting_mode(ieee_invalid, .false.)
        end if
        res = -1
        res = sqrt(res)
        if (ieee_support_halting(ieee_invalid)) then
           call ieee_set_halting_mode(ieee_invalid, flag)
        end if
...

Answer 29 · 2020-02-07T14:34:55.000Z

Regarding the discussion on the return value for zero-size arrays or masked arrays with no true entries: the behavior for maxval and minval is to return -huge(x). I think this is preferable to NaN because it can be tested for with non-IEEE floating point types. For instance, I know IBM's real(16) is non-IEEE.

Answer 30 · 2020-02-07T17:17:01.000Z

Regarding the discussion on the return value for zero-size arrays or masked arrays with no true entries: the behavior for maxval and minval is to return -huge(x). I think this is preferable to NaN because it can be tested for with non-IEEE floating point types. For instance, I know IBM's real(16) is non-IEEE.

When I decide to go down this route, in my codes I usually introduce a parameter (e.g. r_undef = -huge(x)) in a module with all similar control parameters, used across the program. Just to keep the code more clean and avoid "magic numbers".

Answer 31 · 2020-02-07T20:13:39.000Z

Regarding the discussion on the return value for zero-size arrays or masked arrays with no true entries: the behavior for maxval and minval is to return -huge(x). I think this is preferable to NaN because it can be tested for with non-IEEE floating point types. For instance, I know IBM's real(16) is non-IEEE.

I don't agree it is preferable to return -huge(x) (or something different than NaN) for zero-size arrays or masked arrrays with no true entries, because these cases are equal to 0./0., which is NaN.
NaN are also returned (by the division of 0./0.) when mask is an array, but for these cases they are dependent on the compiler.

To avoid the dependency on IEEE NaN, would you agree to replace:

res = ieee_value(res, ieee_quiet_nan)

by

res = -1
res = sqrt(res)

(as implemented in ieee_aritmethic.f90 of gcc)

or

res = 0._dp
res = sum(x, .false.) / res

?

Using sqrt could allow the compilation of the code for non-IEEE types.

Answer 32 · 2020-02-07T22:18:00.000Z

I would suggest we provide a function get_nan and internally implement it by any of the approaches that were discussed (ieee_value, sqrt or huge). That way we have just one simple place to modify, and the rest of stdlib uses get_nan and thus does not have to be modified. We should discuss a good name for such a function.

Current

mean - mean of array elements

Description

Syntax

Arguments

Return value

Proposal

mean - mean of array elements

Description

Syntax

Arguments

Return value

`mean` - mean of array elements

`mean` - mean of array elements