Alexander-Barth/NCDatasets.jl

64-bit offset NetCDF and unlimited dimensions

Closed this issue · 8 comments

This code with 64-bit offset NetCDF files and unlimited dimensions are failing with current master:

using NCDatasets
using Test
sz = (4,5,3)
filename = tempname()

NCDataset(filename,"c",format=:netcdf3_64bit_offset) do ds
    ds.dim["lon"] = sz[1]
    ds.dim["lat"] = sz[2]
    ds.dim["time"] = Inf

    T = Float32
    v = defVar(ds,"var-$T",T,("lon","lat","time"))
    j = 1
    v[:,:,j] = fill(T(j), sz[1:2])
end

This is the stack-trace:

julia> include("/home/abarth/.julia/dev/NCDatasets/test/test_variable_unlim2.jl");
ERROR: LoadError: ArgumentError: Chunk sizes must be strictly positive
Stacktrace:
  [1] RegularChunks
    @ ~/.julia/packages/DiskArrays/QlfRF/src/chunks.jl:26 [inlined]
  [2] #24
    @ ~/.julia/packages/DiskArrays/QlfRF/src/chunks.jl:133 [inlined]
  [3] map (repeats 3 times)
    @ ./tuple.jl:318 [inlined]
  [4] DiskArrays.GridChunks(a::Tuple{Int64, Int64, Int64}, chunksize::Tuple{Int64, Int64, Int64}; offset::Tuple{Int64, Int64, Int64})                              
    @ DiskArrays ~/.julia/packages/DiskArrays/QlfRF/src/chunks.jl:132
  [5] GridChunks
    @ ~/.julia/packages/DiskArrays/QlfRF/src/chunks.jl:131 [inlined]
  [6] #GridChunks#18
    @ ~/.julia/packages/DiskArrays/QlfRF/src/chunks.jl:130 [inlined]
  [7] DiskArrays.GridChunks(a::NCDatasets.Variable{Float32, 3, NCDataset{Nothing}}, chunksize::Tuple{Int64, Int64, Int64})                                                  
    @ DiskArrays ~/.julia/packages/DiskArrays/QlfRF/src/chunks.jl:130
  [8] eachchunk(v::NCDatasets.Variable{Float32, 3, NCDataset{Nothing}})
    @ NCDatasets ~/.julia/dev/NCDatasets/src/variable.jl:407
  [9] _writeblock!(::NCDatasets.Variable{Float32, 3, NCDataset{Nothing}}, ::Array{Float32, 3}, ::Base.OneTo{Int64}, ::Vararg{AbstractVector})                            
    @ DiskArrays ~/.julia/packages/DiskArrays/QlfRF/src/batchgetindex.jl:187
 [10] writeblock!(::NCDatasets.Variable{Float32, 3, NCDataset{Nothing}}, ::Array{Float32, 3}, ::Base.OneTo{Int64}, ::Vararg{AbstractVector})                            
    @ NCDatasets ~/.julia/packages/DiskArrays/QlfRF/src/batchgetindex.jl:213
 [11] setindex_disk!(::NCDatasets.Variable{Float32, 3, NCDataset{Nothing}}, ::Matrix{Float32}, ::Function, ::Vararg{Any})                                                   
    @ DiskArrays ~/.julia/packages/DiskArrays/QlfRF/src/diskarray.jl:57
 [12] setindex!
    @ ~/.julia/packages/DiskArrays/QlfRF/src/diskarray.jl:229 [inlined]
 [13] setindex!(::CommonDataModel.CFVariable{Float32, 3, NCDatasets.Variable{Float32, 3, NCDataset{Nothing}}, NCDatasets.Attributes{NCDataset{Nothing}}, NamedTuple{(:fillvalue, :missing_values, :scale_factor, :add_offset, :calendar, :time_origin, :time_factor), Tuple{Nothing, Tuple{}, Vararg{Nothing, 5}}}}, ::Matrix{Float32}, ::Colon, ::Colon, ::Int64)                                                                                  
    @ CommonDataModel ~/.julia/dev/CommonDataModel/src/cfvariable.jl:419
 [14] (::var"#29#30")(ds::NCDataset{Nothing})
    @ Main ~/.julia/dev/NCDatasets/test/test_variable_unlim2.jl:18
 [15] NCDataset(::var"#29#30", ::String, ::Vararg{String}; kwargs::Base.Pairs{Symbol, Symbol, Tuple{Symbol}, NamedTuple{(:format,), Tuple{Symbol}}})                 
    @ NCDatasets ~/.julia/dev/NCDatasets/src/dataset.jl:255
 [16] top-level scope
    @ ~/.julia/dev/NCDatasets/test/test_variable_unlim2.jl:7
 [17] include(fname::String)
    @ Base.MainInclude ./client.jl:478
 [18] top-level scope
    @ REPL[36]:1
in expression starting at /home/abarth/.julia/dev/NCDatasets/test/test_variable_unlim2.jl:7

I am using DiskArray 0.3.22. Note 64-bit offset NetCDF are always unchunked. The error does not appear with NetCDF4 files or with NCDatasets 0.12.

@tcarion or @rafaqz , do you have any ideas ?

My tentative reproduce with NetCDF.jl fails with a different error:

import NetCDF
fname2 = tempname()
time_dim = NetCDF.NcDim("time",Int[],unlimited=true);
lon_dim = NetCDF.NcDim("lon",1:3);
v = NetCDF.NcVar("obs",[lon_dim,time_dim]);
NetCDF.create(fname2,v; mode=NetCDF.NC_64BIT_OFFSET) do nc
   nc["obs"][:,1:3] = rand(3,3)
end;

which produces Not a valid data type or _FillValue type mismatch.

rafaqz commented

Hmm

@meggart recently added a check to avoid broken chunks (zero or negative size)

https://github.com/meggart/DiskArrays.jl/blob/7d297f8b4a0067f7c93cb84e8c2d42b88754e54e/src/chunks.jl#L26

This is explicitly erroring now instead of allowing an object with broken chunks, it is good to have that guarantee in most cases except yours, where you are trying to break the chunks!

I have warned that the chunks would break if you used setindex! like this ;)

But I don't know what to do besides removing the safety check and let you have broken chunks. I do have a simple solution that will take a few hours to implement, but you know all about that...

I think the problem of this issue here is not that difficult.
For a NetCDF 3 file, the NetCDF library reports a file always a unchunked and the chunk size defaults to the initial array size (here for example (4,5,0)). But technically, for variables with unlimited dimensions the data is actually chunked for NetCDF3 (see e.g. https://www.unidata.ucar.edu/software/netcdf/workshops/2007/performance/FileFormat.html)
Simply replacing any zero in the chunk size by one (in NCDatasets) gives the actual chunk size on disk and may solve this issue .

See: https://github.com/Alexander-Barth/NCDatasets.jl/pull/232/files

rafaqz commented

Well seems slightly hacky, but if it works!

Yes, it seems quite hacky. I am wondering if the netcdf library is actually corrected to report variables with unlimited dimensions as continuous while they are saved in a non-contiguous manner on disk.

Ref:
Unidata/netcdf-c#2224

rafaqz commented

Is it possible to check for unlimited variables manually when you load a file, and effectively ignore what netcdf reports?

yes, this is certainly possible but it could cause some confusion. Maybe NCDatasets.chunking should just report what nc_inq_var_chunking says, but for DiskArrays (and NCDatasets.getchunksize) we would use the chunk size that corresponds to the contiguously data blocks on file (for performance).

rafaqz commented

Yes that makes sense, the effective chunk pattern is different to the official chunk pattern.

Resolved in a0ca027 .