zarr-developers/community

Organizing the Community

jakirkham opened this issue ยท 10 comments

Am opening this issue to discuss how best to organize the community around Zarr/N5. Basically think things have been going well and adoption has been great. Would like us to organize a bit so we can be a bit more effective and especially accessible to new people. There are a few main areas that keep coming up in different issues and conversations. Will lay them out in the following paragraphs and organize them a bit. Sorry if it is a bit rambly.

There are a few different implementations that have emerged that we are aware of. Some support Zarr, some support N5, some support both (list below). Should add that a C implementation is being discussed ( constantinpape/z5#68 ) (Fortran and many others could be built off this) and a JavaScript implementation may be possible via Rust ( https://github.com/zarr-developers/zarr/issues/289 ). If I missed any, please let me know and we can add them here. May also be worthwhile to get this into a doc page.

Name N5, Zarr Language URL
Zarr Zarr Python ref
N5 N5 Java ref
z5 N5, Zarr C++, Python ref
rust-n5 N5 Rust ref
Zarr.jl Zarr Julia ref

First off not everyone was aware of these other implementations. Through discussion and outreach we were able to discover these implementation. We will likely discover even more people interested in this problem, who are trying to solve it in some way in their own language. So it would be good if we can start doing more outreach. IOW talks, meetups, etc. would be good to start familiarizing more people with this growing standard. As we connect more people, this only grows further.

Second if Zarr = N5, we have at least 5 language implementations (though potentially more if we include the discussions above). Right now Zarr โ‰ˆ N5 mainly via Z5 and some other efforts to move the two closer. Putting effort towards bringing these two specifications closer together is desirable for the growth of the community going forward. There is some great discussion in issue ( https://github.com/zarr-developers/zarr/issues/231 ). Some important observations about where they differ. Also some issues were raised about how to get these to converge. Already work has been done and some plans made. Help on this effort is definitely appreciated.

Third as our community grows to include more implementations, more use cases, and more developers, our decision model needs to grow to better approach these needs. Along these lines it would be quite helpful to start work on Governance and Code of Conduct. We may also want to consider an Enhancement Proposal model to better handle and discuss the needs of this community. We may also consider applying for NumFOCUS sponsorship, which we could use to fund testing with different cloud storage platforms, sponsor sprints, and generally help spread the word.

Thoughts and feedback on this are welcome.

I thought it might be useful to ping @WardF of Unidata, leader of the netCDF team.

When Ward presented at the recent pangeo meeting, he mentioned that netCDF is considering building its next-generation library on top of zarr. So they will probably be quite interested in issues such as multi-language support and project sustainability. Also, if they go that route, we might see zarr contributors from within Unidata emerge.

My 2c, I would very much enjoy seeing the community around zarr and n5 grow and gain from working together, and also help to build a broad, open community around approaches to storage of ND array data generally.

Zarr is getting great exposure in the scientific Python community particularly via Pangeo, but it would be great to have some talks specifically on zarr and/or cloud/distributed storage of ND array data more generally. My own ability to travel is very limited at the moment, but I would be very happy for others to submit talks on zarr or related areas and I'd be happy to help make slides etc.

I think it would also be very timely to discuss the decision model. This is particularly important where changes to the storage spec and/or public API are being considered, where issues can be technically complex and many possible solutions can be under discussion (a good example is #276).

Zarr has always been intended as an open, hackable vehicle for exploring and experimenting with new approaches to ND array storage. I wouldn't want to lose this spirit or dynamism, especially as we gain more experience with cloud storage, and more people are getting under the hood and trying things out. But at the same time I think there does need to be a decision process.

Some form of Enhancement Proposal model would be worth considering IMO, at least for proposals that affect the storage spec. It might even be worth considering breaking the storage spec out into a separate github repo, and having different decision processes for the spec and the Python implementation, e.g., where any proposed change to the spec requires approval from a representative of all active implementations.

I'd also support approaching NumFOCUS. I know funds are likely to be very limited, but every little helps, even if it is just to support outreach/community-building.

๐Ÿ‘ to NumFOCUS. I think zarr is a good candidate.

Another idea I'll throw into the mix here: I think a technical peer-reviewed paper about zarr could have a lot of value. As we make inroads with difference scientific communities (genomics, climate, astronomy), we encounter stakeholders who place a lot of value in the peer-reviewed scientific literature. Being able to point to a paper could help increase adoption. @alimanfoo has basically already written it. This could be submitted to a software-focused journal like JOSS or IEEE Computing in Science & Engineering.

I really like the discussion points here. In the Xarray and Pangeo projects, we've recently gone down the governance and code of conduct paths so we can offer some experience in those areas.

I also think moving the zarr spect to its own repository is a really good idea - this will help break the idea that zarr is a python thing.

We could start zarr as NumFocus Affiliated project. The point of entry there is quite a bit easier and then we could look into fiscal sponsorship when that seems necessary.

A proper journal article would also go a long way in the academic communities recognition of this project.

I'm wondering if there would be appetite for a semi-regular telecon, say bi-monthly, with at least core developers trying to attend, but maybe also open to anyone? It might be useful to review and discuss a roadmap, and work towards a common view on development priorities for the next release. It also might be an opportunity to discuss actions we could take towards supporting the community, e.g., specific people we could reach out to, or planning for a workshop or BOF at a conference. Any thoughts? Other ways we could increase coordination of efforts?

I think a telecon is a good idea. Other things that might help community building:

  • gitter channel
  • code of conduct
  • make contributor guide more prominent on github

See #305 for a PR adding a code of conduct.

@alimanfoo and/or @jakirkham - you would need to create the gitter channel under the zarr-developers name.

Just want to say thanks for all of your thoughts here. They all sound like good ideas.

Also thanks for putting the code of conduct together, @jhamman.

Guess the next step along these lines is a governance document. Would allow us to hash out our decision model as well. This could be valuable to do with core now or it could be good to include some of our key stakeholders (future core members?). Also can be done through a different medium (conference call, email, etc.) or here based on preference. Thoughts on this welcome.

At least from my perspective, joining NumFOCUS makes a number of things easier. Funding is only one of them. Others revolve around interfacing with other orgs or companies (e.g. access to a other experienced professionals, non-profit status useful for tax write-offs, opportunities to collaborate with other projects, formal organizational structure, etc.). Starting as a NumFOCUS affiliate project is a great way to go. That worked well for us at conda-forge. Happy to try that here.

Guess the other point that came up recently is whether we should join NumFOCUS or create/join some other YTBD holding organization (like HDFGroup, though not specifically suggesting we join HDFGroup). This could be another good option and deserves discussion. Have no concrete thoughts on it at this time.

There are a bunch of other great points raised in this thread. Excited to discuss these with you. Have moved them into their own issues/threads that are xref'd above so they can be discussed separately. If I missed anything like this, please feel free to raise an issue and xref it to make it easy to find from here.

The "registry" of implementations may be a candidate for a wiki page.