CyNetDiff Submission
eliotwrobson opened this issue · 54 comments
Submitting Author: Name (@eliotwrobson)
All current maintainers: (@eliotwrobson, @abhishekumrawal)
Package Name: CyNetDiff
One-Line Description of Package: A performance-focused library implementing algorithms for simulating network diffusion processes, written in Cython.
Repository Link: https://github.com/eliotwrobson/CyNetDiff
Version submitted: v0.1.13
EIC: @Batalex
Editor: @sneakers-the-rat, @lwasser
Reviewer 1: @Kai-Striega
Reviewer 2: @jeremyzhangsq
Archive: TBD
JOSS DOI: TBD
Version accepted: TBD
Date accepted (month/day/year): TBD
Code of Conduct & Commitment to Maintain Package
- I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package after should it be accepted.
- I have read and will commit to package maintenance after the review as per the pyOpenSci Policies Guidelines.
Description
- Include a brief paragraph describing what your package does:
Network diffusion processes aim to model the spread of information through social networks, represented using graphs. Experimental work involving these models usually involves simulating these processes many times over large graphs, which can be computationally very expensive. At the same time, being able to conduct experiments using a high-level language like Python is helpful to researchers, as this gives greater flexibility in developing research software. To address both of these concerns, CyNetDiff is a Cython module implementing the independent cascade and linear threshold models, two of the most popular network diffusion models. Development has been focused on performance, while still giving an intuitive, high-level interface to assist in research tasks.
Scope
-
Please indicate which category or categories.
Check out our package scope page to learn more about our
scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):- Data retrieval
- Data extraction
- Data processing/munging
- Data deposition
- Data validation and testing
- Data visualization1
- Workflow automation
- Citation management and bibliometrics
- Scientific software wrappers
- Database interoperability
Domain Specific
- Geospatial
- Education
Community Partnerships
If your package is associated with an
existing community please check below:
- Astropy:My package adheres to Astropy community standards
- Pangeo: My package adheres to the Pangeo standards listed in the pyOpenSci peer review guidebook
-
For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
- Who is the target audience and what are scientific applications of this package?
This is aimed at researchers working in areas related to network diffusion and influence maximization, and specifically at optimizing the most computationally expensive part of this process. This should enable researchers to conduct experiments on larger graphs than would be possible with a pure-Python package. For a recent work doing experiments that fit the use cases of this package, see https://arxiv.org/abs/2207.08937
- Are there other Python packages that accomplish the same thing? If so, how does yours differ?
There is a previous package filling a similar use case called ndlib: https://github.com/GiulioRossetti/ndlib
Our package differs as it was developed with a focus on performance, and with lesser emphasis on visualization
and flexibility (for example, we do not have a way of defining custom models). Using code compiled with Cython
allows our package to handle much larger graphs than are possible with a pure-Python package like ndlib.
- If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or
@tagthe editor you contacted:
Continuing off of the discussion there, the main goals of the review for me are to finalize the interface and add related features / utility code to aid usability. In terms of concrete features, the algorithms already implemented are enough for a full release.
Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
- does not violate the Terms of Service of any service it interacts with.
- uses an OSI approved license.
- contains a README with instructions for installing the development version.
- includes documentation with examples for all functions.
- contains a tutorial with examples of its essential functions and uses.
- has a test suite.
- has continuous integration setup, such as GitHub Actions CircleCI, and/or others.
Publication Options
Is it possible to add the paper submission after this review is completed? Would like the content of the paper to incorporate any functionality that gets added during the review process.
- Do you wish to automatically submit to the Journal of Open Source Software? If so:
JOSS Checks
- The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
- The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
- The package contains a
paper.mdmatching JOSS's requirements with a high-level description in the package root or ininst/. - The package is deposited in a long-term repository with the DOI:
Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.
Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
- Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.
Confirm each of the following by checking the box.
- I have read the author guide.
- I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.
Please fill out our survey
- Last but not least please fill out our pre-review survey. This helps us track
submission and improve our peer review process. We will also ask our reviewers
and editors to fill this out.
P.S. Have feedback/comments about our review process? Leave a comment here
Editor and Review Templates
The editor template can be found here.
The review template can be found here.
Review issues
- eliotwrobson/CyNetDiff#22
- eliotwrobson/CyNetDiff#21
- eliotwrobson/CyNetDiff#25
- eliotwrobson/CyNetDiff#26
- eliotwrobson/CyNetDiff#27
Footnotes
-
Please fill out a pre-submission inquiry before submitting a data visualization package. ↩
Editor in Chief checks
Hi there! Thank you for submitting your package for pyOpenSci review. Below are the basic checks that your package needs to pass to begin our review. If some of these are missing, we will ask you to work on them before the review process begins.
Please check our Python packaging guide for more information on the elements below.
- Installation The package can be installed from a community repository such as PyPI (preferred), and/or a community channel on conda (e.g. conda-forge, bioconda).
- The package imports properly into a standard Python environment
import package.
- The package imports properly into a standard Python environment
- Fit The package meets criteria for fit and overlap.
- Documentation The package has sufficient online documentation to allow us to evaluate package function and scope without installing the package. This includes:
- User-facing documentation that overviews how to install and start using the package.
- Short tutorials that help a user understand how to use the package and what it can do for them.
- API documentation (documentation for your code's functions, classes, methods and attributes): this includes clearly written docstrings with variables defined using a standard docstring format.
- Core GitHub repository Files
- README The package has a
README.mdfile with clear explanation of what the package does, instructions on how to install it, and a link to development instructions. - Contributing File The package has a
CONTRIBUTING.mdfile that details how to install and contribute to the package. - Code of Conduct The package has a
CODE_OF_CONDUCT.mdfile. - License The package has an OSI approved license.
NOTE: We prefer that you have development instructions in your documentation too.
- README The package has a
- Issue Submission Documentation All of the information is filled out in the
YAMLheader of the issue (located at the top of the issue template). - Automated tests Package has a testing suite and is tested via a Continuous Integration service.
- Repository The repository link resolves correctly.
- Package overlap The package doesn't entirely overlap with the functionality of other packages that have already been submitted to pyOpenSci.
- Archive (JOSS only, may be post-review): The repository DOI resolves correctly.
- Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?
- Initial onboarding survey was filled out
We appreciate each maintainer of the package filling out this survey individually. 🙌
Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. 🙌
Editor comments
I think the documentation is a little bare currently to get started with the review. I'd like to see the information you wrote in this submission issue in the project repository itself. For example:
Network diffusion processes aim to model the spread of information through social networks, represented using graphs. Experimental work involving these models usually involves simulating these processes many times over large graphs, which can be computationally very expensive. At the same time, being able to conduct experiments using a high-level language like Python is helpful to researchers, as this gives greater flexibility in developing research software. To address both of these concerns, CyNetDiff is a Cython module implementing the independent cascade and linear threshold models, two of the most popular network diffusion models. Development has been focused on performance, while still giving an intuitive, high-level interface to assist in research tasks.
This is a nice elevator pitch to rework and include in the README.
I'd like to see more content in the documentation as well. The API section is great, but the basic example lacks context and does not really show case what CyNetDiff is great for. I want to share with you this resource: Diátaxis framework.
I am not asking that you follow the full framework to the letter, and it would be a huge task to do so, but I think there is some virtue in thinking about the kind of documentation you aim to showcase.
The documentation is the first thing a potential user will take a look at. As such, it needs to quickly guide the reader to:
- what does it do? (= should I try it?)
- how does it work? (= ok, I think this is the tool I need, how do I proceed?)
Thank you for the feedback! I think those changes are reasonable and will significantly enhance the package, I'll try and expand the documentation by Monday so we can get the ball rolling on the review before too long 🚀
@Batalex I have just published a release with the following changes:
- Updated the elevator pitch and example in the README
- Added two longer extended examples to the documentation site
Thanks, this looks great. From what I can tell, we are good to go. I'll get started on finding the perfect editor for this review 🐈⬛
@Batalex I know we don't have an editor yet, but is it too early to begin suggesting reviewers? We've already found someone that might be a good candidate for this.
@Batalex I know we don't have an editor yet, but is it too early to begin suggesting reviewers? We've already found someone that might be a good candidate for this.
That's fantastic!
Could you please orient them to our signups form? https://forms.gle/GHfxvmS47nQFDcBM6
For sure! I'll be sure to email this person as well, but this is @jeremyzhangsq
Shiqi has agreed to review and filled out the form!
@Batalex just wanted to check back in and see if there was anything we could do to help find a suitable editor for this. We're not in a huge rush or anything, but we want to submit this package to a conference before too long.
Hello @eliotwrobson,
Unfortunately, I do not have an editor at the moment. There are two reasons for that: first, we are so successful our current editors have their hands full currently (let me brag a little). Second, the domain is more niche than others, so we (or more specifically I) cannot rely on personal connections or previous involvements to find someone with a relevant expertise. What would be your target conference? I'll see what I can do to help accelerate the process.
@Batalex Thank you for your detailed response. We don't quite have a target conference in mind, but we're starting more work on a longer version of what will be our paper for this. I mainly asked because we may be able to help search for an editor as well, depending on what the preferred qualifications are. For example, Shiqi has published papers related to this, and might be able to serve, depending on how heavy the workload is, and how difficult it would be to find two other reviewers.
Regarding this general area, anyone who has worked on projects related to influence maximization or even just general graph theory might know enough to get the ball rolling.
I can edit this, although I just was the editor for the automata package and maybe the authors want a break from me ;)
@sneakers-the-rat we would greatly appreciate you serving as editor 🙂
Happy to do it, i'll be able to get started searching for reviewers tmrw :). (i'll let the EiC do the formal assignment)
i'll let the EiC do the formal assignment
Thanks a lot @sneakers-the-rat and thanks to @Batalex who did the hard work as EIC here!
Happy review to all!
Awesome! @sneakers-the-rat just to fill you in, we've already identified a reviewer (@/jeremyzhangsq) who has papers published in this area. So the other reviewer doesn't necessarily need to be a subject area expert, and I think it would be appropriate to have someone with more knowledge of Cython packaging / best practices, since that was the part I was most unfamiliar with when writing this package.
hi there team! i'm just checking in on this review to see the status. it looks like there's been no activity for almost 3 months. Is there anything that i can do to support moving things forward?
Thanks for checking in @lwasser ! We found one reviewer and are looking for another
can we help @eliotwrobson @sneakers-the-rat with the second reviewer? if you have one domain-specific review, we can pull a second from our signup who knows about packaging and can test for usability. i just updated the label to seeking-reviewers as well. please let me know how we can move things forward. I want to support everyone here!
Also thank you for the speedy reply and to everyone her that's already invested in this review ✨
@lwasser your assessment is correct, the first reviewer we have is knowledgeable about the domain, so I think having a reviewer focused on packaging / usability / documentation is the way to go 👍🏽
hey @eliotwrobson @sneakers-the-rat I think I found a second reviewer for this @Kai-Striega . 🚀 Kai if you are still able to review this package with a focus on the Cython side of things, we'd appreciate it!
@lwasser I'm happy to. I should have time this weekend. If I haven't replied by Monday please remind me.
@eliotwrobson the plan is to spend a couple of hours on an informal review tomorrow (informal as I'm still waiting to do the pyOpenSci induction). I've spent some time (~30 mins) this morning building the project from scratch (aka with Cython) in case there are any build problems. Seems to build so far.
ok let's kick off this review!
Below is the editor template that we need to get started with.
Editor response to review:
Editor comments
👋 Hi @Kai-Striega @jeremyzhangsq ! Thank you for volunteering to review
for pyOpenSci! It's so awesome to have you here supporting our open source / open science community 🚀
To begin there are a few items to complete:
Please fill out our pre-review survey
Before beginning your review, can you both please fill out our pre-review survey. This helps us improve all aspects of our review and better understand our community. No personal data will be shared from this survey--it will only be used in an aggregated format by our Executive Director to improve our processes and programs.
- @Kai-Striega survey completed.
- @jeremyzhangsq survey completed.
Thank you in advance for doing this!
Review templates
The following resources will help you complete your review:
- Here is the reviewers guide. This guide contains all of the steps and information needed to complete your review.
- Here is the review template that you will need to fill out and submit
here as a comment once your review is complete.
Let's shoot for a review deadline of 8 November 2024. Please get in touch with any questions that you have!! Have an awesome weekend y'all!! 🚀
Reviewers:
Due date: 8 November 2024.
YIKES sorry for being AWOL on this one, i couldn't remember if this was one i agreed to do or one that i waved off because i am a little slammed at the moment.
@lwasser You're pretty busy with fall festival stuff atm, would you like me to pick this up again?
@sneakers-the-rat please don't apologize! It's so understandable!
If you can take this over that would be wonderful!!! Thank you Jonny! 😊
Thanks all for getting the ball rolling! I'm not sure how much @/jeremyzhangsq checks GitHub, so I'll go ahead and ping him again via email.
A quick couple of questions @lwasser /@sneakers-the-rat:
- as previously discussed, my review has focused primarily on the Cython aspects of the project. Do you require me to review the remaining points on the review template too? I'm happy to do this, but I need to be aware that it's required.
- I've look through the package and written plenty of notes but haven't typed them into the review template yet. As this is my first time reviewing, would we be able to organise a mentoring session to ensure I do a great job?
@Kai-Striega 👋🏻 i just saw your discord dm's! So sorry it took me a few days to respond. I answered you there, please do go through the entire template. I'm happy to either provide a mentor session OR to get someone else in our community to do so as well. You're also in great hands here with @sneakers-the-rat so i don't want to step on any toes more than I already have!! so please just let me know how I can best support everyone here 🚀
@eliotwrobson please see my review below. I'm happy to expand/talk about any of the points I have made.
@sneakers-the-rat / @lwasser I haven't received a mentor yet, but had the available time to do it today. Please let me know if I am required to change/update anything.
Package Review
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
- As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).
Documentation
The package includes all the following forms of documentation:
- A statement of need clearly stating problems the software is designed to solve and its target audience in README.
- Installation instructions: for the development version of the package and any non-standard dependencies in README.
- Instructions provided, however I had difficulty following them. See comments below.
- Vignette(s) demonstrating major functionality that runs successfully locally.
- Vignette in README and examples/ directory
- Function Documentation: for all user-facing functions.
- Examples for all user-facing functions.
- No examples for user facing documentation page
- No examples in
models.pyifile with rest of documentation
- Community guidelines including contribution guidelines in the README or CONTRIBUTING.
- Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a
pyproject.tomlfile or elsewhere.
Readme file requirements
The package meets the readme requirements below:
- Package has a README.md file in the root directory.
The README should include, from top to bottom:
- The package name
- Badges for:
- Continuous integration and test coverage,
- Docs building (if you have a documentation website),
- A repostatus.org badge,
- Python versions supported,
- Current package version (on PyPI / Conda).
- PyPi supported and badged
- conda not yet available, however this package seems appropriate for conda
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
- Short description of package goals.
- Package installation instructions
- Any additional setup required to use the package (authentication tokens, etc.)
- Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- Link to examples/ directory missing in README
- Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
- Vignettes/examples are sufficient
- Link to your documentation website.
- If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
- Not provided, have not done a lit review to check if this is required.
- Citation information
Usability
Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole.
Package structure should follow general community best-practices. In general please consider whether:
- Package documentation is clear and easy to find and use.
- The need for the package is clear
- All functions have documentation and associated examples for use
- Functions/methods have documentation, but not examples in the documentation.
- The package is easy to install
- See comments in reviewer comments about wheels and required C++ compiler
Functionality
- Installation: Installation succeeds as documented.
- Functionality: Any functional claims of the software been confirmed.
- Performance: Any performance claims of the software been confirmed.
- No performance claims made
- Benchmarks appear fairly ad-hoc.
- Benchmarks provided as part of the examples, should be own directory.
- asv (or another benchmarking framework) could be used to produce more reliable benchmarks
- Automated tests:
- All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
- Tests cover essential functions of the package and a reasonable range of inputs and conditions.
- Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
- Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.
A few notable highlights to look at:- Package supports modern versions of Python and not End of life versions.
- Missing CPython 3.13 support, however GitHub declares
Requires: Python >=3.9 - Missing free-threaded Python support, this is not unusual as the project heavily uses Cython, which does not officially support free-threaded Python yet
- Missing CPython 3.13 support, however GitHub declares
- Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)
- Passes with
rufflocally - Passes in CI for
ruffandmypy - As this project utilizes Cython heavily a tool such as cython-lint is recommended
- Passes with
- Package supports modern versions of Python and not End of life versions.
For packages also submitting to JOSS
- The package has an obvious research application according to JOSS's definition in their submission requirements.
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md matching JOSS's requirements with:
- A short summary describing the high-level functionality of the software
- Authors: A list of authors with their affiliations
- A statement of need clearly stating problems the software is designed to solve and its target audience.
- References: With DOIs for all those that have one (e.g. papers, datasets, software).
Final approval (post-review)
- The author has responded to my review and made changes to my satisfaction. I recommend approving this package.
Estimated hours spent reviewing: 8
Review Comments
I have several review comments below. These are ordered by importance.
Requires a C++ compiler to build/lack of wheels
Wheels are provided only CPython 3.11 on Linux. This requires all other possible installers to compile the project from source. This is fairly simple, however it provides an extra step/requirements which is possible to avoid. My recommendation is to provide wheels for all officially supported Python versions on all major platforms.
Cython code generation
The project currently caches the file src/cynetdiff/models.cpp explicitly. I usually recommended not to ship the .c/.cpp files directly. There are several problems with shipping .c/.cpp files:
- They are not guaranteed to be forward compatible with future CPython C-API or Cython versions. The CPython C-API seems to be changing fairly frequently at the moment.
- It's possible that Cython generated code will not work on certain platforms e.g. arm if built for x86
This is echoed by the official Cython user guide
it is not recommended to include generated files in source distributions. Instead, require Cython at build-time to generate the C/C++ files
Pseudo random number generation
The package currently uses C's rand function. From models.pyx:
from libc.stdlib cimport rand, RAND_MAX
cdef double RAND_SCALE = 1.0 / RAND_MAX
cdef inline double next_rand() nogil:
return rand() * RAND_SCALEThere are several issues with rand:
- There are no guarantees about the algorithm, and therefore the quality of the randomness, used to generate random numbers
- It uses global state which can lead to problems with multithreaded applications. While the current project doesn't use any parallelism, this could be significant if CyNetDiff-1 is implemented.
My recommendation is to adopt a more suitable random number generation engine. In the Cython world there are two ways I can recommend of doing this:
- C++'s
<random>header via the Cython bindings - Adopt NumPy's random number generation. This is officially documented and an example can be seen here
Note that (2) would also allow you to be compliant with SPEC-7.
Minor Cython related improvements
Note these points are minor and should not be considered blocking.
Use of array.array
The package uses array.array both for type hints and Cython typing in several places. Consider using Cython's memoryview instead. memoryviews provide the advantage of being available to anything that provides the Buffer Protocol, such as NumPy arrays, which I suspect are used significantly more frequently than array.arrays are by the packages target audience.
Missing potential bounds checks when running cythonize
[1/1] Cythonizing src/cynetdiff/models.pyx
warning: src/cynetdiff/models.pyx:97:51: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:107:35: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:135:39: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:137:38: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:141:34: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:261:39: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:263:45: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:264:34: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:268:38: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:284:52: Use boundscheck(False) for faster access- It may be possible to improve performance by disabling bounds checking in these situations.
- I have not verified if this actually improves performance in these particular circumstances
Review resizing array.array
for i in range(m):
influence_arr.append(1.0 / in_degrees[self.edges[i]])This computations resizes influence_arr multiple times, leading to multiple memory allocations. It would be more efficient to use .extend rather than .append.
Use of name mangling
Starting a name with __ introduces name mangling in Python. In C/C++ these are reserved for the compilers internal keywords. Name mangling can make life difficult. CyNetDiff uses name mangling at several points (e.g. __activation_suceeds and __advance_model). I'd be careful with these.
@Kai-Striega thank you for your detailed comments! I have a couple of questions, but I'll try to get started addressing some of these items soon.
Requires a C++ compiler to build/lack of wheels
Do you know of any documentation that has discussion of how to do this? I agree it is much more convenient to not need a compiler installed, but I wasn't sure how to set this up.
Cython code generation
This makes sense to me, I could swear I read the opposite somewhere but I must have been mistaken.
Pseudo random number generation
Thank you for the references! I had actually thought about whether there was a standard way to handle the random number generation. I think I will follow option (2) and add the SPEC-7 compliance.
Use of array.array
Is there a recommended type annotation that expresses what works with the Cython memoryview? It would be nice to add this annotation along with support for numpy arrays.
@Kai-Striega thank you for your detailed comments! I have a couple of questions, but I'll try to get started addressing some of these items soon.
You're welcome.
Feel free to message if you have any more questions. I'll try answer your questions below.
Do you know of any documentation that has discussion of how to do this? I agree it is much more convenient to not need a compiler installed, but I wasn't sure how to set this up.
AFAIK, there isn't much. You could look at how SciPy builds their wheels for nightly builds. This should include all the info you need. I'm happy to help out with this, but might be a bit slow at the moment.
Note that SciPy's CI is far more complex than I would expect for this project.
This makes sense to me, I could swear I read the opposite somewhere but I must have been mistaken.
This isn't cut and dry. In the past, it was advised to do the opposite. This has changed over the last decade or so. Some people still prefer to include the .c/cpp files. However, this isn't recommended anymore.
Thank you for the references! I had actually thought about whether there was a standard way to handle the random number generation. I think I will follow option (2) and add the SPEC-7 compliance.
You're welcome :)
Is there a recommended type annotation that expresses what works with the Cython memoryview? It would be nice to add this annotation along with support for numpy arrays.
Unfortunately, not that I've come across. The problem is that Cython memoryviews support anything that supports the Buffer Protocol this could include custom objects.
My usual tactic is to use ArrayLike with a Python level wrapper, check if the value is scalar and then convert it to an array. This isn't great.
An alternative option would be to create your own Protocol with all the buffer protocol methods.
@Kai-Striega I've been able to implement most of the above changes without too many issues. I'm starting to take a look at the spec7 documentation and the examples you linked. Would it be better to provide client code with a wrapper class like the one you linked here that then gets passed to the model class that executes simulations? Or should client code directly pass in a numpy generator? From the timing code in the notebook you linked, it seems like it would be much faster to expose this wrapper class to client code, rather than having a numpy generator passed in.
I'm at work right now, and can't respond in the length I'd like. I'll edit this over the next few days to expand on my thoughts.
TLDR: Neither. SPEC-7 allows for a couple of different options.
Here is the officially recommended way
from collections.abc import Sequence
import numpy as np
SeedLike = int | np.integer | Sequence[int] | np.random.SeedSequence
RNGLike = np.random.Generator | np.random.BitGenerator
def my_func(*, rng: RNGLike | SeedLike | None = None):
"""My function summary.
Parameters
----------
rng : `numpy.random.Generator`, optional
Pseudorandom number generator state. When `rng` is None, a new
`numpy.random.Generator` is created using entropy from the
operating system. Types other than `numpy.random.Generator` are
passed to `numpy.random.default_rng` to instantiate a `Generator`.
"""
rng = np.random.default_rng(rng)
...I would use this as a wrapper for your Cython functions.
Short update that I've addressed most of the items from the review comments in a new release (v0.1.14) besides the rng seeding that is still being discussed and adding more builds. Not quite ready for a re-review till we address the above. Also emailed shiqi, and looks like he is traveling now but will be back to review in early November 👍🏽
Catching up - here are the things i see raised in the review:
making a checklist mostly for tidyiness sake, not trying to list out requirements - tell me which of these are taken care of/still outstanding
- docs: examples
- statement of need: we don't have a hard statement of need requirement, and i don't like novelty tests so it's not that, but it would be nice to have some statement for the pyos site that puts this package in context with other packages - "similar to x except y, etc." kind of stuff
- packaging: wheels for multiple platforms (looks ongoing)
- packaging: versioning generated artifacts (looks like this is done?)
- function: use of c's rand function (looks like this is ongoing)
- function: name mangling
Thanks for the detailed review :) I learned some things about cython here too.
If it would be helpful, you might consider opening up the remaining items as issues that link back to this one just to keep track of what's left todo/use issues as a cheap form of threading.
Re: building for multiple platforms, this sounds like something we would want to have in the packaging guide, so if you figure something out, maybe we put instructions there?
@sneakers-the-rat thanks for writing that out. I've removed all the instances of name mangling I could find, and you are correct that removing generated artifacts is done. For the statement of need, would you like something written that can go on the pyos site, or that also to be added to the library documentation? A simple description is that this package is a faster version of ndlib.
Re: building for multiple platforms, this sounds like something we would want to have in the packaging guide, so if you figure something out, maybe we put instructions there?
While this is noble, I think a unified approach to building will be hard, especially when extension modules are involved. It will depend on which languages are used, which platforms you build for, how complicated you want to get, and so on...
In particular, I feel that Poetry doesn't handle extension modules particularly well. There is SPIN which much of the scientific Python ecosystem seems to be standardising on, however this would require a complete rewrite of CyNetDiff's infrastructure.
A simple description is that this package is a faster version of ndlib.
If you're going to make claims that it's faster, it would be great to show it. I think this should be required if you make performance claims.
While this is noble, I think a unified approach to building will be hard
Fair! i was thinking of it as a starting point for a more general guide, but let's leave that aside so we don't muddy this review :) Moved that over here: pyOpenSci/python-package-guide#422
I think this should be required if you make performance claims.
Agreed that we should be able to show a benchmark to claim this. @eliotwrobson to answer your earlier question, yes this would be for the pyos website, where ideally the goal is that we build out the website so that it's possible to see a package and its 'neighbors,' "the family of tools that do things like x," and much like how open reviews on eLife allow readers to know the context of a paper and how people in that field think of it, it would be great (but not required) to provide that same context for software. "why this and not that? etc." If benchmarks are not in reach at the moment, a simple "see also: ndlib" would be fine
For the benchmark, some light benchmarks are included in this example notebook: https://github.com/eliotwrobson/CyNetDiff/blob/main/examples/visualizations.ipynb
I'm not sure if there's a more standard way of doing this? We also include some discussion of these benchmarks in our paper: https://www.vldb.org/pvldb/vol17/p4409-umrawal.pdf
Regarding building, although changing / documenting something to work with Poetry is probably not the most productive given what others have said, I think having some example related to this in the packaging guide would be extremely useful. I had a very hard time even figuring out how to get started writing a package that used Cython, and this would be extremely useful to me on another project I want to get started on before too long.
If you have published benchmarks already, then that's good enough for me for a claim like that - we can just cite them :)
Also am i reading this right, that an identical operation took 14s with ndlib and 0.067s with cynetdiff? that's.... certainly a perf improvement. well done.
Also i forgot to say earlier, i was asking for a context statement for the website but i also generally (i.e. outside of the purposes of the review) think they are good to have in documentation. If you think you've improved on something, I don't think it's disrespectful to compare yourself to it esp on something objective like 'the time it takes to perform an identical operation,' and it's helpful to everyone to have better linking between what's out there. One example i love is how the first two sections of the dask dataframes 'best practices' page are "use pandas" (and then clarifications about when you would want to use dask)
For the benchmark, some light benchmarks are included in this example notebook: https://github.com/eliotwrobson/CyNetDiff/blob/main/examples/visualizations.ipynb
I'm happy with the benchmarks as-is, for the purpose of a formal review, I think there is room for improvement.
I'm not sure if there's a more standard way of doing this? We also include some discussion of these benchmarks in our paper: https://www.vldb.org/pvldb/vol17/p4409-umrawal.pdf
There are tools out there to do this for you. The one that seems to be favoured by NumPy/SciPy (where I do most of my FOSS stuff) is air speed velocity. There are several other tools also available, but I don't know too much about them.
Regarding building, although changing / documenting something to work with Poetry is probably not the most productive given what others have said, I think having some example related to this in the packaging guide would be extremely useful. I had a very hard time even figuring out how to get started writing a package that used Cython, and this would be extremely useful to me on another project I want to get started on before too long.
I've tried building Cython projects with Poetry and actually gave up. I found it very hard/confusing, so just getting your package to build should be considered an achievement. My go to for building Cython projects is now SPIN. This uses Meson which has an inbuilt Cython Module.
I've tried building Cython projects with Poetry and actually gave up. I found it very hard/confusing, so just getting your package to build should be considered an achievement. My go to for building Cython projects is now SPIN. This uses Meson which has an inbuilt Cython Module.
@sneakers-the-rat I think that even this simple recommendation alone would be a useful thing to add to the packaging guide 👍🏽
Regarding the remaining review items, I need to refactor another feature for this package to implement something for a paper I'm working on. I think the other reviewer is going to be able to take a look soon, so I'll wait for his comments before attacking the rest of these items.
Great - if it would be possible to open issues for each of the requested changes linking back to this issue (either of you), then I (or you) can make a list in the OP of completed/outstanding work to do so that the second reviewer can know whats already been raised/is being addressed and they can focus their review elsewhere, or comment on those ongoing issues!
@sneakers-the-rat just made associated issues and linked them back here (they are the top 4 in the current issues list).
@eliotwrobson check this out:D
Package Review
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
- As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).
Documentation
The package includes all the following forms of documentation:
- A statement of need clearly stating problems the software is designed to solve and its target audience in README.
- Installation instructions: for the development version of the package and any non-standard dependencies in README.
- Vignette(s) demonstrating major functionality that runs successfully locally.
- Function Documentation: for all user-facing functions.
- Examples for all user-facing functions.
- Community guidelines including contribution guidelines in the README or CONTRIBUTING.
- Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a
pyproject.tomlfile or elsewhere.
Readme file requirements
The package meets the readme requirements below:
- Package has a [README.md](http://readme.md/) file in the root directory.
The README should include, from top to bottom:
- The package name
- Badges for:
- Continuous integration and test coverage,
- Docs building (if you have a documentation website),
- A [repostatus.org](https://www.repostatus.org/) badge,
- Python versions supported,
- Current package version (on PyPI / Conda).
- Yes for PyPI
NOTE: If the README has many more badges, you might want to consider using a table for badges: [see this example](https://github.com/ropensci/drake). Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
- Short description of package goals.
- Package installation instructions
- Any additional setup required to use the package (authentication tokens, etc.)
- Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the [README.md](http://readme.md/) file.
- Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
- Link to your documentation website.
- If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
- The authors have compared this package with NDlib, a wide-accepted library for network diffusion. This package is more efficient (See the PVLDB paper).
- Citation information
Usability
Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole.
Package structure should follow general community best-practices. In general please consider whether:
- Package documentation is clear and easy to find and use.
- The need for the package is clear
- All functions have documentation and associated examples for use
- The package is easy to install
Functionality
- Installation: Installation succeeds as documented.
- Functionality: Any functional claims of the software been confirmed.
- Performance: Any performance claims of the software been confirmed.
- Although the authors provided the benchmark codes, I suggest the authors give some key results in the README.
- Automated tests:
- All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
- Tests cover essential functions of the package and a reasonable range of inputs and conditions.
- Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
- Packaging guidelines: The package conforms to the pyOpenSci [packaging guidelines](https://www.pyopensci.org/python-package-guide).
A few notable highlights to look at:- Package supports modern versions of Python and not [End of life versions](https://endoflife.date/python).
- Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)
For packages also submitting to JOSS
- The package has an obvious research application according to JOSS's definition in their [submission requirements](http://joss.theoj.org/about#submission_requirements).
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md matching [JOSS's requirements](http://joss.theoj.org/about#paper_structure) with:
- A short summary describing the high-level functionality of the software
- Authors: A list of authors with their affiliations
- A statement of need clearly stating problems the software is designed to solve and its target audience.
- References: With DOIs for all those that have one (e.g. papers, datasets, software).
Final approval (post-review)
- The author has responded to my review and made changes to my satisfaction. I recommend approving this package.
Estimated hours spent reviewing: 6 hours
Review Comments
I have to admit that this project really surprised me. Previously, the IM community mainly used C++ for implementation for the sake of efficiency. However, it may cause inconvenience for real-world development and deployment (from my own experience). Thus, I believe this project will help our community a lot. Thanks to other reviewers’ comments, I can quickly catch up with the progress. Overall, this project meets my standards except two minors to improve.
- I suggest showing some key evaluation results in the README, such as how fast CyNetDiff can be compared to NDlib.
- To adapt other variants of IC and LT models in the future, I suggest to provide args about initial node thresholds and edge weights in
def linear_threshold(
G: t.Union[nx.Graph, nx.DiGraph], seeds: t.Iterable[int], steps: int = 0
) -> t.List[t.List[int]]:
An example of the LT variant that requires customized thresholds and weights is this:
- Nicola Barbieri and Francesco Bonchi. 2014. Influence maximization with viral
product design. In ICDM.
Thanks to both reviewers for the feedback! Have a few more urgent things to take care of but will try to get through some of these changes over the upcoming holiday 👍🏽
EDIT: A little behind schedule (whoops), but will get to everything mentioned here before too long.
Finally had a chance to go over things again, and I managed to get a PR for the most significant item so far (SPEC7 compliance): eliotwrobson/CyNetDiff#28
@Kai-Striega if you have time, could you glance at this PR briefly? No need to spend too much time looking through the internals, I just want to double check that we're doing everything as expected in the spec.
As for some of the other comments, I think that we have a sufficient amount of customization in the model creation for now (edge weights are accepted as arguments by both models, and adding custom thresholds for the linear threshold model won't give a performance boost over pure-python unless we adopt a custom model of randomly generating thresholds in each simulation). So the only remaining items are the following
- More examples in the documentation.
- ASV / something showing more current benchmarks.
- Releasing wheels on multiple platforms (I probably won't do this because I can't figure out how).
@eliotwrobson I'll take a look, hopefully today.
Releasing wheels on multiple platforms (I probably won't do this because I can't figure out how).
I think having wheels for multiple platforms is important. If you don't supply them, the user has to figure out how to compile your project on their own. i.e. you're moving the burden of having to figure out how to compile it on their platform from yourself to each individual user.
I'm happy to help with this a bit if you need it, basically for NumPy/SciPy we build it as a CI job. You can see how NumPy does it by looking in the .github/workflows/ directory.
I'm happy to help with this a bit if you need it, basically for NumPy/SciPy we build it as a CI job. You can see how NumPy does it by looking in the .github/workflows/ directory.
@Kai-Striega Thank you! This would be greatly appreciated 🙏🏽 🙏🏽 I'd be open to switching the build system as well if that is necessary, I'm just currently not sure how to set up automated publishing with respect to this.
Update on the status of things here. I've made a pass and resolved all but one of the outstanding requested changes from this review (the remaining is the distribution of wheels for other platforms, which I'll be handling next). The latest released version (0.1.16) has the following changes:
- Examples for all of the user facing functions in the API, in addition to the extended examples present already: https://eliotwrobson.github.io/CyNetDiff/api/
- Benchmarks run through
pytest-benchmarkon GitHub actions for the main branch. The comparison here is done with NDlib and there is a link to the latest benchmarks run from the main branch in the README. Not the most sophisticated benchmarks in the world but appropriate given the size of the project (ASV was ultimately too much overhead when there are only really a couple of tasks we care about benchmarking here). - Support for randomized activation and other utility functions to support additional use cases.
The wheel distribution will take a little more time than everything else, but I'll get to that as soon as I can. However, I think everything else is in a good spot 👍🏽