Pandera: A flexible and expressive pandas data validation library.
cosmicBboy opened this issue · 46 comments
Submitting Author: Niels Bantilan (@cosmicBboy)
All current maintainers: (@cosmicBboy)
Package Name: pandera
One-Line Description of Package: validate the types, properties, and statistics of pandas data structures
Repository Link: https://github.com/unionai-oss/pandera
Version submitted: 0.1.5
Editor: @lwasser
Reviewer 1: @mbjoseph
Reviewer 2: @xmnlab
Archive: https://github.com/pandera-dev/pandera/releases/tag/v0.2.3
Version accepted: v0.2.3
Date Accepted: 10/10/2019
Description
pandas
data structures can hide a lot of information, and explicitly
validating them at runtime in production-critical or reproducible research
settings is a good idea for building reliable data transformation pipelines.
pandera
enables users to:
- Check the types and properties of columns in a
DataFrame
or values in
aSeries
. - Perform descriptive and inferential statistical validation, e.g. two-sample
t-tests. - Seamlessly integrate with existing data analysis/processing pipelines
via function decorators.
pandera
provides a flexible and expressive API for performing data validation
on tidy (long-form) and wide data to make data processing pipelines more
readable and robust.
Scope
- Please indicate which category or categories this package falls under:
- Data retrieval
- Data extraction
- Data munging
- Data deposition
- Reproducibility
- Geospatial
- Education
- Data visualization*
* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.
- Explain how and why the package falls under these categories (briefly, 1-2 sentences):
Data munging: the package makes ETL, data analysis, and data processing
pipelines more robust and reliable by providing users with tools to validate
assumptions about the schema and statistical properties of datasets.
This package supports validation on long (tidy) data and wide data.
Reproducibility: This package enables users to validate DataFrame
or Series
objects at runtime or as unit/integration tests, and can easily be integrated
to existing pipelines using the check_input
and check_output
decorators.
It also supports collaboration and reproducible research by programmatically
enforcing assertions made about the statistical properties of a dataset in
addition to making it easier to review pandas code in production-critical
contexts.
- Who is the target audience and what are scientific applications of this package?
The target audience of pandera
consist of data scientists, data engineers,
machine learning engineers, and machine learning scientists who use pandas
in
their data processing pipelines for various purposes e.g., transforming data
for reporting, analytics, model training, and data visualization. This tool is
built on top of pandas
and scipy
to provide a user-friendly interface for
explicitly specifying the set of properties that a DataFrame
or Series
must
fulfill in order to be considered valid. Since pandera
makes no assumptions
about the domain of study or contents of these pandas
data structures, it
could be used in a wide variety of quantitative fields that involve the
analysis of tabular data.
- Are there other Python packages that accomplish the same thing? If so, how does yours differ?
There are a few alternatives to pandera in the the Python ecosystem and here
is how they compare:
- https://github.com/alecthomas/voluptuous
- not specific to pandas, applies to JSON/YAML etc.
- very flexible and reasonably simple
- no decorators, hypothesis or sophisticated checks
- https://github.com/keleshev/schema
- similar to voloptuous
- validation of generic python data structures
- https://github.com/TMiguelT/PandasSchema
- has a wider range of 'built-in' validator types
- limited type support (only has a conversion/coercion check)
- no decorators
- implementation has less flexibility than pandera's
- has generic 'check'-like validators
- https://github.com/danielvdende/opulent-pandas
- similar to voluptuous, and conceptually similar to pandera, but lacking
functionality
- similar to voluptuous, and conceptually similar to pandera, but lacking
- https://github.com/c-data/pandas-validator
- not maintained, inflexible syntax
- https://github.com/xguse/table_enforcer
- not maintained
- the
Enforcer
andColumn
objects are very similar to pandera, but it's a
little difficult to follow
Key differentiators of pandera:
-
column data types, nullability, and uniqueness are first-class concepts.
-
check_input
andcheck_output
decorators enable seamless integration with
existing code. -
Check
s provide flexibility and performance by providing access topandas
API by design. -
Hypothesis
class provides a tidy-first interface for statistical hypothesis
testing. -
Check
s andHypothesis
objects support both tidy and wide data validation. -
Comprehensive documentation on key functionality.
-
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or
@tag
the editor you contacted:
Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
- does not violate the Terms of Service of any service it interacts with.
- has an OSI approved license
- contains a README with instructions for installing the development version.
- includes documentation with examples for all functions.
- contains a vignette with examples of its essential functions and uses.
- has a test suite.
- has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
Publication options
- Do you wish to automatically submit to the Journal of Open Source Software? If so:
JOSS Checks
- The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
- The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
- The package contains a
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
. - The package is deposited in a long-term repository with the DOI:
Note: Do not submit your package separately to JOSS
Are you OK with Reviewers Submitting Issues to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
- Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.
Code of conduct
- I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.
P.S. Have feedback/comments about our review process? Leave a comment here
Editor and Review Templates
Editor and review templates can be found here
Previous Repo: https://github.com/cosmicBboy/pandera
Thank you @cosmicBboy !! we will get back to you with the editor / review process next steps !!
Editor checks:
- Fit: The package meets criteria for fit and overlap.
- Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
Might add better dev setup instructions for contributing... but i see a dev envt txt
- License: The package has an OSI accepted license
MIT License
- Repository: The repository link resolves correctly
- Archive (JOSS only, may be post-review): The repository DOI resolves correctly
- Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?
Editor comments
Reviewers: @mbjoseph @xmnlab
Due date: @mbjoseph we agreed to do reviews one at a time. Given that, is a 2 week deadline (which would be September 6) ok for your schedule? if that is ok then @xmnlab i will ping you once Max's review is in and you can begin your review!! @cosmicBboy has agreed to issues and PR's if you want to create a review using that approach rather than all text in this issue (links to the issue and/or PR may be preferred). Thank you all for your time!!
thanks everyone for participating in this review! Just FYI, the pandera issues page has a couple of tickets that may be of interest for reviewers.
We're planning on a 0.2.0 release in the next week or so.
Package Review
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
- As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).
Documentation
The package includes all the following forms of documentation:
- A statement of need clearly stating problems the software is designed to solve and its target audience in README
- Installation instructions: for the development version of package and any non-standard dependencies in README
- Vignette(s) demonstrating major functionality that runs successfully locally
- Function Documentation: for all user-facing functions
- Examples for all user-facing functions
- Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with
URL
,BugReports
andMaintainer
.
Readme requirements
The package meets the readme requirements below:
- Package has a README.md file in the root directory.
The README should include, from top to bottom:
- The package name
- Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
- Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
- Installation instructions
- Any additional setup required (authentication tokens, etc)
- Brief demonstration usage
- Direction to more detailed documentation (e.g. your documentation files or website).
- If applicable, how the package compares to other similar packages and/or how it relates to other packages
- Citation information
Functionality
- Installation: Installation succeeds as documented.
- Functionality: Any functional claims of the software been confirmed.
- Performance: Any performance claims of the software been confirmed.
- Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
- Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
- Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.
Final approval (post-review)
- The author has responded to my review and made changes to my satisfaction. I recommend approving this package.
Estimated hours spent reviewing: 6
Review Comments
Overall, this is a great package with a clear scope, good docs, and good testing infrastructure. Clearly, a lot of effort has been put into its development, and as somebody who works with raw data, something like this would be immediately useful. With this in mind, most of my comments are fairly minor.
Bigger points:
These relate to the top-level boxes for the pyOpenSci review process that I could not check.
-
API documentation is in pretty good shape, but there are some things without a description in the API docs (e.g., https://pandera.readthedocs.io/en/stable/API.html#pandera.Check.error_message).
-
I am not checking the box for "Examples for all user-facing functions". Taken literally, there are user-facing functions that do not have examples (e.g.,
generic_error_message
), though I believe the examples cover the most common use cases. It might be a good idea to prefix some of these methods that users aren't expected to use with an underscore, or if it makes more sense to add examples (e.g., via doctest in the API docs), that could also be worth considering.
Minor notes
These are a smattering of questions I ran into, and notes that might help improve the package.
-
Test coverage is pretty high - any particular reason why the remaining lines are not tested?
-
There are some deprecation warnings that arise in running the tests: https://travis-ci.org/pandera-dev/pandera/jobs/579197344#L2287
-
Citation info is missing from the README, and could be added if you wanted to make it easy for others to cite the package.
-
CI testing on OSX and Windows might be nice too.
-
"Column Hypothesis test support testing different column so that assertions can be made about the relationships..." -- would "tests" work better? https://pandera.readthedocs.io/en/stable/dataframe_schemas.html
-
Backtick usage is somewhat inconsistent in the docs, e.g.,
Column
vs. Column -
SeriesSchema docs seem to have an unfinished sentence on Series Validation: https://pandera.readthedocs.io/en/stable/series_schemas.html#series-validation
-
pd.series
should bepd.Series
orpandas.Series
(used below): https://pandera.readthedocs.io/en/stable/checks.html#checking-values-within-a-column -
In describing how the function signature of
Check
changes, there may be a typo: "This changes the function signature of the Check function so that its input is a dict where keys are the group names and keys are subsets of the Column series." (https://pandera.readthedocs.io/en/stable/checks.html#column-check-groups). Should this be keys and values instead of keys and keys? -
There are a few places where a significance treshhold/alpha value of 0.5 is used in the Hypothesis docs (https://pandera.readthedocs.io/en/stable/hypothesis.html#hypothesis-testing). Should this be 0.05, which seems like a more commonly used threshold than 0.5?
-
Some URLs still point to
cosmicBboy/pandera
: https://github.com/pandera-dev/pandera/search?q=cosmicbboy%2Fpandera&unscoped_q=cosmicbboy%2Fpandera -
Why not conda-forge instead of the cosmicbboy conda channel?
-
Installation instructions look great for released versions, but you could also add installation instructions for the dev version (e.g.,
pip install -e .
). -
There is inconsistent capitalization of dataframe (also DataFrame): https://pandera.readthedocs.io/en/stable/dataframe_schemas.html
-
pylint points out some places where the code could be streamlined a bit (e.g., unnecessary
else
statements, and some cases whereobject
is explicitly declared as a parent class), but none of the output is indicative of major problems. Feel free to address or ignore any of these checks:
>>> pylint pandera
************* Module pandera
pandera/__init__.py:1:0: C0111: Missing module docstring (missing-docstring)
************* Module pandera.dtypes
pandera/dtypes.py:6:0: C0111: Missing class docstring (missing-docstring)
pandera/dtypes.py:17:0: C0103: Constant name "Bool" doesn't conform to UPPER_CASE naming style (invalid-name)
pandera/dtypes.py:18:0: C0103: Constant name "DateTime" doesn't conform to UPPER_CASE naming style (invalid-name)
pandera/dtypes.py:19:0: C0103: Constant name "Category" doesn't conform to UPPER_CASE naming style (invalid-name)
pandera/dtypes.py:20:0: C0103: Constant name "Float" doesn't conform to UPPER_CASE naming style (invalid-name)
pandera/dtypes.py:21:0: C0103: Constant name "Int" doesn't conform to UPPER_CASE naming style (invalid-name)
pandera/dtypes.py:22:0: C0103: Constant name "Object" doesn't conform to UPPER_CASE naming style (invalid-name)
pandera/dtypes.py:23:0: C0103: Constant name "String" doesn't conform to UPPER_CASE naming style (invalid-name)
pandera/dtypes.py:24:0: C0103: Constant name "Timedelta" doesn't conform to UPPER_CASE naming style (invalid-name)
************* Module pandera.constants
pandera/constants.py:1:0: C0111: Missing module docstring (missing-docstring)
************* Module pandera.errors
pandera/errors.py:4:0: C0111: Missing class docstring (missing-docstring)
pandera/errors.py:8:0: C0111: Missing class docstring (missing-docstring)
pandera/errors.py:12:0: C0111: Missing class docstring (missing-docstring)
************* Module pandera.schemas
pandera/schemas.py:252:0: C0330: Wrong hanging indentation (add 1 space).
constants.N_FAILURE_CASES).to_dict()))
^| (bad-continuation)
pandera/schemas.py:258:0: C0330: Wrong hanging indentation (add 1 space).
constants.N_FAILURE_CASES).to_dict()))
^| (bad-continuation)
pandera/schemas.py:268:0: C0330: Wrong hanging indentation (add 1 space).
constants.N_FAILURE_CASES).to_dict()))
^| (bad-continuation)
pandera/schemas.py:11:0: R0205: Class 'DataFrameSchema' inherits from object, can be safely removed from bases in python3 (useless-object-inheritance)
pandera/schemas.py:14:4: R0913: Too many arguments (7/5) (too-many-arguments)
pandera/schemas.py:56:4: R0913: Too many arguments (6/5) (too-many-arguments)
pandera/schemas.py:79:25: W0212: Access to a protected member _checks of a client class (protected-access)
pandera/schemas.py:105:28: C1801: Do not use `len(SEQUENCE)` to determine if a sequence is empty (len-as-condition)
pandera/schemas.py:118:4: R0913: Too many arguments (6/5) (too-many-arguments)
pandera/schemas.py:172:0: R0205: Class 'SeriesSchemaBase' inherits from object, can be safely removed from bases in python3 (useless-object-inheritance)
pandera/schemas.py:175:4: R0913: Too many arguments (6/5) (too-many-arguments)
pandera/schemas.py:246:16: R1720: Unnecessary "else" after "raise" (no-else-raise)
pandera/schemas.py:219:4: R0912: Too many branches (13/12) (too-many-branches)
pandera/schemas.py:172:0: R0903: Too few public methods (1/2) (too-few-public-methods)
pandera/schemas.py:285:0: C0111: Missing class docstring (missing-docstring)
pandera/schemas.py:287:4: R0913: Too many arguments (6/5) (too-many-arguments)
pandera/schemas.py:287:4: W0235: Useless super delegation in method '__init__' (useless-super-delegation)
pandera/schemas.py:285:0: R0903: Too few public methods (1/2) (too-few-public-methods)
pandera/schemas.py:5:0: C0411: standard import "from typing import Optional" should be placed before "import pandas as pd" (wrong-import-order)
************* Module pandera.checks
pandera/checks.py:98:0: C0330: Wrong hanging indentation (remove 4 spaces).
"%s failed element-wise validator %d:\n"
| ^ (bad-continuation)
pandera/checks.py:100:0: C0330: Wrong hanging indentation (remove 4 spaces).
(parent_schema, check_index,
| ^ (bad-continuation)
pandera/checks.py:59:8: C0103: Attribute name "fn" doesn't conform to snake_case naming style (invalid-name)
pandera/checks.py:12:0: C0111: Missing class docstring (missing-docstring)
pandera/checks.py:12:0: R0205: Class 'Check' inherits from object, can be safely removed from bases in python3 (useless-object-inheritance)
pandera/checks.py:14:4: R0913: Too many arguments (7/5) (too-many-arguments)
pandera/checks.py:77:4: C0111: Missing method docstring (missing-docstring)
pandera/checks.py:163:4: R0201: Method could be a function (no-self-use)
pandera/checks.py:194:8: R1705: Unnecessary "elif" after "return" (no-else-return)
pandera/checks.py:212:8: R1705: Unnecessary "else" after "return" (no-else-return)
pandera/checks.py:238:12: R1705: Unnecessary "elif" after "return" (no-else-return)
pandera/checks.py:261:8: R1720: Unnecessary "elif" after "raise" (no-else-raise)
pandera/checks.py:160:8: W0201: Attribute 'failure_cases' defined outside __init__ (attribute-defined-outside-init)
pandera/checks.py:5:0: C0411: standard import "from functools import partial" should be placed before "import pandas as pd" (wrong-import-order)
pandera/checks.py:6:0: C0411: standard import "from typing import Union, Optional, List, Dict" should be placed before "import pandas as pd" (wrong-import-order)
************* Module pandera.decorators
pandera/decorators.py:64:0: C0330: Wrong hanging indentation (remove 4 spaces).
"error in check_input decorator of function '%s': the "
| ^ (bad-continuation)
pandera/decorators.py:68:0: C0330: Wrong hanging indentation (remove 4 spaces).
(fn.__name__,
| ^ (bad-continuation)
pandera/decorators.py:74:0: C0330: Wrong hanging indentation.
)
| | ^ (bad-continuation)
pandera/decorators.py:13:0: C0103: Argument name "fn" doesn't conform to snake_case naming style (invalid-name)
pandera/decorators.py:22:0: R0913: Too many arguments (6/5) (too-many-arguments)
pandera/decorators.py:57:4: C0103: Argument name "fn" doesn't conform to snake_case naming style (invalid-name)
pandera/decorators.py:62:12: C0103: Variable name "e" doesn't conform to snake_case naming style (invalid-name)
pandera/decorators.py:88:12: C0103: Variable name "e" doesn't conform to snake_case naming style (invalid-name)
pandera/decorators.py:57:21: W0613: Unused argument 'instance' (unused-argument)
pandera/decorators.py:100:0: R0913: Too many arguments (6/5) (too-many-arguments)
pandera/decorators.py:135:4: C0103: Argument name "fn" doesn't conform to snake_case naming style (invalid-name)
pandera/decorators.py:153:8: C0103: Variable name "e" doesn't conform to snake_case naming style (invalid-name)
pandera/decorators.py:135:21: W0613: Unused argument 'instance' (unused-argument)
************* Module pandera.schema_components
pandera/schema_components.py:9:0: C0111: Missing class docstring (missing-docstring)
pandera/schema_components.py:11:4: R0913: Too many arguments (7/5) (too-many-arguments)
pandera/schema_components.py:70:4: W0222: Signature differs from overridden '__call__' method (signature-differs)
pandera/schema_components.py:85:0: C0111: Missing class docstring (missing-docstring)
pandera/schema_components.py:87:4: R0913: Too many arguments (6/5) (too-many-arguments)
pandera/schema_components.py:87:4: W0235: Useless super delegation in method '__init__' (useless-super-delegation)
pandera/schema_components.py:101:4: W0222: Signature differs from overridden '__call__' method (signature-differs)
pandera/schema_components.py:110:0: C0111: Missing class docstring (missing-docstring)
pandera/schema_components.py:115:21: W0212: Access to a protected member _name of a client class (protected-access)
pandera/schema_components.py:115:46: W0212: Access to a protected member _name of a client class (protected-access)
pandera/schema_components.py:116:20: W0212: Access to a protected member _pandas_dtype of a client class (protected-access)
pandera/schema_components.py:117:27: W0212: Access to a protected member _checks of a client class (protected-access)
pandera/schema_components.py:118:29: W0212: Access to a protected member _nullable of a client class (protected-access)
pandera/schema_components.py:119:37: W0212: Access to a protected member _allow_duplicates of a client class (protected-access)
pandera/schema_components.py:127:4: W0222: Signature differs from overridden '__call__' method (signature-differs)
************* Module pandera.hypotheses
pandera/hypotheses.py:237:0: C0301: Line too long (103/100) (line-too-long)
pandera/hypotheses.py:30:4: R0913: Too many arguments (8/5) (too-many-arguments)
pandera/hypotheses.py:148:12: R1720: Unnecessary "else" after "raise" (no-else-raise)
pandera/hypotheses.py:168:8: R1705: Unnecessary "else" after "return" (no-else-return)
pandera/hypotheses.py:177:4: R0913: Too many arguments (8/5) (too-many-arguments)
pandera/hypotheses.py:5:0: C0411: standard import "from functools import partial" should be placed before "import pandas as pd" (wrong-import-order)
pandera/hypotheses.py:8:0: C0411: standard import "from typing import Union, Optional, List, Dict" should be placed before "import pandas as pd" (wrong-import-order)
pandera/hypotheses.py:1:0: R0801: Similar lines in 3 files
==pandera.schema_components:86
==pandera.schemas:174
==pandera.schemas:286
def __init__(
self,
pandas_dtype,
checks: callable = None,
nullable: bool = False,
allow_duplicates: bool = True,
name: str = None): (duplicate-code)
pandera/hypotheses.py:1:0: R0801: Similar lines in 3 files
==pandera.schema_components:10
==pandera.schemas:174
==pandera.schemas:286
def __init__(
self,
pandas_dtype,
checks: callable = None,
nullable: bool = False,
allow_duplicates: bool = True, (duplicate-code)
------------------------------------------------------------------
thank you @mbjoseph for this extremely thorough review. gosh i'm not sure why i didn't see this in my github notifications. my apologies. @xmnlab you can have a look at the review above. Do you want to give the second review a go after seeing what max has pointed out above? If you need any guidance, please say the word!!
awesome @xmnlab please reach out if you have any questions !! we are all hear to support. @cosmicBboy just a note that the second reviewer is starting the process. You could have a look at @mbjoseph review if you'd like in the meantime!! thank you all!! :)
thanks @lwasser!
@mbjoseph your review is much appreciated! I've released v0.2.1, where I addressed many of the points that you raised, check out the release notes. @xmnlab FYI I've taken a crack at some of @mbjoseph's comments.
Most notable changes:
- add citation information
- add dev installation instructions
- improve formatting and wording of sphinx documentation (this addresses several of the points you made about formatting and wording in the documentation)
- make SchemaError message formatting functions private (
generic_error_message
and this other methods should have been private all along) - add docstrings to error classes
Minor points:
Test coverage is pretty high - any particular reason why the remaining lines are not tested?
I haven't really had to much time to prioritize covering the rest, though I'd like to prioritize the biggest holes and cover those.
There are some deprecation warnings that arise in running the tests: https://travis-ci.org/pandera-dev/pandera/jobs/579197344#L2287
Planning to do this as part of unionai-oss/pandera#110
CI testing on OSX and Windows might be nice too.
Made an issue for this: unionai-oss/pandera#109
Why not conda-forge instead of the cosmicbboy conda channel?
Yes, would love to get a conda-forge recipe going: unionai-oss/pandera#90
pylint points out some places where the code could be streamlined a bit (e.g., unnecessary else statements, and some cases where object is explicitly declared as a parent class), but none of the output is indicative of major problems. Feel free to address or ignore any of these checks:
Cool, made an issue to add pylint to CI: unionai-oss/pandera#108
just one question. the version submitted for review is 0.1.5
but it seems pandera has more 2 version after that.
should I review just 0.1.5
? the same applies to documentation on readthedocs?
@mbjoseph i think that is a reasonable suggestion!! may i assume you reviewed the most recent version as well? if that is the case then the reviews will be consistent. thank you both!!
That's right @lwasser -- my review was for the most recent version at the time, but the package has been updated since (including updates that address my review). So, probably better to work on the most recent version for review 2.
sorry for throwing a wrench in the review process! I probably should have waited on review 2 before updating the package
Package Review
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
- As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).
Documentation
The package includes all the following forms of documentation:
- A statement of need clearly stating problems the software is designed to solve and its target audience in README
- Installation instructions: for the development version of package and any non-standard dependencies in README
- Vignette(s) demonstrating major functionality that runs successfully locally
- Function Documentation: for all user-facing functions
- Examples for all user-facing functions
- Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with
URL
,BugReports
andMaintainer
.
Readme requirements
The package meets the readme requirements below:
- Package has a README.md file in the root directory.
The README should include, from top to bottom:
- The package name
- Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
- Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
- Installation instructions
- Any additional setup required (authentication tokens, etc)
- Brief demonstration usage
- Direction to more detailed documentation (e.g. your documentation files or website).
- If applicable, how the package compares to other similar packages and/or how it relates to other packages
- Citation information
Functionality
- Installation: Installation succeeds as documented.
- Functionality: Any functional claims of the software been confirmed.
- Performance: Any performance claims of the software been confirmed.
- Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
- Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
- Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.
Final approval (post-review)
- The author has responded to my review and made changes to my satisfaction. I recommend approving this package.
Estimated hours spent reviewing: 4:30
Review Comments
The package looks very good: package structure, documentation, tests and CI looks in very good shape. Some points reported by @mbjoseph were already fixed or already added as an GitHub issue.
I am adding just 2 more comments. Actually the 1st is just a comment related to an issue that was already partial fixed (installation for development) but maybe it could be improved.
- Installation instructions:: probably the documentation should recommend
python setup.py develop
orpip install -e .
for the installation in development mode (as @mbjoseph suggested) - Examples: maybe it should consider the usage of example sections for docstrings. It seems the project is using sphinx style for docstrings. I didn't find an official documentation for that but maybe it could help: http://queirozf.com/entries/python-docstrings-reference-examples
awesome. thanks @xmnlab and great job on your first review !!! @cosmicBboy please note the new round of review comments. Ping me when changes have been implemented / you have questions etc!! Thank you all for a really smooth review process!!
thanks @lwasser @xmnlab @mbjoseph!
I've cut a new pandera release 0.2.2 that adds example docstrings to all public-facing classes and methods. The commit also:
- docstring examples should be reflected in the docs
- changes README with updated development installation instructions.
- adds more test coverage in
schema.py
- fixes unit test pandas FutureDeprecation warnings
Please let me know if you have any questions.
thank you @cosmicBboy !! @mbjoseph @xmnlab will you please have a look at the latest release? let me know if the changes are acceptable given your review! if so, you can check the. "the author has responded to my review" box at the bottom of your review submission. If you see anything that wasn't addressed to your satisfaction please let me know!!
thank you all for such a smooth review process!
@cosmicBboy thanks for addressing my suggestions - v0.2.2 looks good to me!
@xmnlab can you kindly have a look at the above and if you are happy with the edits, check the box in your review that states that the author has addressed everything to your satisfaction .
@lwasser sure thing. I will work on that today.
@cosmicBboy thanks for working on this new release. I will do the review today in some hours. thanks!
@cosmicBboy good job with the examples in the documentation!
I have checked the my "Final approval" checkbox.
Just some minors observations:
- There are some public methods without type hinting and/or docstring for
return
, if you could open an issue for that and address that it would be great. - And there are some private methods without any docstring or type hinting.
thanks for working on the suggestions we made!
this is great. @cosmicBboy @xmnlab are these small things that could be done quickly? it would be nice to have that done and wrapped up before the final approval. pyopensci does value documentation highly!
awesome @cosmicBboy !! I believe this is done!! congratulations as you are the second package to successfully go through our review process!
Next steps.
- I am redoing the website package list so i'll go ahead and add you to that and will provide the link when it's done!!
- I will also add you as a contributor on our contributors page as i am working on that now.
- we'd love for you to add a pyopensci badge to your readme file. Can you please do that?
[![pyOpenSci](https://tinyurl.com/y22nb8up)](https://github.com/pyOpenSci/software-review/issues/12)
Looks like this:
Congratulations all for pushing another package through the pyopensci review process.
@cosmicBboy because we are just building out our packages page, I may ask you for a bit more info on Pandera in the future. maybe a few sentence description. it won't be a big ask!!
@lwasser badge added! Yes please let me know when you need more info on the package.
Out of curiosity, what are the requirements for submitting this package to JOSS?
hi @cosmicBboy please check out this page
https://www.pyopensci.org/dev_guide/appendices/templates.html
in our dev guide and scroll down to the section on joss!! (it's actually also in the review template) ... if you are interested in pursuing joss and meet their requirements, please do let us know. it would be the first submission that also went through JOSS!! but we do have an established partnership tih them.
thank you so much for also adding a badge!! i will close this submission now UNLESS you decide you'd like to submit to joss as well!! if so we'd need you to add the write up that they require and then we'd push things up to their review process!!
please let me know what you'd like to do!!
given this has been APPROVED, i will close this issue. If there is any reason to reopen it, please say the word!!!
reopening to keep tabs on JOSS submission!
I tried to locate the pandera paper on JOSS, without success. Am I missing anything?
hey there @astrojuanlu i believe that @cosmicBboy hasn't yet submitted to JOSS. I briefly chatted over twitter i think or maybe at scipy and it wasn't submitted yet. it may not be under review yet. @cosmicBboy can you confirm? i can also remove that tag if you don't plan on submitting there but it sounded like you were interested in doing that at some point. the submission process is fast with JOSS once it goes through our review.
Hi @lwasser @astrojuanlu yes I do intend on submitting a paper to JOSS, I'm still working on a draft and plan on submitting within the next 2-3 weeks.
hey there @cosmicBboy did this ever go through JOSS? i just didn't see the issue referenced here. I am going to close this for the time being but if it does go into JOSS please reference this issue and we can update it accordingly! thank you!
thanks @lwasser will do! Just got swamped with other things, but am committed to submitting through JOSS in the new year
hey 👋 @cosmicBboy @mbjoseph @xmnlab ! I hope that you are all well. I am reaching out here to all reviewers and maintainers about pyOpenSci now that i am working full time on the project (read more here). We have a survey that we'd like for you to fill out so we can:
- invite you to our slack channel to participate in our community (if you wish to join - no worries if that is not how you prefer to communicate / participate).
- Collect information from you about how we can improve our review process and also better serve maintainers.
The survey should take about 10 minutes to complete depending upon how much you decide to write. This information will help us greatly as we make decisions about how pyOpenSci grows and serves the community. Thank you so much in advance for filling it out.
NOTE: this is different from the form designed for reviewers to sign up to review.
If there are other maintainers for this project, please ping them here and ask them to fill out the survey as well. It is important that we ensure packages are supported long term or sunsetted with sufficient communication to users. Thus we will check in with maintainers annually about maintenance.
Thank you in advance for doing this and supporting pyOpenSci.
hey there @cosmicBboy @mbjoseph 👋 Just a friendly reminder to take 5-10 minutes to fill out our survey . We really appreciate it. Thank you in advance for helping us by filling out the survey!! 🙌 Niels, it's really important for us to collect information from our maintainers so that we can both stay in touch with you regarding package maintenance and also support you through time. We really appreciate your time in filling this out. Also are you the sole maintainer of this package? if not, please have your co-maintainers also fill it out and please list them here as well. Many thanks in advance!
✨ Ivan you only need to do this once :) ping me on slack with any questions!! 🙌
hi again @cosmicBboy and @mbjoseph i'd be super appreciative if your filling our our survey
I know you are busy and Niels I know you have super exciting job transition life happening now. But i'd appreciate your time. We'd like to check in with maintainers once a year to ensure all is well with package maintenance. Also your input on the survey helps us improve and show funders we are doing good things! Many thanks for your time!
just filled it out!
You rock!! thanks Niels!
Hi @cosmicBboy we are updating our metadata to be consistent.
When you have a second, can you please confirm for me that at the time of this review you were the only core maintainer? I have added that in the "all current maintainers" field above (as in #109)
Hi @NickleDave sorry for the late response 😅
can you please confirm for me that at the time of this review you were the only core maintainer?
Yes, confirmed