Governence
dhirschfeld opened this issue ยท 20 comments
cc: @cpcloud, @mrocklin, @llllllllll, @kwmsmith, @cowlicks, @quasiben, @teoliphant
Sorry for the spam, but as the top contributors to this project you may want to weigh in on the future of odo.
I think it's a great project and have been happily using it for a couple of years. I'd like to continue to use it however for quite some time now it hasn't had any active maintainer. Worse, no one with commit access seems interested in merging bug-fix PRs from the community let alone actively fixing issues raised.
My particular bugbear is PR #596 which makes odo unusable for a number of our databases. I've been unable to convince anyone with commit access to review or merge the PR and have had no response to my request for commit access for myself.
At this point, given the lack of engagement with the community odo is effectively a dead project. Is this likely to change at all in the near future? Does anyone want to give me commit rights so I can at the very least fix the bugs affecting my usage? If not, my plan is to fork the project however before doing that I wanted to solicit feedback from those who might have an interest in the projects future...
I agree that this project is in an odd state administratively. And I apologize for personal lack of response on recent PRs. Personally speaking I would be happy to see commit access extended to others that are more active. As someone who has been inactive for a long period I don't personally feel qualified to determine who those people are.
I would trust developers like @llllllllll and @necaris who have been active more recently to determine developers that have been active recently and should be given commit privileges. I would personally trust their judgment and encourange them to act in a liberal way in regards to permissions.
I also agree that this project has sort of fallen into disrepair. A few weeks ago my coworker @ehebert also expressed this concern. After that conversation we reached out to Anaconda to get admin access to the blaze organization which allows me to add new commiters. I know that you have been working with and on odo for a long time so I would be happy to give you commit access.
One issue I see with odo in general is that, by design, we consume a huge set of packages. These packages are all constantly changing and breaking odo. In expectation, no single odo developer uses or cares about the vast majority of edges in odo. I know that this has contributed to my lack of enthusiasm for odo because it is hard to invest a lot of time maintaining code you never intend to use while trying not to break someone's production systems. I think odo is a project that would be best served with a large number of contributors who each need to think about just a small set of edges. Unfortunately, I don't know how to get there from our current state. I agree with Matt, especially given how inactive I have been, that we should give out more access to odo; however, I would like to stress that due to the tangled nature of the odo graph, small changes can break seemingly unrelated code. The test suite is not totally comprehensive (or functional all the time) so we still need to be careful about merging code, even when travis passes.
One step towards faster acceptance and developments of patches would
be getting continuous integration back to passing, and making passing
tests requirement for merging.
The biggest blocker to getting the tests to currently pass are the
MySQL tests. I'm not sure what the current demand for MySQL is, and we
may be able to drop support. It would be useful to drop support for
MySQL, and other backends, at least until there is someone who needs
it and is willing to take ownership. Off the top of my head, for
relational databases, recently, most activity has come on Postgres,
from Quantopian (myself and @llllllllll ), and MSSQL, from
@dhirschfeld. I'm not sure if anyone is actively using MySQL; however,
it may be that is just working for those who use it, despite the
failing tests.
Could we assign owners/maintainers for backends (and put that on the project wiki or docs) and then add skip
for backends which are not maintained. Or at least add skip
to unowned backends, if/when tests fail?
@llllllllll:
| The test suite is not totally comprehensive (or functional all the
time) so we still need to be careful about merging code, even when
travis passes.
I'm not sure what to do here, either.
One initial step we could take is to run coverage on our internal projects' usage of odo
and make sure that odo
s unit test suite's coverage also covers the same lines. However, line coverage does not necessitate that similar input states are covered.
Another tact could be that when our internal projects do break because of updates that are made to master, we go back in odo
and a test case to reproduce our usage that broke. However, that may run into the problems where the changes that were made are necessary for someone else's use case?
Would it be useful to look at Linux maintenance practices for
"Maintaining a stable user space, while supporting a wide array of
backends?"
With a good test suite it's (theoretically) easy - developers are pretty much free to do what they want so long as the tests pass. In my experience though that's never really true - there will always be some aspect which isn't covered by the tests which may be broken by new changes. The only way I've found to deal with this is to insist that any patch for a breakage which wasn't picked up by the test suite is accompanied by a new test. In this way, slowly over time the test suite becomes more comprehensive and therefore such breakage hopefully becomes increasingly unlikely.
Even though the current test suite may not be comprehensive, following such a policy should get us out of the woods; however it will require that new changes aren't just vetted by the CI but also by the internal test-suites of those with skin in the game.
I'd suggest after getting the current CI passing and in good shape that PRs which pass CI are allowed to be merged however if it causes breakage for others the maintainers reserve the right to revert the changes (if those affected supply a test for the breakage?)
IIUC what I'm proposing above is essentially the same as @ehebert.
However, that may run into the problems where the changes that were made are necessary for someone else's use case?
In this case perhaps the changes can be made in such a way so as to not break others' usage. If it's really not possible to cater for both use-cases then I'm optimistic that the pros & cons of adding the new functionality vs keeping the old functionality can be cordially discussed in an issue and a resolution found! Such a resolution would include a deprecation period for the old functionality before it was removed/changed (if that was decided as the way forward)
I don't think odo has a huge userbase these days - I'd imagine most users are early adopters who still have some dependence on it. Given the state of it (no release in >1 year) I doubt there are many new users. That being the case I think we can get by with such an informal governance structure but we can certainly revisit that if anything changes / we reach an impasse.
I'm using odo against our SQL Server and Oracle databases some of which use (IMHO idiotic) non-standard data types. I'm happy to set up tests against our databases and report back any breakage and provides tests for such. PR #596 is in fact a good example of this and provides a simple test/repro for the issue.
FWIW I'm in favour of removing cruft (stuff I don't use ๐ ) - if people complain it can always be added back, perhaps on the condition of providing maintainership for that backend.
Maybe odo could be pared back to just the core backends and a mechanism provided for people to add their own backends, maintained as separate packages? The core backends could also use the same mechanism but would be shipped by default with odo.
I've been having thoughts about a comprehensive testing scheme. It's possible to enumerate combinations of source containers (and their data types) and dest containers. Each backend needs to declare/enumerate all its datatypes and be able to create them (w/o odo involved). From there you can see if you can successfully convert to the destination. (This test functionality would also include benchmarking).
It may make sense to split the project: odo-core + various (sub)projects for backends. That way we can also have different administrators for the backends projects.
It may make sense to split the project: odo-core + various (sub)projects for backends. That way we can also have different administrators for the backends projects.
That's a great idea! But this places a dependency on 'core' and core has to be maintained as well. Did you consider the numpy and pandas backends as core as well? It doesn't need to be but then I don't know how you'd get (core + backendA) + (core + backendB) to work together without some common, connecting 'core' data conainer.
With the way things are going this common container might be Arrow instead of pandas dataframe.
I guess another way of expressing what I'm trying to say is that odo is useless without the backend network. Will we have 'distributions' of odo?
@majidaldo, I would see odo-core limited to the Python native data structures (yes, it's very limited). odo-core essentially is the construction of the graph and the dispatcher functions. And yes, in the current core package, convert includes numpy ndarray and pandas DataFrame but I think they should be in their own backend. I agree with you that Arrow should be at the same level as ndarray or pd.DataFrame. That's why instead of including Arrow in the core, I would simply push all three in different backends.
It is then up to the backend projects to build the graph they want. For some going thru ndarray will take priority over going thru Arrow, for example.
It is true, however, that the odo-core I described is pretty useless without any backend (and at least 2 to make it worthwhile). But even if ndarray or pd.DataFrame (or Arrow data structures) were to be in the core, it's doubtful that the core would be useful as is (moving data from ndarray to pd.DataFrame is not that challenging).
Distributions (i.e. a bundle of odo-core + backends) would be indeed nice. Yet I should be able to install the backends individually.
It might also be nice to just do a bug fix release with the changes since 0.5.1 (0.5.1...master)
I just ran into a bug that was fixed on master in 2016.
@llllllllll any update on the blaze org and new commit permissions?
I thought I had given out commit access to @dhirschfeld, but I don't see that now. Maybe he didn't accept it? Basically, no one has come along and actually demonstrated a willingness to take over so the status quo remains. I am no longer being paid to develop blaze and odo so I don't have much time to commit to the project.
I don't think I ever got an invite? Or if I did I must've missed it (which is entirely possible!)
Even should I get commit access, I'll help out where I can but my time available to dedicate to blaze/odo is pretty limited also :(
Edit: Sorry, I did miss the invite - have accepted now...
FWIW I'm not being paid to work on this any longer either -- happy to help where I can but my time is also pretty limited.
I'm taking on the responsibility under Quansight of transitioning Blaze and harvesting from it the pieces that are still useful.
The general plan is to ensure the useful projects continue. Projects under the Blaze organization that will continue include datashape and odo as well as (perhaps) the data-base adaptors and Text Adaptors.
However, the current plan is to move datashape to the Plures (XND) organization and re-base odo to use the ContinuumIO/intake project. Odo, for now, will live under the Blaze org until a better home is found for it (perhaps the PyData organization).
If there are perspectives or opinions, please speak up on this thread or on this one: blaze/blaze#1669
Quansight does not have specific funding for this at the moment, but is generally working to get funding to sustain open source and we will work on this in our spare time to help the community. If there is anyone also eager to help or with feedback about our plans --- continue to make it known.
Not sure if this is relevant, I'm working on a fork of odo
@kineticadb and building out support for a variety of data types (e.g. ESRI Shapefiles, Apache ORC, Apache Parquet, etc.), and ideally have our fork open-sourced by the end of the year. I find odo
useful and would like to stay in touch with the Blaze community wherever it goes!
@teoliphant @othermaintainers : wanted to check if odo has been has been housed under a new project. It was a great tool which provided a lot of flexibility to users in a single line of code. If you are aware of any such other data migration tool in python, could you please help (since it looks like odo is a dead library now).
intake, judging from the name and its docs, appears to solve the problem of getting data from various sources into a notebook (dataframes and others), right? odo was interesting to me personally, because it also worked the other way.
For the record, I've been researching Python alternatives to odo (or at least, easy db-to-db transfer capabilities) and found none. As written in intake/intake#91, many-to-many write capabilities are out of scope for intake.
I guess that the best alternative these days for what I'm looking for is to use something like intake for reading + pd.DataFrame.to_sql
, although that won't cover all use cases that odo served or attain the simplicity of odo(source, sink)
.
For the record, I've been researching Python alternatives to odo (or at least, easy db-to-db transfer capabilities) and found none. As written in intake/intake#91, many-to-many write capabilities are out of scope for intake.
I guess that the best alternative these days for what I'm looking for is to use something like intake for reading +
pd.DataFrame.to_sql
, although that won't cover all use cases that odo served or attain the simplicity ofodo(source, sink)
.
heh, even pd.DF.to_sql this can go wrong. if you think about the conversions df > sql alchemy > odbc > db. plenty can go wrong there! i already know that geospacial datatype trips this chain up.