dmwm/WMCore

Difference in DBS python vs dbs2go result output

vkuznet opened this issue · 27 comments

Impact of the new feature
To migrate from DBS python implementation to GoLang one we need to ensure that all DMWM tools should be able to properly parse JSON results of dbs2go server. Due to change in architecture in DBS implementation some of the DBS API output change output format of specific DBS attributes. This ticket provides all details.

Is your feature request related to a problem? Please describe.
It is related to migration from python to Go implementation of DBS server

Describe the solution you'd like
I request WMCore team to verify their clients to properly parse outlined API outputs

Describe alternatives you've considered
None

Additional context
This feature affects only specific DBS APIs which perform additional data aggregation on a server side. The dbs2go does not do that.

To reduce CPU/RAM footprints on a server side the dbs2go implementation does not perform aggregation of output results coming out from ORACLE (i.e. it streams from the ORACLE directly). As a result few DBS reader APIs output have slightly different data-format (instead of aggregated lists they provide individual records):

Below you can find difference in DBS API output for specific API calls:

  • DBS runs API:
# python server
"https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader/runs?logical_file_name=/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root"
[{"run_num": [99, 97, 98]}]

# Go server
"https://cmsweb-testbed.cern.ch/dbs2go/runs?logical_file_name=/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root"
[
{"run_num":99}
,{"run_num":97}
,{"run_num":98}
]
  • DBS filelumis API:
# python server
"https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader/filelumis?logical_file_name=/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root"                                                   [
{"lumi_section_num": [26422], "logical_file_name": "/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root", "run_num": 98}
,
{"lumi_section_num": [27414], "logical_file_name": "/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root", "run_num": 97}
,
{"lumi_section_num": [29838], "logical_file_name": "/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root", "run_num": 99}

# Go server
"https://cmsweb-testbed.cern.ch/dbs2go/filelumis?logical_file_name=/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root"
[
{"event_count":null,"logical_file_name":"/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root","lumi_section_num":29838,"run_num":99}
,{"event_count":null,"logical_file_name":"/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root","lumi_section_num":27414,"run_num":97}
,{"event_count":null,"logical_file_name":"/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root","lumi_section_num":26422,"run_num":98}
]
  • DBS fileparents API
# python server
"https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader/fileparents?logical_file_name=/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root"
[{"parent_logical_file_name": ["/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/p52787/0.root"], "logical_file_name": "/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root"}

# Go server
"https://cmsweb-testbed.cern.ch/dbs2go/fileparents?logical_file_name=/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root"
[
{"logical_file_name":"/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root","parent_file_id":653168717,"parent_logical_file_name":"/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/p52787/0.root"}
]
  • DBS filechildren API
# python server
"https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader/filechildren?logical_file_name=/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/p52787/0.root"
[{"child_logical_file_name": ["/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root"], "logical_file_name": "/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/p52787/0.root"}]

# Go server
"https://cmsweb-testbed.cern.ch/dbs2go/filechildren?logical_file_name=/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/p52787/0.root"
[
{"child_file_id":653169117,"child_logical_file_name":"/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/52787/0.root","logical_file_name":"/store/mc/Fall08/BBJets250to500-madgraph/GEN-SIM-RAW/IDEAL_/p52787/0.root"}
]
  • DBS releaseversions API
# python server
"https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader/releaseversions"
[{"release_version": ["2", "CMSSW", "CMSSWError", "CMSSW_10_0_0", "CMSSW_10_0_0_MWGR1", "CMSSW_10_0_0_pre1", "CMSSW_10_0_0_pre2", "CMSSW_10_0_0_pre3", "CMSSW_10_0_0_pre3_G4VecGeom2", ...}]

# Go server
"https://cmsweb-testbed.cern.ch/dbs2go/releaseversions"
[
{"release_version":"2"}
,{"release_version":"CMSSW"}
,{"release_version":"CMSSWError"}
,{"release_version":"CMSSW_10_0_0"}
,{"release_version":"CMSSW_10_0_0_MWGR1"}
...
]
  • DBS datasetaccesstypes API
# python server
"https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader/datasetaccesstypes"
[{"dataset_access_type": ["DELETED", "DEPRECATED", "INVALID", "PRODUCTION", "VALID"]}]

# Go server
"https://cmsweb-testbed.cern.ch/dbs2go/datasetaccesstypes"
[
{"dataset_access_type":"DELETED"}
,{"dataset_access_type":"DEPRECATED"}
,{"dataset_access_type":"INVALID"}
,{"dataset_access_type":"PRODUCTION"}
,{"dataset_access_type":"VALID"}
]

Since there will be quite some changes within this GH issue, we might want to take the opportunity and also work on:
#7279

to avoid having 3 different schemas in our DBSReader/DBSWritter wrapper...

usually we do not make HTTP calls to DBS server, but use DBS python API. Can't the format differences be hidden inside the client ?

WMCore is the same, Stefano. All the communication with DBSServer goes through the DBS3Reader/DBS3Writer wrapper (which uses dbs3-client). However, with time and migration from DBS2 to DBS3, we accumulated "temporary" code that was never refactored/removed, like this:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/DBS/DBS3Reader.py#L32

maybe cleanup DBS3Reader and/or review the need for it then ? I often found myself in trouble in using DBSReader wrapper around DBS Client API since it does not expose full functionality of the underlaying client.

Yes, this is what we have been considering for some time.
It's also expected that the wrapper does not expose all the underlying client functionality, it's meant to be a (custom) wrapper serving our needs and making these APIs easier to be used within the system.

@vkuznet Valentin, as much as I like this response data streaming, this is a very significant change on our application and I think the DBS2Go implementation should make application's life easier with a backward-compatible code.
How about supporting both plain JSON (as it's done right now by Python DBS3) and also the JSON streaming, as you suggest with this issue. This functionality should be provided together with the Accept (or Accept-Type) request HTTP header and the Content-Type response HTTP header.

To make the right thing on the WMCore side, we would have to consume the response as it gets streamed from the server, but this would definitely require substantial changes in the code. Contrary to the simple and dirty workaround that could be done, which is to parse the data streamed from the server and load all of it in memory, before passing this data structure to be consumed somewhere else.

+1
All in all, unless one completely rewrites a lot of code to do things more dynamically, WMCore and/or CRAB would end in adding a wrapper around DBS3Reader to do exacty that and then process things as before :-)

OTOH it is possible that memory consumption by current code gets so bad that at least in some particular cases we want/need to do it. So once this streaming mode is available it will be good to have an option to use it.

IIUC the only question is whether the merging in a single big memory structure should be done in DBSClient or in DBS3Reader. Personally I'd prefer the former, so to be able to move from DBS3Reader to plain DBSClient in my code and still be able to leave higher level code unchanged until real need arise.

IIUC the only question is whether the merging in a single big memory structure should be done in DBSClient or in DBS3Reader. Personally I'd prefer the former, so to be able to move from DBS3Reader to plain DBSClient in my code and still be able to leave higher level code unchanged until real need arise.

I'm in favor of:

  • if client provides Accept:application/json - which I believe to be the case right now - then nothing should change. Applications that work today should work in the future with the Go implementation without a single bit of change.
  • however, if client provides Accept:application/stream+json, then the client will be expecting JSON streaming for the server response, and in this case our application needs to be prepared to parse that output AND to properly consume it as data flows.

My understand is that, right now, we are not planning any changes to be made to the dbs3-client itself, right? Of course, if we find ourselves implementing the same thing in CRAB/WMCore/etc, then we should consider a single implementation in the client.

but I do not not think that current DBSClient has an argument to tell it to pass one header or another, At some point it might be useful, but I agree that default DBSClient behavior should not change. Streaming should be a new feature to be enabled optionally. I do not user direct HTTP calls to DBS, if you do, I think your proposal is fine. But AFAIK direct HTTP has been discouraged since ever, hopefully it is very very limited.

The DBS Python code does performs some aggregation for certain queries in its code. Since we may have timeouts due to long running DBS process I decided to avoid doing these steps in Go implementation. The Go code takes results from ORACLE and put them on a wire. This helps to avoid several issues like extra CPU cycles, extra RAM usage to hold intermediate results and reduce query latency on a client side. I truly believe this is correct approach on a server in a long run to avoid FE timeouts for long running queries. The server should provides results as fast as possible, while client can decide what to do with these results and can use resources to do the aggregation as desired. Therefore, I think the proper solution would be to adjust DBS Python client to perform these steps. If you want we may add separate header to return streamed data or not.

I'm not sure what @belforte refer by saying "DBSClient or in DBS3Reader", to me the aggregation should be done in a client and not on a server. I would appreciate if you'll further clarify what "DBS3Reader" means here.

Therefore, my suggestion is the following:

  • keep ORACLE streaming since it removes complexity from the server code, it makes it fast and it will allow us to avoid memory issues over there, avoid FE timeouts, etc.
  • add aggregation to DBS client code since all production tools use it
  • we can add support for different HTTP headers (if you would like to do that)
  • if any other HTTP clients will be in use they will get non-aggregated results and should deal with them properly

This approach will provide backward compatibility and avoid undesired bottlenecks on a server side.

@vkuznet I referred to this as "DBS3Reader": https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/DBS/DBS3Reader.py
i.e. the wrapper around DBS Python Client API which is used in WMCore and CRAB (actually by further wrapping it with https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/DBS/DBSReader.py yeah.. lots of history here).

that said, your suggestion looks fine to me.

@belforte , thanks for clarification. It always puzzles me why we develop wrappers (DBS3Reader) around wrappers (DBS client) to perform a simple (HTTP) action which can be done through a standard library, and then we complain that we have timeouts, slow queries, huge builds, etc. Anyway, it only confirms that we should strive to simplify and streamline data access and reduce complexity.

yeah.. but people who had that idea are now gone, we'll never know.

Therefore, my suggestion is the following:
keep ORACLE streaming since it removes complexity from the server code, it makes it fast and it will allow us to avoid memory issues over there, avoid FE timeouts, etc.
add aggregation to DBS client code since all production tools use it
we can add support for different HTTP headers (if you would like to do that)
if any other HTTP clients will be in use they will get non-aggregated results and should deal with them properly
This approach will provide backward compatibility and avoid undesired bottlenecks on a server side.

I'm afraid this proposal is NOT backwards compatible. It will require applications using dbs3-client to actually upgrade it, otherwise they won't get the data aggregation functionality (which is right now provided by the server).

If we want to have a backward compatible migration, we need to provide aggregated data - by default - from the DBS2Go server. Such that old clients will keep exactly the same functionality and behavior.
Other applications that want to take the best of data streaming, would have to upgrade their dbs3-client package and properly define the HTTP header Accept key.

Note, however, that this is a new approach that we are discussing. Up to now, we were saying that 0 changes would be required on the dbs3-client package.

It always puzzles me why we develop wrappers (DBS3Reader) around wrappers (DBS client) to perform a simple (HTTP) action which can be done through a standard library, and then we complain that we have timeouts, slow queries, huge builds, etc.

If you have to perform the same call from different modules, or even different services, having extra manipulation, error handling and caching becomes quite important. You can keep all that data manipulation at a single place, instead of re-implementing the same thing at different parts of the code in the repository. That said, IMO having a wrapper around a given service pays off in the long run, regardless whether there is already a client for that service or not.

@amaltaro do you worry about application which talk to DBS Server directly w/o going through DBS Python Client ? Are there many such ones ?

Alan, and what is a stopper to upgrade DBS client?

I tried to avoid to intentionally add complexity to the server implementation knowing how they influence its performance. One of the main reason I start doing Go implementation for DBS server is to solve high RAM utilization, etc. There is different requirements in server and in a client, and I'm trying to find a common ground such that we can improve the server (and eliminate all issues we have, including high RAM usage, timeouts, etc) and keep existing functionality. If you provide me any significant reason why we can't upgrade DBS client I don't see it as a stopper of backward compatibility. I'll be happy to adjust DBS Client to add aggregation and then we can upgrade it gradually everywhere. Then all code will be ready to use new server. Does such path seem reasonable to you?

@vkuznet, while it might not be impossible to upgrade applications that use the client, we are working in a resource constrained environment, where the precious resource is developer time. Updating all of the code that uses the DBS client for a change in server behavior is not on the list of Q4 development priorities, and so we'd have to displace something else in order to get to that. If this can be handled by adjusting the client to the do aggregation (and as much as possible otherwise shielding the code that uses the client from changes to the underlying server), I think that would be a good compromise.

AFAICT there is some misunderstanding at work. E.g. I personally do not see any problem. I humbly suggest that you guys have a talk.

@klannon , I'm not sure I understand your comment. I never claimed that these changes are required in Q4, neither we defined yet any particular timeline for DBS Go migration, and neither I asked to perform anything from dev-teams on this subject. I only outlined the differences and explained why they exists in different implementation, and seeking feedback. But your last sentence seems align with what I propose.

Yes, at this stage is probably better to have a discussion over Zoom, and then update this issue with the outcome of that discussion. There are many fronts that we can explore here, and it looks like we are targeting something that is not going to be backwards compatible.

I guess we better explore those possibilities and come up with the best "cost-benefit" in a meeting. It's planned for the next WMCore chat (next Monday), but I could make it tomorrow if others can too. Starting a thread over slack.

Ah, I think I might have confused people, which is what I get for trying to dash off a response in between meetings. Definitely, @vkuznet, I am trying to agree with your proposal to handle this in the client. I thought you were asking for justification before pursuing this approach, which I was trying to give. I would really like to switch over to the Go-based DBS, so I favor options that don’t require additional development for WM or CRAB before we can make that change.

A meeting sounds like a good idea, but I’m booked up during CERN friendly hours for the rest of this week. I trust any solution that @amaltaro, @belforte, and @vkuznet agree to if you want to meet without me. Otherwise, maybe we can discuss in the next WMCore Chat?

The following PR dmwm/DBSClient#20 contains all required changes to DBC client to provide aggregation, and therefore is fully compatible functionality of DBS client between Python and Go-based servers. The PR covers all cases listed in this issue.

Just a bit of history, DBS server had not done aggregation initially, but we got a strong push to do it. The argument was repeated data waisted network resource and some clients needed to do aggregation anyway. So later, we made aggregation on some APIs that have heavy outputs and used widely. That is why we have a mixed outputs that some are aggregated and others are not.

I don't think the aggregation was the root cause of slow APIs. The major problem in the python server was caused by the abusing the server, such as some applications used DBS as cache by repeating queries in a loop and someone mistakenly queried DBS from grid. I hope the GO server can avoid all these by making the throttle working. throttling does not work in DBS right now. Or Go has better ways to control the access from a single client.

I saw Go API output database primary keys, such as "child_file_id":653169117 . These are internal for db. We can twister the queries a bit to get rid them w/o aggregation.

@yuyi, you're absolutely correct. Over the time we learnt things and reevaluated them. I was very critical to myself what features should go into new implementation. The aggregation procedure requires time and resources, regardless how it is done. And, I opt-ed to not do it in a new implementation. This allows to keep code simple and efficient. I realized that this step can be easily done in a client which are not constrained by cmsweb limitations (5min threshold for all HTTP requests), but server hardware limitations, etc.

I also pointed out that in a different email thread that our APIs should start using gzip to compress input/output data which eliminates original request to reduce network bandwidth. I demonstrated that we can save factor of 10+ in network bandwidth if clients simply will request data in gzipped form. This features is implemented in dbs2go too. For the record, I provided full example how to adapt python code to use gzip here.

And, I hope we'll check all SQL queries to ensure that we do not leak extra information. So far, I took queries as they were written in Python code, but sometimes python code itself drops some input. In new implementation all queries are in templates and it would be easier to tweak them.

For the record, I added separate #10872 issue to demonstrate gzip encoding both in server and in client. Since we use various clients, and DBS Client is one of them, it would be nice to add gzip encoding request into it, but it is out of the scope of this issue.

@vkuznet Valentin, I reviewed the whole discussion in this thread and my conclusion is that all those APIs listed at the initial description of this issue, now support data aggregation through the dbs3-client, such that applications running with the latest dbs3-client are fully compatible with the Python and Go-based DBSServer implementation. Can you confirm that?

If you agree with this, then I think we can close this issue out.

@amaltaro to the best of my knowledge it is the case. All APIs which require aggregation are covered now in new dbs3-client. Said that if WMCore use direct DBS APIs (as is the case of Common.py) then it should be checked. We already fixed Common.py, and I'm not aware of another place of WMCore which relies on direct REST APIs to DBS. You better tell me. I'll check DBSreader.py though to confirm that, it is separate issue: #10842

Thanks Valentin. I'm closing this one out then.