Unidata/thredds

URI encoding query strings to data servers

rschmunk opened this issue · 39 comments

A developer at OPENDAP was testing out my Panoply app and reported a problem with accessing data on his testbed server. The same issue arises when using IDV. It seems that when a variable in the remote dataset has been chosen and a subset of that variable is selected, the constraint query string sent to the server is not URI-encoded. Apparently due to security concerns some server software (e.g. Tomcat) is tightening up requirements and enforcing URI encoding rules on the incoming HTTP request.

Here's part of the stack trace from IDV 5.5:

java.io.IOException: opendap.dap.DAP2Exception: Method failed:HTTP/1.1 400  on URL= http://theserver:8080/opendap/CMR/C1276812863-GES_DISC/temporal/1985/06/05/M2T1NXSLV.5.12.4%3AMERRA2_100.tavg1_2d_slv_Nx.19850605.nc4.dods?U250[2:1:2][0:1:360][0:1:575]; U250 -- 2:2,0:360,0:575
	at ucar.nc2.dods.DODSNetcdfFile.readData(DODSNetcdfFile.java:1693)
	at ucar.nc2.Variable.reallyRead(Variable.java:913)
	at ucar.nc2.Variable._read(Variable.java:898)
	at ucar.nc2.Variable.read(Variable.java:709)
	at ucar.nc2.dataset.VariableDS.reallyRead(VariableDS.java:557)
	at ucar.nc2.dataset.VariableDS._read(VariableDS.java:537)
	at ucar.nc2.Variable.read(Variable.java:709)
	at ucar.nc2.Variable.read(Variable.java:655)
	at ucar.nc2.dt.grid.GeoGrid.readDataSlice(GeoGrid.java:606)
	at ucar.visad.data.GeoGridFlatField.readData(GeoGridFlatField.java:246)
        ...

A stack trace in Panoply using a current netCDF-Java 5.0 snapshot reveals a bit more depth:

opendap.dap.DAP2Exception: Method failed:HTTP/1.1 400  on URL= http://theserver:8080/opendap/CMR/C1276812863-GES_DISC/temporal/1985/06/05/M2T1NXSLV.5.12.4%3AMERRA2_100.tavg1_2d_slv_Nx.19850605.nc4.dods?T250[0:1:0][0:1:360][0:1:575]
	at opendap.dap.DConnect2.openConnection(DConnect2.java:297)
	at opendap.dap.DConnect2.getData(DConnect2.java:913)
	at opendap.dap.DConnect2.getData(DConnect2.java:1213)
	at ucar.nc2.dods.DODSNetcdfFile.readDataDDSfromServer(DODSNetcdfFile.java:1436)
	at ucar.nc2.dods.DODSNetcdfFile.readData(DODSNetcdfFile.java:1596)
	at ucar.nc2.Variable.reallyRead(Variable.java:823)
	at ucar.nc2.Variable._read(Variable.java:808)
	at ucar.nc2.Variable.read(Variable.java:619)
	at ucar.nc2.dataset.VariableDS.reallyRead(VariableDS.java:424)
	at ucar.nc2.dataset.VariableDS._read(VariableDS.java:408)
	at ucar.nc2.Variable.read(Variable.java:619)
	at ucar.nc2.Variable.read(Variable.java:565)
        ...

The problem is that the query string in these requests look like T250[0:1:0][0:1:360][0:1:575], but the :, [ and ] are reserved characters and need to be "% encoded".

For the purposes of Panoply, I found I could hack DODSNetcdfFile.readDataDDSfromServer to alter the CE (constraint expression) via CE = java.net.URLEncoder(CE,"UTF-8"); before making the call to dodsConnection.getData(CE,null) at line 1436. The request went through without complaint and the data came back.

However, I'm wondering if this is the best place for that fix. Perhaps encoding the CE should be done within DConnect2, perhaps within its getCompleteCE(String) method?

BTW: The OPENDAP developers found that a 4-year-old version of Panoply based on netCDF-Java 4.3.22 properly encoded the URI when communicating with their server. Further testing on my part found found that this problem arose when my code started using NJ 4.5.x.

This has begun to occur because Tomcat changed a default about
what characters are legal in in-coming urls. If memory serves, it is possible
to undo that change in the Tomcat configuration file. It is assumed that
url encoding is to be done at the lowest levels (somewhere in libcurl). Since
we have to deal with both situations in different tomcat servers, it will cause
problems if we start encoding square brackets at a higher level.
This is all complicated because opendap uses %xx encoding as its own
escaping mechanism.
I would suggest that server administrators undo the change in the tomcat
configuration.

Is it for certain that doing the proper URL encoding will break TDS running on older Tomcat versions? It seems like having the URL encoded should always be ok.

As for older servers, we would need to stand up such a server. Not sure how far back
we need to go (probably pre-github since I recall addressing this issue many years
ago).
We know that encoding the '[]' causes the server to fail to process the request, so fixing that
is the most important task.

HI folks,

I have spent quite a bit of time looking into this issue in Hyrax and I would like to share my findings.

The change in Tomcat was not actually a change in "what characters are legal". These characters were never legal unless URI encoded. Unfortunately, Tomcat (and pretty much every other web server) was lenient w.r.t. issue. And, over time, clients became non-conformant. It is the client application's responsibility to ensure that the URL content is encoded correctly before request transmission. Not doing this is now considered a significant security issue and all major web servers are moving to block unencoded URL content.

My first inclination was also to reconfigure Tomcat to allow the verboten characters. But after discussing this with James and several others we came to the conclusion that suggesting to sys admins that they lower the security barriers in order to accept unencoded content from non-conformant clients doesn't seem to be an advisable path forward. (And imagine the fallout if you make this suggestion, they do it, and then they get compromised. Ouch.)

And I think it is not necessary. Older servers should accept the encoded content just fine as this has always been a part of the specification that clients transmit correctly encoded URL content. Tomcat et al. have been lenient, but they do also accept correct inputs.

For example, our test server, test.opendap.org, is running an older Hyrax (1.14.0) and Tomcat-7.0.90. Both encodings work:

curl -g "http://test.opendap.org/opendap/w10n/data/nc/fnoc1.nc/u[0][0][0]?output=json"
curl -g "http://test.opendap.org/opendap/w10n/data/nc/fnoc1.nc/u%5B0%5D%5B0%5D%5B0%5D?output=json"

(Seems like either GitHub's link builder or my browser was encoding all of the URLs in my examples so I buried them in code blocks to stop it. hah.)

However, this test mule is running Hyrax on top of Tomcat-8.5.33 and only the encoded URL will prevail:

curl -g "http://52.23.165.57:8080/opendap/w10n/data/nc/fnoc1.nc/u[0][0][0]?output=json"
curl -g "http://52.23.165.57:8080/opendap/w10n/data/nc/fnoc1.nc/u%5B0%5D%5B0%5D%5B0%5D?output=json"

One thing that I found a bit vexing is that Tomcat only returns HTTP headers when the encoding is rejected:

curl -i -g "http://52.23.165.57:8080/opendap/w10n/data/nc/fnoc1.nc/u[0][0][0]?output=json"
HTTP/1.1 400 
Transfer-Encoding: chunked
Date: Tue, 11 Sep 2018 00:36:11 GMT
Connection: close

Which can make understanding the problem on the client side difficult.

I think Robert is dead on with this:

However, I'm wondering if this is the best place for that fix. Perhaps encoding the CE should be done within DConnect2, perhaps within its getCompleteCE(String) method?

I hope you will consider revisiting this before choosing not to make the change.

Sincerely,
Nathan

Certainly we need to fix the server to be able to handle url encoded requests, and we need netCDF-Java to make encoded requests by default (perhaps provide a signature to allow non-encoded requests). Dennis - can you investigate this on 5.0 first? Ideally, this should be done on the 4.6.x branch as well if possible (will make a 4.6.12 release as soon it is in).

Is there anything to fix in TDS? If you enter a DAP URL into a browser, the browser encodes the URL before making the request. I haven't yet had trouble mucking about with DAP URLs in a browser. So, I suspect TDS handles encoded DAP URLs.

Is there anything to fix in TDS? If you enter a DAP URL into a browser, the browser encodes the URL before making the request. I haven't yet had trouble mucking about with DAP URLs in a browser. So, I suspect TDS handles encoded DAP URLs.

From Dennis comment above:

We know that encoding the '[]' causes the server to fail to process the request, so fixing that
is the most important task.

So I was basing my comment of fixing the server off of that.

Did I misinterpret what failure we were seeing?

The problem that Robert originally brought is that the java-netcdf library is no longer encoding DAP request URLs where previously it did. (And it should still.)

Dennis stated that doing so would confound the TDS.

I have not tested the TDS. But I can say that in Hyrax I have had to make 0 changes to accommodate the encoded URLs - Not in the OLFS (java servlet in Tomcat) or in our c++ BES code. For what it's worth, I checked my code in a debugger and it's clear that Tomcat is doing the decoding. For example HttpServletRequest.getPathInfo() returns the decoded path part of the request URL and the method HttpServletRequest.getParameterMap() returns the decoded query string components. Of course, like all things, YMMV

When I first came to Unidata, the fact that the DAP2 protocol used
%xx encoding as its escape encoding caused us a lot of problems
in terms of conflicts between URL encoding and the occurrence of
DAP2 escapes. So, we made some changes to deal with that problem.
Frankly, I no longer remember the problems or the solution we came up with.
So, a side experiment is to see what happens when you have a DAP2 request that
has %xx escapes due to DAP2 escaping separate from the URL encoding.

In any case, it will take me a while to reconstruct what was the problem
and how we solved it. My speculation is that somewhere we are doing
the encoding using a now faulty set of characters to escape, or are using
something like the Java URLEncoding class that needs updating.

A few points to clarify how and why OPeNDAP's C++ server-side software uses the %hh encoding, or more to the point, takes advantage of its presence. We don't use the encoding in any way that is incompatible with the web, so there's no reason that encoding a URL will break any OPeNDAP software (assuming that the software is correctly written, but that's a given). In our C++ server software (code that is run long after the information has been read off the wire and passed down through several layers), the 'Web URL encoding' is removed. Dennis, I think you may be remembering that the software does this in two stages: first all but the spaces are decoded, then the constraint expression is parsed, and then spaces are treated. So, in that very limited sense, our software uses the encoding. But this happens far down inside the server. Properly encoding a URL in the client will in no way affect this. As for the TDS code, it's hard to imagine that correctly encoding URLs would break it. Certainly, web browsers feed the server URLs and they encode the characters.

@dmh - any updates? I'll be making a PR that make sure the urls sent by the OPeNDAP Dataset Access Form are encoded properly, as the requests made by the Get Ascii and Get Binary buttons are done by javascript and not being encoded correctly.

The bigger problem lies in the httpservices module - this impacts all of netCDF-Java, as well as any client using the httpservices module.

I noticed in 4.6 and 5.0 that there is a comment in the HTTPSession code that says:

 * Note that the term legalurl in the following code means that the url has
 * reserved characters within identifiers in escaped form. This is
 * particularly an issue for queries. Especially: ?x[0:5] is legal and the
 * square brackets need not be encoded.

We need to start encoding those square brackets.

Once this is fixed, I'll cut a 4.6.12 release and a 5.0.0-beta6 release.

I knew I forgot something :-)
I will try to solve this ASAP.

SO I added encoding in the branch fixurlencode.tmp
and it caused two extra tests to fail on this URL:
http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/.CPC/.GLOBAL/.daily/dods.dods?olr[0:1:0][0:1:72][0:1:143]
But putting that url into a browser succeeded.

Well, heck - it looks like they return a 404 with a slightly better message in the response body:

Request Headers2 = 
 
Status = 404 HTTP/1.1 404 Not Found
Status Line = HTTP/1.1 404 Not Found
Response Headers = 
  Date: Fri, 02 Nov 2018 22:18:27 GMT
  Server: Ingrid 0.9
  Mime-Version: 1.0
<snip>

ResponseBody---------------
Error {
    code = 3;
    message ="The identifier `olr%5B0:1:0%5D%5B0:1:72%5D%5B0:1:143%5D' is not in the dataset.";
};

We could start getting all kinds of hack-y here, and check the headers for Server: Ingrid, but let's not start down that road of madness.

Does anyone have a connection to the developers of Ingrid? This is the first I've personally encountered it. @ethanrd?

DO we have any idea what kind of server/servlet setup they are using? Does not look
like thredds. Looks home grown.

According to http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/.CPC/.GLOBAL/.daily/dods.ver:

Ingrid is currently at version 9.10.59

I can't seem to find anything that looks like what we are seeing here. The closest would be https://www.ingrid-oss.eu/latest/index.html, but I'm pretty sure that's not right (they are at version 4.5 of their software stack); that looks very WMS-y, and not very DAP-y.

Mybe @rabernat knows someone with more information about INGRID?

Benno Blumenthal was the original author of the Ingrid data system. I believe he, for personal reasons, stopped working on Ingrid several years ago. I don't know who might have taken over.

@DennisHeimbigner Since the URL you gave works in the browser, I would check the tests again.

(FYI the link text is not encoded but the link URL is already encoded. So following the link ends up with the browser double encoding the URL.)

Sorry, I do not understand your comment. Who is double encoding? ncdump? the web browser?

@ethanrd - I think I might have double encoded using toolsUI (looks like it might be encoding in the UrlDump tab), but I don't think our tests are:

public void testGrid() throws IOException, InvalidRangeException {
try (DODSNetcdfFile dodsfile = TestDODSRead.openAbs("http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/.CPC/.GLOBAL/.daily/dods")) {
Variable dataV = null;
// array
assert (null != (dataV = dodsfile.findVariable("olr")));
assert dataV instanceof DODSVariable;
// maps
Variable v = null;
assert (null != (v = dodsfile.findVariable("time")));
assert (null != (v = dodsfile.findVariable("lat")));
assert (null != (v = dodsfile.findVariable("lon")));
// read data
Array data = dataV.read("0, 0:72:1, 0:143:1");
assert null != data;
}
}

Currently, I am encoding down in HTTPMethod.java, so
doing it anywhere higher in the stack is probably going
to lead to double encoding.

@DennisHeimbigner Sorry, I was waiting for a plane and writing on my phone for that last, very terse message.

I just meant that, since browsers encode URLs, if the Ingrid URL with square brackets works in a browser, the server probably is not the problem.

I really think this is a server problem.

Using curl without encoding works:

 curl -G -v http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/.CPC/.GLOBAL/.daily/dods.dods?olr[0:1:0][0:1:2][0:1:3]
*   Trying 129.236.110.35...
* Connected to iridl.ldeo.columbia.edu (129.236.110.35) port 80 (#0)
> GET /SOURCES/.NOAA/.NCEP/.CPC/.GLOBAL/.daily/dods.dods?olr[0:1:0][0:1:2][0:1:3] HTTP/1.1
> Host: iridl.ldeo.columbia.edu
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< XDODS-Server: Ingrid/2.15
< Date: Mon, 05 Nov 2018 16:28:53 GMT
< Server: Ingrid 0.9
< Mime-Version: 1.0
< Access-Control-Allow-Origin: *
< Cache-Control: public
< Content-Description: dods_data
< Content-Type: application/octet-stream
< Last-Modified: Mon, 30 Apr 2018 07:04:57 GMT
< Content-Length: 321
< X-Cache: MISS from gfs2geo2.ldeo.columbia.edu
< X-Cache-Lookup: MISS from gfs2geo2.ldeo.columbia.edu:3128
< X-Cache: MISS from iridls1.ldeo.columbia.edu
< X-Cache-Lookup: MISS from iridls1.ldeo.columbia.edu:80
< Via: 1.0 gfs2geo2.ldeo.columbia.edu:3128 (squid/2.7.STABLE9), 1.0 iridls1.ldeo.columbia.edu:80 (squid/2.7.STABLE9)
<
Dataset {
    Grid {
     ARRAY:
        Int16 olr[time = 1][lat = 3][lon = 4];
     MAPS:
        Float32 time[time = 1];
        Float32 lat[lat = 3];
        Float32 lon[lon = 4];
    } olr;
} daily;
Data:
* Connection #0 to host iridl.ldeo.columbia.edu left intact
�   �   �C�    �   �B  B  B     �   �    @   @  @

but with encoding fails:

curl -G -v http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/.CPC/.GLOBAL/.d
aily/dods.dods --data-urlencode "olr[0:1:0][0:1:2][0:1:3]"
*   Trying 129.236.110.35...
* Connected to iridl.ldeo.columbia.edu (129.236.110.35) port 80 (#0)
> GET /SOURCES/.NOAA/.NCEP/.CPC/.GLOBAL/.daily/dods.dods?olr%5B0%3A1%3A0%5D%5B0%3A1%3A2%5D%5B0%3A1%3A3%5D HTTP/1.1
> Host: iridl.ldeo.columbia.edu
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 404 Not Found
< Date: Mon, 05 Nov 2018 16:30:44 GMT
< Server: Ingrid 0.9
< Mime-Version: 1.0
< Access-Control-Allow-Origin: *
< Cache-Control: public
< Content-Type: text/plain
< XDODS-Server: Ingrid/2.15
< Content-Description: dods_error
< X-Cache: MISS from gfs2geo2.ldeo.columbia.edu
< X-Cache-Lookup: MISS from gfs2geo2.ldeo.columbia.edu:3128
< X-Cache: MISS from iridls1.ldeo.columbia.edu
< X-Cache-Lookup: MISS from iridls1.ldeo.columbia.edu:80
< Transfer-Encoding: chunked
< Via: 1.0 gfs2geo2.ldeo.columbia.edu:3128 (squid/2.7.STABLE9), 1.0 iridls1.ldeo.columbia.edu:80 (squid/2.7.STABLE9)
< Connection: close
<
Error {
    code = 3;
    message ="The identifier `olr%5B0%3A1%3A0%5D%5B0%3A1%3A2%5D%5B0%3A1%3A3%5D' is not in the dataset.";
};
* Closing connection 0

I'll try pinging some people from the IRI Ingrid world to see if they have any insights on this issue:

@jdelcorral, @remicousin, @agmunozs, @ikhomyakov

@lesserwhirls Yes, similar to the results when pasting the URL into a browser.

I believe both the browser and curl are encoding the URL behind the scenes. So, for your first request with the unencoded URL [1], the server gets the encode URL [2] which it unencodes back to the unencoded URL [1]. So the query part the server interprets is a valid DAP constraint expression and it succeeds.

For your second request with the encoded URL [2], the server gets a double encoded URL [3] which the server unencodes back to the encoded URL [2]. The server tries to interpret that as a DAP constraint and fails.

[1] Unencoded URL: ?olr[0:1:0][0:1:2][0:1:3]
[2] Encoded URL: ?olr%5B0%3A1%3A0%5D%5B0%3A1%3A2%5D%5B0%3A1%3A3%5D
[3] Double encoded URL: ?olr%255B0%253A1%253A0%255D...

@agmunozs We noticed after enforcing URL encoding by netCDF-java (which sits on top of the Apache HTTPClient library) that the Ingrid server does not appear to handle encoded urls, specifically when there are encoded square brackets or colons in the query of the request. See #1144 (comment) and #1144 (comment).

Ethan-
libcurl explicitly claims they do not enode anything and that it is
up to the caller to do that.

@DennisHeimbigner Yes, I talked to Sean about this offline and he convinced met that Curl and browsers don't encode URLs (or at least different browsers do it differently).

It makes sense to me that libcurl would leave that to the caller. But I'm surprised that browsers don't encode URLs. From previous run-ins with this issue, I believe browsers used to encode URLs.

@DennisHeimbigner - let's take your fix as is, broken tests and all. Unless you are super bored, I can add some code to look for a Java system property (something like -Dhttpservices.urlencode=false) to override the new encoding by httpservices (default true). Then, we can update the failing test to set the property to false, and voilà - new feature test. Once that's all done, I can backport to 4.6.

Ok, netCDF-Java will now, by default, encode URLs. This has been address in the 4.6 and 5.0 branches. I will be cutting a 4.6.12 and 5.0.0-beta6 release this week.

We have found some servers that do not handle this (e.g. iridl.ldeo.columbia.edu), so there is an option to turn off and on encoding as needed. This can be done by setting a java property, either by passing -Dhttpservices.urlencode="false" on the command line, or using System.setProperty("httpservices.urlencode", "false"); in code (default is "true").

@lesserwhirls, Any chance the 5.0b6 will be posted Thursday? I'm on the verge of posting a Panoply bugfix release.

@msdsoftware - there are a few 3rd party deps that need to be updated, but I will work hard to get 5.0b6 up today to make sure it's ready for you.

@lesserwhirls, Yes, netCDFAll.jar is the only jar I actually use. Building via gradle would have been no problem, but thanks for providing a compiled binary.

Great - let me know how it goes!