Using Rasterio's new linux wheels
sgillies opened this issue · 8 comments
We're making binary wheels for Linux that include all the C libraries Rasterio needs for all of the pre-1.0 releases. This is a post about how to use them.
These wheels are not intended for production use by the internet, but should be perfectly adequate for integration testing of Python software that requires Rasterio. They might even be useful for developing prototype services.
The GDAL library included in these wheels is only lightly provisioned with format drivers. The JPEG2000 driver based on Jasper is the only non-default driver. There are no proprietary drivers.
My example: installing Rasterio wheels on Ubuntu 14.04 and performing the little extra configuration needed to access AWS Public Datasets like Landsat on AWS.
Installation
I'm going to use a container based on the Ubuntu 14.04 image in Docker Hub as a host. It has Python 3 installed, but pip, the program we're going to use to install Rasterio, is not installed. Rather than install the python3-pip apt package (possibly requiring apt-get update
) and drag in a mess of other dependencies, let's get pip via wget.
$ apt-get install wget
$ wget https://bootstrap.pypa.io/get-pip.py
$ python3 get-pip.py
Rasterio has a host of extra Python dependencies, thus it's always a good idea to install Rasterio applications in a dedicated environment. Create and activate one with virtualenv.
$ pip install virtualenv
$ virtualenv -p python3 venv
$ source venv/bin/activate
Now install Rasterio into the environment using pip, also requesting the optional "s3" set of extra dependencies (boto3 and more).
(venv)$ pip install --pre rasterio[s3]>=1.0a4
This fetches the rasterio-1.0a4-cp34-cp34m-manylinux1_x86_64.whl
file from the Python Package Index and extracts it into the environment's site-packages directory. A peek into site-packages reveals the included C libraries.
(venv)$ ls -l venv/lib/python3.4/site-packages/rasterio/.libs/
total 122004
-rwxr-xr-x 1 root root 3659864 Dec 8 09:29 libcurl-96d9b940.so.4.4.0
-rwxr-xr-x 1 root root 94185184 Dec 8 09:29 libgdal-03eecd3b.so.20.1.2
-rwxr-xr-x 1 root root 22032320 Dec 8 09:29 libgeos-3-fc05f4c1.5.0.so
-rwxr-xr-x 1 root root 1499128 Dec 8 09:29 libgeos_c-09576097.so.1.9.0
-rwxr-xr-x 1 root root 1428600 Dec 8 09:29 libjasper-fb9de72f.so.1.0.0
-rwxr-xr-x 1 root root 43712 Dec 8 09:29 libjson-c-ca0558d5.so.2.0.1
-rwxr-xr-x 1 root root 2074320 Dec 8 09:29 libproj-18c59ecd.so.12.0.0
Yes, the libs are big. The wheels are heavy. I'm working on it, I promise.
Start a Python interpreter and import rasterio as a last check.
(venv)$ python
Python 3.4.3 (default, Oct 14 2015, 20:28:29)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import rasterio
>>> rasterio.__gdal_version__
'2.1.2'
Configuration
Rasterio includes a program named "rio" and its "info" sub-command provides many of the same features as the venerable "gdalinfo" program. Before you can use it to query datasets on S3, you need to do a little extra system configuration.
First, set language and locale environment variables so rio will run properly with Python 3.
(venv)$ export LC_ALL=C.UTF-8
(venv)$ export LANG=C.UTF-8
Next, specify where to find the SSL certs on your host. Rasterio's libcurl, which is built on CentOS, expects /etc/pki/tls/certs/ca-bundle.crt
. Ubuntu's are in a different location.
(venv)$ export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
Finally, set up AWS credentials. Rasterio uses boto3 to deal with credentials and these can be configured following the directions in the AWS CLI guide.
(venv)$ mkdir ~/.aws
(venv)$ cat << EOF > ~/.aws/credentials
> [default]
> aws_access_key_id = AWS_ACCESS_KEY_ID
> aws_secret_access_key = AWS_SECRET_ACCESS_KEY
> EOF
Running rio-info
Give an s3-prefixed object identifier, the same kind you would use with the AWS CLI, to rio info
with a --indent 2
option to get pretty-printed JSON.
(venv)$ rio info --indent 2 s3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF
{
"blockxsize": 512,
"blockysize": 512,
"bounds": [
381885.0,
2279085.0,
610515.0,
2512815.0
],
"colorinterp": [
"grey"
],
"compress": "deflate",
"count": 1,
"crs": "EPSG:32645",
"descriptions": [
null
],
"driver": "GTiff",
"dtype": "uint16",
"height": 7791,
"indexes": [
1
],
"interleave": "band",
"lnglat": [
86.96327090815723,
21.666821827007773
],
"mask_flags": [
[
"all_valid"
]
],
"nodata": null,
"res": [
30.0,
30.0
],
"shape": [
7791,
7621
],
"tiled": true,
"transform": [
30.0,
0.0,
381885.0,
0.0,
-30.0,
2512815.0,
0.0,
0.0,
1.0
],
"units": [
null
],
"width": 7621
}
Efficient metadata queries
Access to S3 GeoTIFF metadata is very efficient. Thanks to GDAL's support for HTTP range requests, Rasterio only needs to download 0.03% of the dataset's bytes in order to query its metadata. Turn up the verbosity of rio-info and ask for extra curl logging to see the individual HTTP requests.
(venv)$ CPL_CURL_VERBOSE=1 rio -vv info s3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF 2>&1 > /dev/null | grep '< '
< HTTP/1.1 400 Bad Request
< Content-Type: application/xml
< Transfer-Encoding: chunked
< Date: Thu, 08 Dec 2016 09:53:18 GMT
< Connection: close
< Server: AmazonS3
<
< HTTP/1.1 200 OK
< Date: Thu, 08 Dec 2016 09:53:21 GMT
< Content-Type: application/xml
< Transfer-Encoding: chunked
< Server: AmazonS3
<
< HTTP/1.1 206 Partial Content
< Date: Thu, 08 Dec 2016 09:53:21 GMT
< Last-Modified: Sat, 14 Mar 2015 23:20:01 GMT
< ETag: "f08bdf1e626bf0039746c102fbd2c2b8"
< Accept-Ranges: bytes
< Content-Range: bytes 0-16383/51099231
< Content-Type: image/tiff
< Content-Length: 16384
< Server: AmazonS3
<
The HTTP/1.1 400 Bad Request
is in response to probing of the object's folder that GDAL does by default. In a future version of GDAL the probing can be disabled.
Efficient partial data queries
Because the Landsat GeoTIFFs are tiled, subsets of them can be queried for a fraction of the cost of downloading the entire dataset. I'm going to use Rasterio's dataset inspector, rio-insp, to demonstrate. Knowing that the GeoTIFF is tiled and that the tiles are 512 x 512 bytes, I'm going to request a subset corresponding to a single tile in the middle of the raster.
(venv)$ CPL_CURL_VERBOSE=1 rio -vv insp s3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF
Rasterio 1.0a4 Interactive Inspector (Python 3.4.3)
Type "src.meta", "src.read(1)", or "help(src)" for more information.
>>> from rasterio.windows import Window
>>> src.read(window=Window(2048, 2048, 512, 512))
Here are the request details printed to stderr:
> GET /L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF HTTP/1.1
Host: landsat-pds.s3.amazonaws.com
Range: bytes=12189696-12550143
Accept: */*
< HTTP/1.1 206 Partial Content
< Date: Fri, 09 Dec 2016 10:09:59 GMT
< Last-Modified: Sat, 14 Mar 2015 23:20:01 GMT
< ETag: "f08bdf1e626bf0039746c102fbd2c2b8"
< Accept-Ranges: bytes
< Content-Range: bytes 12189696-12550143/51099231
< Content-Type: image/tiff
< Content-Length: 360448
< Server: AmazonS3
<
And here is the abbreviated representation of the 512 x 512 array in the Python console:
array([[[10311, 10249, 10306, ..., 10736, 10637, 10468],
[10320, 10262, 10231, ..., 10834, 10682, 10461],
[10225, 10287, 10305, ..., 10742, 10660, 10516],
...,
[10055, 10072, 10042, ..., 10509, 10555, 10548],
[10034, 10055, 10042, ..., 10566, 10529, 10563],
[10005, 9996, 10030, ..., 10592, 10549, 10551]]], dtype=uint16)
Only about 0.7% of the dataset's bytes have to be read in order to get that subset. If I ask for the tile in the upper left corner, which happens to be all zeros and has been compressed to nearly nothing, there's no additional HTTP request: all the data for that tile was already picked up in the initial 16 kb request and cached by GDAL.
>>> src.read(window=Window(0, 0, 512, 512))
array([[[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]]], dtype=uint16)
That's it for examples in this post. There's more spacewalking to be done with other datasets and other formats. I'll leave that up to you.
See also
The manylinux project is the one that we're closely following to learn how to build these wheels.
The wheel building infrastructure is here: https://github.com/sgillies/frs-wheel-builds.
Feedback is very welcome
Are these useful to you? Can they be more useful with a modest amount of effort? Please let us know.
Thanks for reading!
I made edits to the post a few minutes ago. A reader pointed out to me in an email that Rasterio shouldn't be making so many requests for a tile. Indeed: I'd misused Rasterio's Window()
constructor, and after fixing my usage I find that partial data access is even more efficient than I'd initially reported and correct in comparison to gdal_translate results.
I need similar functionality but with google cloud. Is it possible?
@MelnykAndriy search the repo for "google cloud": https://github.com/mapbox/rasterio/search?q=google+cloud&type=Issues&utf8=%E2%9C%93.
Is there a possibility of using "/vsizip/" as well as S3 to query metadata from a large zip compressed geotiff on S3?
I have tried to use this with Sentinel-2 images from the sentinel-s2-l1c
AWS Public Dataset bucket. The sentinel images are stored in the JPEG2000
format, and internally tiled in blocks of 1014x1024
pixels. The windowed partial data query works fine, as long as I only request data from "within" one internal tile. If I request a block that spans over multiple tiles, the routine gives an error I can not interpret.
Using
(venv)$ CPL_CURL_VERBOSE=1 rio -vv insp s3://sentinel-s2-l1c/tiles/29/S/ND/2017/11/16/0/B03.jp2
The following works, but seems to be less efficient, as it does a lot more requests than in the TIF file example above.
>>> from rasterio.windows import Window
>>> src.read(window=Window(1024, 1024, 512, 512))
... (lots of output)
array([[[ 817, 779, 940, ..., 781, 669, 720],
[ 811, 797, 966, ..., 930, 695, 707],
[ 859, 894, 971, ..., 1161, 927, 806],
...,
[ 759, 772, 763, ..., 886, 844, 728],
[ 751, 747, 725, ..., 847, 825, 745],
[ 723, 678, 683, ..., 1022, 938, 806]]], dtype=uint16)
The following fails
>>> src.read(window=Window(1000, 1000, 512, 512))
DEBUG:rasterio._io:Output nodata value read from file: None
DEBUG:rasterio._io:Output nodata values: [None]
DEBUG:rasterio._io:Jump straight to _read()
DEBUG:rasterio._io:Window: Window(col_off=1000, row_off=1000, width=512, height=512)
DEBUG:rasterio._io:IO window xoff=1000.0 yoff=1000.0 width=512.0 height=512.0
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "rasterio/_io.pyx", line 330, in rasterio._io.DatasetReaderBase.read
File "rasterio/_io.pyx", line 591, in rasterio._io.DatasetReaderBase._read
OSError: Read or write failed
So I guess I have two questions: why is the partial querying doing many more requests than in the TIF example, and why am I getting the above errors?
My GDAL version
>>> import rasterio
>>> rasterio.__gdal_version__
'2.2.2'
@yellowcap I have also noticed poor performance with the same JP2 files. I think they're not optimized for remote access with GDAL like the Landsat PDS GeoTIFFs are.
Can you make a new ticket for the OSError
issue above? That looks like a bug to me.
Thanks for the info @sgillies regarding performance. Any chance the Sentinel-2 data access can be optimized in the future through a software update without changes in the files? Or is that related to the files and can not be worked around? Opened separate ticket for error as requested.