Automatically predicting the list of species that are the most likely to be observed at a given location is useful for many scenarios in biodiversity informatics. First of all, it could improve species identification processes and tools by reducing the list of candidate species that are observable at a given location (be they automated, semi-automated or based on classical field guides or flora). More generally, it could facilitate biodiversity inventories through the development of location-based recommendation services (typically on mobile phones) as well as the involvement of non-expert nature observers. Last but not least, it might serve educational purposes thanks to biodiversity discovery applications providing functionalities such as contextualized educational pathways.
The rest of this documents presents (1) the data, and (2) the python code.
The data are composed of two parts: the environmental rasters and the actual dataset containing all the occurrences. All the data is downloadable on the CrowdAI page. This section will describe both. You can check the Protocol note for more details.
The rasters are available directly on CrowdAI. The following variables are available:
Name | Description | Nature | Values |
---|---|---|---|
CHBIO_1 | Annual Mean Temp. (mean of monthly) | quanti. | [-10.7,18.4] |
CHBIO_2 | Max-temp - min-temp | quanti. | [7.8, 21.0] |
CHBIO_3 | Isothermality (100*2/7) | quanti. | [41.1,60.0] |
CHBIO_4 | Temp. seasonality (std.dev*100) | quanti. | [302.7, 777.8] |
CHBIO_5 | Max Temp of warmest month | quanti. | [6.1,36.6] |
CHBIO_6 | Min Temp of coldest month | quanti. | [-28.3,5.4] |
CHBIO_7 | Temp. annual range | quanti. | [16.7,42.0] |
CHBIO_8 | Mean temp. of wettest quarter | quanti. | [-14.2,23.0] |
CHBIO_9 | Mean temp. of driest quarter | quanti. | [-17.7,26.5] |
CHBIO_10 | Mean temp. of warmest quarter | quanti. | [-2.8, 26.5] |
CHBIO_11 | Mean temp. of coldest quarter | quanti. | [-17.7, 11.8] |
CHBIO_12 | Annual precipitations | quanti. | [318.3,2543.3] |
CHBIO_13 | Precipitations of wettest month | quanti. | [43.0,285.5] |
CHBIO_14 | Precipitations of driest month | quanti. | [3.0,135.6] |
CHBIO_15 | Precipitations seasonality (coef. of var.) | quanti. | [8.2,26.5] |
CHBIO_16 | Precipitations of wettest quarter | quanti. | [121.6,855.6] |
CHBIO_17 | Precipitations of driest quarter | quanti. | [19.8,421.3] |
CHBIO_18 | Precipitations of warmest quarter | quanti. | [198,851.7] |
CHBIO_19 | Precipitations of coldest quarter | quanti. | [60.5,520.4] |
etp | Potential evapo transpiration | quanti. | [133,1176] |
alti | Elevation | quanti. | [-188,4672] |
awc_top | Topsoil available water capacity | ordinal | {0,120,165,210} |
bs_top | Base saturation of the topsoil | ordinal | {35,62,85} |
cec_top | Topsoil cation exchange capacity | ordinal | {7,22,50} |
crusting | Soil crusting class | ordinal | [0,5] |
dgh | Depth to a gleyed horizon | ordinal | {20,60,140} |
dimp | Depth to an impermeable layer | ordinal | {60,100} |
erodi | Soil erodibility class | ordinal | [0,5] |
oc_top | Topsoil organic carbon content | ordinal | {1,2,4,8} |
pd_top | Topsoil packing density | ordinal | {1,2} |
text | Dominant surface textural class | ordinal | [0,5] |
proxi_eau_fast | <50 meters to fresh water | boolean | {0,1} |
clc | Ground occupation | categorial | [1,48] |
More details about each raster are available within the archive.
The dataset is composed in multiple files:
- PL_complete.csv
- PL_trusted.csv
- noPlant.csv
- GLC_2018.csv
More details about the dataset are given in the protocol note. The datasets columns include :
Name | Description | Data source |
---|---|---|
Longitude | decimal longitude in the WGS84 coordinate system. | All |
Latitude | decimal latitude in the WGS84 coordinate system. | All |
glc19SpId | The GLC19 reference identifier for the species name. | All |
scName | the original data source taxon name of the occurrence. | All |
coordinateuncertaintyinmeters | location uncertainty. | GBIF |
accuracy | coordinate uncertainty in meters mostly computed by smartphone devices. | PL |
date | date of the observation. | PL |
eventDate | date of the observation. | GBIF |
X_key | a key for the observation. | PL |
session | Pl@ntNet session ID | PL |
project | the plantnet taxonomic referential to which the original taxon name belong. | PL |
FirstResPLv2Score | the confidence score of the automatically identified species. | PL |
Notice that the most important fields are Latitude and Longitude in order to extract the environmental patch and glc19SpId which contains the species ID.
The file environmental_raster_glc.py
provides to the participant of the GLC19 challenge a mean to extract
environmental patches or vectors given the provided rasters. Providing a set of input rasters, it enables the online (in memory) extraction of environmental patches at a given spatial position OR of the offline construction (on disk) of all the patches of a set of spatial positions.
The following examples are for python3 but the code should work with python2.
The environmental_raster_glc.py
follows two goals:
- in code use,
- command line use.
In code use enables to extract environmental tensors on the go, for instance within a Pytorch dataset, thus reducing IO and improving training performances.
The command line use enables to export the dataset on disk.
In addition to the standard libraries, this code requires the following ones:
import rasterio
import pandas
import numpy
import matplotlib
The core object to manipulate is the PatchExtractor
which will manage the multiple available rasters.
Constructing the extractor only requires to set up the root_path
of the rasters data:
# constructing the extractor
extractor = PatchExtractor(root_path='/home/test/rasters')
By default, the extractor will return nx64x64 (where n depends on the rasters) patches. For custom size (other then 64),
the constructor also accept an additional size
parameter:
# constructing the extractor
extractor = PatchExtractor(root_path='/home/test/rasters', size=256)
Attention, if size is too big, some patches from the dataset will be smaller due to an overflow in the raster map.
If size equals 1, then the extractor will return an environmental vector instead of the environmental tensor.
Once the extractor is available, the rasters can be added. Two strategies are available : either adding all the rasters at once, or a one by one approach where specific transformation can be specified and some rasters avoided.
# adding a single raster
extractor.append('clc', nan=0, normalized=True, transform=some_user_defined_function)
or
# adding all the raster at root_path
extractor.add_all(nan=0, normalized=True, transform=some_user_defined_function)
In addition, some rasters are preferably used through a one hot encoding representation
thus increasing the depth of the environmental tensor. The global parameter raster_metadata
enables to set some of these properties on a per raster basis.
If parameters are not set, default values are used. For instance, nan have a default value on a per raster basis. If you want to change it, either modify the metadata or set the parameter.
Please check the environmental_raster_glc.py
file for more details.
The extractor acts as an array. For instance, len(extractor)
gives the number of availble rasters.
Accessing to a specific vector or tensor is done in the following way, by giving latitude and longitude:
env_tensor = extractor[43.61, 3.88]
# env_tensor is a numpy array
Attention the shape of env_tensor
does not necessarily corresponds to len(extractor)
as some variables using
a one hot encoding representation actually correspond to a deeper representation.
The extractor also enables to plot a specific patch:
extractor.plot((43.61, 3.88))
# accept an optional style parameter to modify the style temporarily
Resulting in images of the following type:
The plot method accept a cancel_one_hot
parameter which value is True by default thus representing a variable
initially set to have a one hot encoding as a single patch. In the previous image, clc
is set to have
a one hot encoding representation but is plotted as a single patch.
Using the online extraction of patches is fast but requires a significant amount of memory to store all rasters. So, for those who would rather export the patches on disk, an additional functionality is provided.
The patch corresponding to the csv dataset will be extracted using the following command:
python3.7 extract_offline.py rasters_directory dataset.csv destination_directory
The destination_directory will be created if it does not exist yet. Its content might be erased if two files have the same name.
The extractor code has been conceived for low memory usage but might be slower in that sense.
Notice that the patch will be exported in numpy format. R library RcppCNPy
enables to read
numpy format.
Help command returns:
usage: environmental_raster_glc.py [-h] [--size SIZE] [--normalized NORM]
rasters dataset destination
extract environmental patches to disk
positional arguments:
rasters the path to the raster directory
dataset the dataset in CSV format
destination The directory where the patches will be exported
optional arguments:
-h, --help show this help message and exit
--size SIZE size of the final patch (default : 64)
--normalized NORM true if patch normalized (False by default)
Notice that some rasters (proxi_eau_fast
in particular) require a lots of memory and can be removed from
the extraction by using the exception variable (in the extract_offline.py
file).