GeoLifeCLEF 2019

Automatically predicting the list of species that are the most likely to be observed at a given location is useful for many scenarios in biodiversity informatics. First of all, it could improve species identification processes and tools by reducing the list of candidate species that are observable at a given location (be they automated, semi-automated or based on classical field guides or flora). More generally, it could facilitate biodiversity inventories through the development of location-based recommendation services (typically on mobile phones) as well as the involvement of non-expert nature observers. Last but not least, it might serve educational purposes thanks to biodiversity discovery applications providing functionalities such as contextualized educational pathways.

The rest of this documents presents (1) the data, and (2) the python code.

1. Data

The data are composed of two parts: the environmental rasters and the actual dataset containing all the occurrences. All the data is downloadable on the CrowdAI page. This section will describe both. You can check the Protocol note for more details.

Environmental Rasters

The rasters are available directly on CrowdAI. The following variables are available:

Name	Description	Nature	Values
CHBIO_1	Annual Mean Temp. (mean of monthly)	quanti.	[-10.7,18.4]
CHBIO_2	Max-temp - min-temp	quanti.	[7.8, 21.0]
CHBIO_3	Isothermality (100*2/7)	quanti.	[41.1,60.0]
CHBIO_4	Temp. seasonality (std.dev*100)	quanti.	[302.7, 777.8]
CHBIO_5	Max Temp of warmest month	quanti.	[6.1,36.6]
CHBIO_6	Min Temp of coldest month	quanti.	[-28.3,5.4]
CHBIO_7	Temp. annual range	quanti.	[16.7,42.0]
CHBIO_8	Mean temp. of wettest quarter	quanti.	[-14.2,23.0]
CHBIO_9	Mean temp. of driest quarter	quanti.	[-17.7,26.5]
CHBIO_10	Mean temp. of warmest quarter	quanti.	[-2.8, 26.5]
CHBIO_11	Mean temp. of coldest quarter	quanti.	[-17.7, 11.8]
CHBIO_12	Annual precipitations	quanti.	[318.3,2543.3]
CHBIO_13	Precipitations of wettest month	quanti.	[43.0,285.5]
CHBIO_14	Precipitations of driest month	quanti.	[3.0,135.6]
CHBIO_15	Precipitations seasonality (coef. of var.)	quanti.	[8.2,26.5]
CHBIO_16	Precipitations of wettest quarter	quanti.	[121.6,855.6]
CHBIO_17	Precipitations of driest quarter	quanti.	[19.8,421.3]
CHBIO_18	Precipitations of warmest quarter	quanti.	[198,851.7]
CHBIO_19	Precipitations of coldest quarter	quanti.	[60.5,520.4]
etp	Potential evapo transpiration	quanti.	[133,1176]
alti	Elevation	quanti.	[-188,4672]
awc_top	Topsoil available water capacity	ordinal	{0,120,165,210}
bs_top	Base saturation of the topsoil	ordinal	{35,62,85}
cec_top	Topsoil cation exchange capacity	ordinal	{7,22,50}
crusting	Soil crusting class	ordinal	[0,5]
dgh	Depth to a gleyed horizon	ordinal	{20,60,140}
dimp	Depth to an impermeable layer	ordinal	{60,100}
erodi	Soil erodibility class	ordinal	[0,5]
oc_top	Topsoil organic carbon content	ordinal	{1,2,4,8}
pd_top	Topsoil packing density	ordinal	{1,2}
text	Dominant surface textural class	ordinal	[0,5]
proxi_eau_fast	<50 meters to fresh water	boolean	{0,1}
clc	Ground occupation	categorial	[1,48]

More details about each raster are available within the archive.

Dataset of Occurrences

The dataset is composed in multiple files:

PL_complete.csv
PL_trusted.csv
noPlant.csv
GLC_2018.csv

More details about the dataset are given in the protocol note. The datasets columns include :

Name	Description	Data source
Longitude	decimal longitude in the WGS84 coordinate system.	All
Latitude	decimal latitude in the WGS84 coordinate system.	All
glc19SpId	The GLC19 reference identifier for the species name.	All
scName	the original data source taxon name of the occurrence.	All
coordinateuncertaintyinmeters	location uncertainty.	GBIF
accuracy	coordinate uncertainty in meters mostly computed by smartphone devices.	PL
date	date of the observation.	PL
eventDate	date of the observation.	GBIF
X_key	a key for the observation.	PL
session	Pl@ntNet session ID	PL
project	the plantnet taxonomic referential to which the original taxon name belong.	PL
FirstResPLv2Score	the confidence score of the automatically identified species.	PL

Notice that the most important fields are Latitude and Longitude in order to extract the environmental patch and glc19SpId which contains the species ID.

2. Python3

The file environmental_raster_glc.py provides to the participant of the GLC19 challenge a mean to extract environmental patches or vectors given the provided rasters. Providing a set of input rasters, it enables the online (in memory) extraction of environmental patches at a given spatial position OR of the offline construction (on disk) of all the patches of a set of spatial positions.

The following examples are for python3 but the code should work with python2.

The environmental_raster_glc.py follows two goals:

in code use,
command line use.

In code use enables to extract environmental tensors on the go, for instance within a Pytorch dataset, thus reducing IO and improving training performances.

The command line use enables to export the dataset on disk.

In code

In addition to the standard libraries, this code requires the following ones:

import rasterio
import pandas
import numpy
import matplotlib

Constructing the Extractor

The core object to manipulate is the PatchExtractor which will manage the multiple available rasters. Constructing the extractor only requires to set up the root_path of the rasters data:

# constructing the extractor
extractor = PatchExtractor(root_path='/home/test/rasters')

By default, the extractor will return nx64x64 (where n depends on the rasters) patches. For custom size (other then 64), the constructor also accept an additional size parameter:

# constructing the extractor
extractor = PatchExtractor(root_path='/home/test/rasters', size=256)

Attention, if size is too big, some patches from the dataset will be smaller due to an overflow in the raster map.

If size equals 1, then the extractor will return an environmental vector instead of the environmental tensor.

Once the extractor is available, the rasters can be added. Two strategies are available : either adding all the rasters at once, or a one by one approach where specific transformation can be specified and some rasters avoided.

# adding a single raster
extractor.append('clc', nan=0, normalized=True, transform=some_user_defined_function)

# adding all the raster at root_path
extractor.add_all(nan=0, normalized=True, transform=some_user_defined_function)

In addition, some rasters are preferably used through a one hot encoding representation thus increasing the depth of the environmental tensor. The global parameter raster_metadata enables to set some of these properties on a per raster basis.

If parameters are not set, default values are used. For instance, nan have a default value on a per raster basis. If you want to change it, either modify the metadata or set the parameter.

Please check the environmental_raster_glc.py file for more details.

Using the Extractor

The extractor acts as an array. For instance, len(extractor) gives the number of availble rasters. Accessing to a specific vector or tensor is done in the following way, by giving latitude and longitude:

env_tensor = extractor[43.61, 3.88]
# env_tensor is a numpy array

Attention the shape of env_tensor does not necessarily corresponds to len(extractor) as some variables using a one hot encoding representation actually correspond to a deeper representation.

The extractor also enables to plot a specific patch:

extractor.plot((43.61, 3.88))
# accept an optional style parameter to modify the style temporarily

Resulting in images of the following type:

The plot method accept a cancel_one_hot parameter which value is True by default thus representing a variable initially set to have a one hot encoding as a single patch. In the previous image, clc is set to have a one hot encoding representation but is plotted as a single patch.

Command line use

Using the online extraction of patches is fast but requires a significant amount of memory to store all rasters. So, for those who would rather export the patches on disk, an additional functionality is provided.

The patch corresponding to the csv dataset will be extracted using the following command:

python3.7 extract_offline.py rasters_directory dataset.csv destination_directory

The destination_directory will be created if it does not exist yet. Its content might be erased if two files have the same name.

The extractor code has been conceived for low memory usage but might be slower in that sense.

Notice that the patch will be exported in numpy format. R library RcppCNPy enables to read numpy format.

Help command returns:

usage: environmental_raster_glc.py [-h] [--size SIZE] [--normalized NORM]
                                   rasters dataset destination

extract environmental patches to disk

positional arguments:
  rasters            the path to the raster directory
  dataset            the dataset in CSV format
  destination        The directory where the patches will be exported

optional arguments:
  -h, --help         show this help message and exit
  --size SIZE        size of the final patch (default : 64)
  --normalized NORM  true if patch normalized (False by default)

Notice that some rasters (proxi_eau_fast in particular) require a lots of memory and can be removed from the extraction by using the exception variable (in the extract_offline.py file).

RamKaushikR/GLC19