/spark-esri

Repo to demonstrate the usage of Apache Spark within a Jupyter notebook within ArcGIS Pro

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Spark Esri

Project to demonstrate the usage of Apache Spark within a Jupyter notebook within ArcGIS Pro.

Notes

Oct 25, 2022 - Updated to support upcoming Pro 3.1. See SparkGeo2 notebook for integration with Apache Arrow :-)

Apr 12, 2022 - Running PySpark in Pro 2.9 requires the PYSPARK_PYTHON environment variable to be set. It should point to the python.exe executable of your active conda environment, e.g., C:\Users\%USERNAME%\AppData\Local\ESRI\conda\envs\spark_esri\python.exe. Defining CONDA_DEFAULT_ENV is neither sufficient and nor necesary.

Dec 16, 2021 - Added check for env var SPARK_HOME to override built-in spark. See instructions below.

Oct 30, 2021 - Pro 2.8 relies on the Windows registry to find the active conda environment. The registry key is HKEY_CURRENT_USER/SOFTWARE/ESRI/ArcGISPro/PythonCondaEnv. The value of this key is used to set the required os environment variable PYSPARK_PYTHON for PySpark to work correctly in a Pro notebook.

As of this writing, the order to detect the active conda environment is as follows:

  • look for env var CONDA_DEFAULT_ENV.
  • look for %LOCALAPPDATA%/ESRI/conda/envs/proenv.txt, in case of an older Pro version.
  • look for HKEY_CURRENT_USER/SOFTWARE/ESRI/ArcGISPro/PythonCondaEnv.

Oct 27, 2021 - Pro 2.8.3 removed the reliance and existence of the file %LOCALAPPDATA%/ESRI/conda/envs/proenv.txt. It now depend on env var CONDA_DEFAULT_ENV to determine the activate conda env.

Sep 16, 2021 - Perform the following as a patch for Pro 2.8.3

cd c:\
git clone https://github.com/kontext-tech/winutils

Define a system environment variable HADOOP_HOME with value C:\winutils\hadoop-3.3.0 and add to system variable PATH the %HADOOP_HOME%/bin value.

NOTE: This works in Pro 2.6 ONLY. There is a small "issue" with Pro 2.7 and pyarrow. The folks in Redlands have a fix that will be in 2.8 :-(

Installation

Install Spark (Optional).

If you do not wish to use Pro's built-in Spark, you can download and install Spark 3.x separately. For example, download spark-3.2.1-bin-hadoop3.2.tgz and set the environment variable SPARK_HOME to the folder where you extracted the archive. It's best to avoid spaces in the folder path.

Create a new Pro Conda Environment.

Start a Python Command Prompt:

Note: You might need to add proxy settings to .condarc located in C:\Program Files\ArcGIS\Pro\bin\Python.

conda config --set proxy_servers.http http://username:password@host:port
conda config --set proxy_servers.https https://username:password@host:port

The above will produce something like the below:

ssl_verify: true
proxy_servers:
  http: http://domainname\username:password@host:port
  https: http://domainname\username:password@host:port

Create a new conda environment:

proswap arcgispro-py3
conda remove --yes --all --name spark_esri
conda create --yes --name spark_esri --clone arcgispro-py3
proswap spark_esri

Optional:

pip install fsspec==2021.8.1 boto3==1.18.35 s3fs==0.4.2 pyarrow==1.0.1

conda install --yes -c esri -c conda-forge -c default^
    "numba=0.53.*"^
    "pandas=1.2.*"^
    "pyodbc=4.0.*"^
    "gcsfs=0.7.*"        

Install the Esri Spark module.

Note: You might need to install Git for Windows.

git clone https://github.com/mraad/spark-esri.git
cd spark-esri
python setup.py install

MicroPathing Notebook

Please note the usage of the range slider on the map to filter the micropaths between a user defined hour of day.

The following is the resulting crossing points and gates statistics.

TODO

  • Unify spark_esri and spark_dbconnect python modules.

References