ESA-PhiLab/iris

LightGBM consistently segfaults

BradNeuberg opened this issue · 5 comments

I've attempted to get the "AI Scoring" to work, but it consistently segfaults on me. I manually annotate a few heavy cloud and shadow pixels, then hit "A". I get the following segfault:

/Users/bradneuberg/src/iris/iris_env/lib/python3.9/site-packages/lightgbm/sklearn.py:598: UserWarning: 'silent' argument is deprecated and will be removed in a future release of LightGBM. Pass 'verbose' parameter via keyword arguments instead.
  _log_warning("'silent' argument is deprecated and will be removed in a future release of LightGBM. "
/Users/bradneuberg/src/iris/iris_env/lib/python3.9/site-packages/lightgbm/sklearn.py:726: UserWarning: 'early_stopping_rounds' argument is deprecated and will be removed in a future release of LightGBM. Pass 'early_stopping()' callback via 'callbacks' argument instead.
  _log_warning("'early_stopping_rounds' argument is deprecated and will be removed in a future release of LightGBM. "
/Users/bradneuberg/src/iris/iris_env/lib/python3.9/site-packages/lightgbm/sklearn.py:736: UserWarning: 'verbose' argument is deprecated and will be removed in a future release of LightGBM. Pass 'log_evaluation()' callback via 'callbacks' argument instead.
  _log_warning("'verbose' argument is deprecated and will be removed in a future release of LightGBM. "
Segmentation fault: 11

It segfaults at this line:

gbm.fit(
        inputs[train_indices, :], train_labels,
        eval_set=[(inputs[val_indices, :], val_labels)],
        early_stopping_rounds=4, verbose=0
    )

I've tried upping the version of LightGBM and supporting libraries as follows, but it does not fix things:

-numpy==1.22.0
+numpy==1.23.3
 pyyaml==5.4.1
-lightgbm==3.3.0
+lightgbm==3.3.2
 rasterio==1.2.10
 requests==2.26.0
-scipy==1.7.1
+scipy==1.9.1

I've also tried removing the Sentinel-1 functionality from the demo config file, as I thought perhaps not labeling Sentinel-1 pixels were causing invalid inputs to be passed to LightGBM. This did not fix things.

I'm running Iris in a virtual env environment using Python 3.9.2 on Mac OS X 12.6.

I've serialized the training and validation data to disk in numpy npz format:

train_indices, val_indices, train_labels, val_labels = train_test_split(
        user_indices, user_labels, stratify=user_labels,
        test_size=0.3, random_state=42
    )

    np.save("X", inputs[train_indices, :])
    np.save("X_labels", train_labels)

    np.save("y", inputs[val_indices, :])
    np.save("y_labels", val_labels)

    np.save("inputs", inputs)

    gbm = lgb.LGBMClassifier(
        num_leaves=config['ai_model']['n_leaves'],
        max_bin=128,
        max_depth=config['ai_model']['max_depth'],
        # min_data_in_leaf=1000,
        # bagging_fraction=0.2,
        # boosting_type='dart',
        tree_learner='data',
        learning_rate=0.05,
        n_estimators=config['ai_model']['n_estimators'],
        silent=True,
        #n_jobs=10,
    )

I've then created a reduced Jupyter notebook that takes exactly the same data and uses the same config and calls, and does not crash, so there must be something else imported into the Python environment causing this issue (perhaps thread issues?).

Here's a Google Drive folder I've put the *.npz files and my Jupyter notebook, named gbm.ipynb:
https://drive.google.com/drive/folders/1cv1n-kCGZzPlVdQq59cgTEafcU4YvJP7?usp=sharing

I tried earlier versions of Python to see if that was causing the issue; Python 3.6 is impossible due to version skew between imageio and numpy. I was able to install Python 3.7 with earlier versions of numpy and everything else installed, and lightgbm still segfaulted. I suspect that there is a threading issue between Flask and LightGBM on Mac OS X for some reason. I tried to update to the latest version of Flask to see if that resolved things but the Flask API has changed and is now incompatible.

Next step is I'll spin up a Google Cloud Server with Ubuntu and serve it from there, and see if it works in that environment.

BTW, when you start up iris it reports its running on a debug, non-production server. Are there any recommendations for deploying iris in a production manner for access by multiple users?

I've been meaning to do a bit more testing on Mac OS X side as I don't use it regularly. I'll do some digging too, but let me know how you get on and if you find a workable solution. For my own work with IRIS, I always use python 3.9 in a conda environment on Linux or (sometimes) Windows. This tends to work consistently for me.

In the meantime, I've created a PR to add a WSGI production server using the gevent package. Take a look and see what you think. I'm not an expert on deployment/web development stuff, so happy to hear your opinion on whether what I've done is useful/sensible #22

Thanks!

Any updates on this? Regarding the production server, I've merged PR #22