facebookresearch/ReAgent

Installation Issues

tfurmston opened this issue · 4 comments

Hi,

I am trying to do a local installation of the project so that I can play around with it, but am having some issues with the installation.

There is a fair amount going on with the install, so I have decided to do it through docker. I couldn't find the image referenced in the documentation, so am writing my own. Here it is thus far:

FROM python:3.7-slim-stretch

ENV PROJECT_LOCATION /srv/reagent
RUN mkdir -p $PROJECT_LOCATION
WORKDIR $PROJECT_LOCATION

RUN apt-get update -qq \
  && apt-get install --no-install-recommends -y \
    build-essential \
    openssh-client \
    git \
    software-properties-common \
    libblas-dev \
    libffi-dev \
    liblapack-dev \
    libopenblas-base \
    libsasl2-dev \
    libssl-dev \
    libsasl2-modules \
    python3-dev \
    libpq-dev \
    ffmpeg \
    libsm6 \
    libxext6 \
    curl \
    unzip \
    zip \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/facebookresearch/ReAgent.git $PROJECT_LOCATION
RUN python -m pip install ".[gym]"
RUN python -m pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

RUN curl -s "https://get.sdkman.io" | bash
SHELL ["/bin/bash", "-c", "source $HOME/.sdkman/bin/sdkman-init.sh"]
RUN sdk version
RUN sdk install java 8.0.272.hs-adpt
RUN sdk install scala
RUN sdk install maven
RUN sdk install spark 2.4.6
RUN apt-get update
RUN apt-get install bc

This all goes through fine and builds successfully. (I got some of the configuration from the CI as the documentation seemed a bit out of date.)

However, when I try to run through the offline RL training (batch) introduction, here, I run into some issues.

In particular, when I get to the line:

./reagent/workflow/cli.py run reagent.workflow.gym_batch_rl.timeline_operator $CONFIG

I get the following error:

Building with config: 
{'spark.app.name': 'ReAgent',
 'spark.driver.extraClassPath': '/usr/local/lib/python3.7/site-packages/reagent/../preprocessing/target/rl-preprocessing-1.1.jar',
 'spark.driver.host': '127.0.0.1',
 'spark.master': 'local[*]',
 'spark.sql.catalogImplementation': 'hive',
 'spark.sql.execution.arrow.enabled': 'true',
 'spark.sql.session.timeZone': 'UTC',
 'spark.sql.shuffle.partitions': '12',
 'spark.sql.warehouse.dir': '/srv/reagent/spark-warehouse'}
JAVA_HOME is not set
Traceback (most recent call last):
  File "./reagent/workflow/cli.py", line 89, in <module>
    reagent()
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "./reagent/workflow/cli.py", line 77, in run
    func(**config.asdict())
  File "/usr/local/lib/python3.7/site-packages/reagent/workflow/gym_batch_rl.py", line 75, in timeline_operator
    spark = get_spark_session()
  File "/usr/local/lib/python3.7/site-packages/reagent/workflow/spark_utils.py", line 62, in get_spark_session
    spark = spark.getOrCreate()
  File "/usr/local/lib/python3.7/site-packages/pyspark/sql/session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/usr/local/lib/python3.7/site-packages/pyspark/context.py", line 367, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/usr/local/lib/python3.7/site-packages/pyspark/context.py", line 133, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/usr/local/lib/python3.7/site-packages/pyspark/context.py", line 316, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/usr/local/lib/python3.7/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
    return _launch_gateway(conf)
  File "/usr/local/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
> /usr/local/lib/python3.7/site-packages/pyspark/java_gateway.py(108)_launch_gateway()
-> raise Exception("Java gateway process exited before sending its port number")

Am I missing something in my install? Any help would be much appreciated.

Also, generally, I think it would be helpful if you provided a docker script for people. Happy to add one once I am finished fixing mine, if it helps.

I am having the some problem. Also the CI seems to be failing on the master branch.
Is there any update on adding a dockerfile? I agree it would be very helpful also to keep things up to date when it is included in the CI.

I was also looking for a docker install option but couldn't find the image anywhere in the documentation.

We don't use docker anymore since the installation is all done with pip. But you can use a stock Ubuntu image and pip install it in there

Sorry, maybe I am missing something, but don't we also have non-python dependencies. For example, I thought part of the project uses Spark.

From the error message above it was my impression that the error was coming from the spark pipeline to pre-process the data. Did I misunderstand something?