floydhub/floyd-cli

Cannot mount data to a specific directory nor add multiple datasets

jmsmkn opened this issue · 2 comments

Three problems here:

  1. You cannot specify the data mount directory with the CLI
  2. Data is mounted at /input and not /<ID>
  3. Which leads to not being able to mount more than 1 dataset

In the docs it shows us how to mount datasets to specific directories, but doing this with the CLI results in an error. If we take for example, the public kaggle cats and dogs dataset with id: SyccinddLDdS7p3vzcwGQ2

Command:
$ floyd run --data SyccinddLDdS7p3vzcwGQ2:dataset

Expected output:
The experiment launches with the data mounted at /dataset

Actual output:

$ floyd run --data SyccinddLDdS7p3vzcwGQ2:dataset
New version of CLI (0.9.7) is now available. To upgrade run:
    pip install -U floyd-cli
            
Creating project run. Total upload size: 9.9MiB
Syncing code ...
Error: One or more request parameter is incorrect.

If I take away the dataset key it works:

floyd run --mode jupyter --data SyccinddLDdS7p3vzcwGQ2         

New version of CLI (0.9.7) is now available. To upgrade run:
    pip install -U floyd-cli
            
Creating project run. Total upload size: 9.9MiB
Syncing code ...
RUN ID                  NAME                  VERSION
----------------------  ------------------  ---------
[Truncated]

However, the dataset is not mounted at /<ID> as stated in the docs, but at /input. On the instance, the data is there:

# ls /
bin   code  etc   input  lib64  mnt  output  root  run_jupyter.sh  srv  tmp  var
boot  dev   home  lib    media  opt  proc    run   sbin            sys  usr
# ls /input
test  train

Now, if I add a second dataset, say the public MNIST dataset with ID Gbya2j64ApqjSHt3vDpdSh, this gets mounted at /input, and takes precedent over the first dataset (so I never actually have access to the SyccinddLDdS7p3vzcwGQ2 data):

$ floyd run --mode jupyter --data SyccinddLDdS7p3vzcwGQ2 --data Gbya2j64ApqjSHt3vDpdSh

Then on the instance:

# ls /
bin   code  etc   input  lib64  mnt  output  root  run_jupyter.sh  srv  tmp  var
boot  dev   home  lib    media  opt  proc    run   sbin            sys  usr
# ls /input
t10k-images-idx3-ubyte.gz  t10k-labels-idx1-ubyte.gz  train-images-idx3-ubyte.gz  train-labels-idx1-ubyte.gz

Note that test and train from the kaggle dataset is missing.

System info: floyd-cli 0.9.9, Python 3.5, see #51

houqp commented

Hi, sorry that the documentation is a little bit confusing, we are working on updating it. There are two things:

  1. if you only specify one data input, it will always be mounted at "/input"
  2. if you specify multiple data inputs, you need to tag each of them, for example: --data SyccinddLDdS7p3vzcwGQ2:foo --data Gbya2j64ApqjSHt3vDpdSh:bar. Then you should be able to access them at "/foo" and "/bar" directory.

We will be releasing a new documentation this week to clarify this.

I've seen that single data sets can have mountpoints again too, thank you for fixing it so quickly!