dnload_raw operation requires the ability to `sudo -u hdfs`
Opened this issue · 2 comments
The eggo cli tool to download a dataset on a CDH cluster uses Hadoop Streaming basically as a job scheduler to dnload files into HDFS.
The CLI here:
Line 33 in c0e980f
And the code here:
Lines 38 to 84 in c0e980f
The mapper script is here:
https://github.com/bigdatagenomics/eggo/blob/c0e980f6581e85d4687de625af2957906d446c22/eggo/resources/download_mapper.py
The user creates a tmp HDFS directory to receive the data:
Line 40 in c0e980f
The MR job then runs and downloads the data using curl into that directory, but runs as user "yarn".
After the dataset is dnloaded, we create the final output directory:
Line 78 in c0e980f
And ideally we'd just move all the data there. However, all the data is owned by user "yarn", so it causes lots of permissions problems downstream. Instead, we chown
all the data here:
Lines 79 to 81 in c0e980f
(note: this chowns is to user ec2-user, but easy to change to whatever the current user is)
which requires the
sudo
capability.
Any way around this? cc @tomwhite
There are a couple of options:
- Set dfs.permissions.enabled to false, so that permission checking is disabled.
- Enable the LinuxContainerExecutor so that containers run as the user that submitted the job.
The second is preferable from a security point of view. See the following for more info:
So anyone that wants to use the dnload tool on their cluster has to set up the LinuxContainerExecutor? Do you think this is an overly-restrictive environment? Should I just not be using MapReduce to dnload the datasets?