bigdatagenomics/eggo

dnload_raw operation requires the ability to `sudo -u hdfs`

Opened this issue · 2 comments

The eggo cli tool to download a dataset on a CDH cluster uses Hadoop Streaming basically as a job scheduler to dnload files into HDFS.

The CLI here:

def dnload_raw(input, output):

And the code here:

eggo/eggo/operations.py

Lines 38 to 84 in c0e980f

def download_dataset_with_hadoop(datapackage, hdfs_path):
with make_local_tmp() as tmp_local_dir:
with make_hdfs_tmp(permissions='777') as tmp_hdfs_dir:
# NOTE: 777 used so user yarn can write to this dir
# create input file for MR job that downloads the files and puts
# them in HDFS
local_resource_file = pjoin(tmp_local_dir, 'resource_file.txt')
with open(local_resource_file, 'w') as op:
for resource in datapackage['resources']:
op.write('{0}\n'.format(json.dumps(resource)))
check_call('hadoop fs -put {0} {1}'.format(local_resource_file,
tmp_hdfs_dir),
shell=True)
# construct and execute hadoop streaming command to initiate dnload
cmd = ('hadoop jar {streaming_jar} '
'-D mapreduce.job.reduces=0 '
'-D mapreduce.map.speculative=false '
'-D mapreduce.task.timeout=12000000 '
'-files {mapper_script_path} '
'-input {resource_file} -output {dummy_output} '
'-mapper {mapper_script_name} '
'-inputformat {input_format} -outputformat {output_format} '
'-cmdenv STAGING_PATH={staging_path} ')
args = {'streaming_jar': STREAMING_JAR,
'resource_file': pjoin(tmp_hdfs_dir, 'resource_file.txt'),
'dummy_output': pjoin(tmp_hdfs_dir, 'dummy_output'),
'mapper_script_name': 'download_mapper.py',
'mapper_script_path': pjoin(
os.path.dirname(__file__), 'resources',
'download_mapper.py'),
'input_format': (
'org.apache.hadoop.mapred.lib.NLineInputFormat'),
'output_format': (
'org.apache.hadoop.mapred.lib.NullOutputFormat'),
'staging_path': pjoin(tmp_hdfs_dir, 'staging')}
print(cmd.format(**args))
check_call(cmd.format(**args), shell=True)
# move dnloaded data to final path
check_call('hadoop fs -mkdir -p {0}'.format(hdfs_path), shell=True)
check_call(
'sudo -u hdfs hadoop fs -chown -R ec2-user:supergroup {0}'
.format(tmp_hdfs_dir), shell=True)
check_call(
'hadoop fs -mv "{0}/*" {1}'.format(
pjoin(tmp_hdfs_dir, 'staging'), hdfs_path), shell=True)

The mapper script is here:
https://github.com/bigdatagenomics/eggo/blob/c0e980f6581e85d4687de625af2957906d446c22/eggo/resources/download_mapper.py

The user creates a tmp HDFS directory to receive the data:

with make_hdfs_tmp(permissions='777') as tmp_hdfs_dir:

The MR job then runs and downloads the data using curl into that directory, but runs as user "yarn".

After the dataset is dnloaded, we create the final output directory:

check_call('hadoop fs -mkdir -p {0}'.format(hdfs_path), shell=True)

And ideally we'd just move all the data there. However, all the data is owned by user "yarn", so it causes lots of permissions problems downstream. Instead, we chown all the data here:

eggo/eggo/operations.py

Lines 79 to 81 in c0e980f

check_call(
'sudo -u hdfs hadoop fs -chown -R ec2-user:supergroup {0}'
.format(tmp_hdfs_dir), shell=True)

(note: this chowns is to user ec2-user, but easy to change to whatever the current user is)
which requires the sudo capability.

Any way around this? cc @tomwhite

There are a couple of options:

  1. Set dfs.permissions.enabled to false, so that permission checking is disabled.
  2. Enable the LinuxContainerExecutor so that containers run as the user that submitted the job.

The second is preferable from a security point of view. See the following for more info:

So anyone that wants to use the dnload tool on their cluster has to set up the LinuxContainerExecutor? Do you think this is an overly-restrictive environment? Should I just not be using MapReduce to dnload the datasets?