dnload_raw operation requires the ability to `sudo -u hdfs`

The eggo cli tool to download a dataset on a CDH cluster uses Hadoop Streaming basically as a job scheduler to dnload files into HDFS.

The CLI here:

eggo/eggo/cli/datasets.py

Line 33 in c0e980f

def dnload_raw(input, output):

And the code here:

eggo/eggo/operations.py

Lines 38 to 84 in c0e980f

    
           def download_dataset_with_hadoop(datapackage, hdfs_path): 
        
               with make_local_tmp() as tmp_local_dir: 
        
                   with make_hdfs_tmp(permissions='777') as tmp_hdfs_dir: 
        
                       # NOTE: 777 used so user yarn can write to this dir 
        
                       # create input file for MR job that downloads the files and puts 
        
                       # them in HDFS 
        
                       local_resource_file = pjoin(tmp_local_dir, 'resource_file.txt') 
        
                       with open(local_resource_file, 'w') as op: 
        
                           for resource in datapackage['resources']: 
        
                               op.write('{0}\n'.format(json.dumps(resource))) 
        
                       check_call('hadoop fs -put {0} {1}'.format(local_resource_file, 
        
                                                                  tmp_hdfs_dir), 
        
                                  shell=True) 
        
                       # construct and execute hadoop streaming command to initiate dnload 
        
                       cmd = ('hadoop jar {streaming_jar} ' 
        
                              '-D mapreduce.job.reduces=0 ' 
        
                              '-D mapreduce.map.speculative=false ' 
        
                              '-D mapreduce.task.timeout=12000000 ' 
        
                              '-files {mapper_script_path} ' 
        
                              '-input {resource_file} -output {dummy_output} ' 
        
                              '-mapper {mapper_script_name} ' 
        
                              '-inputformat {input_format} -outputformat {output_format} ' 
        
                              '-cmdenv STAGING_PATH={staging_path} ') 
        
                       args = {'streaming_jar': STREAMING_JAR, 
        
                               'resource_file': pjoin(tmp_hdfs_dir, 'resource_file.txt'), 
        
                               'dummy_output': pjoin(tmp_hdfs_dir, 'dummy_output'), 
        
                               'mapper_script_name': 'download_mapper.py', 
        
                               'mapper_script_path': pjoin( 
        
                                   os.path.dirname(__file__), 'resources', 
        
                                   'download_mapper.py'), 
        
                               'input_format': ( 
        
                                   'org.apache.hadoop.mapred.lib.NLineInputFormat'), 
        
                               'output_format': ( 
        
                                   'org.apache.hadoop.mapred.lib.NullOutputFormat'), 
        
                               'staging_path': pjoin(tmp_hdfs_dir, 'staging')} 
        
                       print(cmd.format(**args)) 
        
                       check_call(cmd.format(**args), shell=True) 
        
                       # move dnloaded data to final path 
        
                       check_call('hadoop fs -mkdir -p {0}'.format(hdfs_path), shell=True) 
        
                       check_call( 
        
                           'sudo -u hdfs hadoop fs -chown -R ec2-user:supergroup {0}' 
        
                           .format(tmp_hdfs_dir), shell=True) 
        
                       check_call( 
        
                           'hadoop fs -mv "{0}/*" {1}'.format( 
        
                               pjoin(tmp_hdfs_dir, 'staging'), hdfs_path), shell=True)

The mapper script is here:
https://github.com/bigdatagenomics/eggo/blob/c0e980f6581e85d4687de625af2957906d446c22/eggo/resources/download_mapper.py

The user creates a tmp HDFS directory to receive the data:

eggo/eggo/operations.py

Line 40 in c0e980f

with make_hdfs_tmp(permissions='777') as tmp_hdfs_dir:

The MR job then runs and downloads the data using curl into that directory, but runs as user "yarn".

After the dataset is dnloaded, we create the final output directory:

eggo/eggo/operations.py

Line 78 in c0e980f

check_call('hadoop fs -mkdir -p {0}'.format(hdfs_path), shell=True)

And ideally we'd just move all the data there. However, all the data is owned by user "yarn", so it causes lots of permissions problems downstream. Instead, we chown all the data here:

eggo/eggo/operations.py

Lines 79 to 81 in c0e980f

    
           check_call( 
        
               'sudo -u hdfs hadoop fs -chown -R ec2-user:supergroup {0}' 
        
               .format(tmp_hdfs_dir), shell=True)

(note: this chowns is to user ec2-user, but easy to change to whatever the current user is)
which requires the sudo capability.

Any way around this? cc @tomwhite

There are a couple of options:

Set dfs.permissions.enabled to false, so that permission checking is disabled.
Enable the LinuxContainerExecutor so that containers run as the user that submitted the job.

The second is preferable from a security point of view. See the following for more info:

So anyone that wants to use the dnload tool on their cluster has to set up the LinuxContainerExecutor? Do you think this is an overly-restrictive environment? Should I just not be using MapReduce to dnload the datasets?

	def download_dataset_with_hadoop(datapackage, hdfs_path):
	with make_local_tmp() as tmp_local_dir:
	with make_hdfs_tmp(permissions='777') as tmp_hdfs_dir:
	# NOTE: 777 used so user yarn can write to this dir
	# create input file for MR job that downloads the files and puts
	# them in HDFS
	local_resource_file = pjoin(tmp_local_dir, 'resource_file.txt')
	with open(local_resource_file, 'w') as op:
	for resource in datapackage['resources']:
	op.write('{0}\n'.format(json.dumps(resource)))
	check_call('hadoop fs -put {0} {1}'.format(local_resource_file,
	tmp_hdfs_dir),
	shell=True)

	# construct and execute hadoop streaming command to initiate dnload
	cmd = ('hadoop jar {streaming_jar} '
	'-D mapreduce.job.reduces=0 '
	'-D mapreduce.map.speculative=false '
	'-D mapreduce.task.timeout=12000000 '
	'-files {mapper_script_path} '
	'-input {resource_file} -output {dummy_output} '
	'-mapper {mapper_script_name} '
	'-inputformat {input_format} -outputformat {output_format} '
	'-cmdenv STAGING_PATH={staging_path} ')
	args = {'streaming_jar': STREAMING_JAR,
	'resource_file': pjoin(tmp_hdfs_dir, 'resource_file.txt'),
	'dummy_output': pjoin(tmp_hdfs_dir, 'dummy_output'),
	'mapper_script_name': 'download_mapper.py',
	'mapper_script_path': pjoin(
	os.path.dirname(__file__), 'resources',
	'download_mapper.py'),
	'input_format': (
	'org.apache.hadoop.mapred.lib.NLineInputFormat'),
	'output_format': (
	'org.apache.hadoop.mapred.lib.NullOutputFormat'),
	'staging_path': pjoin(tmp_hdfs_dir, 'staging')}
	print(cmd.format(**args))
	check_call(cmd.format(**args), shell=True)

	# move dnloaded data to final path
	check_call('hadoop fs -mkdir -p {0}'.format(hdfs_path), shell=True)
	check_call(
	'sudo -u hdfs hadoop fs -chown -R ec2-user:supergroup {0}'
	.format(tmp_hdfs_dir), shell=True)
	check_call(
	'hadoop fs -mv "{0}/*" {1}'.format(
	pjoin(tmp_hdfs_dir, 'staging'), hdfs_path), shell=True)