Netflix/genie

Beginner question: how does genie connect to a cluster?

PeteW opened this issue · 4 comments

PeteW commented

I'm trying to understand something around authentication, specifically how genie initiates the conversation with a "cluster" when its time to run a job.

I presume that at the time of job invocation, genie can copy application/job scripts/files to the cluster. It can run optional application installation/setup and issue a "command" from within a directory of a machine in the cluster.

I'm a beginner in genie and EMR, but my experience with workflow tools such as oozie or luigi we might leverage SSH as a means of caller/worker communication. And that would require credentials and a specific target machine in the cluster. I couldnt find (so far) any related information, anticipating this would be part of the cluster data model. I need to better understand how genie was intended to "talk" to a target cluster or even a specific machine in that cluster possibly, and how to configure a cluster accordingly.

Thank you for what is clearly a wonderfully well-designed architecture.

Hello Pete,

Genie does not natively communicate with any cluster.

What Genie provides is a framework for administrators to templatize complex commands that use clusters and remote resources. So that users can make use of such commands without having to worry about client configuration, cluster endpoint, binaries version, etc.

I suggest taking a look at the demo if you haven't already.
If you look at what various scripts and component do, you'll see that submitting an Hadoop job to Genie just means configuring a client with all the necessary bits of information to run such job.
But Genie itself is not aware of Hadoop and does not talk directly to the cluster.

PeteW commented

Marco thanks I appreciate genie's position as a job manager. What I continue to miss is that handoff point between the job scheduler and the cluster. In the demo, say, run yarn job. This code will configure a job, its command, its cluster requirements, and then the last line POSTS the job to genie API as submission to execute.

# Create a job instance and fill in the required parameters
job = pygenie.jobs.HadoopJob() \
    .job_name('Genie Demo YARN Job') \
    .genie_username('root') \
    .job_version('3.0.0')

# Set cluster criteria which determine the cluster to run the job on
job.cluster_tags(['sched:' + str(sys.argv[1]), 'type:yarn'])

# Set command criteria which will determine what command Genie executes for the job
job.command_tags(['type:yarn'])

# Any command line arguments to run along with the command. In this case it holds
# the actual query but this could also be done via an attachment or file dependency.
job.command("application -list --appStates ALL")

# Submit the job to Genie
running_job = job.execute()

Genie will locate all known clusters matching the requested tags and choose one based on configurable pecking order. Once a cluster is selected, an actual shell command is executed (e.g. in this case, ${HADOOP_HOME}/bin/yarn/application -list --appStates ALL). My question is where is this shell command executed? Is that shell call assumed to be always executed locally on the server running genie (starting to think so but want to make sure I follow)?

hey @PeteW. Sorry for delayed response as most of us have been disconnected over the holidays.

In Genie 3.x your assumption is correct the job (client process) runs on the node where the API request lands (provided the node has enough memory capacity to handle the job). In a Genie cluster the node the job is running on will be saved in the database so it can be located later.

How genie achieves unified access to data within a job without distinguishing between clusters. According to my understanding, only when the job program does not distinguish between clusters to access data, can genie automatically help users migrate the running cluster of computing tasks, especially when a certain cluster goes offline