This project contains various developer helper scripts in order to simplify every day tasks related to Apache Hadoop YARN development.
-
gitpython - GitPython is a python library used to interact with git repositories, high-level like git-porcelain, or low-level like git-plumbing.
-
tabulate - python-tabulate: Pretty-print tabular data in Python, a library and a command-line utility.
-
bs4 - Beautiful Soup is a Python library for pulling data out of HTML and XML files.
-
TODO: Missing dependencies
TODO
- Szilard Nemeth - Initial work - Szilard Nemeth
TODO
TODO
In order to use this tool, you need to have at least Python 3.8 installed.
If you don't want to tinker with the source code, you can download yarn-dev-tools from PyPi as well.
This is probably the easiest way to use it.
You don't need to install anything manually as I created a script that performs the installation automatically.
The script has a setup-vars
function at the beginning that defines some environment variables:
These are the following:
YARNDEVTOOLS_ROOT
: Specifies the directory where the Python virtualenv will be created and yarn-dev-tools will be installed to this virtualenv.HADOOP_DEV_DIR
Should be set to the upstream Hadoop repository root, e.g.: "~/development/apache/hadoop/"CLOUDERA_HADOOP_ROOT
Should be set to the downstream Hadoop repository root, e.g.: "~/development/cloudera/hadoop/"
The latter two environment variables is better to be added to your bashrc / zshrc file (depending on what shell you are using) to keep them between the shells.
If you want to use yarn-dev-tools from source, first you need to install its dependencies.
The project root contains a pyproject.toml file that has all the dependencies listed.
The project uses Poetry to resolve the dependencies so you need to install poetry as well.
Simply go to the root of this project and execute poetry install --without localdev
.
Alternatively, you can run make
from the root of the project.
If you completed the installation (either by source or by package), you may want to define some shell aliases to use the tool more easily.
In my system, I have these.
Please make sure to source this script so that the command 'yarndevtools' will be available since it's defined as a function.
It is important to specify HADOOP_DEV_DIR
and CLOUDERA_HADOOP_ROOT
as mentioned above, before sourcing the script.
After these steps, you will have a basic set of aliases that is enough to get you started.
- Check out the branch 'cloudera-mirror-version'
- Upload the initial setup scripts to the CDSW files, to the root directory (/home/cdsw).
You can do this by drag & drop, after choosing "Files" from the left-hand side menu.
-
Create and launch new CDSW session. Wait for the session to be launched and open up a terminal by Clicking "Terminal access" on the top menu bar.
-
Execute this command in the CLI:
~/initial-cdsw-setup.sh user cloudera
The initial-cdsw-setup.sh
script performs the following actions:
- Downloads the scripts that are cloning the upstream and downstream Hadoop repositories + installing yarndevtools itself as a python module.
The download location is:
/home/cdsw/scripts
Please note that the files will be downloaded from the GitHub master branch of this repository!
-
Executes the script described in step 2. This can take some time, especially cloning Hadoop. Note: The individual CDSW jobs should make sure for themselves to clone the repositories.
-
Copies the python-based job configs for all jobs to
/home/cdsw/jobs
-
After this, all you have to do in CDSW is to set up the projects and their starter scripts like this:
Project | Starter script location | Arguments for script |
---|---|---|
Jira umbrella data fetcher | scripts/start_job.py | jira-umbrella-data-fetcher |
Unit test result aggregator | scripts/start_job.py | unit-test-result-aggregator |
Unit test result fetcher | scripts/start_job.py | unit-test-result-fetcher |
Branch comparator | scripts/start_job.py | branch-comparator |
Review sheet backport updater | scripts/start_job.py | review-sheet-backport-updater |
Reviewsync | scripts/start_job.py | reviewsync |
The two provided arguments user
and cloudera
corresponds to:
PYTHON_MODULE_MODE=user
EXEC_MODE=cloudera
In any case, the script that download the Hadoop repos (either upstream or downstream) are downloaded from https://github.com/szilard-nemeth/yarn-dev-tools. See this code block for details.
PYTHON_MODULE_MODE
can be set to user
or global
. It controls if the Python package should be installed globally or just for the user.
See this code block for more details.
EXEC_MODE
controls just one thing: the downstream Hadoop repo will only be downloaded if EXEC_MODE
is set to cloudera
.
The script called install-requirements.sh
will be executed.
What does the install-requirements.sh
do?
- Uninstalls the
yarn-dev-tools
python package - Installs the
yarn-dev-tools
python package
Installation details can be found here
As you can see in this code block, the env var called YARNDEVTOOLS_VERSION
controls how the package should be installed.
As the current setup, YARNDEVTOOLS_VERSION=repo
(set as env var in CDSW / Project Settings / Advanced), therefore the package will be installed from the github.com repository, with command:
pip3 install git+https://github.com/szilard-nemeth/yarn-dev-tools.git@cloudera-mirror-version
See https://jira.cloudera.com/browse/COMPX-17121 for detailed execution logs.
All common environment variables are used from a class called CdswEnvVar
Name | Level | Mandatory? | Default value | Description |
---|---|---|---|---|
MAIL_ACC_USER | Project | Yes | N/A | Username for the Gmail account that is being used for sending emails |
MAIL_ACC_PASSWORD | Project | Yes | N/A | Password for the Gmail account that is being used for sending emails |
MAIL_RECIPIENTS | Project or Job | No | yarn_eng_bp@cloudera.com | Comma separated email addresses to send emails to. If not specified, the YARN mailing list is the default: yarn_eng_bp@cloudera.com Can be specified on Job-level, too |
ENABLE_GOOGLE_DRIVE_INTEGRATION | Project or Job | No | True | Whether to enable Google Drive integration for saving result files. |
DEBUG_ENABLED | Project or Job | No | Job-level default | Whether to enable debug mode for yarndevtools commands. Adds the --debug switch to CLI commands. Accepted values: True, False |
OVERRIDE_SCRIPT_BASEDIR | Project | No | N/A | Option to change the scripts dir for CDSW jobs. Do not modify unless absolutely necessary! |
ENABLE_LOGGER_HANDLER_SANITY_CHECK | Project or Job | No | True | Whether to enable sanity checking the number of loggers after first logger initialization. Can be disabled if errors come up during logger setup. |
CLOUDERA_HADOOP_ROOT | Project | Yes | <CDSW_BASEDIR>/repos/cloudera/hadoop/ | Downstream repository path for Hadoop. Auto set for CDSW |
HADOOP_DEV_DIR | Project | Yes | <CDSW_BASEDIR>/repos/apache/hadoop/ | Upstream repository path for Hadoop. Auto set for CDSW |
PYTHONPATH | Project | No | $PYTHONPATH:/home/cdsw/scripts | Tweaked PYTHONPATH, to correctly reload python dependencies. Do not modify unless absolutely necessary! |
TEST_EXEC_MODE | Project | No | cloudera | Test execution mode. Can take values of TestExecMode enum. For CDSW, it should be always set to TestExecMode.CLOUDERA |
PYTHON_MODULE_MODE | Project | No | user | Python module mode. Can take values of user and global . For CDSW, it should be always set to user . |
INSTALL_REQUIREMENTS | Project | No | True | Whether to run the install-requirements.sh script. Do not modify unless absolutely necessary! |
RESTART_PROCESS_WHEN_REQUIREMENTS_INSTALLED | Project | No | False | Only used for testing |
Corresponding class: JiraUmbrellaFetcherEnvVar
Name | Mandatory? | Default value | Actual value | Description |
---|---|---|---|---|
UMBRELLA_IDS | Yes | N/A | "YARN-10888 YARN-10889" | Comma separated list of umbrella Jira IDs |
Corresponding class: UnitTestResultAggregatorEmailEnvVar
Name | Mandatory? | Default value | Actual value | Description |
---|---|---|---|---|
GSHEET_CLIENT_SECRET | Yes | N/A | /home/cdsw/.secret/projects/cloudera/hadoop-reviewsync/client_secret_service_account_snemeth_cloudera_com.json | Path to the Google Sheets client secret file. Used for authenticating with Google Sheets. |
GSHEET_SPREADSHEET | Yes | N/A | "Failed testcases parsed from emails [generated by script]" | Name of the Google Sheets to work on |
GSHEET_WORKSHEET | Yes | N/A | "Failed testcases" | Name of the Google Sheets worksheet to work on |
REQUEST_LIMIT | No | 999 | 3000 | Limit the number of Gmail threads to query. |
MATCH_EXPRESSION | Yes | N/A | YARN::org.apache.hadoop.yarn MR::org.apache.hadoop.mapreduce | Match expressions that serves as a basis for grouping and rendering tables of test failures. See this file for more details |
ABBREV_TC_PACKAGE | No | N/A | org.apache.hadoop.yarn.server | Whether to abbreviate testcase package names in outputs in order to save screen space. The specified string will be abbreviated with the starting letters. |
AGGREGATE_FILTERS | No | N/A | CDPD-7.1.x CDPD-7.x | The resulted emails and testcases for each filter will be aggregated to a separate worksheet with name aggregated where WS is equal to the value specified by the --gsheet-worksheet argument. |
SKIP_AGGREGATION_RESOURCE_FILE | No | N/A | N/A | Specify file that defines lines to skip. If lines starting with these strings, they will not be considered as a line to parse from the emails. |
SKIP_AGGREGATION_RESOURCE_FILE_AUTO_DISCOVERY | Yes | N/A | 1 | Whether to enable auto-discovery of skip aggregation resource file. Can take values to enable: ("True", "true", "1") or to disable: ("False", "false", "0"). |
GSHEET_COMPARE_WITH_JIRA_TABLE | No | N/A | "testcases with jiras" | A value should be provided if comparison of failed testcases with reported Jira table must be performed. The value is a name to a Google Sheets worksheet, for example 'testcases with jiras' |
Corresponding class: UnitTestResultFetcherEnvVar For legacy reasons, Jenkins-related env vars are declared in the class called CdswEnvVar.
Name | Mandatory? | Default value | Actual value | Description |
---|---|---|---|---|
JENKINS_USER | Yes | N/A | snemeth | User name for Cloudera Jenkins API access. |
JENKINS_PASSWORD | Yes | N/A | Password for Cloudera Jenkins API access. | |
BUILD_PROCESSING_LIMIT | No | 999 | 999 | Limit the number of Jenkins builds to fetch |
FORCE_SENDING_MAIL | No | False | False | Force sending email for all Jenkins runs even they sent out earlier |
RESET_JOB_BUILD_DATA | No | False | False | Reset job build data for specified jobs. Useful when job build data is corrupted. |
Corresponding class: BranchComparatorEnvVar
Name | Mandatory? | Default value | Actual value | Description |
---|---|---|---|---|
BRANCH_COMP_FEATURE_BRANCH | No | origin/CDH-7.1-maint | origin/CDH-7.1-maint | Name of the feature branch |
BRANCH_COMP_MASTER_BRANCH | No | origin/cdpd-master | origin/cdpd-master | Name of the master branch |
BRANCH_COMP_REPO_TYPE | No | downstream (RepoType.DOWNSTREAM ) |
N/A | Repository type. Can take a value of RepoType enum |
Corresponding class: ReviewSheetBackportUpdaterEnvVar
Name | Mandatory? | Default value | Actual value | Description |
---|---|---|---|---|
GSHEET_CLIENT_SECRET | Yes | N/A | /home/cdsw/.secret/projects/cloudera/hadoop-reviewsync/client_secret_service_account_snemeth_cloudera_com.json | Path to the Google Sheets client secret file. Used for authenticating with Google Sheets. |
GSHEET_SPREADSHEET | Yes | N/A | "YARN/MR Reviews" | Name of the Google Sheets to work on |
GSHEET_WORKSHEET | Yes | N/A | "Reviews done" | Name of the Google Sheets worksheet to work on |
GSHEET_JIRA_COLUMN | Yes | N/A | "JIRA" | Name of the column that contains Jira issue IDs in the Google Sheets spreadsheet |
GSHEET_UPDATE_DATE_COLUMN | Yes | N/A | "Last Updated" | Name of the column where this script will store last updated date in the Google Sheets spreadsheet |
GSHEET_STATUS_INFO_COLUMN | Yes | N/A | "Backported" | Name of the column where this script will store patch status info in the Google Sheets spreadsheet |
BRANCHES | Yes | N/A | origin/CDH-7.1-maint origin/cdpd-master origin/CDH-7.1.6.x origin/CDH-7.1.7.1057 origin/CDH-7.1.7.2000 origin/CDH-7.1.8.x | Check backports against these branches. Values should be separated by space. |
Corresponding class: ReviewSyncEnvVar
Name | Mandatory? | Default value | Actual value | Description |
---|---|---|---|---|
GSHEET_CLIENT_SECRET | Yes | N/A | /home/cdsw/.secret/projects/cloudera/hadoop-reviewsync/client_secret_service_account_snemeth_cloudera_com.json | Path to the Google Sheets client secret file. Used for authenticating with Google Sheets. |
GSHEET_SPREADSHEET | Yes | N/A | "YARN/MR Reviews" | Name of the Google Sheets to work on |
GSHEET_WORKSHEET | Yes | N/A | Incoming | Name of the Google Sheets worksheet to work on |
GSHEET_JIRA_COLUMN | Yes | N/A | "JIRA" | Name of the column that contains Jira issue IDs in the Google Sheets spreadsheet |
GSHEET_UPDATE_DATE_COLUMN | Yes | N/A | "Last Updated" | Name of the column where this script will store last updated date in the Google Sheets spreadsheet |
GSHEET_STATUS_INFO_COLUMN | Yes | N/A | "Reviewsync" | Name of the column where this script will store patch status info in the Google Sheets spreadsheet |
BRANCHES | Yes | N/A | branch-3.2 branch-3.1 | List of branches to apply patches that are targeted to trunk. Values should be separated by space. |
Name | Mandatory? | Default value | Class | Description |
---|---|---|---|---|
IGNORE_SMTP_AUTH_ERROR | No | False | EnvVar | Enable to ignore SMTPAuthenticationError s |
FORCE_COLLECTING_ARTIFACTS | No | False | YarnDevToolsTestEnvVar | Enable to always collect all test artifacts. |
PROJECT_DETERMINATION_STRATEGY | Yes | N/A | YarnDevToolsEnvVar | Method for detecting the project name. Value can be one of: common_file , sys_path , repository_dir . common_file is suitable for most of the use-cases. Behaviour defined in external repo (https://github.com/szilard-nemeth/python-commons) |
ENV_CLOUDERA_HADOOP_ROOT | Yes | N/A | YarnDevToolsEnvVar | Alias of CLOUDERA_HADOOP_ROOT , see CDSW env vars above |
ENV_HADOOP_DEV_DIR | Yes | N/A | YarnDevToolsEnvVar | Alias of HADOOP_DEV_DIR , see CDSW env vars above |
YARNDEVTOOLS_VERSION | Yes | repo | N/A (script) | Used by script install-requirements.sh . See this function for details. Special value of latest means using the most recent pypi version. Special value of repo means use the most recent version from the repository. |
To backport YARN-6221 to 2 branches, run these commands:
yarn-backport YARN-6221 COMPX-6664 cdpd-master
yarn-backport YARN-6221 COMPX-6664 CDH-7.1-maint --no-fetch
The first argument is the upstream Jira ID
The second argument is the downstream Jira ID.
The third argument is the downstream branch.
The --no-fetch
option is a means to skip git fetch on both repos.
- Go to Gerrit UI and download the patch. For example:
git fetch "https://gerrit.sjc.cloudera.com/cdh/hadoop" refs/changes/29/156429/5 && git checkout FETCH_HEAD
- Checkout a new branch
git checkout -b my-relation-chain
- Run backporter with:
yarn-backport YARN-10314 COMPX-7855 CDH-7.1.7.1000 --no-fetch --downstream_base_ref my-relation-chain
where:
The first argument is the upstream Jira ID
The second argument is the downstream Jira ID.
The third argument is the downstream branch.
The --no-fetch
option is a means to skip git fetch on both repos.
The --downstream_base_ref <local-branch
is a way to use a local branch to base the backport on so the Git remote name won't be prepended.
Finally, I set up two aliases for pushing the changes to the downstream repo:
alias git-push-to-cdpdmaster="git push <REMOTE> HEAD:refs/for/cdpd-master%<REVIEWER_LIST>"
alias git-push-to-cdh71maint="git push <REMOTE> HEAD:refs/for/CDH-7.1-maint%<REVIEWER_LIST>"
where REVIEWER_LIST is in this format: "r=user1,r=user2,r=user3,..."
Configure precommit as described in this blogpost.
Commands:
- Install precommit:
pip install pre-commit
- Make sure to add pre-commit to your path. For example, on a Mac system, pre-commit is installed here:
$HOME/Library/Python/3.8/bin/pre-commit
. - Execute
pre-commit install
to install git hooks in your.git/
directory.
TODO
In case you're facing a similar issue:
An error has occurred: InvalidManifestError:
=====> /<userhome>/.cache/pre-commit/repoBP08UH/.pre-commit-hooks.yaml does not exist
Check the log at /<userhome>/.cache/pre-commit/pre-commit.log
, please run: pre-commit autoupdate
More info can be found here.