CoffeaHarvester (NanoAOD coffeabeans producer)

Set up CMSSW

cmsrel CMSSW_9_4_11_cand2
cd CMSSW_9_4_11_cand2/src
cmsenv

Hack NanoAOD framework (when running over 94X MiniAODv3)

git cms-addpkg PhysicsTools/NanoAOD

# comment out L183 of PhysicsTools/NanoAOD/python/nano_cff.py
run2_nanoAOD_94X2016.toModify(process, nanoAOD_addDeepFlavourTagFor94X2016)

Apply changes on PhysicsTools/NanoAOD

# this one contains the updated EGM corrections for electron/photons (**only needed for legacy 2016**)
git cms-merge-topic -u hqucms:deep-boosted-jets-94X-custom-nano

Get customized NanoAOD producers

git clone https://github.com/CoffeaTeam/CoffeaHarvester PhysicsTools/NanoTuples

Compile

scram b -j16

Test

cd PhysicsTools/NanoTuples/test

MC (80X, MiniAODv2):

cmsDriver.py mc -n -1 --mc --eventcontent NANOAODSIM --datatier NANOAODSIM --conditions 94X_mcRun2_asymptotic_v2 --step NANO --nThreads 4 --era Run2_2016,run2_miniAOD_80XLegacy --customise PhysicsTools/NanoTuples/nanoTuples_cff.nanoTuples_customizeMC --filein file:step-1.root --fileout file:nano.root --no_exec

# test file: /store/mc/RunIISummer16MiniAODv2/ttHToCC_M125_TuneCUETP8M2_13TeV_powheg_pythia8/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/50000/106F8E1B-23ED-E711-9F58-0025905B861C.root

Data (2016 ReReco):

cmsDriver.py data -n -1 --data --eventcontent NANOAOD --datatier NANOAOD --conditions 94X_dataRun2_v4 --step NANO --nThreads 4 --era Run2_2016,run2_miniAOD_80XLegacy --customise PhysicsTools/NanoTuples/nanoTuples_cff.nanoTuples_customizeData_METMuEGClean --filein file:step-1.root --fileout file:nano.root --no_exec

# test file: /store/data/Run2016H/MET/MINIAOD/03Feb2017_ver3-v1/80000/2A9DE5C7-ADEA-E611-9F9C-008CFA111290.root

MC (2016 Legacy):

cmsDriver.py mc -n -1 --mc --eventcontent NANOAODSIM --datatier NANOAODSIM --conditions 94X_mcRun2_asymptotic_v3 --step NANO --nThreads 4 --era Run2_2016,run2_miniAOD_80XLegacy --customise PhysicsTools/NanoTuples/nanoTuples_cff.nanoTuples_customizeMC --filein file:step-1.root --fileout file:nano.root --no_exec

# test file: /store/mc/RunIISummer16MiniAODv2/ttHToCC_M125_TuneCUETP8M2_13TeV_powheg_pythia8/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/50000/106F8E1B-23ED-E711-9F58-0025905B861C.root

Data (2016 Legacy):

cmsDriver.py data -n -1 --data --eventcontent NANOAOD --datatier NANOAOD --conditions 94X_dataRun2_v10 --step NANO --nThreads 4 --era Run2_2016,run2_nanoAOD_94X2016 --customise PhysicsTools/NanoTuples/nanoTuples_cff.nanoTuples_customizeData --filein file:step-1.root --fileout file:nano.root --no_exec

# test file: /store/data/Run2016H/MET/MINIAOD/17Jul2018-v2/00000/0A0B71F7-75B8-E811-BAB7-0425C5DE7BE4.root

MC (94X, re-miniAOD 12Apr2018):

cmsDriver.py mc -n 100 --mc --eventcontent NANOAODSIM --datatier NANOAODSIM --conditions 94X_mc2017_realistic_v14 --step NANO --nThreads 4 --era Run2_2017,run2_miniAOD_94XFall17 --customise PhysicsTools/NanoTuples/nanoTuples_cff.nanoTuples_customizeMC --filein /store/mc/RunIIFall17MiniAODv2/ttHToCC_M125_TuneCP5_13TeV-powheg-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v2/70000/EED096D8-EE98-E811-A327-0CC47A7C3572.root --fileout nano_mc2017.root --no_exec

Data (94X, re-miniAOD 31Mar2018):

cmsDriver.py data -n 100 --data --eventcontent NANOAOD --datatier NANOAOD --conditions 94X_dataRun2_v6 --step NANO --nThreads 4 --era Run2_2017,run2_miniAOD_94XFall17 --customise PhysicsTools/NanoTuples/nanoTuples_cff.nanoTuples_customizeData --filein /store/data/Run2017F/MET/MINIAOD/31Mar2018-v1/910000/A0858FDD-E73B-E811-803F-0CC47A7C34A6.root --fileout nano_data2017.root --no_exec

Production

Step 0: switch to the crab production directory and set up grid proxy, CRAB environment, etc.

cd $CMSSW_BASE/PhysicsTools/NanoTuples/crab
# set up grid proxy
voms-proxy-init -rfc -voms cms --valid 168:00
# set up CRAB env (must be done after cmsenv)
source /cvmfs/cms.cern.ch/crab3/crab.sh

Step 1: generate the python config file with cmsDriver.py with the following commands:

MC (80X, MiniAODv2):

cmsDriver.py mc -n -1 --mc --eventcontent NANOAODSIM --datatier NANOAODSIM --conditions 94X_mcRun2_asymptotic_v2 --step NANO --nThreads 4 --era Run2_2016,run2_miniAOD_80XLegacy --customise PhysicsTools/NanoTuples/nanoTuples_cff.nanoTuples_customizeMC --filein file:step-1.root --fileout file:nano.root --no_exec

Data (23Sep2016 ReReco):

cmsDriver.py data -n -1 --data --eventcontent NANOAOD --datatier NANOAOD --conditions 94X_dataRun2_v4 --step NANO --nThreads 4 --era Run2_2016,run2_miniAOD_80XLegacy --customise PhysicsTools/NanoTuples/nanoTuples_cff.nanoTuples_customizeData_METMuEGClean --filein file:step-1.root --fileout file:nano.root --no_exec

MC (94X, re-miniAOD 12Apr2018):

cmsDriver.py mc -n -1 --mc --eventcontent NANOAODSIM --datatier NANOAODSIM --conditions 94X_mc2017_realistic_v14 --step NANO --nThreads 4 --era Run2_2017,run2_miniAOD_94XFall17 --customise PhysicsTools/NanoTuples/nanoTuples_cff.nanoTuples_customizeMC --filein file:step-1.root --fileout file:nano.root --no_exec

Data (94X, re-miniAOD 31Mar2018):

cmsDriver.py data -n -1 --data --eventcontent NANOAOD --datatier NANOAOD --conditions 94X_dataRun2_v6 --step NANO --nThreads 4 --era Run2_2017,run2_miniAOD_94XFall17 --customise PhysicsTools/NanoTuples/nanoTuples_cff.nanoTuples_customizeData --filein  file:step-1.root --fileout file:nano.root --no_exec

Global tags and eras are gotten from: https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookMiniAOD

Step 2: use the crab.py script to submit the CRAB jobs:

For MC:

python crab.py -p mc_NANO.py -o /store/group/lpccoffea/coffeabeans/nano_mc_[version] -t NanoTuples-[version] -i mc_[ABC].txt --num-cores 4 --send-external -s EventAwareLumiBased -n 50000 --work-area crab_projects_mc_[ABC] --dryrun

For data:

python crab.py -p data_NANO.py -o /store/group/lpccoffea/coffeabeans/nano_data_[version] -t NanoTuples-[version] -i data.txt --num-cores 4 --send-external -s EventAwareLumiBased -n 50000 --work-area crab_projects_data --dryrun

A JSON file can be applied for data samples with the -j options. By default, we use the golden JSON for 2016:

https://cms-service-dqm.web.cern.ch/cms-service-dqm/CAF/certification/Collisions16/13TeV/ReReco/Final/Cert_271036-284044_13TeV_23Sep2016ReReco_Collisions16_JSON.txt

For 2017, the recommended JSON is:

https://cms-service-dqm.web.cern.ch/cms-service-dqm/CAF/certification/Collisions17/13TeV/ReReco/Cert_294927-306462_13TeV_EOY2017ReReco_Collisions17_JSON.txt

For updated information, check: https://twiki.cern.ch/twiki/bin/viewauth/CMS/PdmV2017Analysis

These command will perform a "dryrun" to print out the CRAB configuration files. Please check everything is correct (e.g., the output path, version number, requested number of cores, etc.) before submitting the actual jobs. To actually submit the jobs to CRAB, just remove the --dryrun option at the end.

Step 3: check job status

The status of the CRAB jobs can be checked with:

./crab.py --status --work-area crab_projects_[ABC]

Note that this will also resubmit failed jobs automatically.

The crab dashboard can also be used to get a quick overview of the job status: https://dashb-cms-job.cern.ch/dashboard/templates/task-analysis

More options of this crab.py script can be found with:

./crab.py -h

Dataset Loading Instructions

Obtaining an account and setting up environment

To upload datasets on the database you need an account at ifdb02. Ask Igor for one by email (ivm@fnal.gov). Provide your FNAL username in the request. Once you have the account kinit and log in into ifdb02:

kinit FNAL_USERNAME@FNAL.GOV
ssh FNAL_USERNAME@ifdb02.fnal.gov

At your home directory in ifdb02 download and install Miniconda2

wget https://repo.anaconda.com/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh

Follow the screen instructions (select yes/ENTER in all cases). Once that is done, clone the data loading tools at your home directory

git clone http://cdcvs.fnal.gov/projects/nosql-ldrd bigdata

Change directory into bigdata/ingest folder and untar the couchbase client and pyxrootd libraries into the python site-packages directory

cd ~/bigdata/ingest
tar -xzvf couchbase_python.tar.gz -C ~/miniconda2/lib/python2.7/site-packages/
tar -xzvf pyxrootd.tar.gz -C ~/miniconda2/lib/python2.7/site-packages/

Now set up python with miniconda

export PATH=~/miniconda2/bin:$PATH
conda create -n py2 python

(select yes 'y' when asked)

Install uproot 3.3.0

pip install uproot==3.3.0

Change directory into bigdata and run the setup.py script

cd ~/bigdata/
python setup.py install

and copy the following directory into your home.

cp -r /data3/fnavarro/build ~/

Switch into the follwing directory

cd ~/bigdata/ingest/ingestion

Create a file named setup.py and fill it with the following

export STRIPED_HOME=${HOME}/striped_home
export PYTHONPATH=${HOME}/build/striped:${HOME}/pythreader
export COUCHBASE_BACKEND_CFG=`pwd`/couchbase.cfg

In the same directory create a file named couchbase.cfg with the following

[CouchBase]
Username = striped
Password = Striped501
Readonly_Username = readonly
Readonly_Password = StripedReadOnly
ClusterURL = couchbase://dbdev112,dbdev115?operation_timeout=100

With the environment set up you may run the following to check if everything is working properly.

cd ~/bigdata/ingest/ingestion/
source setup.py
cd ../tools 
python deleteDataset.py Sandbox user_testDataset
python createDataset.py nanoMC2016.json Sandbox user_testDataset
cd ../ingestion
python loadFile.py root://cmseos.fnal.gov//store/group/lpccoffea/coffeabeans/nano_2016/DYJetsToLL_M-50_HT-100to200_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/NanoTuples-2016_RunIISummer16MiniAODv2-PUMoriond17_80X_v6-v1/181126_171720/0000/nano_1.root Sandbox user_testDataset

If with the last command you receive an error complaining lzma is not installed do

conda install -c conda-forge backports.lzma

and re-run

python loadFile.py root://cmseos.fnal.gov//store/group/lpccoffea/coffeabeans/nano_2016/DYJetsToLL_M-50_HT-100to200_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/NanoTuples-2016_RunIISummer16MiniAODv2-PUMoriond17_80X_v6-v1/181126_171720/0000/nano_1.root Sandbox user_testDataset

The output should be

Using file name: nano_1.root
No user profile will be created
Checking if the file is already uploaded
Found 0 events from this file in the database
Starting RGID: 0, 3 row groups, average row group size:16649

Loading data

Before using any of the following scripts you must run the following once, everytime you log in

cd ~/bigdata/ingest/ingestion/
source setup.py

Step 1: Generate a dataset schema

You will create a json file that describes the structure of the input rootfiles. You should expect a different schema for data and MC simulations. Schema can be also different due to specific reasons. For instance, in NanoAOD trigger bits are stored singly and the list of triggers can vary in each file. Taking this into account, you can generate the schema as follows.

First, switch into

cd $HOME/bigdata/ingest/datasets/yourFileFormat

where yourFileFormat is either bacon or NanoAOD

Now run the following script

python make_schema.py <file_path> <schema>

Where file_path is the absolute path to one of the files from the dataset you want to upload and schema is the name you will give to the created json file. You may use xrootd for the input file, the path would be: root://cmseols.fnal.gov//path/to/file/in/eos

Step 2: Create the datasets in the database

Before uploading your files you need to define datasets. Datasets can be defined based on the physics process taken into account for simulation or the primary dataset for data. This step is fairly elastic. As an example, one can define a W+jets dataset and pass the HT/pT bin information as metadata, or generate different datasets based on the HT/pT bin.

To do this:

cd ~/bigdata/ingest/tools/

and run

python createDataset.py <schema> <bucket name> <dataset name>

You may view available buckets at http://dbdev121.fnal.gov:8091/ui/index.html (user:admin password:ad___501) clicking at the "Buckets" button at the top left. <schema> would be the previously created schema json file and dataset name is whatever name you want to give to the dataset, you will be using this name to access the files when running an analysis. You may obtain premade schemas for nanoAOD datasets from /data3/fnavarro/schemas/. To copy all of them in to your ingestion directory:

cp -r /data3/fnavarro/schemas ~/bigdata/ingest/ingestion/

Step 3 Load datasets

To upload the files you will need a list containing the path to the files. This may be local files or files stored at eos. A script that generates this files is found at

/data3/fnavarro/fileListScripts/

Instructions on how to use them are found on the README file.

Switch into the ingestion directory (bigdata/ingest/ingestion), you will use the script loadFiles.py The minimum parameters you need are

python loadFiles.py  <bucket name> <dataset-name> @<file containing list of files>  # '@' before the name of the file is necessary

This will load the files with a default column size of 10000 and name them as they are originally named. You can increase the column size with the -n parameter. You may also give a prefix to the file paths at your list with the parameter -p . Also, if files within the same dataset have the same name (for example files from a dataset may be located under dataset/.../0000/nano_1.root and dataset/.../0001/nano_1.root) you can use the -k option to add the n directories previous to the file into its name. (in the previous example 0000 and 0001 would be added to nano_1.root. now one would be 0000_nano_1.root and the other 0001_nano_1.root). This way they can be uploaded into the same dataset. You can find example file lists under. /data3/fnavarro/exampleUploadFiles.
To see additional options run

python loadFiles.py

You can see a complete list of uploaded datasets at http://dbweb6.fnal.gov:9090/striped_130tb/app/datasets

Loading tools extras

Listing Dataset info

A tool you might want to use to check if a dataset was uploaded properly is listDataset.py under ~/bigdata/ingest/tools/. It is used the following way

python listDataset.py [-f|-l] <bucket> <dataset>

Running this without the options -f or -l will output the number of files uploaded, the total number of events, the number of row groups and the missing rowgroups along with some other info. Using -f will list the files and -l will list the files along with the row group id they belong to.

Re-running failed jobs

If an upload gets interrupted by any reason you may either continue uploading the files remaining or start from scratch by firsts deleting the dataset and creating it again. To remove a dataset:

cd ~/bigdata/ingest/tools/
python deleteDataset.py <bucket> <dataset>

you may use listDataset.py to check if it was properly removed. An unexisting dataset should output

0 items removed

If a dataset upload fails before it is done, you may view the files that have been uploaded with

python listDataset.py -f <bucket> <dataset>

then remove them from the file list you are using to upload and run loadFiles.py with the same options you were using before.

Extra Tip: If you computer gets temporarily disconnected from the internet while an upload was in progress do not interact with the terminal that was doing the job by clicking or typing on it. This will log you out with a connection error. If left alone it is likely the job will continue when you reconnect.

fnavarro94/CoffeaHarvester