Thank you for applying for a position with SAP. If you are reading this, it means that you have exhibited a set of desirable qualities during your first-stage interview. We would like to access your technical skills further by working on two simple problems.
Confidentiality requirements:
- Please do not share the task or data to anyone.
- Please delete the forked repository and your local files after the test. However, you are free to keep your Part 2 code if you want.
This technical test is to assess the following:
- Python profiency: ability to write good-quaility code.
- Machine Learning knoweldge: ability to apply the appropriate feature engineering and modelling algorithms.
- Learning aptitude: ability to learn on the job.
There are 2 parts to the test:
-
Part 1: Data preparation based on requirements.
-
Part 2: Machine-learning task.
- Fork this repository to your account.
- Clone from your own repostitory.
- Add your codes, documentation and/or plot in the jupyter notebooks.
- Save your notebooks and output files (if any) in
submit/
folder. - When you are done, commit and push your code to your repository.
- Create a Pull Request to this repository.
- Please make your submission within 1 week from the day you fork this repository.
- If you are not sure of any of the intructions, please exercise your own judgement and proceed accordingly.
- For non-test related queries, please contact ray.han@sap.com or traci.lim@sap.com.
Your team is working on a contract related project. Your colleagues want to train text classification models with contract text data, and they need your help to prepare data for the task. This task is about filtering relevant contracts and merging them with labels.
- In
data/
, you are given 2 compressed pandas dataframes,df_cases_200906.gzip
anddf_label_200906.gzip
.
df_cases_200906.gzip
:
-
Each row represents a unique contract.
-
Schema:
CaseId A unique ID given to group of contracts FileName File name of contract Language Language of contract StartDate Start date of contract's creation date DocumentType Type of contract IsExecuted Boolean describing the execution status of contract OcrText OCR result of contract pdf QualityScore A numeric float (0 to 1), to access the quality of text from OCR conversion
df_label_200906.gzip
-
Each row represents a unique Case ID.
-
Schema:
CaseId A unique ID given to group of contracts label_1 boolean (description of label is not needed for test) label_2 boolean (description of label is not needed for test)
-
Main objective is to merge
label_1
andlabel_2
intodf_cases_200906.gzip
, so that the team can train text classification models using data from theOcrText
field. -
A
CaseId
can contain either one contract, or a group of contracts. However, eachCaseId
is given only 1 set of labels. Therefore, for eachCaseId
, you have to concatentate allOcrText
fields of all VALID contracts. -
Contracts are consider INVALID if any of the following occurs:
IsExecuted==False
QualityScore<0.81
-
The final prepared dataset should contain the following columns:
CaseId A unique ID given to group of contracts ValidFileNames List of all valid file names InvalidFileNames List of all invalid file OcrText OCR result of contract pdf label_1 boolean (description of label is not needed for test) label_2 boolean (description of label is not needed for test)
-
Export the final prepared dataset as
submit/df_final.gzip
. -
Example of final prepared dataset:
- You are required to code in a Jupyter Notebook, using Python.
- Use the command to read gzip using pandas:
df = pd.read_pickle('df_cases_200906.gzip')
- Save the following files into the
submit/
folder:- One Jupyter notebook, detailing the data preparation code
- Final prepared dataset,
df_final.gzip
- Please make your notebook clean and readable, your code should run top to bottom without any errors.
- The following will be evaluated:
- Code correctness
- Readibility
- Runtime
Your submission in this section is NOT to train the best model. It is to help us assess your ability to learn and apply NLP modelling techniques.
-
Dataset: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset
-
Train 3 binary text-classfication models to detect Fake/True news, you are free to use
- Any text processing/feature-engineering setup: Bag-of-words, TF-iDF, pretrained, FastText, GloVe, BERT, XLNET, RoBERTa, etc.
- Any algorithms: XGBoost, LightGBM, CNN, LSTM, etc.
-
Sample your own train and test set, and report test-set's F1 scores and accuracy for each of your models.
-
Add a summary section to describe why you chose to train those models, and which is the most preferred.
- You are required to code in a Jupyter Notebook, using Python.
- Example libraries/frameworks you can use: Sklearn, TensorFlow, Keras, Pytorch, Huggingface's transformers, etc.
- Feel free to take blocks of your code from existing public kaggle kernels or external tutorials online. Add a comment to indicate the source of your code if you do so.
- Save the following files into the
submit/
folder:- One Jupyter notebook, detailing your entire ML pipeline as per instructions.
- Do not submit any locally-saved embedding files.
- Do not attempt to hyper-parameter tune your models.
- If you lack CPU or GPU resources, feel free to use Google Colab, or re-sample your dataset to be smaller.
- Other than the points listed in instructions, there are no strict rules on what to include in your jupyter notebook. You are free to apply data cleaning and exploratory data analysis on top of whatever that is required.
- The following will be evaluated:
- Code correctness
- Readibility
- Summary
Good luck!