Data Scientist test

Overview

This repository contains data and instructions for a test as part of your application. It contains a simulated dataset, with 300 patients from different organisations, with columns:

  • ID - a unique row ID.
  • Organisation - The organisation the patient was seen at.
  • Age - The patient’s age.
  • LOS - The patent’s length-of-stay in hospital, in whole days.
  • Death - A flag indicating whether the patient died, coded: 0 = survived, 1 = died.
  • Category - The risk category the patient falls into.

The task

The task in this assessment is:

build the best model you can, predicting the ‘Death’ column, in this dataset.

We will not restrict you to particular model classes, statistical paradigms, or sets of assumptions. Please consider the following:

  • Some exploratory analyses and visualisations of the data items
  • Statistical / machine learning models to predict the ‘Death’ field
  • A method of assessing how ‘good’ a model is, and how you select the ‘best’ model
  • Interpretation of the model outputs

We advise you to spend no more than two hours on this test, as it is only part of the selection process. The first stage of interview will be to discuss you analysis approach, regardless of whether it ‘worked,’ so we ask you to leave any partial code/documents you create, even if they are incomplete. This will allows us to see your thought process, as there are many ways to approach this exercise.

Instructions

To access the data and share your answers with us:

  • Download a copy of the data in this repository:
  • If you are familiar with GitHub, please clone the repository and push your work to your own repository.
  • If you are completely new to GitHub, you can download the data using the green ‘Code’ or ‘Clone or download’ buttons towards the top of this page, click on button and select ‘Download ZIP’, then send your answers by email to HED@uhb.nhs.uk.

Git is a tool used within the HED Data Science work-flow, and small weighting will be added in-favour of using GitHub properly.

Once you have downloaded the data/repository:

  • Load the simulated_data.csv file to your analysis environment of choice. We encourage the use of R and Rmarkdown in particular, but are happy to assess most common analysis environments provided we can access them.

Please submit you answers by 1700 BST, Monday 3rd August 2020

If you have questions, or need further support, please contact HED@uhb.nhs.uk