/Take-home-Allen

My attempt

Primary LanguageJupyter Notebook

Take-Home

The take-home problem for candidates for pivotal life sciences (LLM data scientist).

We are looking for a candidate who can demonstrate their ability to learn new technologies and solve problems. Take this as an opportunity to showcase parts of your skillset.

You are free to present the information in any way you desire, while jupyter notebooks are preferred, but not required (there is no penalty for using a different format, use whatever you think you will perform strongest with).

There are two parts to the assignment - 1 - a series of NLP/LLM related questions below and 2 - the coding assignment - both parts should be completed for the assignment.

Part 1:

  • Have you worked with any LLM libraries or frameworks (e.g. NLTK, GPT-4)? If so, can you describe your experience and the specific tasks you used them for?
  • How do you approach preprocessing and cleaning text data for LLM tasks? What techniques do you use to deal with common issues such as noise, missing data, or outliers?
  • How do you imagine LLM being applied to investment tasks such as risk assessment, market analysis, or financial forecasting? Can you provide an example of a specific application or use case?
  • How do you envision using LLMs to extract insights from financial news articles, social media posts, medical literature, or other text-based sources of information?
  • Have you worked with any financial data sources that incorporate text data, such as earnings call transcripts or analyst reports? If so, can you describe your experience working with these types of data and the insights you were able to glean? If not, you can use any other text based data that you've worked with that might be relevant in the life sciences investment context.
  • How do you approach using LLMs to inform a comprehensive investment analysis? Can you provide an example of a time when you combined text data with other types of financial data to inform a business decision?

Submission Instructions

All answers should be in the form of a pull request from a fork of this repository (being able to use git basics is a requirement for this position). You will not be able to push/merge branches directly (you do not have permissions) so please fork this repository and submit a pull request from there (via "contribute").

Part 2:

Choose 1 exercise to complete from the following list:

There preferred exercise for the LLM role is Exercise #3 but feel free to choose any of the exercises that best showcase your LLM experience.

Exercise 1: Chess data

The file chess_games.csv is a collection of chess games from Lichess and Chess.com along with a collection of metrics of each player's performance during the game (the actual game is not included).

The task is this: Determine the difference in player behavior when they’re winning vs losing vs maintaining their elo score

Be sure to look at the problem carefully, while we are looking at overall patterns amongst players, how you break up a player's performance into winning, losing, and maintaining is important.

If you have any questions please reach out to shiva.amiri@pivotallifesciences.com.

Exercise 2: Financial data

This exercise is a BYOD (bring your own data) exercise. You are free to use any data you wish, but you must either include the data as a file in the repository or provide a link to the data in the README.md file. (No matter what you choose, you must explain why you chose this particular data set).

The task is this: which companies inside the XBI index were hit the hardest by the latest market downturn? Why?

This problem is equally graded on a candidate's ability to communicate their thought process as it is their ability to pull and extract meaningful data for a problem.

If you have any questions please reach out to shiva.amiri@pivotallifesciences.com.

Exercise 3: Reddit Data

The file askscience_data.csv is a collection of posts from the subreddit r/askscience. The task comes with two parts:

  1. Determine the attributes of a successful post on r/askscience
  2. Build a model that can predict the score of a post on r/askscience given at least the title and body of the post (There is no need to limit it to just the title and body, but you must explain why you chose the features you did).

If you have any questions please reach out to shiva.amiri@pivotallifesciences.com.