id (string) | first_name (string) | last_name (string) | age (integer) | gender (string) | visit_date (date) |
---|
- Create a new db (SQL RDS e.g. postgresl, sqlite)
- Import the "users" CSV file into a new db table called “users”.
- Create a new table named "daily_user_counts". The new table should have 5 columns, year, month, day, observed, count).
- Create a Python3 script that:
- Reads "users" table into a pandas dataframe.
- Count the number of users by day.
- Then calculate the number of users expected to signup 1 day into the future. Note a simple mean/average is sufficient for communicating an expected value.
- Append the new record with the expected count to the dataframe.
- Load/Insert the results of your calculations (the dataframe) into your "daily_user_counts" table.
- The above request is needed to solve a problem and the above outlines the "best" solution considered
- The solution will be considered a one-off solution used to solve for an edge case
- In a typical implementation, a docker container would likely not be used for the PostgreSQL db but instead would be deployed to RDS
- The csv file contains sensitive data
- The id column is made of two ids a user id and possibly last four of a social
- The analysis of the user data would be needed on a repeated schedule
- Security - Are there any guidelines in place to enforce HIPPA compliance and/or other security standards?
- Speed to deploy vs Repeatability - Because it is being considered as a one-off speed is favored over being able to recreate multiple times.
- Because of the interesting user id the full id is needed for uniqueness.
- The second section of the id can have a leading zero so it is treated a string
- When addressing uniqueness the original id structure is needed.
- Forecasting the number of users expected to signup the next would likely be a repeated need so it would make sense to
implement a class
- Also, for such requirement I would gather a more complicated solution would be needed and refined. By having it in a class it can better be shaped in to a more refined forecast model
- With the current deployment the tables have no primary keys. Although important because the implementation didn't involve any joins I thought it would be ok to leave out. Ideally the user_id and the last_four column would be be the composite key for the users table and the index column would be the primary key.
- Lastly, I admit the solution is a bit convoluted for the ask, however, by employing this solution it could better provide visibility in to my skills
- Set environment variables like in the following example. They will be called by the env.list files.
Example:export POSTGRES_USER="some_user_name" export POSTGRES_PASSWORD="some_super_secret_pw" export CSV_FILE="/full/path/to/file.csv" export POSTGRES_HOST="some-postgres"
- Run the following from the Root directory to build and run the postgres database and load the csv in to the db
docker-compose -f docker-compose.yaml up
- Run the following from the "process_user_signup_forecast" directory when ready to create daily_user_counts table.
docker-compose -f docker-compose.yaml up