Data Engineer Test

The purpose of this exercice is ...

Part I

Download and save csv files in utf-8 encoding from Base de donnée publique du médicament (Bdpm).
You will only need the composition and specialties files (Fichier des spécialités, Fichier des compositions)

Tips: Files are tab (\t) seprated and encoded using cp1252

Open the archive bdpm_headers.tgz (in exercice/)to acess the headers for the Bdpm's csv.

Open the Bdpm's csv in pandas and add the corresponding headers to them.

Pull a docker image of mysql and run it ! (Create a dockerfile in results/mysql/ showing what you did).
Connect to it from within you previously created script (the one with the pandas dataframes) then add the dataframe to the mysql instance (one dataframe is a separated table in the database named respectively CIS_bdpm and COMPO_bdpm).
You can view your newly created mysql database using a GUI tool like TablePlus (mac/windows) or Sequel Pro (mac only) or simply using CLI.
Make a dump of the created database, name it bdpm.sql, compress its and drop it in results/mysql/.

In results/mysql/ create a bdpm_request.sql file.
In this file we will write the sql request to join CIS_bdpm and COMPO_bdpm.
(There is usefull resources to learn how to do a join in mysql here).
We will join COMPO_bdpm to CIS_bdpm on the equivalence of the foreign key cis.
In a new python script, using pandas, execute the previously created sql request.
You can print the 5 first lines of the resulting pandas dataframe.

In the resulting dataframe we want to keep only active substances, so you'll need to filter it by sub_nature = SA.
In the same dataframe, we want to create a new column called normalized_longname that is a copy of the longname column except we want the values to be lowered and cleared of any accents. In addition we want to get rid of the portion corresponding to the drug form after the coma. Ex: ANASTROZOLE ACCORD 1 mg, comprimé pelliculé -> anastrozole accord 1 mg

Tips: We want to convert the coma between digit to a dot.

To avoid unwanted data loading, we can filter the active substances directly in mysql.
Redo Step 6 but this time don't filter (on sub_nature= SA) in pandas but directly by amending bdpm_request.sql (practically create a new file called filtered_bdpm_request.sql in the same directory)

In a new python file / notebook, load this html web page Zopiclone - Wikipédia.
From the harvested page, extract p and h2.
We end up with a list of selectors, we want to fusion p until we meet a h2 then start fusionning again.
Fusion the h2 text content to the corresponding paragraph.

Now that we have some clean paragraphs, we need to allow full text query in them. For this part we will use Elasticsearch.
Use the docker-compose file from here and start the Elasticsearch, Kibana, Logstash (ELK) stack.
You will need to use a way to index documents from python into elasticsearch. You can use this package.
When documents are indexed, take a look at kibana to check everything is correctly in.
Now create a function that, given a string, query Elasticsearch and return the result.
Print the result then save it as a json file in results/elasticsearch/.