The purpose of this exercice is ...
- Download and save csv files in utf-8 encoding from Base de donnée publique du médicament (Bdpm).
- You will only need the composition and specialties files (
Fichier des spécialités
,Fichier des compositions
)
Tips: Files are tab (\t
) seprated and encoded usingcp1252
Open the archive bdpm_headers.tgz
(in exercice/
)to acess the headers for the Bdpm's csv.
Open the Bdpm's csv in pandas and add the corresponding headers to them.
- Pull a docker image of
mysql
and run it ! (Create adockerfile
inresults/mysql/
showing what you did). - Connect to it from within you previously created script (the one with the pandas dataframes) then add the dataframe to the
mysql
instance (one dataframe is a separated table in the database named respectivelyCIS_bdpm
andCOMPO_bdpm
). - You can view your newly created
mysql
database using a GUI tool like TablePlus (mac/windows) or Sequel Pro (mac only) or simply using CLI. - Make a dump of the created database, name it
bdpm.sql
, compress its and drop it inresults/mysql/
.
- In
results/mysql/
create abdpm_request.sql
file. - In this file we will write the
sql
request to joinCIS_bdpm
andCOMPO_bdpm
. - (There is usefull resources to learn how to do a join in mysql here).
- We will join
COMPO_bdpm
toCIS_bdpm
on the equivalence of the foreign keycis
. - In a new python script, using pandas, execute the previously created sql request.
- You can print the 5 first lines of the resulting pandas dataframe.
- In the resulting dataframe we want to keep only active substances, so you'll need to filter it by
sub_nature
=SA
. - In the same dataframe, we want to create a new column called
normalized_longname
that is a copy of thelongname
column except we want the values to be lowered and cleared of any accents. In addition we want to get rid of the portion corresponding to the drug form after the coma. Ex:ANASTROZOLE ACCORD 1 mg, comprimé pelliculé
->anastrozole accord 1 mg
Tips: We want to convert the coma between digit to a dot.
- To avoid unwanted data loading, we can filter the active substances directly in mysql.
- Redo
Step 6
but this time don't filter (onsub_nature
=SA
) in pandas but directly by amendingbdpm_request.sql
(practically create a new file calledfiltered_bdpm_request.sql
in the same directory)
- In a new python file / notebook, load this html web page Zopiclone - Wikipédia.
- From the harvested page, extract
p
andh2
. - We end up with a list of selectors, we want to fusion
p
until we meet ah2
then start fusionning again. - Fusion the
h2
text content to the corresponding paragraph.
- Now that we have some clean paragraphs, we need to allow full text query in them. For this part we will use Elasticsearch.
- Use the
docker-compose
file from here and start the Elasticsearch, Kibana, Logstash (ELK) stack. - You will need to use a way to index documents from python into elasticsearch. You can use this package.
- When documents are indexed, take a look at kibana to check everything is correctly in.
- Now create a function that, given a string, query Elasticsearch and return the result.
- Print the result then save it as a json file in
results/elasticsearch/
.