Extract the EU Clinical Trials registry into a database for analysis
The intent of this project is to make information in the EU Clinical Trial Registry (EUCTReg) accessible for analysis. The problem is that although the EUCTReg is online and searchable, it was established to provide a way to make clinical trial regulatory information transparently available, but deliberately designed to output data in way that does not facilitate research. This is a shame since other clinical trial registries such as ClinicalTrials.gov provide a wealth of useful information for researchers involved with clinical trials, but only a fraction of trials conducted in the European Union are replicated in other registries.
The first step was to develop a python script (scrape.py) to download the full text content of the registry. That is working, but rather than have everyone run it and potentially burden the EUCTReg website with large data requests, my plan is to run the script periodically and upload to a public repository. This text file is about 2 GB in size, so the next step is the scan.py script, which parses the text file and stores the result in an sqlite database (for now, eventually, a Postgres database). I did not attempt to extract every data element in the registry, but the script was written with modification in mind. Finally, the toexcel.py script provides a very rudimentary search facility and outputs the selected data set as an excel spreadsheet. The future plan is to make a web front-end to facilitate searching and provide both on-screen display and downloads of search results. Getting to that point could take a while. For now, I suggest taking advantage of what is already available, the dump of the full text contents of the website, the corresponding sqlite database or excel spreadsheet that includes all registry entries. This is a work in progress, so expect formats and features to shift as the process is refined. I should mention that I am not at all experienced when it comes to python, so suggestions are welcome.
The EU Clinical Trials Registry is a public-facing portal that makes available non-confidential information stored in the EudraCT (European Union Drug Regulating Authorities Clinical Trials) database, which was established under the provisions of the EU clinical trial directive of 2001. That directive went into force in 2004, and since that time, when sponsors submit clinical trial applications EU member states competent authorities (CAs, i.e., regulatory agencies), the information goes into the EudraCT database, which is maintained by the European Medicines Agency (EMA). The database does not include all kinds of clinical trials -- its remit is limited to those that involve a medicinal product for human use and that involve at least one EU member state. Consequently, the registry does not list trials that focus solely on surgical procedures, medical devices or psychotherapeutic procedures. Additionally, the registry does not list trials that are phase I only (phase I/II are listed). Some pediatric studies completed prior to 2007 are listed on another part of the registry website and since they follow a different format, they are not part of this project. Pediatric studies after that point are listed in the registry.
The EMA does not offer any API or web services to external parties to extract data from the registry; the only interface is a web form that is intended to be used manually. Aside from on-screen display in HTML, the only export format is plain text. Clinical trial applications (CTAs) are submitted to EU memberstate competent authorities as an XML document, and these roots are apparent in the format of the plain text document that follows the same outline, but unfortunately, without the formal tag structure.
A major complicating factor is that a given trial, represented by a unique Eudract number, appears multiple times in the listing because sponsors were required to submit versions of the application to each EU member state involved. This allowed for certain items to vary across jurisdictions and for some items to be entered in the local language. However, it is also a huge source of error as the process involved manual data entry and not much data validation. Different versions of the CTA for the same trial are difficult to align because data are often variable (spelling, spacing, punctuation, language), appear in different order, represent different times in the trial, and are sometimes contradictory. The process has apparently improved over time and more recent records have better data quality.
The 2014 Clinical Trial regulation has not entered into force (at this time this is written), but it is likely that it will further improve data quality as sponsors will submit a CTA to single competent authority. It is not clear whether the new regulation will result in other structural changes in the registry.
-
Get the registry contents - the scrape.py script submits a blank query to the site, so the site replies with a paginated index of all clinical trials. The script determines the number of the final page and then requests page after page of text until it reaches the final page. All of that text ends up in one large, UTF-8 encoded text file named with the current date. The file is large, about 2GB and takes hours to download. This is not a matter of the connection on my end: I run this script on a computer at my ISP, so connectivity is good - I have the impression the registry server is slow for some reason, perhaps intentionally throttled? All the more reason to grab the file that I have already uploaded to my Google Drive.
-
Parse the text into a database. One approach to this would have been to try to put the data back into XML format and then use standard tools to process it, more or less tryng to put the genie back in the bottle. I took the less elegant approach of just looking through the text file line by line and trying to match up headings and extract data. I encountered a few difficulties along the way:
- Redundancy of the records - Every EU member state participating in the trial submits their own clinical trial application, so there are many pages of redundant text for each trial; the script consolidtaes these entries to a single view of the trial.
- Records for the same trial are not consistent - There can be contradictions between data submitted by different EU member states due to langauge differences, entry error, and due to changes that may have occurred as the study matured. The script takes a conservative approach in merging differences. If a field is blank in one record but a value is supplied in another, it takes the value supplied, for example. In general, it does not try to correct data except for making spacing regular. Trial status is often not concordant among records, and this may be due to error or the trial closing in one member state before another. Since my interest is the trial as a whole, if the date of global completion is listed but the trial status field indicates it is still ongoing, the value is updated to "not ongoing". That value does not indicate whether the trial closed prematurely, was withdrawn or continued until completion.
- IMP entries are complicated - IMPs (Investigational Medicinal Products) are numbered in each CTA, but a given IMP may be listed in a different order in one member state's submission versus another. It is difficult to match up the IMPs because different local names may be used for the same product and often a dose or other informtion is appended to the same field. Where here there is some overlapping term in trade name, product name or sponsor's product code, the script consolidates the entry. In some cases, none of these fields overlap for a given drug, so that drug may have more than one entry in the database -- better to be redundant than to throw away data. For each IMP, the registry sometimes lists active substances. One IMP may have multiple active substances. Due to the difficult of sorting out IMPs from the text file, I did not attempt to extract infomration at the active substance level, although that could be useful information.
- Sponsor data is hard to consolidate - When the sponsor data is exactly the same between records for a given trial, it is consolidated, but very often the sponsor's name is written differently across all the applications: differences in national name of a company, abbreviations, an extra comma here and there, etc. Also, a given sponsor may have one CRO run the trial in one country and another in another country. Often, there are different sponsor contacts in different regions. For these reasons, there many be multiple sponsor entries for a given study. Again, the decision was to err in the direction of retaining information mostly to facilitate searching.
-
Output to excel. It is easy to manipulate and export data from Excel and not everyone wants to deal directly with a database, so the final script bridges the data from the database into an excel spreadsheet with one row per trial. Since there is a one-to-many relationship for drugs and sponsor information, these fields are condensed. All the drugs are listed in on cell, with terms broken up by semicolons. The same applies for sponsor information. Not all the drug and sponsor information appears in excel -- if needed, it can be accessed directly from the database.