Revise the makeup of the central metadata tables
Opened this issue · 1 comments
jordantgh commented
The central fact tables should have PMCIDs, titles, and URLs for the studies for each table. Additionally the original spreadsheet URL/filename would be useful, as would the sheet name/number if any. This is in addition to the full table unique IDs. This will enable easier downstream analysis to e.g. group tables by study etc.
jordantgh commented
Relevant code for getting some of the key names is here in app/model/tabular_operations.py, should investigate where it's most efficient to add to the metadata table.
def parse_tables(selected_articles, ...):
for index, article in enumerate(selected_articles):
processed_table_ids = []
for file in article.supp_files:
...
try:
fname = download_supp(file.url)
if fname is None:
continue
...
data = extract_dfs(fname)
...
for sheetname, df in data.items():
...
base_name = os.path.splitext(fname)[0]
unique_id = f"{base_name}_{sheetname}_Table{i}"
```