Revise the makeup of the central metadata tables

Question

Revise the makeup of the central metadata tables

Opened this issue a year ago · 1 comments

The central fact tables should have PMCIDs, titles, and URLs for the studies for each table. Additionally the original spreadsheet URL/filename would be useful, as would the sheet name/number if any. This is in addition to the full table unique IDs. This will enable easier downstream analysis to e.g. group tables by study etc.

Answer 1 · 2023-10-11T06:25:31.000Z

Relevant code for getting some of the key names is here in app/model/tabular_operations.py, should investigate where it's most efficient to add to the metadata table.

def parse_tables(selected_articles, ...):
    for index, article in enumerate(selected_articles):
        processed_table_ids = []
        for file in article.supp_files:
            ...
            try:
                fname = download_supp(file.url)
                if fname is None:
                    continue
                ...
                data = extract_dfs(fname)
                ...
                for sheetname, df in data.items():
                        ...
                        base_name = os.path.splitext(fname)[0]
                        unique_id = f"{base_name}_{sheetname}_Table{i}"
```