jordantgh/meta-analyser

Revise the makeup of the central metadata tables

Opened this issue · 1 comments

The central fact tables should have PMCIDs, titles, and URLs for the studies for each table. Additionally the original spreadsheet URL/filename would be useful, as would the sheet name/number if any. This is in addition to the full table unique IDs. This will enable easier downstream analysis to e.g. group tables by study etc.

Relevant code for getting some of the key names is here in app/model/tabular_operations.py, should investigate where it's most efficient to add to the metadata table.

def parse_tables(selected_articles, ...):
    for index, article in enumerate(selected_articles):
        processed_table_ids = []
        for file in article.supp_files:
            ...
            try:
                fname = download_supp(file.url)
                if fname is None:
                    continue
                ...
                data = extract_dfs(fname)
                ...
                for sheetname, df in data.items():
                        ...
                        base_name = os.path.splitext(fname)[0]
                        unique_id = f"{base_name}_{sheetname}_Table{i}"
```