Duplicate lines in atomic database

Question

Duplicate lines in atomic database

Opened this issue a year ago · 1 comments

Describe the bug

The atomic data base generated from the quickstart notebook (i.e. the TARDIS quickstart database kurucz_cd23_chianti_H_He) contains duplicate lines_data/ lines entries. This can lead to issues, e.g., when reindexing the database (see TARDIS #2442).

To Reproduce

import pandas as pd
from tardis.io.atom_data.util import download_atom_data

download_atom_data('kurucz_cd23_chianti_H_He')
store = pd.HDFStore('/path/to/kurucz_cd23_chianti_H_He.h5')

def check_duplicates(df, verbose=False):
    dup_idx = df.index[df.index.duplicated()]
    not_identical = 0
    for idx in dup_idx:
        data = df.loc[idx]
        assert len(data) > 1, "Ups"
        identical = True
        for i in range(len(data) - 1):
            # Line ID will obviously be different
            data_a = data.iloc[i].drop(labels="line_id")
            data_b = data.iloc[i+1].drop(labels="line_id")
            identical = data_a.equals(data_b)
            if not identical:
                if verbose:
                    print(data.loc[idx])
                not_identical += 1
    if not_identical > 0:
        raise ValueError("Not all data is identical! (%d not identical)" % not_identical)

check_duplicates(store["lines"])

Screenshots

System

OS:
- GNU/Linux
- macOS
Environment (conda list):
(Default carsus env)

Additional context

Answer 1 · 2024-09-10T17:44:13.000Z

I'm not sure why this code raises a ValueError when the data are not identical, when I assume it should be looking the count of duplicates. I tried this with a newly-generated output from Carsus and there don't seem to be duplicates (hard to say) so regenerating the basic TARDIS atom data seems like a good idea.