Duplicate lines in atomic database
Opened this issue · 1 comments
AlexHls commented
Describe the bug
The atomic data base generated from the quickstart notebook (i.e. the TARDIS quickstart database kurucz_cd23_chianti_H_He
) contains duplicate lines_data
/ lines
entries. This can lead to issues, e.g., when reindexing the database (see TARDIS #2442).
To Reproduce
import pandas as pd
from tardis.io.atom_data.util import download_atom_data
download_atom_data('kurucz_cd23_chianti_H_He')
store = pd.HDFStore('/path/to/kurucz_cd23_chianti_H_He.h5')
def check_duplicates(df, verbose=False):
dup_idx = df.index[df.index.duplicated()]
not_identical = 0
for idx in dup_idx:
data = df.loc[idx]
assert len(data) > 1, "Ups"
identical = True
for i in range(len(data) - 1):
# Line ID will obviously be different
data_a = data.iloc[i].drop(labels="line_id")
data_b = data.iloc[i+1].drop(labels="line_id")
identical = data_a.equals(data_b)
if not identical:
if verbose:
print(data.loc[idx])
not_identical += 1
if not_identical > 0:
raise ValueError("Not all data is identical! (%d not identical)" % not_identical)
check_duplicates(store["lines"])
Screenshots
System
-
OS:
- GNU/Linux
- macOS
-
Environment (
conda list
):
(Default carsus env)
Additional context
andrewfullard commented
I'm not sure why this code raises a ValueError when the data are not identical, when I assume it should be looking the count of duplicates. I tried this with a newly-generated output from Carsus and there don't seem to be duplicates (hard to say) so regenerating the basic TARDIS atom data seems like a good idea.