Starlitnightly/omicverse

Actually,not retain only the highest expressed gene

Closed this issue · 2 comments

The tutorial indicates that:"We notes that the gene_name mapping before exist some duplicates, we will process the duplicate indexes to retain only the highest expressed genes"
but in the code , It assumes that the data is sorted from large to small and sorted by the sum of each row.

def data_drop_duplicates_index(data:pd.DataFrame)->pd.DataFrame:
    r"""
    Drop the duplicated index of data.

    Arguments:
        data: The data to be processed.

    Returns:
        data: The data after dropping the duplicated index.
    """
    index=data.index
    data=data.loc[~index.duplicated(keep='first')]
    return data
data = pd.read_csv('https://raw.githubusercontent.com/Starlitnightly/omicverse/master/sample/counts.txt',index_col=0,sep='\t',header=1)
data.columns=[i.split('/')[-1].replace('.bam','') for i in data.columns]
data.head()

data=ov.bulk.Matrix_ID_mapping(data,'genesets/pair_GRCm39.tsv') 
data 
print(data.index.value_counts())
dds=ov.bulk.pyDEG(data)
x=dds.drop_duplicates_index()
print('... drop_duplicates_index success')
x

By observing the intermediate data, it can be seen that the 7SK in the tutorial is not the highest expression level.
image
image

Thanks for your advice, this bug will be fixed in the next version.

We have fixed this error in 1.6.4.