Actually,not retain only the highest expressed gene
Closed this issue · 2 comments
user-tq commented
The tutorial indicates that:"We notes that the gene_name mapping before exist some duplicates, we will process the duplicate indexes to retain only the highest expressed genes"
but in the code , It assumes that the data is sorted from large to small and sorted by the sum of each row.
def data_drop_duplicates_index(data:pd.DataFrame)->pd.DataFrame:
r"""
Drop the duplicated index of data.
Arguments:
data: The data to be processed.
Returns:
data: The data after dropping the duplicated index.
"""
index=data.index
data=data.loc[~index.duplicated(keep='first')]
return data
data = pd.read_csv('https://raw.githubusercontent.com/Starlitnightly/omicverse/master/sample/counts.txt',index_col=0,sep='\t',header=1)
data.columns=[i.split('/')[-1].replace('.bam','') for i in data.columns]
data.head()
data=ov.bulk.Matrix_ID_mapping(data,'genesets/pair_GRCm39.tsv')
data
print(data.index.value_counts())
dds=ov.bulk.pyDEG(data)
x=dds.drop_duplicates_index()
print('... drop_duplicates_index success')
x
By observing the intermediate data, it can be seen that the 7SK in the tutorial is not the highest expression level.
Starlitnightly commented
Thanks for your advice, this bug will be fixed in the next version.
Starlitnightly commented
We have fixed this error in 1.6.4
.