Error in generating costomize interaction database

Question

Error in generating costomize interaction database

Closed this issue a year ago · 3 comments

I can successfully generate the database using the default data files:

from cellphonedb.utils import db_utils
db_utils.create_db('/home/Tools/cellphoneDB/db/v4.1.0/') 
Created /home/UTHSCSA/hef/Tools/cellphoneDB/db/v4.1.0/cellphonedb_08_04_2023_18:15:57.zip successfully

But when changing to my custom interaction dataset:
from cellphonedb.utils import db_utils
db_utils.create_db('/home/Tools/cellphoneDB/db/customized')

WARNING: The following sets of interaction partners appear in multiple rows of interaction_input.csv file:
H0Y2W6,Q9NZQ7
H0Y2W6,Q9BQ51
Q0Q5F1,A0N0P2
Q9NZQ7,A0N0P2
H3BM68,Q0Q5F1
H3BM68,A0N0P2
Q96DU3,Q13291
Q0Q5F1,C9JXS1

WARNING: protein_input.csv has the following UniProt accession duplicates:
Q14213


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 3
      1 from cellphonedb.utils import db_utils
----> 3 db_utils.create_db('/home/UTHSCSA/hef/Tools/cellphoneDB/db/customized') 

File ~/Tools/miniconda3/envs/cpdb/lib/python3.8/site-packages/cellphonedb/utils/db_utils.py:168, in create_db(target_dir)
    164 dataDFs = getDFs(gene_input=gene_input, protein_input=protein_input, complex_input=complex_input,
    165                  interaction_input=interaction_input, gene_synonyms_input=gene_synonyms_input)
    167 # Perform sanity tests on *_input files and report any issues to the user as warnings
--> 168 run_sanity_tests(dataDFs)
    170 # Collect protein data
    171 protein_db_df = dataDFs['protein_input'][['protein_name', 'tags', 'tags_reason', 'tags_description', 'uniprot']]

File ~/Tools/miniconda3/envs/cpdb/lib/python3.8/site-packages/cellphonedb/utils/db_utils.py:433, in run_sanity_tests(dataDFs)
    431 unknown_proteins = set()
    432 for col in PROTEIN_COLUMN_NAMES:
--> 433     aux_df = pd.merge(complex_db_df, protein_db_df, left_on=col, right_on='uniprot', how='outer')
    434     unknown_complex_proteins = set(aux_df[pd.isnull(aux_df['uniprot']) & ~pd.isnull(aux_df[col])][col].tolist())
    435     unknown_proteins = unknown_proteins.union(unknown_complex_proteins)
....
ValueError: You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat

My interaction set didn't included any complex data, as I can't run without complex data or empty complex input, so I generate a pseudo complex file and add both gene/protein data to the final input. But it turns out that in my "protein_db_df" I didn't have "uniprot_1" columns, so pd.merge failed.

Any suggestions to deal with this problem? if I didn't have complex data, how to curate the database table?

Thanks,
Funan

Answer 1 · 2023-08-06T21:18:25.000Z

Hi.

It seems you are introducing interactions already present in the interaction_input.csv (first warning) and in protein_input.csv (second warning) a protein already present in the file. Identify the duplicated interactions and proteins and remove them from the input files.

Check the format of your custom db corresponds to the same format as the one downloaded.

Regards

Answer 2 · 2023-08-07T15:49:20.000Z

Hi.

It seems you are introducing interactions already present in the interaction_input.csv (first warning) and in protein_input.csv (second warning) a protein already present in the file. Identify the duplicated interactions and proteins and remove them from the input files.

Check the format of your custom db corresponds to the same format as the one downloaded.

Regards

I deleted those duplicates lines. but the same error still exists "ValueError: You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat"

I've checked the format, it's matching
but I have left a lot of columns into "blank" like this:

is it ok with so many "NaN" columns?

Answer 3 · 2023-08-07T15:58:58.000Z

I figure out why this happened, since the pseudo complex only included 2 uniprot proteins, so I deleted those "uniprot_3" to "_5" columns, and changed "PROTEIN_COLUMN_NAMES = ['uniprot_1','uniprot_2']" in db_utils.py.
Then, I can get the final cellphonedb.zip