Su-informatics-lab/DSTG

Data format of our own data

leihouyeung opened this issue · 11 comments

Could you explain more about the specific data format of our own data respectively? (including scRNAseq_data.RDS spatial_data.RDS scRNAseq_label.RDS)
What should they contain? What are the row and column of each file? Thanks!

Could you explain more about the specific data format of our own data respectively? (including scRNAseq_data.RDS spatial_data.RDS scRNAseq_label.RDS)
What should they contain? What are the row and column of each file? Thanks!

The 'scRNAseq_data.RDS' refers to the single-cell RNA-seq data that you use for deconvolution. It is a data matrix with rows as genes and columns as cells. The 'scRNAseq_label.RDS' is a data frame with rowname as the cell names and one column as cell type. The 'spatial_data.RDS' is the spatial transcriptomics data matrix with rows as genes and columns as spots.

Could you explain more about the specific data format of our own data respectively? (including scRNAseq_data.RDS spatial_data.RDS scRNAseq_label.RDS)
What should they contain? What are the row and column of each file? Thanks!

The 'scRNAseq_data.RDS' refers to the single-cell RNA-seq data that you use for deconvolution. It is a data matrix with rows as genes and columns as cells. The 'scRNAseq_label.RDS' is a data frame with rowname as the cell names and one column as cell type. The 'spatial_data.RDS' is the spatial transcriptomics data matrix with rows as genes and columns as spots.

Thanks for your response! I have another question. I have tried to run my own data(not the example data). What's the purpose of returning the second data frames in the "st_labels" in the function called "data_process"? I know the first one is the mixed pseudo-ST by raw scRNA-seq data. I am just confused about the existence of the second one. Why could we get the real-ST labels?

Could you explain more about the specific data format of our own data respectively? (including scRNAseq_data.RDS spatial_data.RDS scRNAseq_label.RDS)
What should they contain? What are the row and column of each file? Thanks!

The 'scRNAseq_data.RDS' refers to the single-cell RNA-seq data that you use for deconvolution. It is a data matrix with rows as genes and columns as cells. The 'scRNAseq_label.RDS' is a data frame with rowname as the cell names and one column as cell type. The 'spatial_data.RDS' is the spatial transcriptomics data matrix with rows as genes and columns as spots.

Thanks for your response! I have another question. I have tried to run my own data(not the example data). What's the purpose of returning the second data frames in the "st_labels" in the function called "data_process"? I know the first one is the mixed pseudo-ST by raw scRNA-seq data. I am just confused about the existence of the second one. Why could we get the real-ST labels?

Hey good question. Yes the returned list includes two elements, the 1st is the mixed labels of pseudo-ST data, but the 2nd is not the real-ST labels. The 2nd one is used to keep the size of data structure, but will not be used in the learning process. You will obtain the real-ST labels after you finish the whole pipeline.

Thanks for your explanation.
FYI, when I run convert_data.R with my own data, it raised a mistake:missing values are not allowed in subscripted assignments of data frames on running function "SPOTlight::test_spot_fun". I solved it by following code before running "test_spot_fun" :rownames(st_label[[1]]) = colnames(st_count[[1]]). I hope it could be helpful.
Nice work :)

Could you please explain more about the meaning of the existence of "filterEdge" function in gutils.py? What is it used for?

And I am confused in the adjacent matrix construction part in utils.py.
id_grp1 = np.array([ np.concatenate((np.where(find1 == id_graph2.iloc[i, 1])[0], np.where(find1 == id_graph2.iloc[i, 0])[0])) for i in range(len(id_graph2)) ])
I think it should be
id_grp1 = np.array([ np.concatenate((np.where(find1 == id_graph2.iloc[i, 2])[0], np.where(find1 == id_graph2.iloc[i, 1])[0])) for i in range(len(id_graph2)) ])
Because the id_graph2.iloc[i,0] are the indices of all edges.

Thanks for your explanation.
FYI, when I run convert_data.R with my own data, it raised a mistake:missing values are not allowed in subscripted assignments of data frames on running function "SPOTlight::test_spot_fun". I solved it by following code before running "test_spot_fun" :rownames(st_label[[1]]) = colnames(st_count[[1]]). I hope it could be helpful.
Nice work :)

Thanks for the comments. I will check the names and update the codes.

Could you please explain more about the meaning of the existence of "filterEdge" function in gutils.py? What is it used for?

The link graph between pseudo-ST and real-ST data is built primarily based on the reduced dimension space. The 'filterEdge' function further purifies the link graph for reliability based on the original pseudo-ST and real-ST data. Hope my explanation helps.

And I am confused in the adjacent matrix construction part in utils.py.
id_grp1 = np.array([ np.concatenate((np.where(find1 == id_graph2.iloc[i, 1])[0], np.where(find1 == id_graph2.iloc[i, 0])[0])) for i in range(len(id_graph2)) ])
I think it should be
id_grp1 = np.array([ np.concatenate((np.where(find1 == id_graph2.iloc[i, 2])[0], np.where(find1 == id_graph2.iloc[i, 1])[0])) for i in range(len(id_graph2)) ])
Because the id_graph2.iloc[i,0] are the indices of all edges.

In the codes setting, the variable id_graph2 should have two columns. If your id_graph2 has three columns, then you have the first column as the indices of all edges, and accordingly change '1' to '2', and '0' to '1'.

Could you please explain more about the meaning of the existence of "filterEdge" function in gutils.py? What is it used for?

The link graph between pseudo-ST and real-ST data is built primarily based on the reduced dimension space. The 'filterEdge' function further purifies the link graph for reliability based on the original pseudo-ST and real-ST data. Hope my explanation helps.

I am confused about the "position" vector in this function. It is the indices of "nn[1]", but why it can be used in "edge"? They are different data frames. I am confused about the meaning of "fedge" of this function.

Could you please explain more about the meaning of the existence of "filterEdge" function in gutils.py? What is it used for?

The link graph between pseudo-ST and real-ST data is built primarily based on the reduced dimension space. The 'filterEdge' function further purifies the link graph for reliability based on the original pseudo-ST and real-ST data. Hope my explanation helps.

I am confused about the "position" vector in this function. It is the indices of "nn[1]", but why it can be used in "edge"? They are different data frames. I am confused about the meaning of "fedge" of this function.

If I understand your question correct, the indices of nn[1] represent the node indices. So it applies to edges. That line of code means identifying the neighbors between a certain pseudo-ST node and the other real-ST node in the "edges" as well as in "nn".