About data preprocessing for diffusion pseudotime #26

ShuhuaGao · 2017-07-03T11:13:37Z

Hi, first thanks for sharing this analysis tool. I prefer Python much more to R, though most Bioinformatics tools are written in R. Here I want to ask a question about data processing before we feed it as adata into dpt for pseudotime ordering.

As the DPT algorithm can accept multiple types of data, such as the most commonly single-cell qPCR (Ct values) and RNA-Seq (FPKM/TPM) data, is the data processing procedure identical with each other? Since I have also checked the Monocle 2 algorithm, it seems much more complicated in Monocle 2. For instance, in the 4th page of its document link, it asks you to specify different expressionFamily, i.e., the proper distribution of the data, for different kinds of data. Then, how about the dpt function in scanpy? Does it take all kinds of data the same way?

According to my understanding,

For qPCR data, we should provide delta_Ct=LOD-Ct values to dpt (LOD: limit of detection);
For RNA-Seq data, we should offer log2(FPKM+1) to dpt.
Is it right?

Any help is appreciated.

falexwolf · 2017-07-03T11:39:43Z

Hi ShuhuaGao,

thanks for your input! Monocle 2 has many more options for preprocessing, that's right. I believe though that you should get along with the limited options of Scanpy for a robust pseudotime and branching inference using DPT; simply because DPT is very robust. Nonetheless I have to admit that I've not worked with an extensive number of data types. From this experience, my understanding is the following

for RNA-Seq data, you should normalize and extract highly-variable genes. this is most simply done by using the procedure of cell ranger sc.pp.recipe_zheng17 (example here) or, if you want more control, the Seurat workflow (example here)
for qPCR, a simple log-normalization (sc.pp.log1p) should suffice (see example here); you might though consider "normalizing per cell / UMI correction", one of the steps done in RNA-seq part (sc.pp.normalize_per_cell)

Ask if you have further questions. 😄

ShuhuaGao · 2017-07-03T13:49:37Z

Hi, Alex,

Many thanks for your quick reply. I just saw your reply as it is almost 10PM in Singapore now. It is understandable to perform quality control, in-cell normalization and to extract the highly variable genes for ordering. I got your point.

For your reply about qPCR, do we need a log normalization? I think a log transform is only required for RNA-Seq data to get a non-skewed normal distribution. As for qPCR data, the delta_Ct value is actually already in a log scale. In the example you have mentioned, there is no call of sc.pp.log1p, either. Instead, we just read the data by
adata = sc.read(filename, sheet='dCt_values.txt', backup_url=backup_url)
and no more processing is applied. As can be found from the original paper, the so-called dCt_value is just defined as HK_Ct - Ct, where HK_Ct is the mean Ct of 4 housing keeping genes on a cell-wise basis.

Besides, in many cases, there may be no UMI data available. In such a case, the normalization per cell for RNA-Seq is actually to compute the FPKM/TPM to compensate for the sequencing depth, right? Usually, the RNA-Seq data in FPKM form is already provided in publications. And then we work on this data to find the highly variable genes. (Just personal understanding. I am new to this field from mechatronics engineering.)

Anyway, thanks again for your help. I noticed that there are no examples for pseudo-time ordering with RNA-Seq data. Maybe I can provide one in the near future, as I am working on gene network modeling based on the pseudo-time information.

falexwolf · 2017-07-03T14:04:44Z

Hi!

Everything that you write makes sense: if the qPCR values are already on a log scale, you shouldn't log-transform them anymore / if the RNA-seq data is already in FPKM form, you do not need to do account for UMI correction ...

Regarding the pseudotime example for RNA-seq data: here is a public one. But it would be nice to have more!

Thanks for your input!

ShuhuaGao · 2017-07-04T04:58:02Z

Thanks for your reply. I will try that and may given more feedback. Cheers!

ShuhuaGao closed this as completed Jul 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About data preprocessing for diffusion pseudotime #26

About data preprocessing for diffusion pseudotime #26

About data preprocessing for diffusion pseudotime #26

About data preprocessing for diffusion pseudotime #26

Comments