[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export expression matrix from h5ad #262

Closed
cartal opened this issue Sep 13, 2018 · 30 comments
Closed

Export expression matrix from h5ad #262

cartal opened this issue Sep 13, 2018 · 30 comments

Comments

@cartal
Copy link
cartal commented Sep 13, 2018

Hi,

I would like to extract the expression matrix (genes and counts) from a h5ad file.
How can I do this? I have searched the documentation but I couldn't find anything about this (maybe I missed it).

@LuckyMD
Copy link
Contributor
LuckyMD commented Sep 13, 2018

If you want to extract it in python, you can load the h5ad file using adata = sc.read(filename) and then use adata.X, which is the expression matrix.

To extract the matrix into R, you can use the rhdf5 library. That's a bit more complicated as there was a recent update to this library I believe. Note that you need to transpose the expression matrix from python into R due to different conventions (R expects a genes x cells matrix, python a cells x genes matrix).

An alternative to the rhdf5 library is to just save the expression matrix via numpy.savetxt() to save it, for example, as a space-delimited file.

I hope this helps.

@cartal
Copy link
Author
cartal commented Sep 14, 2018

Hi, this is very useful. But I think I formulated my question the wrong way.

Can I export the h5ad file to a standard 10X h5 file?

@LuckyMD
Copy link
Contributor
LuckyMD commented Sep 14, 2018

That, I can't help you with I'm afraid. I'm not as familiar with the h5ad format.

@maximilianh
Copy link
Contributor
maximilianh commented Sep 14, 2018 via email

@falexwolf
Copy link
Member

Hi @cartal, it wouldn't be very hard to export to a 10x h5 file, but I'd need to write a custom function for it. Why is it needed? Does 10x offer any downstream analysis that you'd want to use on the data? I thought there are none, hence there is only sc.read_10x_h5 and no sc.write_10x_h5.

@cartal
Copy link
Author
cartal commented Dec 14, 2018

Hi, I'm sorry I forgot about this trend, I just stumble into the same issue.

Lets say I have done my analysis in scanpy and everything is good and nice, but now I want to run, say, the cluster 10 from the louvain subset, with Palantir. Palantir can read 10X and 10X_H5 files. Is there a way to plug-and-play this with scanpy?

In another case, if I want to extract the subset expression matrix, where rows are genes (with rownames as gene symbols) and columns are cells (with colnames as cells), so I can use this with SCENIC. How could I get this from the scanpy adata?

I apologise in advance if I'm asking something very basic, but it will be really nice to have some sort of interconectivity between tools, since scanpy is so nice to have as a major analysis suite.

@falexwolf
Copy link
Member

You can always export as a .csv file and read that into other tools using adata.write_csvs(filename, skip_data=False). You can call adata.T for transposing before.

I can imagine that Palantir would also accept AnnData objects, you could make an issue there. Also, have you tried tl.paga for trajectory inference, the paper is here?

@cartal
Copy link
Author
cartal commented Dec 21, 2018

Hi, I am trying PAGA. Thanks a lot for the help.

@aditisk
Copy link
aditisk commented Feb 4, 2019

@falexwolf @cartal @LuckyMD I am also trying to export a gene by cell expression file. I tried using adata.write_csvs(filename, skip_data=False) but that wrote the output to multiple files. Is there a way to generate a single file with genes as rows (with gene names as row IDs) and cells as columns (barcodes as column IDs) ?

Thank you in advance for your help.

@falexwolf
Copy link
Member

No, there is no way to produce a single file with data and metadata. Having genes as rows can simply be achieved by transposing the matrix (adata.T.write_csvs(...)).

@maximilianh
Copy link
Contributor
maximilianh commented Feb 5, 2019 via email

@aditisk
Copy link
aditisk commented Feb 7, 2019

@falexwolf thanks for the feedback. As @maximilianh suggested, I was able to export the expression matrix from the cellbrowser export function. Thank you for your help.

@maximilianh
Copy link
Contributor
maximilianh commented Feb 7, 2019 via email

@aditisk
Copy link
aditisk commented Apr 1, 2019

@maximilianh I was able to use the cell browser export function in the past but this time I am getting an error message:

INFO:root:Writing scanpy matrix to adata_cellbrowser_04_01_19_CD8_subclustered/exprMatrix.tsv.gz
INFO:root:Transposing matrix
INFO:root:Writing gene-by-gene, without using pandas
INFO:root:Writing 8068 genes in total
INFO:root:Wrote 0 genes
INFO:root:Wrote 2000 genes
INFO:root:Wrote 4000 genes
INFO:root:Wrote 6000 genes
INFO:root:Wrote 8000 genes
INFO:root:Writing UMAP coords to adata_cellbrowser_04_01_19_CD8_subclustered/umap_coords.tsv
ERROR:root:Couldnt find cluster markers list

I am using an h5ad file to import my ann data object. Is that why there is some issue with finding cluster markers ? I am able to plot the clusters in a UMAP plot so I know that the 'louvain' observation exists. Any thoughts on why this is happening ?

Thanks.

@LuckyMD
Copy link
Contributor
LuckyMD commented Apr 1, 2019

Just a thought... have you run sc.tl.rank_genes_groups() to obtain cluster markers?

@maximilianh
Copy link
Contributor
maximilianh commented Apr 1, 2019 via email

@aditisk
Copy link
aditisk commented Apr 1, 2019

@LuckyMD I did not run sc.tl.rank_genes_groups() which was the problem.

@maximilianh I think it should be optional to include the cluster-specific markers so maybe keeping it as a warning might be the best ? That way the user has control on what they want to include.

@flying-sheep
Copy link
Member

@maximilianh I think those messages are from your code? maybe you should improve the error message to include something like

Try running sc.tl.rank_genes_groups(adata) to create the cluster annotation

@hejing3283
Copy link

Hi, the expression matrix I exported from adata.write only have the top variable genes. Is there a way to output the raw matrix including all genes?

@maximilianh
Copy link
Contributor
maximilianh commented Jun 3, 2019 via email

@hejing3283
Copy link
hejing3283 commented Jun 5, 2019 via email

@LuckyMD
Copy link
Contributor
LuckyMD commented Jun 5, 2019

Hi @hejing3283,

The wrong shape is probably because you have subsetted adata.X to highly variable genes, or did some additional filtering after storing data in adata.raw. For a while now scanpy avoids filtering highly variable genes, but instead annotates them in adata.var['highly_variable'] which is then used in sc.pp.pca(). I would suggest you use subset=False next time you use sc.pp.highly_variable() to avoid different dimensions in adata.X and adata.raw.X.

You can easily proceed by just making a new anndata object from adata.raw.X, adata.raw.var and adata.raw.obs and storing this to be loaded into cellxgene. Just do the following:

adata_raw = sc.AnnData(X=adata.raw.X, obs=adata.raw.obs, var=adata.raw.var)
adata_raw.write(my_file)

@hejing3283
Copy link
hejing3283 commented Jun 5, 2019 via email

@ekernf01
Copy link
ekernf01 commented Nov 5, 2019

Here's my solution to a similar inter-operability hiccup. This produces files similar to 10X v2 triplet format, plus an extra cell metadata file.

pd.DataFrame(ad.var.index).to_csv(os.path.join(destination, "genes.tsv" ),   sep = "\t", index_col = False)
pd.DataFrame(ad.obs.index).to_csv(os.path.join(destination, "barcodes.tsv"), sep = "\t", index_col = False)
ad.obs.to_csv(os.path.join(destination, "metadata.tsv"), sep = "\t", index_col = True)
scipy.io.mmwrite(os.path.join(destination, "matrix.mtx"), ad.X)

@mariafiruleva
Copy link

You can convert h5ad format to Seurat object using sceasy.

@chris-rands
Copy link
Contributor
chris-rands commented Mar 9, 2020

To write a genes (rows) vs. cells (columns) matrix, i tried this

adata.T.to_df().to_csv('matrix.csv')

@scverse scverse deleted a comment from ekernf01 Mar 9, 2020
@scverse scverse deleted a comment from ekernf01 Mar 9, 2020
@shizhiwen1990
Copy link

Use this code:
import datatable as dt
X = pd.DataFrame(adata.X.toarray().T,columns=adata.obs_names)
Symbol=dt.Frame({'Symbol':adata.var_names.values})
X = dt.cbind([Symbol,dt.Frame(X)])
X.to_csv('X.csv')

@apredeus
Copy link
scipy.io.mmwrite

This code doesn't actually work - rows and columns are switched in the matrix, and it produces an error when you try to read in the output using either Scanpy or Seurat wrapper functions. Perhaps it's a package version thing though..

@indapa
Copy link
indapa commented Apr 24, 2023
scipy.io.mmwrite

This code doesn't actually work - rows and columns are switched in the matrix, and it produces an error when you try to read in the output using either Scanpy or Seurat wrapper functions. Perhaps it's a package version thing though..

I was having the same issue as well. I ended up doing what was suggested above:

adata.T.to_df().to_csv('matrix.csv')

@pcm32
Copy link
pcm32 commented May 8, 2024

We have this method in scanpy-scripts https://github.com/ebi-gene-expression-group/scanpy-scripts/blob/6297be21119d6964e074fa0b40a3b6fcaec53bbc/scanpy_scripts/cmd_utils.py#L137 - you could as well use it just from there with one of the containers https://quay.io/repository/biocontainers/scanpy-scripts?tab=tags&tag=latest I think it can be used through the filtering CLI call, given numbers that won't filter anything out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests