Export expression matrix from h5ad #262

cartal · 2018-09-13T15:22:33Z

Hi,

I would like to extract the expression matrix (genes and counts) from a h5ad file.
How can I do this? I have searched the documentation but I couldn't find anything about this (maybe I missed it).

LuckyMD · 2018-09-13T17:11:56Z

If you want to extract it in python, you can load the h5ad file using adata = sc.read(filename) and then use adata.X, which is the expression matrix.

To extract the matrix into R, you can use the rhdf5 library. That's a bit more complicated as there was a recent update to this library I believe. Note that you need to transpose the expression matrix from python into R due to different conventions (R expects a genes x cells matrix, python a cells x genes matrix).

An alternative to the rhdf5 library is to just save the expression matrix via numpy.savetxt() to save it, for example, as a space-delimited file.

I hope this helps.

cartal · 2018-09-14T08:26:00Z

Hi, this is very useful. But I think I formulated my question the wrong way.

Can I export the h5ad file to a standard 10X h5 file?

LuckyMD · 2018-09-14T09:34:50Z

That, I can't help you with I'm afraid. I'm not as familiar with the h5ad format.

maximilianh · 2018-09-14T09:50:00Z

This is an interesting question... I just googled it and found only examples in R, but that may be good enough for you: https://rdrr.io/github/MarioniLab/DropletUtils/man/write10xCounts.html

falexwolf · 2018-09-17T17:01:55Z

Hi @cartal, it wouldn't be very hard to export to a 10x h5 file, but I'd need to write a custom function for it. Why is it needed? Does 10x offer any downstream analysis that you'd want to use on the data? I thought there are none, hence there is only sc.read_10x_h5 and no sc.write_10x_h5.

cartal · 2018-12-14T10:48:44Z

Hi, I'm sorry I forgot about this trend, I just stumble into the same issue.

Lets say I have done my analysis in scanpy and everything is good and nice, but now I want to run, say, the cluster 10 from the louvain subset, with Palantir. Palantir can read 10X and 10X_H5 files. Is there a way to plug-and-play this with scanpy?

In another case, if I want to extract the subset expression matrix, where rows are genes (with rownames as gene symbols) and columns are cells (with colnames as cells), so I can use this with SCENIC. How could I get this from the scanpy adata?

I apologise in advance if I'm asking something very basic, but it will be really nice to have some sort of interconectivity between tools, since scanpy is so nice to have as a major analysis suite.

falexwolf · 2018-12-16T03:01:54Z

You can always export as a .csv file and read that into other tools using adata.write_csvs(filename, skip_data=False). You can call adata.T for transposing before.

I can imagine that Palantir would also accept AnnData objects, you could make an issue there. Also, have you tried tl.paga for trajectory inference, the paper is here?

cartal · 2018-12-21T17:01:42Z

Hi, I am trying PAGA. Thanks a lot for the help.

aditisk · 2019-02-04T20:24:06Z

@falexwolf @cartal @LuckyMD I am also trying to export a gene by cell expression file. I tried using adata.write_csvs(filename, skip_data=False) but that wrote the output to multiple files. Is there a way to generate a single file with genes as rows (with gene names as row IDs) and cells as columns (barcodes as column IDs) ?

Thank you in advance for your help.

falexwolf · 2019-02-05T01:05:38Z

No, there is no way to produce a single file with data and metadata. Having genes as rows can simply be achieved by transposing the matrix (adata.T.write_csvs(...)).

maximilianh · 2019-02-05T01:11:58Z

In exporting.cellbrowser the cellbrowser export function creates two files, one with the expression matrix (genes on rows, cells as columns) and one for the cell metadata. That may be what you're looking for?

…

aditisk · 2019-02-07T18:21:05Z

@falexwolf thanks for the feedback. As @maximilianh suggested, I was able to export the expression matrix from the cellbrowser export function. Thank you for your help.

maximilianh · 2019-02-07T19:10:49Z

Nice to hear that it worked. As a side product, you can now create a cell browser html directory from the generated directory.

…

On Thu, Feb 7, 2019 at 10:21 AM aditisk ***@***.***> wrote: @falexwolf <https://github.com/falexwolf> thanks for the feedback. As @maximilianh <https://github.com/maximilianh> suggested, I was able to export the expression matrix from the cellbrowser export function. Thank you for your help. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#262 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAS-TXGCkdaQWO8ks_x7uOm-P2_ISArRks5vLG6RgaJpZM4Wne7Z> .

aditisk · 2019-04-01T18:10:21Z

@maximilianh I was able to use the cell browser export function in the past but this time I am getting an error message:

INFO:root:Writing scanpy matrix to adata_cellbrowser_04_01_19_CD8_subclustered/exprMatrix.tsv.gz
INFO:root:Transposing matrix
INFO:root:Writing gene-by-gene, without using pandas
INFO:root:Writing 8068 genes in total
INFO:root:Wrote 0 genes
INFO:root:Wrote 2000 genes
INFO:root:Wrote 4000 genes
INFO:root:Wrote 6000 genes
INFO:root:Wrote 8000 genes
INFO:root:Writing UMAP coords to adata_cellbrowser_04_01_19_CD8_subclustered/umap_coords.tsv
ERROR:root:Couldnt find cluster markers list

I am using an h5ad file to import my ann data object. Is that why there is some issue with finding cluster markers ? I am able to plot the clusters in a UMAP plot so I know that the 'louvain' observation exists. Any thoughts on why this is happening ?

Thanks.

LuckyMD · 2019-04-01T18:46:49Z

Just a thought... have you run sc.tl.rank_genes_groups() to obtain cluster markers?

maximilianh · 2019-04-01T20:53:09Z

Yes, it seems that I should change this to a warning. People may not want cluster-specific marker genes in their browser. Do you agree @aditisk ?

…

aditisk · 2019-04-01T21:17:13Z

@LuckyMD I did not run sc.tl.rank_genes_groups() which was the problem.

@maximilianh I think it should be optional to include the cluster-specific markers so maybe keeping it as a warning might be the best ? That way the user has control on what they want to include.

flying-sheep · 2019-04-02T09:00:41Z

@maximilianh I think those messages are from your code? maybe you should improve the error message to include something like

Try running sc.tl.rank_genes_groups(adata) to create the cluster annotation

hejing3283 · 2019-05-31T15:14:14Z

Hi, the expression matrix I exported from adata.write only have the top variable genes. Is there a way to output the raw matrix including all genes?

maximilianh · 2019-06-03T10:03:10Z

The scanpyToCellbrowser function has an option useRaw that will use the .raw matrix, if present, for the .tsv export. Otherwise, the raw matrix of all genes is stored as ad.raw.X and the variable names are in ad.raw.var. You can use scanpyToCellbrowser to write the matrix and all annotations, or anndataToTsv to write just the matrix. Or use code from there to write your own.

…

On Fri, May 31, 2019 at 5:14 PM Jing He ***@***.***> wrote: Hi, the expression matrix I exported from adata.write only have the top variable genes. Is there a way to output the raw matrix including all genes? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#262>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACL4TORHPOQ2GTWTUGTAI3PYE6ENANCNFSM4FU553MQ> .

hejing3283 · 2019-06-05T13:50:52Z

Thanks for the suggestion. Actually, I am using cellxgene which takes the h5ad file as an input. when using anndata.write() function, it only output the anndata.X as the expression matrix. And also there is no option of useRaw here. Also, I tried to re-assign anndata.X = anndata.raw.X, but it returns an error saying its wrong shape. Do you have any suggestions? Thanks a lot! On Mon, Jun 3, 2019 at 6:03 AM Maximilian Haeussler < notifications@github.com> wrote:

…

The scanpyToCellbrowser function has an option useRaw that will use the .raw matrix, if present, for the .tsv export. Otherwise, the raw matrix of all genes is stored as ad.raw.X and the variable names are in ad.raw.var. You can use scanpyToCellbrowser to write the matrix and all annotations, or anndataToTsv to write just the matrix. Or use code from there to write your own. On Fri, May 31, 2019 at 5:14 PM Jing He ***@***.***> wrote: > Hi, the expression matrix I exported from adata.write only have the top > variable genes. Is there a way to output the raw matrix including all genes? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #262 >, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AACL4TORHPOQ2GTWTUGTAI3PYE6ENANCNFSM4FU553MQ > > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#262>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAUAIIM2RYJQTSDTKQ4XZLTPYTT6BANCNFSM4FU553MQ> .

-- Cheers! Jing

LuckyMD · 2019-06-05T14:39:44Z

Hi @hejing3283,

The wrong shape is probably because you have subsetted adata.X to highly variable genes, or did some additional filtering after storing data in adata.raw. For a while now scanpy avoids filtering highly variable genes, but instead annotates them in adata.var['highly_variable'] which is then used in sc.pp.pca(). I would suggest you use subset=False next time you use sc.pp.highly_variable() to avoid different dimensions in adata.X and adata.raw.X.

You can easily proceed by just making a new anndata object from adata.raw.X, adata.raw.var and adata.raw.obs and storing this to be loaded into cellxgene. Just do the following:

adata_raw = sc.AnnData(X=adata.raw.X, obs=adata.raw.obs, var=adata.raw.var)
adata_raw.write(my_file)

hejing3283 · 2019-06-05T15:13:43Z

Hi, Thanks so much for the explanations! Doing it now and it works. Best, Jing

…

On Jun 5, 2019, at 10:39, MalteDLuecken ***@***.***> wrote: Hi @hejing3283, The wrong shape is probably because you have subsetted adata.X to highly variable genes, or did some additional filtering after storing data in adata.raw. For a while now scanpy avoids filtering highly variable genes, but instead annotates them in adata.var['highly_variable'] which is then used in sc.pp.pca(). I would suggest you use subset=False next time you use sc.pp.highly_variable() to avoid different dimensions in adata.X and adata.raw.X. You can easily proceed by just making a new anndata object from adata.raw.X, adata.raw.var and adata.raw.obs and storing this to be loaded into cellxgene. Just do the following: adata_raw.write(my_file) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ekernf01 · 2019-11-05T18:02:26Z

Here's my solution to a similar inter-operability hiccup. This produces files similar to 10X v2 triplet format, plus an extra cell metadata file.

pd.DataFrame(ad.var.index).to_csv(os.path.join(destination, "genes.tsv" ),   sep = "\t", index_col = False)
pd.DataFrame(ad.obs.index).to_csv(os.path.join(destination, "barcodes.tsv"), sep = "\t", index_col = False)
ad.obs.to_csv(os.path.join(destination, "metadata.tsv"), sep = "\t", index_col = True)
scipy.io.mmwrite(os.path.join(destination, "matrix.mtx"), ad.X)

mariafiruleva · 2020-01-08T08:45:13Z

You can convert h5ad format to Seurat object using sceasy.

chris-rands · 2020-03-09T12:48:15Z

To write a genes (rows) vs. cells (columns) matrix, i tried this

adata.T.to_df().to_csv('matrix.csv')

shizhiwen1990 · 2023-02-12T03:50:35Z

Use this code:
import datatable as dt
X = pd.DataFrame(adata.X.toarray().T,columns=adata.obs_names)
Symbol=dt.Frame({'Symbol':adata.var_names.values})
X = dt.cbind([Symbol,dt.Frame(X)])
X.to_csv('X.csv')

apredeus · 2023-03-20T11:10:48Z

scipy.io.mmwrite

This code doesn't actually work - rows and columns are switched in the matrix, and it produces an error when you try to read in the output using either Scanpy or Seurat wrapper functions. Perhaps it's a package version thing though..

indapa · 2023-04-24T19:14:18Z

scipy.io.mmwrite
This code doesn't actually work - rows and columns are switched in the matrix, and it produces an error when you try to read in the output using either Scanpy or Seurat wrapper functions. Perhaps it's a package version thing though..

I was having the same issue as well. I ended up doing what was suggested above:

adata.T.to_df().to_csv('matrix.csv')

pcm32 · 2024-05-08T08:34:54Z

We have this method in scanpy-scripts https://github.com/ebi-gene-expression-group/scanpy-scripts/blob/6297be21119d6964e074fa0b40a3b6fcaec53bbc/scanpy_scripts/cmd_utils.py#L137 - you could as well use it just from there with one of the containers https://quay.io/repository/biocontainers/scanpy-scripts?tab=tags&tag=latest I think it can be used through the filtering CLI call, given numbers that won't filter anything out.

cartal closed this as completed Dec 21, 2018

falexwolf mentioned this issue Dec 26, 2018

QUESTION: Subset expression matrix from custom genes scverse/anndata#91

Closed

scverse deleted a comment from ekernf01 Mar 9, 2020

josinejansen mentioned this issue Jan 27, 2021

what is the .h5ad file? scverse/anndata#180

Closed

ivirshup mentioned this issue Jul 26, 2021

write anndata to h5 10x format scverse/anndata#595

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export expression matrix from h5ad #262

Export expression matrix from h5ad #262

Export expression matrix from h5ad #262

Export expression matrix from h5ad #262

Comments