[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phylo newick parser fails when names contain single quotes #2227

Open
shz9 opened this issue Aug 22, 2019 · 3 comments
Open

Phylo newick parser fails when names contain single quotes #2227

shz9 opened this issue Aug 22, 2019 · 3 comments
Assignees

Comments

@shz9
Copy link
shz9 commented Aug 22, 2019

Hi everyone,

It seems like there's a bug in the Phylo newick parser where the parser fails if the name of the node contains single quotes (e.g. as in "Burton's_mouthbreeder"). A similar issue was raised before (#887), though in that case they wanted to allow parenthesis in quoted node names.

Setup

I am reporting a problem with Biopython version, Python version, and operating
system as follows:

import sys; print(sys.version)
import platform; print(platform.python_implementation()); print(platform.platform())
import Bio; print(Bio.__version__)
3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0]
CPython
Linux-3.10.0-957.el7.x86_64-x86_64-with-centos-7.6.1810-Core
1.73

Steps to reproduce

You can reproduce the same error if you try to parse the multiz100way Newick tree with common names from the UCSC genome browser (download).

When I try to parse the UCSC tree file with common names, I get the following error message:

from Bio import Phylo
ucsc_tree = Phylo.read("hg19.100way.commonNames.nh", "newick")
NewickError: Number of open/close parentheses do not match.

Now, if I remove the single quotes from the node names, the error doesn't arise:

from Bio import Phylo
import io
with open("hg19.100way.commonNames.nh", "r") as pf: 
        phy = pf.read().replace("'", "")
        ucsc_tree = Phylo.read(io.StringIO(phy), "newick") 

Other parsers (e.g. ETE Toolkit) don't have issues with single quotes in the node names.

@MortenHofft
Copy link
MortenHofft commented Nov 18, 2021

Other parsers (e.g. ETE Toolkit) don't have issues with single quotes in the node names.

Single quotes do not work on ETE either. Or rather, it appears to work but isn't interpreted correctly. It just doesn't fail. At least that is the case for their website in that link as of today.

UPDATE: silly example ('flower (Linnaeus, 1707:1778)':3, shrub:2)plant:0;
to the best of my knowledge anything is allowed inside single quotes (expect single quotes that need to be escaped by another single quote).

@peterjc
Copy link
Member
peterjc commented Nov 19, 2021

Have you checked your file is valid according to the specification? The details are not fresh in my mind, but perhaps a single quote in a name needs to be escaped to be valid?

@corneliusroemer
Copy link
corneliusroemer commented Dec 13, 2023

I've hit a superficially similar issue related to single quotes. They don't round trip correctly.

A reproducible example is very simple, doesn't require loading a huge genome:

from Bio import Phylo

tree = Phylo.BaseTree.Tree(root=Phylo.BaseTree.Clade(name="Root"))
clade_with_quote = Phylo.BaseTree.Clade(name="Node'Name")
tree.root.clades.append(clade_with_quote)

with open("tree_with_quote.newick", "w") as file:
    Phylo.write(tree, file, "newick")

with open("tree_with_quote.newick", "r") as file:
    tree_read = Phylo.read(file, "newick")

original_node_names = [clade.name for clade in tree.find_clades()]
read_node_names = [clade.name for clade in tree_read.find_clades()]

mismatches = [(original, read) for original, read in zip(original_node_names, read_node_names) if original != read]
print(f"Mismatches: {mismatches}")
# Mismatches: [("Node'Name", "Node\\'Name")]

I've opened two separate issues related to erroneous serialization and deserialization in #4536 and #4537

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants