-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phylo newick parser fails when names contain single quotes #2227
Comments
Single quotes do not work on ETE either. Or rather, it appears to work but isn't interpreted correctly. It just doesn't fail. At least that is the case for their website in that link as of today. UPDATE: silly example |
Have you checked your file is valid according to the specification? The details are not fresh in my mind, but perhaps a single quote in a name needs to be escaped to be valid? |
I've hit a superficially similar issue related to single quotes. They don't round trip correctly. A reproducible example is very simple, doesn't require loading a huge genome: from Bio import Phylo
tree = Phylo.BaseTree.Tree(root=Phylo.BaseTree.Clade(name="Root"))
clade_with_quote = Phylo.BaseTree.Clade(name="Node'Name")
tree.root.clades.append(clade_with_quote)
with open("tree_with_quote.newick", "w") as file:
Phylo.write(tree, file, "newick")
with open("tree_with_quote.newick", "r") as file:
tree_read = Phylo.read(file, "newick")
original_node_names = [clade.name for clade in tree.find_clades()]
read_node_names = [clade.name for clade in tree_read.find_clades()]
mismatches = [(original, read) for original, read in zip(original_node_names, read_node_names) if original != read]
print(f"Mismatches: {mismatches}")
# Mismatches: [("Node'Name", "Node\\'Name")] I've opened two separate issues related to erroneous serialization and deserialization in #4536 and #4537 |
Hi everyone,
It seems like there's a bug in the Phylo newick parser where the parser fails if the name of the node contains single quotes (e.g. as in "Burton's_mouthbreeder"). A similar issue was raised before (#887), though in that case they wanted to allow parenthesis in quoted node names.
Setup
I am reporting a problem with Biopython version, Python version, and operating
system as follows:
Steps to reproduce
You can reproduce the same error if you try to parse the multiz100way Newick tree with common names from the UCSC genome browser (download).
When I try to parse the UCSC tree file with common names, I get the following error message:
Now, if I remove the single quotes from the node names, the error doesn't arise:
Other parsers (e.g. ETE Toolkit) don't have issues with single quotes in the node names.
The text was updated successfully, but these errors were encountered: