[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Phylo.parse(file,"newick") does not throw error if name invalidly contains whitespace #4134

Open
corneliusroemer opened this issue Oct 15, 2022 · 2 comments · May be fixed by #4139
Open

Comments

@corneliusroemer
Copy link

Setup

I am reporting a problem with Biopython version, Python version, and operating
system as follows:

3.9.13 | packaged by conda-forge | (main, May 27 2022, 17:00:52) 
[Clang 13.0.1 ]
CPython
macOS-12.6-x86_64-i386-64bit
1.79

Expected behaviour

Phylo.parse(file,"newick") will raise an Exception when a tree name invalidly contains whitespace in the middle of a node name as specified in the specifications:
image
https://evolution.genetics.washington.edu/phylip/newicktree.html

Actual behaviour

No Exception is raised. Instead, anything that follows the first blank is simply ignored. This is a problem, because I'm handwriting a Newick tree and occasionally forget to put a comma. In that case Phylo should complain. But it doesn't.

Steps to reproduce

from io import StringIO
from Bio import Phylo
newick_string = """(A:0.1,invalid name with whitespace:0.2,(C:0.3,D:0.4)E:0.5)F;"""
Phylo.draw_ascii(Phylo.read(StringIO(newick_string), "newick"))

Result:

  ______ A
 |
_|_____________ whitespace
 |
 |                                     _____________________ C
 |____________________________________|
                                      |_____________________________ D

Wrong because it should error.

@corneliusroemer
Copy link
Author
corneliusroemer commented Oct 15, 2022

Potentially related to: #1782

Can't find an official specification, but many explanations of the Newick format mention space in identifiers is invalid:
image

The whole line is loaded as one unit, so no white spaces or tabs are allowed in any of the identifiers or numbers or before or after parentheses, kommas or colons.

Source: http://bioinformatics.intec.ugent.be/MotifSuite/treeformat.php

Tips are represented by their names. A name can be any string of printable characters except blanks, colons, semicolons, parentheses, and square brackets.
Because you may want to include a blank in a name, it is assumed that an underscore character ("_") stands for a blank; any of these in a name will be converted to a blank when it is read in. Any name may also be empty: a tree like
Blanks can be inserted at any point except in the middle of a species name or a branch length.

Source: https://evolution.genetics.washington.edu/phylip/newicktree.html

Scikit bio's formal grammar shows that whitespace is invalid, so scikit-bio will error, as I would expect Bio.Phylo to do as well.

image

Source: http://scikit-bio.org/docs/0.2.2/generated/skbio.io.newick.html

@corneliusroemer corneliusroemer changed the title Phylo.parse(file,"newick") does not throw error if name invalidly contains whitespace BUG: Phylo.parse(file,"newick") does not throw error if name invalidly contains whitespace Oct 17, 2022
@amatria
Copy link
amatria commented Oct 20, 2022

The issue is that the current regex expression that matches an unquoted label returns several tokens for an invalid unquoted label string:

imagen

Therefore, _parse_tree iterates over the four different unquoted label tokens and updates the node's name four times in a row. So, the final node's name ends up being the last token in the regex group above. That is why your label ends up being whitespace and the function does not throw and exception.

The solution I have come up with is to return the entire unquoted label string with whitespaces, carriage returns, etc. included, and do the checking manually when the _parse_tree method encounters an unquoted label token.

The following changes should do the trick:

# NewickIO.py

UNSAFE_CHARS = ['\r', '\n', '\t', '\f', '\v', ' ']
tokens = [
    (r"[^\(\)\[\]\'\:\;\,]+", "unquoted node label"),
]

# ...

def _parse_tree(self, text):
    # ...
    for match in tokens:
        # ...
        else:
            # unquoted node label
            token_has_unsafe_char = any([c in UNSAFE_CHARS for c in token])
            if token_has_unsafe_char:
                raise NewickError(f"Invalid character detected in node label '{token}'")
            current_clade.name = token

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants