BUG: Phylo.parse(file,"newick") does not throw error if name invalidly contains whitespace #4134

corneliusroemer · 2022-10-15T16:52:03Z

Setup

I am reporting a problem with Biopython version, Python version, and operating
system as follows:

3.9.13 | packaged by conda-forge | (main, May 27 2022, 17:00:52) 
[Clang 13.0.1 ]
CPython
macOS-12.6-x86_64-i386-64bit
1.79

Expected behaviour

Phylo.parse(file,"newick") will raise an Exception when a tree name invalidly contains whitespace in the middle of a node name as specified in the specifications:

https://evolution.genetics.washington.edu/phylip/newicktree.html

Actual behaviour

No Exception is raised. Instead, anything that follows the first blank is simply ignored. This is a problem, because I'm handwriting a Newick tree and occasionally forget to put a comma. In that case Phylo should complain. But it doesn't.

Steps to reproduce

from io import StringIO
from Bio import Phylo
newick_string = """(A:0.1,invalid name with whitespace:0.2,(C:0.3,D:0.4)E:0.5)F;"""
Phylo.draw_ascii(Phylo.read(StringIO(newick_string), "newick"))

Result:

  ______ A
 |
_|_____________ whitespace
 |
 |                                     _____________________ C
 |____________________________________|
                                      |_____________________________ D

Wrong because it should error.

The text was updated successfully, but these errors were encountered:

corneliusroemer · 2022-10-15T16:53:33Z

Potentially related to: #1782

Can't find an official specification, but many explanations of the Newick format mention space in identifiers is invalid:

The whole line is loaded as one unit, so no white spaces or tabs are allowed in any of the identifiers or numbers or before or after parentheses, kommas or colons.

Source: http://bioinformatics.intec.ugent.be/MotifSuite/treeformat.php

Tips are represented by their names. A name can be any string of printable characters except blanks, colons, semicolons, parentheses, and square brackets.
Because you may want to include a blank in a name, it is assumed that an underscore character ("_") stands for a blank; any of these in a name will be converted to a blank when it is read in. Any name may also be empty: a tree like
Blanks can be inserted at any point except in the middle of a species name or a branch length.

Source: https://evolution.genetics.washington.edu/phylip/newicktree.html

Scikit bio's formal grammar shows that whitespace is invalid, so scikit-bio will error, as I would expect Bio.Phylo to do as well.

Source: http://scikit-bio.org/docs/0.2.2/generated/skbio.io.newick.html

amatria · 2022-10-20T19:04:41Z

The issue is that the current regex expression that matches an unquoted label returns several tokens for an invalid unquoted label string:

Therefore, _parse_tree iterates over the four different unquoted label tokens and updates the node's name four times in a row. So, the final node's name ends up being the last token in the regex group above. That is why your label ends up being whitespace and the function does not throw and exception.

The solution I have come up with is to return the entire unquoted label string with whitespaces, carriage returns, etc. included, and do the checking manually when the _parse_tree method encounters an unquoted label token.

The following changes should do the trick:

# NewickIO.py

UNSAFE_CHARS = ['\r', '\n', '\t', '\f', '\v', ' ']
tokens = [
    (r"[^\(\)\[\]\'\:\;\,]+", "unquoted node label"),
]

# ...

def _parse_tree(self, text):
    # ...
    for match in tokens:
        # ...
        else:
            # unquoted node label
            token_has_unsafe_char = any([c in UNSAFE_CHARS for c in token])
            if token_has_unsafe_char:
                raise NewickError(f"Invalid character detected in node label '{token}'")
            current_clade.name = token

corneliusroemer changed the title ~~Phylo.parse(file,"newick") does not throw error if name invalidly contains whitespace~~ BUG: Phylo.parse(file,"newick") does not throw error if name invalidly contains whitespace Oct 17, 2022

amatria linked a pull request Oct 20, 2022 that will close this issue

raise and error if an invalid Newick unquoted label name is found while parsing the tree #4139

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Phylo.parse(file,"newick") does not throw error if name invalidly contains whitespace #4134

BUG: Phylo.parse(file,"newick") does not throw error if name invalidly contains whitespace #4134

BUG: Phylo.parse(file,"newick") does not throw error if name invalidly contains whitespace #4134

BUG: Phylo.parse(file,"newick") does not throw error if name invalidly contains whitespace #4134

Comments

Setup

Expected behaviour

Actual behaviour

Steps to reproduce