-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Phylo.parse(file,"newick") does not throw error if name invalidly contains whitespace #4134
Comments
Potentially related to: #1782 Can't find an official specification, but many explanations of the Newick format mention space in identifiers is invalid:
Source: http://bioinformatics.intec.ugent.be/MotifSuite/treeformat.php
Source: https://evolution.genetics.washington.edu/phylip/newicktree.html Scikit bio's formal grammar shows that whitespace is invalid, so scikit-bio will error, as I would expect Bio.Phylo to do as well. Source: http://scikit-bio.org/docs/0.2.2/generated/skbio.io.newick.html |
The issue is that the current regex expression that matches an unquoted label returns several tokens for an invalid unquoted label string: Therefore, The solution I have come up with is to return the entire unquoted label string with whitespaces, carriage returns, etc. included, and do the checking manually when the The following changes should do the trick: # NewickIO.py
UNSAFE_CHARS = ['\r', '\n', '\t', '\f', '\v', ' ']
tokens = [
(r"[^\(\)\[\]\'\:\;\,]+", "unquoted node label"),
]
# ...
def _parse_tree(self, text):
# ...
for match in tokens:
# ...
else:
# unquoted node label
token_has_unsafe_char = any([c in UNSAFE_CHARS for c in token])
if token_has_unsafe_char:
raise NewickError(f"Invalid character detected in node label '{token}'")
current_clade.name = token |
Setup
I am reporting a problem with Biopython version, Python version, and operating
system as follows:
Expected behaviour
Phylo.parse(file,"newick")
will raise an Exception when a tree name invalidly contains whitespace in the middle of a node name as specified in the specifications:https://evolution.genetics.washington.edu/phylip/newicktree.html
Actual behaviour
No Exception is raised. Instead, anything that follows the first blank is simply ignored. This is a problem, because I'm handwriting a Newick tree and occasionally forget to put a comma. In that case Phylo should complain. But it doesn't.
Steps to reproduce
Result:
Wrong because it should error.
The text was updated successfully, but these errors were encountered: