[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

value-less qualifiers get an empty value in EMBL export #902

Open
satta opened this issue Jul 27, 2016 · 12 comments · May be fixed by #905
Open

value-less qualifiers get an empty value in EMBL export #902

satta opened this issue Jul 27, 2016 · 12 comments · May be fixed by #905
Labels

Comments

@satta
Copy link
satta commented Jul 27, 2016

When reading/writing EMBL files, qualifiers without values like /pseudo get an empty value after exporting. Biopython 1.67. Example:

#!/usr/bin/env python
from Bio import SeqIO
import sys

handle = open(sys.argv[1], "rU")
records = []
for record in SeqIO.parse(handle, "embl"):
    records.append(record)

SeqIO.write(records, sys.stdout, "embl")
$ grep pseudo test.embl
FT                   /pseudo
$ python pseudo_test.py test.embl | grep pseudo
FT                   /pseudo=""

on the following example file:

test.embl.zip

This fails validation by ENA's pre-submission validator.

@peterjc
Copy link
Member
peterjc commented Jul 27, 2016

Can you check what the parser did? It should be parsing /pseudo as a dictionary entry of feature.qualifiers['pseudo'] = None (not an empty list), which the writer handles as a special case:

https://github.com/biopython/biopython/blob/master/Bio/SeqIO/InsdcIO.py#L300

@peterjc peterjc added the Bug label Jul 27, 2016
@satta
Copy link
Author
satta commented Jul 27, 2016

It's definitely an empty list in the example:

qualifiers:
    Key: codon_start, Value: ['1']
    ...
    Key: pseudo, Value: ['']
    ---

Edit: Technically, it's not even an empty list, but a list containing only an empty string.

@peterjc
Copy link
Member
peterjc commented Jul 27, 2016

Thanks - looks like something was broken in the parser, and we can't have any tests at the moment which check this explicitly.

@satta
Copy link
Author
satta commented Jul 28, 2016

OK many thanks, for now I will add a workaround. What do you mean with 'we can't have any tests which check this explicitly' -- are you saying there are no tests but there should be or that you aren't intending to test this at all?

@peterjc
Copy link
Member
peterjc commented Jul 28, 2016

Sorry, poor phrasing, I meant that fact this bug occurred implies we currently don't have test checking for this. We should have a test for this.

@peterjc
Copy link
Member
peterjc commented Aug 2, 2016

It looks like the parser was changed a while back. Switching the parser to use feature.qualifiers['pseudo'] =[None] here has some side effects related to the BioSQL tests, and the nasty special case of GenBank/empty_feature_qualifier.gb which does this:

                     /note="This is a correct note, the following one isn't"
                     /note

That is probably invalid, but might be stored as feature.qualifiers['note'] = ['This is a correct note, the following one isn't', None] and the more typical value-less example could be feature.qualifiers['pseudo'] =[None] which would be consistent with always being a list.

peterjc added a commit to peterjc/biopython that referenced this issue Aug 2, 2016
This is to address biopython#902, and a possibly accidental
change to the original parser behaviour. The GenBank
and EMBL writers expect None to write a valueless
qualifier (rather than an empty string).
@peterjc
Copy link
Member
peterjc commented Aug 2, 2016

The good news is for the GenBank/EMBL output, you can use either feature.qualifiers['pseudo'] = None or feature.qualifiers['pseudo'] = [None] which is helpful.

@peterjc
Copy link
Member
peterjc commented Aug 2, 2016

Since this overlaps with BioSQL, I filed biosql/biosql#6

@satta
Copy link
Author
satta commented Aug 2, 2016

Well, this would work for setting new pseudo entries, but what I wanted was keeping the pseudo qualifier untouched in features read from EMBL. At the moment I am writing the EMBL output into a StringIO and then cleaning it up afterwards anyway (not just the /pseudos but also the ID line).

@peterjc
Copy link
Member
peterjc commented Aug 2, 2016

Pull request #905 makes the parser changes I outlined (so we can read and then write /pseudo correctly), but the BioSQL issue blocks this change as is. Further work needed...

@sfehrmann
Copy link

Hi peterjc, I saw this is still handled like this in SeqIO.write and the code http://biopython.org/DIST/docs/api/Bio.GenBank-pysrc.html#_FeatureConsumer.feature_qualifier
Currently I'm fixing this with an additional sed routine, but on the longer term it would of course preferable to have this handled according the the NCBI specs. This will avoid other programs behave unpredictable when they rely on these genbank files. It's only an minor thing, but this is why the bug is easy to miss in the genbank file, too.
Thank you

@peterjc
Copy link
Member
peterjc commented Jun 15, 2018

@Carambakaracho Can you have a look at #905?

peterjc added a commit to peterjc/biopython that referenced this issue Jun 15, 2018
This is to address biopython#902, and a possibly accidental
change to the original parser behaviour. The GenBank
and EMBL writers expect None to write a valueless
qualifier (rather than an empty string).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants