[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bio/genbank/scanner.py issue when feature qualifier key wraps to the next line [SUGGESTION] #4694

Open
vmkhot opened this issue Apr 9, 2024 · 3 comments

Comments

@vmkhot
Copy link
vmkhot commented Apr 9, 2024

Hello,

I came across the following issue when using Biopython to parse genbank files produced by the "pharokka" program. My parsing script breaks with a ValueError parsing this particular record.

This is an example of a problematic entry. The specific problem is with :
/resistance-nodulation-cell division (RND) antibiotic efflux pump="true"

     CDS             568..951
                     /ID="ERZ1035062_ERZ1035062.110-NODE-110-length-34883-cov-5.
                     965832_CDS_0002"
                     /transl_table=11
                     /phrog="30333"
                     /top_hit="p76161 VI_12486"
                     /locus_tag="ERZ1035062_ERZ1035062.110-NODE-110-length-34883
                     -cov-5.965832_CDS_0002"
                     /function="unknown function"
                     /product="hypothetical protein"
                     /CARD_short_name="marA"
                     /AMR_Gene_Family="General Bacterial Porin with reduced
                     permeability to beta-lactams"
                     /resistance-nodulation-cell division (RND) antibiotic
                     efflux pump="true"
                     /CARD_species="Escherichia coli str. K-12 substr. W3110"
                     /source="Pyrodigal-gv_0.2.0"
                     /score="40.0"
                     /phase="0"
                     /translation="MSRRNTDAITIHSILDWIEDNLESPLSLEKVSERSGYSKWHLQRM
                     FKKETGHSLGQYIRSRKMTEIAQKLKESNEPILYLAERYGFESQQTLTRTFKNYFDVPP
                     HKYRMTNMQGESRFLHPLNHYNS*"

The script breaks because the scanner.py cannot handle qualifier keys that wrap to the next line, like it can handle values.
After a lot of trial and error with debugging my own script, my colleague and I changed the scanner.py script between lines [332-336] to add this code:

                   if i == -1:
                        # Qualifier with no key, e.g. /pseudo
                        key = line[1:]
                        a = next(iterator)
                        if a[0] != "/": 
                            j=a.find('=')
                            key = key.strip()
                            key+=" " +a[:j]
                            print(key)
                            qualifiers.append((key,a[j+1:]))
                        else:
                            qualifiers.append((key, None))

This effectively solved our issue and we wanted to suggest it or something similar as an improvement to the scanner.py module.

Thanks!

Varada

@peterjc
Copy link
Member
peterjc commented Apr 10, 2024

My gut reaction is the pharokka tool is pushing the GenBank format too far, not just trying to wrap the qualifier keys, but using spaces and brackets in them too. Have you got in touch with them about this issue?

I do not think this should be supported and parsed as if it were valid by Biopython. I'm more open to a warning message, ignoring some lines, and continuing as best we can (either at the next qualifier, or the next feature).

@peterjc
Copy link
Member
peterjc commented Apr 15, 2024

See #4703 which will add warnings when trying to write invalid annotations like this. We might consider adding warnings on parsing as well (where the parser currently accepts things like over-long qualifier keys).

@vmkhot
Copy link
Author
vmkhot commented Apr 15, 2024

Thanks for your reply! and thanks for adding in the warnings with the genbank file creation. I can certainly understand that Biopython should not accept genbank files with the long keys but a warning + "move to next qualifier" would be an amazing compromise! My biggest issue was that my entire loop broke at one qualifier in a random file because I had 200000 records x 3000 gbk files to parse. I would have preferred for the record to be thrown away but this won't be the case for all users and their gbk files.

I will definitely bring this up with the pharokka developers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants