[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MMCIFParser warning and IUPAC extended one letter codes #4202

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Truman-Xu
Copy link
  • I hereby agree to dual licence this and any previous contributions under both
    the Biopython License Agreement AND the BSD 3-Clause License.

  • I have read the CONTRIBUTING.rst file, have run pre-commit
    locally, and understand that continuous integration checks will be used to
    confirm the Biopython unit tests and style checks pass with these changes.

  • I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

Closes #...

@peterjc
Copy link
Member
peterjc commented Dec 15, 2022

What issue are you trying to fix? Those dictionaries were separate for a reason.

@Truman-Xu
Copy link
Author

4 out of 6 entries in the dictionary exist in structures in PDB (a total of 90 structures), and you would fail when you read the sequence from these structures without the extended entries for 3to1 conversion. I don't think the current implementation is providing a separate dictionary for 3to1 conversion for these 6 residues. Also, what's the reason to separate them for 3to1 conversion?

Below is the structure count for the residues in the dictionary

  • ASX: 6
  • GLX: 5
  • SEC: 75
  • PYL: 4

@peterjc
Copy link
Member
peterjc commented Dec 15, 2022

Many of the 3-letter codes used in the PDB are not defined in the IUPAC listing, thus a separate listing.

@Truman-Xu
Copy link
Author

But what is the harm in including those in the extended 3to1 dict?

@peterjc
Copy link
Member
peterjc commented Dec 15, 2022

There are PDB extensions to the core 3-letter protein names which are not IUPAC extensions.

@Truman-Xu
Copy link
Author
Truman-Xu commented Dec 15, 2022

Sorry, I have trouble understanding this. Isn't the protein_letters_3to1_extended supposed to be everything including the residue names defined by IUPAC? The current one does not include the reverse mapping of the additional entries from protein_letters_1to3_extended from IUPACData.py

i.e. The reverse mapping of these residue names is not included in the extended 3to1 dict

{"B": "Asx", "X": "Xaa", "Z": "Glx", "J": "Xle", "U": "Sec", "O": "Pyl"}

@Truman-Xu
Copy link
Author

For example,

from Bio.PDB.Polypeptide import protein_letters_3to1_extended
protein_letters_3to1_extended['SEC']   #SELENOCYSTEINE (U)

This would raise a KeyError

@Truman-Xu
Copy link
Author
Truman-Xu commented Dec 15, 2022

Moreover, the 20 canonical residue names are indeed in the protein_letters_3to1_extended as expected

from Bio.Data.PDBData import protein_letters_3to1
from Bio.Data.PDBData import protein_letters_3to1_extended

canonical_resnames = set(protein_letters_3to1.keys())
extended_resnames = set(protein_letters_3to1_extended.keys())
canonical_resnames.issubset(extended_resnames)

This would return True, and this is because the 20 canonical residue names are defined in this extended dict through hard coding, but the IUPAC extended ones are not

@JoaoRodrigues
Copy link
Member

As Peter mentioned, these dictionaries refer to different sources. The dict in IUPACData refers to IUPAC nomenclature, while PDBData is a collection of data from structures deposited in the wwPDB. For example, Xle is a IUPAC aminoacid, but it does not match any monomer entry in the PDB.

The ones you did find in structures should be added to PDBData. IIRC, I updated the dictionary by parsing the components-pub.cif file made available by wwPDB, but I might have missed some indeed.

@Truman-Xu
Copy link
Author

Got it! Thank you for the explanations!

@@ -15,13 +15,21 @@
# The 'fmt:off' lines prevent black for formatting the dictionaries.

from Bio.Data.IUPACData import protein_letters_3to1 as _protein_letters_3to1
from Bio.Data.IUPACData import (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this formatted by black?

@@ -215,7 +215,6 @@ def _build_structure(self, structure_id):
except ValueError:
serial = atom_serial_list[i]
warnings.warn(
"PDBConstructionWarning: "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an unrelated change, is it only for formatting? Does it avoid any repetition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would avoid some repetition in the default traceback message.


# fmt: off
protein_letters_3to1_extended = {
**protein_letters_3to1_extended_one_letter,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to my last reply in the general comment thread as to why I don't think this is necessarily appropriate. Some of the entries in IUPACData will not be present in PDB files, and as such should not be included here. I'd rather this dictionary be updated manually rather than a bulk update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants