-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MMCIFParser warning and IUPAC extended one letter codes #4202
base: master
Are you sure you want to change the base?
Conversation
What issue are you trying to fix? Those dictionaries were separate for a reason. |
4 out of 6 entries in the dictionary exist in structures in PDB (a total of 90 structures), and you would fail when you read the sequence from these structures without the extended entries for 3to1 conversion. I don't think the current implementation is providing a separate dictionary for 3to1 conversion for these 6 residues. Also, what's the reason to separate them for 3to1 conversion? Below is the structure count for the residues in the dictionary
|
Many of the 3-letter codes used in the PDB are not defined in the IUPAC listing, thus a separate listing. |
But what is the harm in including those in the extended 3to1 dict? |
There are PDB extensions to the core 3-letter protein names which are not IUPAC extensions. |
Sorry, I have trouble understanding this. Isn't the i.e. The reverse mapping of these residue names is not included in the extended 3to1 dict
|
For example, from Bio.PDB.Polypeptide import protein_letters_3to1_extended
protein_letters_3to1_extended['SEC'] #SELENOCYSTEINE (U) This would raise a KeyError |
Moreover, the 20 canonical residue names are indeed in the from Bio.Data.PDBData import protein_letters_3to1
from Bio.Data.PDBData import protein_letters_3to1_extended
canonical_resnames = set(protein_letters_3to1.keys())
extended_resnames = set(protein_letters_3to1_extended.keys())
canonical_resnames.issubset(extended_resnames) This would return |
As Peter mentioned, these dictionaries refer to different sources. The dict in IUPACData refers to IUPAC nomenclature, while PDBData is a collection of data from structures deposited in the wwPDB. For example, The ones you did find in structures should be added to |
Got it! Thank you for the explanations! |
@@ -15,13 +15,21 @@ | |||
# The 'fmt:off' lines prevent black for formatting the dictionaries. | |||
|
|||
from Bio.Data.IUPACData import protein_letters_3to1 as _protein_letters_3to1 | |||
from Bio.Data.IUPACData import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this formatted by black?
@@ -215,7 +215,6 @@ def _build_structure(self, structure_id): | |||
except ValueError: | |||
serial = atom_serial_list[i] | |||
warnings.warn( | |||
"PDBConstructionWarning: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an unrelated change, is it only for formatting? Does it avoid any repetition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would avoid some repetition in the default traceback message.
|
||
# fmt: off | ||
protein_letters_3to1_extended = { | ||
**protein_letters_3to1_extended_one_letter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to my last reply in the general comment thread as to why I don't think this is necessarily appropriate. Some of the entries in IUPACData
will not be present in PDB files, and as such should not be included here. I'd rather this dictionary be updated manually rather than a bulk update.
I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the
CONTRIBUTING.rst
file, have runpre-commit
locally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst
andCONTRIB.rst
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)
Closes #...