Fix MMCIFParser warning and IUPAC extended one letter codes #4202

Truman-Xu · 2022-12-15T01:26:42Z

I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the CONTRIBUTING.rst file, have run pre-commit
locally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst and CONTRIB.rst as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)

Closes #...

peterjc · 2022-12-15T09:16:15Z

What issue are you trying to fix? Those dictionaries were separate for a reason.

Truman-Xu · 2022-12-15T15:26:29Z

4 out of 6 entries in the dictionary exist in structures in PDB (a total of 90 structures), and you would fail when you read the sequence from these structures without the extended entries for 3to1 conversion. I don't think the current implementation is providing a separate dictionary for 3to1 conversion for these 6 residues. Also, what's the reason to separate them for 3to1 conversion?

Below is the structure count for the residues in the dictionary

ASX: 6
GLX: 5
SEC: 75
PYL: 4

peterjc · 2022-12-15T17:40:33Z

Many of the 3-letter codes used in the PDB are not defined in the IUPAC listing, thus a separate listing.

Truman-Xu · 2022-12-15T17:42:34Z

But what is the harm in including those in the extended 3to1 dict?

peterjc · 2022-12-15T17:47:40Z

There are PDB extensions to the core 3-letter protein names which are not IUPAC extensions.

Truman-Xu · 2022-12-15T18:00:42Z

Sorry, I have trouble understanding this. Isn't the protein_letters_3to1_extended supposed to be everything including the residue names defined by IUPAC? The current one does not include the reverse mapping of the additional entries from protein_letters_1to3_extended from IUPACData.py

i.e. The reverse mapping of these residue names is not included in the extended 3to1 dict

{"B": "Asx", "X": "Xaa", "Z": "Glx", "J": "Xle", "U": "Sec", "O": "Pyl"}

Truman-Xu · 2022-12-15T18:03:28Z

For example,

from Bio.PDB.Polypeptide import protein_letters_3to1_extended
protein_letters_3to1_extended['SEC']   #SELENOCYSTEINE (U)

This would raise a KeyError

Truman-Xu · 2022-12-15T18:57:30Z

Moreover, the 20 canonical residue names are indeed in the protein_letters_3to1_extended as expected

from Bio.Data.PDBData import protein_letters_3to1
from Bio.Data.PDBData import protein_letters_3to1_extended

canonical_resnames = set(protein_letters_3to1.keys())
extended_resnames = set(protein_letters_3to1_extended.keys())
canonical_resnames.issubset(extended_resnames)

This would return True, and this is because the 20 canonical residue names are defined in this extended dict through hard coding, but the IUPAC extended ones are not

JoaoRodrigues · 2022-12-15T23:58:15Z

As Peter mentioned, these dictionaries refer to different sources. The dict in IUPACData refers to IUPAC nomenclature, while PDBData is a collection of data from structures deposited in the wwPDB. For example, Xle is a IUPAC aminoacid, but it does not match any monomer entry in the PDB.

The ones you did find in structures should be added to PDBData. IIRC, I updated the dictionary by parsing the components-pub.cif file made available by wwPDB, but I might have missed some indeed.

Truman-Xu · 2022-12-16T00:01:23Z

Got it! Thank you for the explanations!

JoaoRodrigues · 2023-01-24T22:53:37Z

Bio/Data/PDBData.py

@@ -15,13 +15,21 @@
 # The 'fmt:off' lines prevent black for formatting the dictionaries.

 from Bio.Data.IUPACData import protein_letters_3to1 as _protein_letters_3to1
+from Bio.Data.IUPACData import (


Was this formatted by black?

JoaoRodrigues · 2023-01-24T22:54:40Z

Bio/PDB/MMCIFParser.py

@@ -215,7 +215,6 @@ def _build_structure(self, structure_id):
            except ValueError:
                serial = atom_serial_list[i]
                warnings.warn(
-                    "PDBConstructionWarning: "


This seems like an unrelated change, is it only for formatting? Does it avoid any repetition?

It would avoid some repetition in the default traceback message.

JoaoRodrigues · 2023-01-24T22:57:04Z

Bio/Data/PDBData.py


 # fmt: off
 protein_letters_3to1_extended = {
+    **protein_letters_3to1_extended_one_letter,


Refer to my last reply in the general comment thread as to why I don't think this is necessarily appropriate. Some of the entries in IUPACData will not be present in PDB files, and as such should not be included here. I'd rather this dictionary be updated manually rather than a bulk update.

Truman-Xu and others added 5 commits December 14, 2022 16:59

Add extended IUPAC protein one letter to 3to1 dict

7e99a73

Fix warning messages on MMCIF Parser

33d57a1

Merge branch 'biopython:master' into master

93786c1

Add extended IUPAC protein one letter to 3to1 dict

4c1cb8b

Merge branch 'master' of github.com:Truman-Xu/biopython

baeea51

Truman-Xu requested a review from JoaoRodrigues as a code owner December 15, 2022 01:26

Merge branch 'biopython:master' into master

dcc6a3f

JoaoRodrigues requested changes Jan 24, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MMCIFParser warning and IUPAC extended one letter codes #4202

Fix MMCIFParser warning and IUPAC extended one letter codes #4202

Fix MMCIFParser warning and IUPAC extended one letter codes #4202

Are you sure you want to change the base?

Fix MMCIFParser warning and IUPAC extended one letter codes #4202

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment