Over 4-char long chain identifiers in the PDB

DNA and RNA are biological macromolecules consisting of long chains of nucleotides. In PDB coordinate files, each DNA/RNA chain is assigned a unique identifier. For the legacy PDB format, the size of the chain identifier is clearly defined to be one alphanumeric character. For the mmCIF format, the length of the chain identifier is flexible: it is normally up to 4-char long, but assembly files can have chain identifiers longer than 4 characters (as of May 2022, see examples).

Recently, I was approached with the following bug report where DSSR v2.4-2021nov11 was used:

Processing file '8feo-assembly1.cif'

[i] '8feo-assembly1.cif' taken as in .cif format by file extension.

*** buffer overflow detected ***: terminated

Aborted

I ran a newer version of DSSR (including the current release v2.5.2-2025apr03) on 8feo-assembly1.cif without any issue, as shown below:

# x3dna-dssr -i=8feo-assembly1.cif -o=8feo.out
[i] '8feo-assembly1.cif' taken as in .cif format by file extension.

Processing file '8feo-assembly1.cif'
[w] chain id 'AAA-2' > 4 chars
...

# Excerpt from 8feo.out
    no. of DNA/RNA chains: 2 [AAA=16,AAA-2=16]
    no. of nucleotides:    32
...
List of 16 base pairs
     nt1            nt2            bp  name        Saenger   LW   DSSR
   1 AAA.A1         AAA-2.U16      A-U WC          20-XX     cWW  cW-W
   2 AAA.G2         AAA-2.C15      G-C WC          19-XIX    cWW  cW-W
   3 AAA.A3         AAA-2.U14      A-U WC          20-XX     cWW  cW-W

The message [w] chain id 'AAA-2' > 4 chars is saying that the chain identifier ‘AAA-2’ is out of the 4-char limit.

In addition to 8feo, similar issues were also fixed for related PDB entries 5a79, 6a7a, 8feo, 8fep, 8feq, 7umk, and 4v3p. Note that 4v3p is a eukaryotic polyribosomal assembly which takes several hours to run on a MacBookPro with 32GB RAM.


Some background information on how DSSR handles chain identifiers for mmCIF format files

When mmCIF support was first added to DSSR in 2013, I hard-coded the chain identifier to 4 chars following the documentation. In early 2024, when running DSSR on weekly updated PDB entries, I noticed a core dump bug with PDB entry 8feo for its biological assembly 1. At that time, I was not aware of the update on mmCIF-Formatted Assembly Files and its expansion of chain identifiers for symmetry-related copies: with PDB assembly files, -# is appended to any chain that is generated by a symmetry operation. So if the base chain id has 3 chars (e.g., AAA), the symmetry related chain will have 5 chars (e.g., AAA-2).

That is the case for PDB entry 8feo: it has a chain with identifier AAA-2 which is symmetry-related to the asymmetric unit chain AAA. Since AAA-2 (5-char long) is above the hard-coded 4-char limit, DSSR crashed (out of array boundary in C). After recognizing the issue, I've increased the chain identifier limit in DSSR to 8 chars, more than enough for all current PDB entries. Moreover, DSSR performs sanity check of chain identifier length: it reports diagnostic message as shown above for chains with over 4-char identifiers (e.g., AAA-2), and automatically shortens long chains to the enlarged limit. DSSR is now more robust and user friendly: it no longer simply crashes, but communicates helpful info about unusual cases to draw users' attention.

Taking this opportunity, I have also proactively updated DSSR to support long atom names , residue names, and segment ids, in preparation for future id changes. Tracing issues to their root causes and fixing them systematically is a key part that makes DSSR a reliable tool for structural bioinformatics. Tests have been added to the quality control infrastructure to ensure that all these new features work as expected.

Nowadays, the vast majority (over 90%) of users’ questions about DSSR can be answered straight away simply because they have already been addressed in advance, as shown in the above example for long chain identifiers. I'm always on the lookout for issues reported on the 3DNA Forum, received from email, Zoom, or in person, and more systematically via DSSR update on weekly released PDB entries, and uploaded files on the web-services (g4.x3dna.org and skmatic.x3dna.org). Every issue is an opportunity to further polish DSSR and make it better. Overall, users’ feedback is invaluable to me: I take it as an asset instead of a burden.


Documentation on chain identifiers in PDB and mmCIF formats

PDB format

The Coordinate Section in the PDB format documentation contains the following for ATOM/HETATM records:

The ATOM records present the atomic coordinates for standard amino acids and nucleotides. ... Non-polymer chemical coordinates use the HETATM record type.

Record Format
COLUMNS        DATA  TYPE    FIELD        DEFINITION
-----------------------------------------------------------
22             Character     chainID      Chain identifier.
**Details**
- Non-blank alphanumerical character is used for chain identifier.

So the chain identifier in PDB format is a single alphanumeric character in column 22 of the ATOM/HETATM records.

mmCIF format

  • Large Structures Represented in mmCIF/PDBx: Chain identifiers of up to 4 characters are permitted. The PDB chain identifier corresponds to the "_atom_site.auth_asym_id" data item.

  • News item on Distributing PDBx/mmCIF-Formatted Assembly Files

  • Github repo on Sample assembly files in PDBx/mmCIF Format

    These updated PDBx/mmCIF format assembly files files will include all symmetry generated copies of each chain within a single model, with distinct chain IDs assigned to each. Generation of distinct chain IDs in assembly files are based upon the following rules:

    • Chain IDs of the original chains from the atomic coordinate file will be retained (e.g., A)
    • Assign unique chain ID for each symmetry copy within a single model. Rules of chain ID assignments:
      • The applied index of the symmetry operator will be appended to the original chain ID separated by a dash (e.g., A-2, A-3, etc.)
      • If there are more than one type of symmetry operators applied to generate symmetry copy, a dash sign will be used between two operators (e.g., A-12-60, A-60-88, etc.)
---

Comment

 
---

·

Thank you for printing this article from http://x3dna.org/. Please do not forget to visit back for more 3DNA-related information. — Xiang-Jun Lu