Least-squares fitting procedures with illustrated examples

The least-squares (LS) fitting procedures presented below make use of well known mathematics. Indeed, the methods are so well known and widely used that it is somewhat difficult to locate the original references. In our previous effort to resolve the discrepancies among nucleic acid conformational analysis programs, we came across a variety of LS fitting procedures. Here we provide a detailed description, with step-by-step examples, of our implementation in 3DNA of two LS fitting algorithms based on a covariance matrix and its eigen-system. This post is the revised version of a note first made available in the “Technical Details” section of earlier 3DNA websites.

LS fitting between standard and experimental bases

Three analysis schemes — CompDNA, Curves/Curves+, and RNA — use LS procedures to fit a standard base with an embedded reference frame to an observed base structure. CompDNA and Curves/Curves+ take advantage of the conventional approach of McLachlan [“Least Squares Fitting of Two Structures.” J. Mol. Biol., 128, 74-79 (1979)], while the RNA program implements a closed-form solution of absolute orientation using unit quaternions first introduced by Horn. The two algorithms are mathematically equivalent for the most general cases, since the unit quaternion can be transformed to the rotation matrix given by McLachlan. The Horn method, however, is more straightforward and generally applicable; it can be applied even when one or both of the structures are perfectly planar, whereas the McLachlan approach fails.

Here we use the ideal adenine geometry derived from the high resolution crystal structures of model nucleosides, nucleotides, and bases. The x-, y-, and z-coordinates of the standard base, taken from the NDB, are listed below in the columns labeled sx, sy, and sz, respectively. s_(average) is the geometric center of the base.

              sx      sy      sz   
  1  N9      0.213   0.660   1.287 
  2  C4      0.250   2.016   1.509 
  3  N3      0.016   2.995   0.619 
  4  C2      0.142   4.189   1.194 
  5  N1      0.451   4.493   2.459 
  6  C6      0.681   3.485   3.329 
  7  N6      0.990   3.787   4.592 
  8  C5      0.579   2.170   2.844 
  9  N7      0.747   0.934   3.454 
 10  C8      0.520   0.074   2.491 
------------------------------------
s_(average): 0.4589  2.4803  2.3778 

We similarly describe the coordinates of one of the adenine bases (the fifth nucleotide in the sequence strand) from the high resolution (1.4 Å) self-complementary d(CGCGAATTCGCG) dodecamer duplex determined by Williams and co-workers (PDB id: 355d). The experimental xyz coordinates are listed below in the columns labeled ex, ey, and ez. The geometric center is e_(average). Note that the atomic serial numbers from the PDB (first column) have been rearranged so that the atoms are in the same order as those of the ideal base listed above.

              ex      ey      ez  
 91  N9     16.461  17.015  14.676 
100  C4     15.775  18.188  14.459
 99  N3     14.489  18.449  14.756
 98  C2     14.171  19.699  14.406
 97  N1     14.933  20.644  13.839
 95  C6     16.223  20.352  13.555
 96  N6     16.984  21.297  12.994
 94  C5     16.683  19.056  13.875
 93  N7     17.918  18.439  13.718
 92  C8     17.734  17.239  14.207
------------------------------------
e_(average):16.1371 19.0378 14.0485

We collect the two sets of xyz coordinates in the 10 × 3 matrices S and E corresponding respectively to the standard and experimental bases. We then construct the 3 × 3 covariance matrix C between S and E using the following formula:

        1             1
 C = ------- [S' E - --- S' i i' E]
      n - 1           n
   =
      0.2782    0.2139   -0.1601
     -1.4028    1.9619   -0.2744
      1.0443    0.9712   -0.6610

Here n, the number of atoms in each base, is 10, and i is an n x 1 column vector consisting of only ones. S' and i' are the transpose of matrix S and column vector i, respectively.

From the nine elements of the C matrix, we subsequently generate the 4 × 4 real symmetric matrix M using the expression:

     | c11+c22+c33     c23-c32       c31-c13        c12-c21    | 
 M = |   c23-c32     c11-c22-c33     c12+c21        c31+c13    | 
     |   c31-c13       c12+c21     -c11+c22-c33     c23+c32    | 
     |   c12-c21       c31+c13       c23+c32      -c11-c22+c33 | 
   =
      1.5792   -1.2456    1.2044    1.6167
     -1.2456   -1.0228   -1.1890    0.8842
      1.2044   -1.1890    2.3447    0.6968
      1.6167    0.8842    0.6968   -2.9011

The largest eigenvalue of matrix M is 4.0335, and its corresponding unit eigenvector is:

 [ q0   q1    q2    q3 ] = [ 0.6135   -0.2878    0.7135    0.1780 ]

The rotation matrix R is deduced from the above eigenvector as below:

     | q0q0+q1q1-q2q2-q3q3    2(q1q2-q0q3)        2(q1q3+q0q2)     | 
 R = |    2(q2q1+q0q3)     q0q0-q1q1+q2q2-q3q3    2(q2q3-q0q1)     | 
     |    2(q3q1-q0q2)        2(q3q2+q0q1)     q0q0-q1q1-q2q2+q3q3 | 
   =
     -0.0817   -0.6291    0.7730
     -0.1923    0.7710    0.6072
     -0.9779   -0.0990   -0.1839

Following coordinate transformation with matrix R, the origin of the standard base is found to be displaced from the experimental structure by:

 o = e_(average) - s_(average) R' = [15.8969 15.7701 15.1802]

The least-squares fitted coordinates (F) of the standard base atoms on the experimental structure are then given by:

 F = S R' + i o
   =
     16.4592   17.0194   14.6699
     15.7747   18.1925   14.4586
     14.4899   18.4519   14.7542
     14.1729   19.6974   14.4070
     14.9343   20.6404   13.8420
     16.2222   20.3472   13.5569
     16.9832   21.2875   12.9925
     16.6829   19.0585   13.8760
     17.9183   18.4437   13.7219
     17.7335   17.2396   14.2062

Here S is the (n x 3) matrix of original coordinates of the standard base, and as noted above, i is an n x 1 column vector consisting of only ones.

The difference matrix (D) between F and E, the (n x 3) matrix of original coordinates of the experimental base, and the root-mean-square (RMS) deviation between the two structures are found as:

 D = E - F
   =
      0.0018   -0.0044    0.0061
      0.0003   -0.0045    0.0004
     -0.0009   -0.0029    0.0018
     -0.0019    0.0016   -0.0010
     -0.0013    0.0036   -0.0030
      0.0008    0.0048   -0.0019
      0.0008    0.0095    0.0015
      0.0001   -0.0025   -0.0010
     -0.0003   -0.0047   -0.0039
      0.0005   -0.0006    0.0008

 RMS deviation = 0.0054

It should be noted that if the standard base is already defined in terms of its reference frame, as in 3DNA (e.g., $X3DNA/config/Atomic_A.pdb), the vector o and the matrix R represent the best-fitted coordinate frame of the experimental base. Moreover, the three axes of the frame given by R are guaranteed to be orthonormal. If you want to get an insight of the LS fitting algorithm and a better understanding of how 3DNA derives its base reference frame, it’d be a valuable experience to repeat the above procedure with $X3DNA/config/Atomic_A.pdb.

Note: the algorithm does not apply to a molecule vs its inversion (an improper rotation) — thanks to Boris Averkiev for reporting this subtle point (see comments below). One possible remedy is to treat this edge case separately.

Base normal

Rather than fit a standard base to experimental coordinates, the CEHS, FREEHELIX, and NUPARM analyses perform a fitting of a LS plane to a set of atoms in order to define the base and base-pair normals. The covariance matrix based on the n x 3 matrix of experimental Cartesian coordinates E is diagonalized to find the vector normal to the best plane. Specifically, C is obtained using the above formula with S substituted by E. The normal vector then lies along the eigenvector that corresponds to the smallest eigenvalue. Note that the coefficient 1/(n-1) in the formula for calculating C has no effect on the direction of the eigenvectors but scales the magnitudes of the eigenvalues.

Using the above adenine base from the high resolution dodecamer duplex as an example, the covariance matrix C is:

 C =
     1.6680   -0.5015   -0.3253
    -0.5015    2.0670   -0.5840
    -0.3253   -0.5840    0.3061

The smallest eigenvalue of C, 8.26e-5, indicates that the base is almost perfectly planar. The corresponding unit eigenvector corresponding to the base normal is:

 Base normal: 0.2737    0.3224    0.9062

Related topics:

---

Comment

Thank you for writing this; exactly what I was looking for. Unfortunately, when trying to follow the example, I noticed that the first and second part don’t match up. Using the quaternion [0.6135, -0.2878, 0.7135, 0.1780] given here, I do indeed get the same good fit and low RMS difference. And when I follow the first steps I obtain the same 4×4 matrix M as you do. However, when I calculate the eigenvalues and vectors of this matrix, the eigenvector corresponding to the largest eigenvalue (indeed 4.0335) is [0.6135, -0.3126, 0.7247, 0.0276]! So I guess there must be a mistake somewhere in the construction of the matrix M. However, I’ve been unable to find the mistake, not even with the help of the excellent paper by Berthold K. P. Horn (J. Opt. Soc. Am. A/Vol. 4, No.4/April 1987). Do you have any idea what I need to do to get the right matrix which gives the correct eigenvector/quaternion?

— ST · 2013-05-01 12:31 · #

---

Thanks for your comment.

So you get the same 4×4 M matrix and the largest eigenvalue of 4.0335, but with a different corresponding eigenvector. I get [0.6135 -0.2878, 0.7135, 0.1780], but yours is: [0.6135, -0.3126, 0.7247, 0.0276].

M = [
      1.5792   -1.2456    1.2044    1.6167
     -1.2456   -1.0228   -1.1890    0.8842
      1.2044   -1.1890    2.3447    0.6968
      1.6167    0.8842    0.6968   -2.9011];

Which software package did you use to calculate the eigenvalues/eigenvectors of M? With Octave (version 3.2.4) and Matlab (R2012b), I can reproduce the eigenvector without any problem, as shown below:

>> M = [
      1.5792   -1.2456    1.2044    1.6167
     -1.2456   -1.0228   -1.1890    0.8842
      1.2044   -1.1890    2.3447    0.6968
      1.6167    0.8842    0.6968   -2.9011];

>> [V, D]=eig(M)

V =

   -0.3126   -0.0276    0.7247    0.6135
   -0.4213   -0.8596    0.0292   -0.2878
   -0.1117   -0.2065   -0.6601    0.7135
    0.8440   -0.4666    0.1956    0.1780

D =

   -4.0336         0         0         0
         0   -0.8684         0         0
         0         0    0.8685         0
         0         0         0    4.0336

Note the last column of V, which is [0.6135 -0.2878, 0.7135, 0.1780], the quaternion you are looking for.

Does this clarify your confusion?

— Xiang-Jun Lu · 2013-05-01 12:53 · #

---

Thank you very much for this information! I was trying to implement this procedure in my perl script.
It worked perfect for your example.
Unfortunately it didn’t work for my own molecule. I calculated the same molecule with two DFT methods and wanted to overlap them. After procedure failed I realized, that atomic coordinates in both molecules are very close, but the second molecule is rotated 180 degrees around Y axis. I manually edited second file (changed + and – for X and Z coordinates) and after that my script worked.
So, maybe this algorithm doesn’t work when the angle of rotation is about 180 degrees? This is just question. Maybe I did something wrong :)
Anyway, thank you again for this explanation!

— Boris Averkiev · 2013-07-10 20:30 · #

---

Hi Boris,

Thanks for stopping by and leaving your comment. I am glad that your found the post helpful.

Regarding the issue you experienced with the 180-degree rotation about y-axis, it should not happen in principle. The algorithm presented here should be robust enough to work perfectly for such cases. If you post here (or email me if you prefer) details, I will have a look to see what went wrong.

Best regards,

— Xiang-Jun Lu · 2013-07-10 21:01 · #

---

Sorry, my fault.
I didn’t take the eigenvector corresponding to maximum eigenvalue.
I just started to use PDL for perl and thought that the first calculated eigenvalue is always maximum. But I was wrong. Now it works perfect.
I have another question – can I use this algorithm directly when the second fragment is the mirror reflection of the first fragment or I should, for example, manually invert all coordinates before comparison?
Thank you!

Boris

— Boris Averkiev · 2013-07-10 21:11 · #

---

Hi Boris,

Glad to see that you’ve figured out where the problem was — as always, it is such little details that make a difference in practice.

The algorithm should also be directly applicable to two structure in mirror reflection. It does not know such things — all it cares about are two sets of xyz coordinates. Have a try and post back how it goes.

All the best!

— Xiang-Jun · 2013-07-10 23:18 · #

---

Hello Xiang-Jun,
I inverted my structure and it doesn’t work now. I tested it for all four eigenvectors. I also took your first structure and compared it with itself inverted. In result I get very small RMS (I think because the molecule is planar), but they should be zeros.

Boris

Initial structure [0.213 0.66 1.287] [ 0.25 2.016 1.509] [0.016 2.995 0.619] [0.142 4.189 1.194] [0.451 4.493 2.459] [0.681 3.485 3.329] [ 0.99 3.787 4.592] [0.579 2.17 2.844] [0.747 0.934 3.454] [ 0.52 0.074 2.491] and its inverted
final structure [ -0.214 -0.66 -1.29] [ -0.25 -2.02 -1.51] [ -0.0155 -2.99 -0.619] [ -0.142 -4.19 -1.19] [ -0.452 -4.49 -2.46] [ -0.681 -3.48 -3.33] [ -0.99 -3.79 -4.59] [ -0.579 -2.17 -2.84] [ -0.747 -0.934 -3.45] [ -0.52 -0.074 -2.49]

RMS [ 0.000782 1.07e-05 -0.000194] [ -0.000277 -3.81e-06 6.88e-05] [ -0.000482 -6.62e-06 0.00012] [ -0.000131 -1.79e-06 3.24e-05] [ 0.000782 1.07e-05 -0.000194] [ -7e-05 -9.61e-07 1.74e-05] [ -4e-05 -5.49e-07 9.92e-06] [ -0.000467 -6.41e-06 0.000116] [ -7.26e-05 -9.96e-07 1.8e-05] [ -2.4e-05 -3.29e-07 5.95e-06]

— Boris Averkiev · 2013-07-11 17:58 · #

---

Hi Boris,

Very interesting observation! I can verify your result of the non-zero rmsd (even though very small, ~10e-4) between your example structure and its invert. This appears to be a very special case — the bug is not due to the planarity of the molecule since a non-planar molecule gives a large non-zero erroneous rmsd value.

Luckily in structural bioinformatics, one is unlikely to perform a ls-fitting between a molecule vs its inversion. In my experience, I have never met a problem like this. One possible fix is to check for this edge case (q0 ~ 0), and treat it separately. You may also find other method a better fit.

Thank for letting me know this bug, and I have added a note in this post.

— Xiang-Jun Lu · 2013-07-11 23:38 · #

---

Hello Xiang-Jun,

So, it means that algorithm can perform overlapping of molecules only using rotations (and translations), but not improper rotations. I remember, when I worked as X-ray crystallographer, the program drawing molecules had command that overlap similar fragments of structure (for example, two symmetrically independent molecules) and the program asked for the list of corresponding atoms for both fragments and also if the second fragment should be inverted.
I like this algorithm. I don’t think there are better methods than this. It is not difficult to test two possible overlapping – standard overlapping and overlapping with one molecule inverted. And after that to choose the best.
Thank you again for this page with detailed explanations!

— Boris Averkiev · 2013-07-12 18:58 · #

---

Hi Boris,

Thanks for the info. It is good to know the subtlety of an otherwise robust least-squares fitting algorithm.

— Xiang-Jun · 2013-07-12 20:02 · #

---

I am trying to use this algorithm in coordinate metrology to align similar (xyz) point sets. I have been able to work the algorithm to the point of calculating the R matrix but do not understand the next two steps. I am not a math major and would appreciate a more detailed explanation and examples of the next steps, namely:

Following coordinate transformation with matrix R, the origin of the standard base is found to be displaced from the experimental structure by: o = e_(average) – s_(average) R’ = [15.8969 15.7701 15.1802]

The least-squares fitted coordinates (F) of the standard base atoms on the experimental structure are then given by: F = S R’ + i o

Any help would be greatly appreciated. Thanks.

— James M. Harshaw · 2013-11-13 13:40 · #

---

Thanks for stopping by and asking questions. Unfortunately, I cannot provide you with more insights other than saying that vector o represents a translation to bring S to E, as shown in the ls-fitted coordinates matrix F. Googling around or reading a textbook should help you more.

— Xiang-Jun · 2013-11-13 19:18 · #

---

Hi Xiang-Jun,

I am working on DNA structure flexibility. I am wondering how can you ensure this best plane is parallel to local x- and y-axis (the short and long axis of a base pair)? i.e., x-axis, y-axis, and the best-plane normal vector to form a orthogonal system?

-Xiaojing

— Xiaojing · 2014-05-07 19:36 · #

---

Hi Xiaojing,

Thanks for stopping by. If you are referring to the section on Base normal, then it is not guaranteed to be orthogonal to the local x- and y-axis, unless for a perfectly planar base.

To ensure an orthonormal coordinate system, you could decompose the raw x-axis to get its orthogonal component as the proper x-axis, and then derive the proper y-axis using the right-handed rule. As you could easily image, there are alternative ways to achieve an orthonormal coordinate system. Several of the options have actually been used in various DNA structural analysis programs: see my early paper titled “Overview of Nucleic Acid Analysis Programs” (http://www.ncbi.nlm.nih.gov/pubmed/10217453) for further details.

HTH,

— Xiang-Jun Lu · 2014-05-07 21:03 · #

---

Hi Xiang-Jun,

Thanks for your reply. It really solves my confusion.

— Xiaojing · 2014-05-08 13:44 · #

---

Hi Xiang-Jun,

Can you tell me how do you calculate to get the ref_frames.dat in your 3DNA software? I think that file stores the local coordinates. However, I referred to your paper and Olson’s paper to calculate the coordinate, and get different results from 3DNA.

— Xiaojing · 2014-09-08 23:13 · #

---

Hi Xiaojing,

Thanks for stopping by. The ref_frames.dat in 3DNA is produced just as described in this blog post, except where only base-ring atoms are used: 6 for C/T/U, and 9 for A/G. Have a try again. If you have any reproducibility issue, I’d be happy to help you sort it out.

Xiang-Jun

— Xiang-Jun Lu · 2014-09-08 23:56 · #

---

Hi Xiang-Jun,

To my understanding, after I got atoms’ coordinates from simulation, I also need to refer to an ideal atom coordinate. Using both to generate local coordinate system. Is this correct?

— Xiaojing · 2014-09-10 00:58 · #

---

Yes. The point becomes clear once you’ve followed the example.

Do you still have any problem in reproducing data in ref_frames.dat, as mentioned in your first comment?

Xiang-Jun

— Xiang-Jun Lu · 2014-09-10 13:31 · #

---

OK, then I just want to know why it is necessary to refer to an ideal structure to build local coordinate. By doing this, we are just artificially modifying the results of MD simulation. Will this affect the reliability of analysis? Thanks!

— Xiaojing · 2014-09-11 01:07 · #

---

How is R’ calculated? Please show example.

Mike · 2017-08-17 08:19 · #

---

From the following:

 [ q0   q1    q2    q3 ] = [ 0.6135   -0.2878    0.7135    0.1780 ]

We have:

q0 =  0.6135
q1 = -0.2878
q2 = 0.7135
q3 = 0.1780

Then simply fill in the 3×3 components, you will get R.

For example, the first row and first column is

q0q0+q1q1-q2q2-q3q3 = -0.0816

With round off error considered, it is close to the reported value: -0.0817. Please try another one and report back if you get the desired numerical value.

— Xiang-Jun · 2017-08-17 16:06 · #

---
 
---

·

Thank you for printing this article from http://x3dna.org/. Please do not forget to visit back for more 3DNA-related information. — Xiang-Jun Lu