As described in the Methods, we used manually curated multiple sequence alignments to construct hidden Markov models of the heavy-chain and light-chain variable domains. We used these models to search the entire set of PDB sequences to identify all PDB chains with variable domains. There were a total of 923 PDB entries identified that contain at least one hypervariable loop with all backbone atom positions defined. Since the asymmetric units of many PDB entries contain more than one copy of the same antibody and other PDB entries contain more than one antibody (anti-idiotypes), within those files were 1232 chains with a variable heavy-chain domain, 1304 chains with a variable light-chain domain, and 30 chains with both a heavy- and a light-chain domain within a single chain (scFv fragments). After low-resolution (>2.8Å) and NMR structures were excluded, there were 703 entries left comprising 882 heavy-chain domains, 953 light-chain domains, and 26 Fv chains.
We defined the CDRs differently than the Kabat and Chothia schemes that are most commonly used. We chose definitions such that the anchors of each loop, the residue immediately before or after the loop, contained tightly clustered conformations relative to the framework, using structure alignments obtained by Honegger and Plückthun.10 We also selected positions such that the N and C terminal residues were opposite each other in the structure, whether they occurred in neighboring β-strands (CDR2 and CDR3) or in different β-sheets (CDR1). Where possible, we also chose definitions using homologous positions in the VL and VH chains.
The sequence motifs around our CDR starting and ending positions are shown in Figure 1. We started with the positions immediately following the conserved cysteines of the intrachain disulfide bond, and defined these as the N-terminal residues of H1, L1, H3, and L3. For both the CDR1 and CDR3 loops, we chose the C-termini based on the Cα positions with least variance across VL and VH domains, which also turned out to be at about the same depth in the structure as the N-termini. In these four cases, the C-termini were followed by conserved aromatic residues that are part of the hydrophobic core of each domain. We chose the L2 start site as the same one used by Chothia, since it occurs opposite the CDR1 sites we had already chosen and at the end of a β-strand, and used this to put the H2 start site in the same position. We placed the end of H2 in a short β-strand immediately across from the N-terminus of H2. However, this region in VL is not always a β-strand, and there is both sequence and structural diversity for several more residues. We chose the L2 C-terminus that agrees with the Martin-Thornton definitions11. This L2 definition includes three more positions on its C-terminus than the H2 definition. Superpositions of VH and VL (from PDB entry 1MJU12) with the CDRs indicated are shown in Figure 2. Note that the N and C termini of each loop are in homologous positions between VH and VL, with the exception of L2 (Figure 2b).
CDR definitions used in this work. The sequence logos of each loop are shown with the first three and last three residues of the CDR in red and the flanking framework residues in black.
CDRs based on our definitions. a. L1 and H1; b. L2 and H2; c. L1 and H3. L1, L2, and L3 in dark blue; H1, H2, and H3 in magenta. Disulfides in yellow. The structure is PDB entry 1Q9R32.
With loop definitions in hand, we applied a number of criteria to filter out loops of uncertain or indeterminate conformation. These include loops with missing coordinates, backbone atoms with high B-factors, residues with cis residues that are not proline (including PDB entry 1OCW13 (resolution 2.0 Å), with ten non-Pro cis residues, including four in H1) and those with high backbone conformational energy, as determined by Ramachandran probability distributions that we have recently published.14 The remaining structures are highly redundant in sequence, since the structures of some antibodies have been determined multiple times. By representing each variable domain structure by the sequences of its six CDRs, we chose the structure with the highest resolution for each sequence. We also removed a small number of loops with conformations that are outliers with respect to all other structures, defined as having at least one backbone dihedral 90° away from every other structure in the data set. The number of loops for each CDR in the data set after applying each of these filters is shown in Table 1. Counts of the different loop lengths for each CDR in the resulting data set are given in Table 2.
Count of loops by CDR and length
Affinity clustering of CDR loop conformations
We ran the affinity clustering algorithm for each combination of CDR, loop length, and cis-trans configuration separately. As an example of the clustering, we show the Ramachandran distributions for the clusters of L1-12 in Figure 3. This CDR-length comprises 12 structures with unique sequences, clustered into 3 conformations of size 5, 5, and 2. We divided the Ramachandran map into labeled regions as shown in Figure 4 in order to label the clusters by conformation. In this definition, B is the β-sheet region, P is polyproline II, A is α-helix, D is δregion (near α-helix but at more negative values of ϕ), L is left-handed helix, and G is the γ region (ϕ>0° excluding the L and B regions). Using these definitions, the median loop of cluster 1 (blue dots) has conformation BPABPBPAADBB, cluster 2 (magenta dots) has conformation BPABPPPLLPBB, and cluster 3 (green dots) has conformation BPPAADAAPPBB. Cluster 1 differs from cluster 2 primarily at residues 8, 9, and 10, with conformations AAD and LLP respectively.
Ramachandran maps of clustering of L1-12. The median loop of cluster 1 (blue dots) has conformation BPABPBPAADBB, cluster 2 (magenta dots) has conformation BPABPPPLLPBB, and cluster 3 (green dots) has conformation BPPAADAAPPBB (see Figure 4 for definitions...
Regions of the Ramachandran map.
The clustering results for CDRs L1, L2, L3, H1, and H2 are shown in Tables 3, 4, 5, 6, and 7 respectively. The clustering for the torso region of longer H3 loops is shown in Table 8 (see below). In each table, the results for each loop length are given, and for each cluster the structure count and percentage, the unique sequence count, the PDB ID for the median loop structure, the consensus sequence, and the conformation of the median loop in terms of the Ramachandran conformations.
Clustering of CDR Loop H3 Anchors
Before we discuss the results of the clustering for each CDR, we can observe three different categories or types of antibody loop type-lengths.
Type I, One-cluster CDR-lengths
For the first type, loops of a certain CDR-length combination have one conformation that forms all or at least a large majority of the structures. When fed into the affinity algorithm, the result is a single conformational cluster or one large cluster and a small number of outlying conformations. The large cluster must be a fairly tight distribution. This CDR-length therefore has a predictable structure. We consider CDR-lengths to be of this type if there are at least 10 unique sequences with more than 85% of the structures in the largest cluster of conformations.
Type II, Predictable CDR-lengths
The second type of CDR-length combination has multiple possible structures, but each cluster is tightly grouped and each cluster significantly differs from the others in sequence. We include in this Type some loops whose conformational clusters are easily predicted by the identity of certain framework residues, even if the loops in the different clusters do not have significantly different sequences. To be in this Type, loops had to have at least 4 unique sequences in each of the larger clusters, two or more clusters, and membership was more than 85% predictable by sequence of the loop (or identity of certain framework residues, see below)
Type III, Unpredictable CDR-lengths
For some CDR-lengths, structure prediction is likely to be difficult or statistically uncertain. This may occur for a number of reasons. First, the affinity propagation procedure may put most structures into a small number of highly dispersed clusters, or into a large number of very small clusters. Second, there may be too few structures to have much confidence in the clustering. In some cases, it may be possible to suggest a sequence motif that determines the cluster but the data are insufficient to do this with confidence. For other CDR-lengths, the structures may be well clustered into discrete conformations, but there is little systematic variability in their sequences. For these CDR-lengths, structure prediction for loops of unknown structure may depend on unrecognized interactions with the other CDRs or the framework.
We discuss each CDR in turn.
Using our definitions, L1 can have loop lengths from 10 to 17 residues. The majority of L1 loops are of length 11 or 16 with 57 and 50 unique sequences respectively. The results of the clustering analysis of L1 are shown in Table 3.
Several L1-lengths are of Type I, meaning that a single conformation strongly predominates. CDR-length L1-10 is one of these with 20 out of 22 total structures (all mouse κ) belonging to a single conformation. The median conformations of the two are BBABPBABBB versus BBABPBPGPB, which differ primarily in residue positions 7 and 8, involving a flip of the peptide bond between these residues. This is a common and relatively minor difference between two homologous structures. L1-16 also belongs to Type I with all 68 structures belonging to a single cluster. L1-17 is also a single cluster CDR-length with all 21 structures having a similar conformation. These loops have normalized average distances from their median structures of 10° per dihedral angle (see Table 3). These small values indicate tight clustering.
L1-11 belongs to Type II, having three alternate conformations that are easily predictable by sequence of the CDR or the identities of certain framework residues. We refer to these clusters as L1-11-1, L1-11-2, and L1-11-3. We looked first at the sequence logos16 derived from the unique sequences in each cluster to determine if sequence can differentiate the clusters; these are shown in Figure 5. Cluster L1-11-3 has a very different amino acid distribution at positions 5 and 6, where clusters L1-11-1 and L1-11-2 have [SDNE][IV] while L1-11-3 has [ILA][GPS]. The L1-11-3 sequences all come from human Vλ chains, while L1-11-1 and L1-11-2 have very similar amino acid distributions, coming from human and mouse Vκ chains. As has been noted by Al-Lazikani et al. based on only four structures,2 the structural difference between L1-11-1 and L1-11-2 is due to a difference in the framework at position 71 (Chothia numbering. 18 residues prior to the start of CDR-L3; residue 89 in the Honegger-Plückthun numbering system10). When position 71 is Phe, 63 out of 67 such structures (94%) are in cluster L1-11-1. All 8 structures with Thr at 71 and both structures with Gly at 71 are in L1-11-1. Of 50 structures with Tyr at positions 71, 48 of them (96%) are in cluster L1-11-2. Loops in cluster L1-11-1 form a hydrogen bond from the carboxyl oxygen of residue 7 of the CDR to the amide hydrogen atom of residue 68 (21 residues prior to L3). In loops belonging to cluster L1-11-2, the orientation of the amide bond between residues 7 and 8 of the CDR is reversed. This directs the amide hydrogen atom of residue 8 towards the hydroxyl oxygen atom of the tyrosine residue at position 71, forming a hydrogen bond. These interactions are shown in Figure 6.
Sequence logos for the three clusters of L1-11-1, L1-11-2, and L1-11-3 from top to bottom. The logos were drawn with the program Weblogo16.
The median structures of clusters L1-11-1 (yellow) and L1-11-2 (magenta). The hydrogen bond of Tyr71 to the NH of residue 8 in cluster 2 is shown. The sequence and residue numbering given are from the L1-11-1 median structure, PDB-chain 1P7KL39.
The remaining L1-lengths only have a small number of available structures and sequences, including L1-12 (12 structures, 12 sequences, 3 clusters), L1-13 (11 structures, 11 sequences, 2 clusters), L1-14 (18 structures, 12 sequences, 2 clusters), and L1-15 (13 structures, 11 sequences, 2 clusters). Even here, though, there are some residues that differentiate these clusters, but because of the small numbers we cannot be confident that these features will always be predictive. We therefore define them as being of Type III. For instance, for cluster L1-12-3 (mouse Vλ) has very different sequences than L1-12-1 (mouse Vκ) and L1-12-2 (human and mouse Vκ). Four out of five L1-12-1 members have Tyr71 while all five L1-12-2 members have Phe71. The two clusters of L1-13, all human Vλ, are easily distinguishable by sequence at positions 2 and 5, with the first five residues of L1-13-1 having sequence motif [ST]G[ST][SAT][ST] and L1-13-2 having TRSSG. The Gly at position 5 of L1-13-2 presumably allows the γ conformation for this residue (ϕ,ψ = +70°,+160°). The two clusters of L1-14 have quite different sequences; the human sequences in cluster L1-14-1 have consensus sequence RSStGavTtsNYAN (completely conserved residues in upper case) and the mouse sequences in L1-14-2 have consensus sequence TgtssnvgGynyVs. The Gly at position 5 of L1-14-1 presumably favors the γ conformation for this residue. Finally, cluster L1-15-2 has only two mouse Vκ members that differ by only one residue from each other. The conformations of L1-15-1 and L1-15-2 differ at positions 7-9 with sequences [DE][YSFN][YFD] and STS respectively.
The results of the clustering analysis for L2 are shown in Table 4. L2 loops of known structure only come in two lengths, L2-8, and L2-12. There are 308 structures for L2-8; of these, 290 of them (94%) consisting of 159 unique sequences belong to the majority cluster with a conformation of BLLDPPPP. The next most common cluster, with 9 structures, has a median structure with a conformation of BLLDPPPA, which varies from the main conformation only at the last residue. There are also three additional very small clusters. We consider L2-8 to be of Type I, that is, effectively having only one conformation.
L2-12 contains only 4 structures in 2 clusters, each with only a single unique sequence. The first is the structure of the human pre B-cell receptor, while the second is a mouse Vλ structure. With so few sequences, this loop is of Type III.
The results of the clustering analysis for L3 are shown in Table 5. L3 loops come in lengths 7 through 13, and 85% of L3 loops are of length 9. The largest cluster of L3-9, representing 83% of this loop length, is one that contains a cis proline at position 7, which we designate L3-9-cis7-1. There are two additional, very small clusters with cis-7, two clusters that are all trans, and one cluster that has cis-6. The structure of an L3-9 loop can be predicted fairly well merely by the positions of proline residues, if any. If all L3-9 loops with Pro7 are predicted to be in cluster L3-9-cis7-1, then this prediction is correct 219/235 times, or 93.2% of the time (and 93.8% for unique sequences). Of the remainder, 10 are in the other cis7 clusters and 6 are in all-trans cluster L3-9-2. If Pro is entirely absent from L3-9, then 22 of 25, or 88% are in cluster L3-9-1. L3-9 is therefore of Type I, and generally predictable in structure. See Figure 7 for superpositions of representative structures of each of the largest four clusters.
The median structures of the largest clusters of L3-9. a. L3-9-cis7-1 (yellow) + L3-9-cis7-2 (magenta). b. L3-9-cis7-1 (yellow) + L3-9-1 (blue) c. L3-9-cis7-1 (yellow) + L3-9-2 (green). The sequence of L3-9-cis7-1 from PDB entry-chain 1J1PL is marked...
There are three additional CDR-lengths for L3 that contain more than one cluster, and all three are of Type III (that is, having small numbers): L3-8, L3-10, and L3-11. All three L3-8 loops with Pro at position 6 belong to the L3-8-cis6-1 cluster. There are two all-trans clusters but with no distinguishing sequence features from each other. For L3-10, all loops with no prolines belong to the all-trans cluster, L3-10-1 The two clusters, L3-10-cis8-1 and L3-10-cis7,8-1 both contain two prolines at positions 7 and 8. The single L3-11-cis7-1 structure has Pro at positions 7 and 8, while none of the all-trans L3-11-1 structures do.
Three loop lengths, L3-7, L3-12 and L3-13 have only one conformation and one or two unique sequences, and are therefore of Type III. The latter two CDR-lengths are λ sequences.
The results of the clustering analysis for H1 are shown in Table 6. CDR H1 comes in lengths 12 through 16 and also length 10. The shortest and longest H1 sequences come from camelid antibodies. CDR-length H1-13 represents 92% of the H1 loops and is dominated by a single conformation. Cluster H1-13-1 comprises 267 out of the 306 structures, or 87%, with a conformation of PPBLBPAAABPBB and a minimum normalized median angle of 13° (see Table 5). It is therefore of Type I. The remaining 39 structures are distributed over eleven different clusters with a wide range of possible structures. No obvious sequence differences exist among them, except that three of them occur only for camelid antibodies. The other CDR-lengths for H1 all exist in single clusters; however, they each contain fewer than 10 unique sequences and therefore these CDR-lengths are of Type III.
The results of the clustering analysis for H2 are shown in Table 7. For H2, there are two common loop lengths, H2-9 and H2-10, each with multiple clusters, as well as three loop lengths with only one cluster each, H2-8, H2-12 and H2-15. For H2-9, 77 out of 81 structures, or 95% belong to cluster H2-9-1 with a minimum normalized median angle of 10° (see Table 7). It is therefore of Type I. All of the H2-9 human sequences are in H2-9-1. Clusters H2-9-1 and H2-9-3 both have an L conformation at position 6, while cluster H2-9-2 has a D conformation. Consistent with this, H2-9-1 and H2-9-3 have mostly Gly at this position (and a few Asp in H2-9-1), while H2-9-2 has Phe and Val.
CDR H2-10 represents 67% of all H2 loops. It is grouped into two large clusters, 68% and 19% of structures, and seven much smaller clusters. We examined the sequence logos for the top 4 clusters and found that there are different patterns of the positions of Gly and Pro in the middle of the loop at several positions, as shown in Figure 8. There are left-handed L or G conformations at positions 7, 6, 5, and 5+6 for the top four clusters respectively. No one position was completely predictive so we created hidden Markov models with HMMER17 based on the unique sequences in each cluster and then assigned each loop to the cluster with which it scored the highest. For cluster H2-10-1, with a conformation of BBPAADLPBB, 130 out of 155 structures or 84% are predicted correctly. For cluster H2-10-2 with a conformation of BBPAALABBB, 30 out of 42 or 71% of its structures are correctly predicted to be in the cluster. H2-10-3 and H2-10-4 are not as well predicted, but are much smaller in population. H2-10-3, with a conformation of BBBPGALPBB, has 6 structures out of 11 predicted correctly. Finally, H2-10-4, with a conformation of BBPPLLABBB, has only 2 out of 7 structures predicted correctly. Overall, however, the scores of loop sequences of H2-10 against the HMMs of its clusters are good at predicting the cluster membership of the sequences.
Sequence logos for clusters H2-10-1, H2-10-2, H2-10-3, and H2-10-4 (top to bottom respectively).
Additionally for H2, Tramontano, Chothia and Lesk18 noted the effect of framework residues in determining the conformation of the loop, particularly the identity of residue 71 (Chothia numbering; 25 residues before the start of H3; Honegger and Plückthun10 number 82). Using our CDR definitions, they analyzed H2-9, H2-10, and H2-12 (their lengths 3, 4, and 6), but in 1990 they had only 2, 3 and 2 structures respectively. We decided to investigate this to see if it holds up with a much larger data set. For H2-9, they found only one conformation regardless of position 71 (Val and Arg). We also found effectively only one structure (H2-9-1 = 77/81 structures). Position 71 was not helpful in distinguishing H2-9-2 and H2-9-3 (data not shown) from H2-9-1. For H2-10, Tramontano et al. found two conformations, two structures with Arg71 similar to our cluster H2-10-2 and one structure with Ala similar to our cluster H2-10-1. In Table 9, we show a contingency table for H2-10 with the different residues at position 71 in columns and the different clusters of H2-10 in rows. We have a total of 227 structures and 196 unique sequences for H2-10; we also have 9 conformational clusters instead of just two, although only the first two are highly populated. If we predict the cluster a structure belongs to merely from position 71, we would assign the cluster with the highest number in each column of Table 9. For example, if position 71 is Ala, we would predict cluster H2-10-1 and we would get 67 correct assignments and 13 incorrect assignments. If position 71 is Arg, we would predict cluster H2-10-2 and get 38 out of 58 assignments correct. If we add the largest numbers in each cluster, we correctly predict 186 of the loops, or 80%, which is comparable to the hidden-Markov models discussed above (78% of the loops in clusters 1-4). As the table shows, the major determinant is whether the residue at position 71 is a small hydrophobic residue (A, I, L, V) or small polar residue (S, T) or Q in which case the loop mostly belongs to cluster H2-10-1 (143 of 161 times, or 90%); if the residue is R or D then the residue belongs to cluster H2-10-2 (39 of 59 times, or 66%). Superpositions of the median structure of cluster H2-10-1 with clusters 2, 3, and 4 are shown in Figure 9. Both clusters H2-10-2 and H2-10-4 have Arg at position 71 and with a hydrogen bond to the carbonyl oxygen of residue 3 of the CDR.
The median structures of the largest clusters of H2-10. a. Cluster H2-10-1 (yellow) and H2-10-2 (magenta), b. H2-10-1 (yellow) and H2-10-3 (blue) and c. H2-10-1 (yellow) and H2-10-4 (green). The side chain of Arg71 of Clusters H2-10-2 and H2-10-4 are...
Residue 71 and H2-10 Contingency Table
Finally, for H2-12 all 26 structures belong to a single tight cluster with a minimum normalized median angle of 8 degrees, therefore qualifying this loop at Type I while the two very small population CDR-lengths, H2-8 and H2-15, also have only one cluster (Type III).
The known loop structures for H3 are very diverse in length, ranging from length 5 to 26, with the majority (86%) between 7 and 16. The shorter loops can be clustered fairly well but these are low in population (Table 2). The longer loops form a few large clusters with higher self-similarity values but the clusters have very large distances to the median. Some clusters have residues in different bins of the Ramachandran map (e.g., A and L regions). At low self-similarity, the number of clusters becomes very large and the cluster sizes become rather small. They are therefore not likely to have predictive value.
Because of these difficulties, a number of analyses have split H3 into a “torso” or anchor region corresponding to its N- and C-terminal ends and a “head” or apex region at the turn of the loop,3; 25 dividing the torso region into two groups, “bulged” and “non-bulged”4 or “kinked” and “extended.”19 We performed affinity propagation clustering on a set of seven residues comprising the first three residues of H3 (in red for the N-terminal region in Figure 1) and the last four residues of H3 (those in red for the C-terminal region in Figure 1 plus one more to the left). The clustering results for these seven-residue discontinuous peptides are shown in Table 8. For the H3 torso clustering, a total of eight clusters are apparent. Cluster H3-anchor-1 covers about two thirds of the structures, and the top four clusters about 95%. The first three clusters are shown in Figure 10. Contingency tables on individual residue positions did not demonstrate predictability of the H3 torso clusters (data not shown) much beyond the 65% that are in the first cluster.
The median structures of the H3-anchor regions. a. Clusters H3-anchor-1 (yellow) and H3-anchor-2 (magenta); b. Clusters H3-anchor-1 (yellow) and H3-anchor-3 (blue/green). Clusters H3-anchor-1 and H3-anchor-3 are bulged and H3-anchor-2 is non-bulged.
We examined the distribution of these clusters for different length H3 loops. The results are shown in Table 10. We included H3-7 loops in the H3-anchor clustering, even though these would not be expected to cluster well with the torso regions of the longer loops. Indeed, these loops clustered predominantly into three clusters, separately from the others: H3-anchor-4, H3-anchor-6, and a cluster with cis4. A small number of H3-7 structures were placed in cluster H3-anchor-1. Interestingly, for the other lengths, the distribution is somewhat dependent on length. For H3-8 (only 5 structures), 2, 1 and 2 of the structures are in H3-anchor-1, H3-anchor-2, and H3-anchor-5 respectively. H3-9 (26 structures) is the only H3 CDR-length for which the non-bulged H3-anchor-2 cluster predominates. For H3 lengths from 10 to 14, 74-79% of structures belong to H3-anchor-1. However, lengths 15 and 16, 92% of structures belong to H3-anchor-1, while the remainder are in cluster H3-anchor-5. For H3 loops longer than 16, 71% belong to H3-anchor-1 while all of the remainder belong to H3-anchor-3. These frequencies are consistent across loop lengths from 17 to 26 (data not shown).
H3-anchor cluster frequencies (in %) for each H3 loop length
Comparison to Chothia and Martin-Thornton clustering
There are several previous studies on the categorization of antibody loop structures.1; 4; 5; 7 The clustering results in this study recapitulate many of the canonical conformations found by both Chothia et al.2 and Martin and Thornton11. However, our conformational clustering approach and more recent structure database have produced a few significant differences with the Chothia and Martin-Thornton results. The correspondences between our clustering and those of Chothia et al and Martin and Thornton are given in Tables 11, 12, and 13.
Conformational clusters of Martin and Thornton
Clusters in this work and those of Chothia et al. and Martin-Thornton
We used the 1997 paper by Al-Lazikani et al.2 to define the Chothia canonical conformations, since this is the most recent and comprehensive of their previous analyses of antibody CDR structures.1; 18; 20; 21; 22 Chothia et al. designated canonical classes for each CDR by integers (1,2,3, etc.) regardless of the length of the loop, and in no particular order. Different designations might be loops of different length or loops of the same length but of different conformations. CDRs of λ light chains were analyzed and numbered separately from κ chains, and following Martin and Thornton we call them 1λ, 2λ, etc. Some classes were broken down into sub-classes, usually because of a flip of a two-amino acid segment within the loop between one structure and another. They designated these A, B, etc., and we append these to the Chothia class name, e.g., L1-2A, L1-2B. For each canonical class, they provided one or more PDB entries that fit that class and the CDR sequences of those loops and their ϕ,ψ values. For some loops, they provided only the names of antibodies and we located the corresponding PDB entries from these names. Their clustering, based on a total of 17 high-resolution structures, was performed manually and visually, not computationally.
Martin and Thornton11 performed a clustering in dihedral-angle space (using vectors of sines and cosines), similar to the one performed here, followed by merging of clusters based on coordinate RMSD. They designated their clusters by the CDR, the length, and then letters for each different conformation, viz. L1-11A, L1-11B, L1-12A, etc. They provided PDB IDs for a representative of each clusters as well as a table of assignments of their clusters to 57 PDB entries.
Our CDR definitions differ somewhat from Chothia et al. and Martin and Thornton. Comparison of these definitions applied to example κ, λ, and heavy chain sequences is given in Figure 11. For Chothia, we use the example sequences given in the paper by Al-Lazikani et al. These are the regions within the Kabat-defined CDRs that they observe to vary in conformation, usually with one extra amino acid on each end for good measure. The regions described in this paper do not always coincide with what others take to be the “Chothia definitions” of the CDRs10; 23. As shown in Figure 11, Chothia et al. define their κ and λ CDR1s differently from each other. Their κ definition is two amino acids shorter on both the N and C terminus of our L1 definition. Their λ definition is only one amino acid shorter on each end. Their L2 definition is three residues shorter than ours on the C-terminus, and their L3 definition is one residue shorter on the N-terminus than ours. Similarly to L1 κ, our H1 definition is two residues longer on both ends than the Chothia definition, as is our H2 definition. Martin and Thornton used the same CDR definitions as we do for L1, L3, and H2. Their L2 begins one residue after ours, and their H1 begins three residues after ours (ours begins as our L1 does immediately after Cys, while theirs begins after Cys-Xxx-Xxx-Xxx).
Comparison of our CDR definitions with those of Al-Lazikani et al.2 and Martin-Thornton3 with the numbering scheme proposed by Honegger and Plückthun10.
For both Chothia and Martin, we used the PDB IDs given in their papers to match their clusters to ours. In many cases, the same PDB chains are present in our filtered data and we can make a one-to-one correspondence. In some cases, we excluded some PDB entries or particular loops because of low resolution, high B-factors, high conformational energies, or removing redundant sequences. In these cases, we calculated our distance function D between the loop in the PDB entry cited by either paper and the median of our clusters for the same CDR and same length. We normalized D by two times the number of residues in the loop (to account for ϕ and ψ) and then inverted Eq. 2 to calculate an average difference in ϕ or ψ in degrees.
The results of these comparisons are given in Table 11 for the Chothia data and in Table 12 for the Martin-Thornton data. The tables provide some or all of the PDBs mentioned in these papers for each of their loop designations. If the chain is listed along with our cluster designation, then that loop was in our clustering data and present in that loop cluster. If a distance is given in parentheses after the PDB chain, then that is the mean absolute difference in ϕ and ψ angles from the median of our loop cluster. In some cases, this distance is larger than 25°, and we list these in italic bold type. These correspondences are then less certain and may be the result of low resolution or high B-factors of that loop in the PDB. This is noted in some cases.
Chothia et al. list 25 canonical classes over 20 CDR-length combinations in their 1997 paper (Table 11); if we consider their alternate conformations within a class as separate classes, then there are 32 classes. It should be noted that only 3 of these 32 classes were based on more than five structures in the PDB, and 15 of 32 (nearly half) were based on only one structure. For most of the canonical classes, we can make a clear one-to-one assignment to our clusters via the PDB chains given by Chothia et al. For instance, their L1-2A, L1-2B, and L1-4λ are our L1-11-1, L1-11-2, and L1-11-3 clusters. As noted above, L1-11-3 is easily distinguishable by sequence from L1-11-1 and L1-11-2, while L1-11-1 and L1-11-2 differ from each other because of the residue at position 71 of VL.
In three cases, the PDB chains given by Chothia et al. for a canonical class fall into more than one of our clusters. This happens for their largest clusters, L2-1, L3-1, and H1-1. In all three cases, most of the structures given by Chothia et al. fall into one of our clusters, while a small number fall into another. Since our loops were longer in these cases, the structural differences may occur outside of the region analyzed by Chothia et al. In four cases, the structures in more than one Chothia canonical class for a given CDR fall into one of our clusters. This occurs for the subclasses, L1-3λ L3-1λ and H2-3A and C, which we put into single clusters.
There are also a few cases when the Chothia representatives do not appear in our data set and are relatively far away from our median structures. In these cases, the assignments to our clusters are uncertain. For instance, their L1-6 is a low-resolution (3 Å) structure that is 52° away from our L1-12-3 cluster. Their L1-2λ cluster is far away (43°) from its closest neighbor in our data, the L1-14-2 cluster, although its sequence (PDB entry 7FABL, TGSSSNIGAGHNVK) clearly fits our L1-14-2 pattern Their L3-1λB structure from PDB entry 7FAB is also not very close (46°) to our L3-9-1 cluster.
Interestingly, only 4 out of 20 Chothia CDR-length combinations comprise more than one canonical class: L1-11 (L1-2A,B and L1-4λ); L1-14 (L1-2λ and L1-2λ); L3-9 (L3-1, L3-3 and L3-1λA,B,C); H2-10 (H2-2A,B and H2-3A,B,C). We recapitulate these results, at least at the level of the Chothia classes if not all the subclasses (e.g., L3-1λ).
The Martin-Thornton clusters are listed in Table 12. Their paper listed 49 clusters for L1, L2, L3, H1 and H2. Only 8 of these clusters (15%) are observed in 5 or more PDB entries, and 28 of them (57%) are observed in only one PDB entry. Many of the latter are far away from any of the median structures of our clusters, and these are highlighted in Table 12. It is noted if they are low resolution or have high conformational energies, thus lending some doubt on whether they should be listed as separate clusters. These include L1-14C,D,E,F, L3-9E,F, H1-10C,D, and H2-10D,E,F. In some cases, the Martin-Thornton clusters are divided into more than one cluster in our analysis. This may be in part because our loops are sometimes longer (L2 and H2) or because of the RMSD step used Martin and Thornton. For instance their L1-11A is split about evenly between our L1-11-1 and L1-11-2 clusters. Martin and Thornton merge the two structures into the same cluster due to the small RMSD difference between the main chain atoms of the two structures. Our algorithm keeps the two clusters separate due to the large difference in ϕ and ψ angles at loop positions 7 and 8. Chothia et al. list them as A and B conformations of the same canonical class. Most of the Martin-Thornton cluster L2-7A corresponds to our L2-8-1, although several structures are members or are closer to our L2-8-2, L2-8-4, and L2-8-5 clusters. Similarly, their L3-9A cluster corresponds to our L3-9-cis7-1, but one of their cluster members is an all-trans structure corresponding to our L3-9-2. We also split their H2-9A and H2-10A clusters.
We examined the CDR-length combinations in the Martin-Thornton analysis, and found that effectively only six of them have more than one conformational cluster that can be validated with our data: L1-11, L1-14, L3-8, L3-9, L3-11, and H2-10 (in our definitions). Several other CDR-lengths have multiple clusters in the Martin-Thornton analysis but rely on very low-resolution structures or structures with high conformational energy. For instance, their L3-10 loops consist of four clusters, but all of these are low resolution or high in conformational energy.
Finally, we examine the results the other way around by listing our clusters in Table 13 along with the number of PDB chain loops that overlap with the Chothia and Martin-Thornton data. We have a total of 72 clusters, each of which has at least two members, since we removed singleton outliers, except when there was only one structure for a given CDR and length (e.g. H2-15-1) or cis-trans configuration. Thirty-one of our clusters have 5 or more members.
A total of 41 of our clusters do not have a corresponding canonical class in the Chothia analysis. Thus we have more than twice as many clusters as present in the Chothia analysis. Many of these are for CDR lengths not present in the PDB available to Chothia et al. These include L2-12, L3-12, L3-13, H1-10, H1-12, H1-16, H2-8 and H2-15. In a small number of cases, our clusters comprise more than one Chothia canonical class, usually when there are small differences in structure, e.g. L3-1λA and L3-1λC are both in our L3-9-1.
A total of 32 of our clusters do not have a corresponding cluster in the Martin-Thornton analysis, and an additional 10 have only distant relationships to their clusters (in italic bold type in Table 13), for a total of 42. Some loop lengths were not represented in their data set, mostly the same as those not present in the Chothia data, since the analyses were performed around the same time (1996-1997). Some of our clusters comprise more than one Martin-Thornton cluster but in almost all cases, these consist of conformations that are quite distant from our median structures and were excluded from our data set, often due to low resolution or high conformational energy.
Comparison of H3 torso analysis to Morea et al
Morea et al.4 presented rules for the prediction of the bulged and non-bulged conformations of the torso on the basis of the residue types at positions 94 and 101 in the Chothia numbering (Honegger-Plückthun numbers 108 and 137 respectively). These are positions 2 and 6 of the seven-residue segments shown in Table 8. Bulged conformations are those with conformations –AB for the last two residues of the loop in our definition, predominantly cluster H3-anchor-1. Non-bulged have conformations –BB, consisting predominantly of cluster H3-anchor-2. In the Morea et al. analysis, bulged torsos have either lysine or arginine at position 94, while at position 101 usually (but not always) aspartic acid is present. For our data, we summarize the number of structures with Lys/Arg94 and Asp101 present or absent and the state of the loop as bulged or non-bulged in a contingency table shown in Table 14.
Chothia rules for bulged or non-bulged H3 torso
According to Morea et al., if position 94 is Lys/Arg and position 101 is Asp, the structure is bulged. A total of 155 structures have this sequence and end in the bulged conformation –AB, while 11 have that sequence but are not bulged and so are counterexamples to the Morea et al. rules. According to Morea et al., if Lys/Arg is present at residue 94 but Asp is absent at 101, the structure should still be bulged. This is true for 36 of the structures with that sequence but is not true for the remaining 5 structures. If Lys/Arg is not present at residue 94 but Asp is present at 101, the structure is supposed to be non-bulged. However, we find 39 bulged examples and only 16 non-bulged structures. Finally, in their study, no structures lacking both the Lys/Arg at position 94 and the Asp at position 101 were observed. In our data set there are 44 examples, of which 27 are bulged and 17 are not. Six structures do not seem to fit either the bulged or un-bulged conformations and so are not considered. Thus, regardless of Lys/Arg or other residues at position 94 or Asp or other residues at position 101, the majority of the H3 torso structures are bulged. However, with Lys/Arg at position 94, 92% of the structures are bulged. Without Lys/Arg, 67% of the structures are bulged.
Canonicals - Chothia Canonical Assignment
This page allows you to identify the Chothia canonical classes of CDRs L1, L2, L3, H1 and H2 from an antibody sequence. When a CDR does not match a canonical class, the most similar class will be displayed and mismatching residues which caused the assignment to fail will be displayed.
Three sets of results will be given. The key residue requirements for each set are defined in a datafile. In these datafiles, the key residue positions are shown with the allowed amino acids at each position. The sequence which you supply is first aligned with a consensus antibody sequence to assign the Kabat numbering scheme. (Some of the datafiles use the Chothia numbering scheme; in these cases conversion between the numbering schemes is handled automatically.) The method used to do this is the same as that used by the sequence testing facility.
The first set of results uses a set of key residue templates derived by a new automated method (Martin and Thornton, (1996) Structural families of loops in homologous proteins: Automatic classification, modelling and application to antibodies. J. Mol. Biol., 263, 800-815). These results are likely to be the most accurate, but since the templates are more restrictive, there are likely to be more sequences which cannot be predicted. Datafile.
The second set of results come from a set of key residues used by Oxford Molecular's AbM software. These are based on the key residues presented by Chothia et al., but have a few additional required and allowed residues. A few additional classes have also been defined. Datafile.
The final set of results comes from strict application of the templates which appear in a number of papers by Chothia et al. Definitions from various papers have been merged to create this template set. There is some slight confusion over the numbering of classes as Chothia has described classes without giving them "official" numbers. Currently, classes are numbered in the order in which they are described with a note where appropriate if a later paper uses a different number. This may change soon.Datafile.
If you get any error or warning messages, please check you have entered your sequence correctly. Strange sequence features may cause the alignment stage to fail. Loops longer than anything observed in the current Kabat database will also cause the alignment to fail. If, after checking your sequence, you still get errors or warnings, please send me EMail: and I'll see if the programs can be modified to accomodate your sequence. Alternatively, your antibody may just be very strange!
Test your sequence against the Kabat sequence database.
Enter the amino acid sequence (1-letter code) of your Fv (optionally you may include the whole Fab fragment, but only the Fv portion will be tested).
Companies may use this public server, but need to be aware that data are not encrypted and it is not secure.
After trialing the system, companies should consider Abysis. A commercial licence will enable you to install a local version of this code together with an integrated database which can also store and analyse proprietary sequence and structure data.
For information on commercial licences, please contact the distributor Ebisu.
Copyright (c) 1995, Andrew C.R. Martin, UCL