Return to the Spring'00 CONFCHEM.

Deconvolution of 2-component NMR spectra

Martin Porter
Physical and Theoretical Chemistry Laboratory
Oxford University, England

martin.porter@christ-church.oxford.ac.uk


The aim of my research is to separate out the unique spectra of proteins from NMR data derived from a solution containing two unknown proteins. From the resulting spectra the component protein molecules may be deduced by further analysis.

There are an infinite number of ways that a single NMR spectrum may be decomposed to two component spectra. It is, therefore, not possible to determine unique solutions to the problem unless more information is available. This may be provided by an experimental technique which produces a series of NMR spectra of the solution, the intensities of each component spectrum varying from one to the next. The variation of intensity is different for each component and is found to decay in a gaussian fashion.

Until recently, the solution to the problem has been sought using a mathematical approach.

To improve accuracy I am tackling the problem using an artificial intelligence technique, the genetic algorithm (GA). The goal is to produce a GA that will evolve the two component spectra given a series of experimental NMR spectra as previously outlined.

A fundamental question when designing a GA is how to encode the problem in a string that is suitable for manipulation by the GA. I have found that for this problem a 2-dimensional string of the form shown below is best.




This representation has an advantage due to its simplicity. The string contains only a single type of data for a single component. The string represents the lineshape function of component A, in each of the experimental spectra, as a fraction of the total intensity function. For example, the string describing the red component below would contain the following data:


The fitness of the string may be determined in the following way:

1. The fractions stored in the string are multiplied by the experimental data to give the absolute lineshape of component A in each spectrum.

2. The lineshape for each trial spectrum is then normalised. If the string were a correct solution the normalised spectra would be identical. The discrepancy between the various normalised spectra can therefore be used to determine the quality of the current solution.





3. From the string which represents component A, a string representing component B is calculated (a one-off string, not manipulated by the GA). This is a simple task of calculating the difference between the experimental spectra and the component A spectra.

4. The calculated component B string then undergoes the same normalisation and error determining process as the component A string. The error from each component is then totaled to give an overall error for the string.

5. Fitness is evaluated as:

The stochastic remainder method is used to select strings for reproduction and selected strings are paired in a random process. The next generation is produced by a 2-point crossover. The UNBLOX crossover operator is used to ensure that all data contained in the strings are sampled equally.

The mutation operator consists of two parts:

1. Data points in the string are selected at random and replaced by a random number.

2. A whole column of data, corresponding to a single data point across all experimental spectra, is modified together. The modification is an increase or decrease in value of a data point by 5%.

At present this method gives relatively good, but not perfect, solutions on an artificial, and much simplified, set of trial spectra. Tens of thousands of generations are required to produce these results from a set of 5 experimental NMR spectra each containing 50 data points. This equates to around 10-15 minutes computer time based on a PIII-450 PC. The programme is written in Java.


Martin R Porter


Return to the Spring'00 CONFCHEM.