The aim of my research is to separate out the unique spectra of proteins from NMR data derived from a solution containing two unknown proteins. From the resulting spectra the component protein molecules may be deduced by further analysis.
There are an infinite number of ways that a single NMR spectrum may be decomposed to two component spectra. It is, therefore, not possible to determine unique solutions to the problem unless more information is available. This may be provided by an experimental technique which produces a series of NMR spectra of the solution, the intensities of each component spectrum varying from one to the next. The variation of intensity is different for each component and is found to decay in a gaussian fashion.
Until recently, the solution to the problem has
been sought using a mathematical approach.
To improve accuracy I am tackling the problem using
an artificial intelligence technique, the
genetic algorithm (GA). The goal is to produce a GA that
will evolve the two component spectra given a series of
experimental NMR spectra as previously outlined.
A
fundamental question when designing a GA is how to encode the
problem in a string that is suitable for manipulation by the GA.
I have found that for this problem a 2-dimensional
string of the form shown below is best.

This representation has an advantage
due to its simplicity. The string contains only a single type
of data for a single component. The string represents the
lineshape function of component A, in each of the experimental
spectra, as a fraction of the total intensity function. For
example, the string describing the red component below would
contain the following data:

The fitness of
the string may be determined in the following way:
1. The
fractions stored in the string are multiplied by the experimental
data to give the absolute lineshape of component A in each
spectrum.
2. The lineshape for each trial spectrum is then normalised.
If the string were a correct solution the normalised
spectra would be identical. The discrepancy between the various normalised
spectra can therefore be
used to determine the quality of the current solution.

3. From the
string which represents component A, a string representing
component B is calculated (a one-off string, not manipulated
by the GA). This is a simple task of calculating the difference
between the experimental spectra and the component A spectra.
4. The calculated component B string then undergoes the same
normalisation and error determining process as the component A
string. The error from each component is then totaled to give an
overall error for the string.
5. Fitness is evaluated
as:
The
stochastic remainder method is used to
select strings for reproduction and selected strings are paired
in a random process. The next generation is produced by a
2-point crossover. The UNBLOX
crossover operator is used to ensure that all data contained
in the strings are sampled equally.
The mutation operator
consists of two parts:
1. Data points in the string are
selected at random and replaced by a random number.
2. A whole column of data, corresponding to a single data point
across all experimental spectra, is modified together. The
modification is an increase or decrease in value of a data point
by 5%.
At present this method gives relatively good, but
not perfect, solutions on an artificial, and much simplified,
set of trial spectra. Tens of thousands of generations are
required to produce these results from a set of 5 experimental
NMR spectra each containing 50 data points. This equates to
around 10-15 minutes computer time based on a PIII-450 PC. The
programme is written in Java.