rstoolbox.components.DesignSeries.generate_mutants_from_matrix

DesignSeries.generate_mutants_from_matrix(seqID, matrix, count, key_residues=None, limit_refseq=False)

From a provided positional frequency matrix, generates count random variants.

It takes into account the individual frequency assigned to each residue type and position. It does not generate the highest possible scored sequence according to the matrix, but picks randomly at each position according to the frequencies in for that position.

For each DesignSeries, it will generate a DesignFrame in which the original sequence becomes the reference_sequence, inheriting the reference_shift.

Warning

This is a computationaly expensive function. Take this in consideration when trying to run it.

Each DesignFrame will have the following structure:

Column Data Content
description Identifier fo the mutant
sequence_<seqID> Sequence content
pssm_score_<seqID> Score obtained by applying matrix
Parameters:
  • seqID (str) – Identifier of the sequence of interest
  • matrix (DataFrame) – Positional frequency matrix. column: residue type; index: sequence position.
  • count (int) – Expected number of unique generated combinations. If the number is bigger than the possible options, it will default to the total amount of options.
  • key_residues (Union[int, list() of int, str, Selection]) – Residues of interest.
  • limit_refseq (bool) – When True, pick only residue types with probabilities equal or higher to the source sequence.
Returns:

list() of DesignFrame - New set of design sequences.

Raises:
ValueError:if matrix rows do not match sequence length.

Example

In [1]: from rstoolbox.io import parse_rosetta_file
   ...: from rstoolbox.tests.helper import random_frequency_matrix
   ...: import pandas as pd
   ...: pd.set_option('display.width', 1000)
   ...: pd.set_option('display.max_columns', 500)
   ...: df = parse_rosetta_file("../rstoolbox/tests/data/input_2seq.minisilent.gz",
   ...:                         {'scores': ['score', 'description'], 'sequence': 'B'})
   ...: df.add_reference_sequence('B', df.get_sequence('B').values[0])
   ...: matrix = random_frequency_matrix(len(df.get_reference_sequence('B')), 0)
   ...: key_res = [3,5,8,12,15,19,25,27]
   ...: mutants = df.iloc[1].generate_mutants_from_matrix('B', matrix, 5, key_res)
   ...: mutants[0].identify_mutants('B')
   ...: 
Out[1]: 
                            description                                                                                                            sequence_B  pssm_score_B                             mutants_B    mutant_positions_B  mutant_count_B
0  test_3lhp_binder_labeled_00002_v0001  PKMEDAMYEAYSLIMKYMHKAQKEGQMEWERMRRTDGTKEEKDMFPEKMIAQALRAIGEIFNAYYWAFLKLQEFKKYPSVRWEEQEEARKRLKIMMKIGAEWAREIAREMKERIKR  6.699162      P3M,E5D,R8Y,K12S,K15M,L19H,A25G,E27M  3,5,8,12,15,19,25,27  8             
1  test_3lhp_binder_labeled_00002_v0002  PKTEEAMIEAYDLIEKYMCKAQKEVQPEWERMRRTDGTKEEKDMFPEKMIAQALRAIGEIFNAYYWAFLKLQEFKKYPSVRWEEQEEARKRLKIMMKIGAEWAREIAREMKERIKR  6.663401      P3T,R8I,K12D,K15E,L19C,A25V,E27P      3,8,12,15,19,25,27    7             
2  test_3lhp_binder_labeled_00002_v0003  PKMEDAMGEAYTLIRKYMDKAQKEKQTEWERMRRTDGTKEEKDMFPEKMIAQALRAIGEIFNAYYWAFLKLQEFKKYPSVRWEEQEEARKRLKIMMKIGAEWAREIAREMKERIKR  6.607991      P3M,E5D,R8G,K12T,K15R,L19D,A25K,E27T  3,5,8,12,15,19,25,27  8             
3  test_3lhp_binder_labeled_00002_v0004  PKSEYAMDEAYDLIIKYMAKAQKECQEEWERMRRTDGTKEEKDMFPEKMIAQALRAIGEIFNAYYWAFLKLQEFKKYPSVRWEEQEEARKRLKIMMKIGAEWAREIAREMKERIKR  6.294339      P3S,E5Y,R8D,K12D,K15I,L19A,A25C       3,5,8,12,15,19,25     7             
4  test_3lhp_binder_labeled_00002_v0005  PKKEEAMSEAYQLIVKYMVKAQKEHQEEWERMRRTDGTKEEKDMFPEKMIAQALRAIGEIFNAYYWAFLKLQEFKKYPSVRWEEQEEARKRLKIMMKIGAEWAREIAREMKERIKR  6.644130      P3K,R8S,K12Q,K15V,L19V,A25H           3,8,12,15,19,25       6