rstoolbox.analysis.label_sequence

rstoolbox.analysis.label_sequence(df, seqID, label, complete=False)

Gets the sequence of a label.

Depends on label data for the seqID.

Adds a new column to the data container:

New Column Data Content
<label>_<seqID>_seq Trimmed sequence by the label.
Parameters:
  • df (Union[DesignFrame, DesignSeries]) – Data container.
  • seqID (str) – Identifier of the sequence of interest.
  • label (str) – Label identifier.
  • complete (bool) – Only applies when input is a DesignFrame. Generates a gapped alignment considering the maches of label as those of the highest matching decoy.
Returns:

Union[DesignFrame, DesignSeries]

Raises:
NotImplementedError:
 if the data passed is not in Union[DesignFrame, DesignSeries].
KeyError:if there is no label information for chain seqID of the decoys.

Example

In [1]: from rstoolbox.io import parse_rosetta_file
   ...: from rstoolbox.analysis import label_sequence
   ...: import pandas as pd
   ...: pd.set_option('display.width', 1000)
   ...: pd.set_option('display.max_columns', 500)
   ...: df = parse_rosetta_file("../rstoolbox/tests/data/input_2seq.minisilent.gz",
   ...:                         {'scores': ['score'], 'sequence': '*',
   ...:                          'labels': ['MOTIF']})
   ...: df = label_sequence(df, 'B', 'MOTIF')
   ...: df.head()
   ...: 
Out[1]: 
     score lbl_MOTIF                                                                                                                                                     sequence_A                                                                                                            sequence_B             MOTIF_B_seq
0 -206.678  [B, A]    AYSTREILLALCIRDSRVHGNGTLHPVLELAARETPLRLSPEDTVVLRYHVLLEEIIERNSETFTETWNRFITHTEHVDLDFNSVFLEIFHRGDPSLGRALAWMAWCMHACRTLCCNQSTPYYVVDLSVRGMLEASEGLDGWIHQQGGWSTLIEDNI  TRPEEARERAWRLAEIAMRKGWEEHEREWEWWKRASKGREERDMLPERMIAAALRAIGEIFNAEWQMRLEMEKERKNPNAGEEKMKEQKKEAWKIAYYWGLMAAYWIKQHREKERK  DMLPERMIAAALRAIGEIFNAE
1 -214.362  [B, A]    AYSTREILLALCIRDSRVHGNGTLHPVLELAARETPLRLSPEDTVVLRYHVLLEEIIERNSETFTETWNRFITHTEHVDLDFNSVFLEIFHRGDPSLGRALAWMAWCMHACRTLCCNQSTPYYVVDLSVRGMLEASEGLDGWIHQQGGWSTLIEDNI  PKPEEAMREAYKLIKKYMLKAQKEAQEEWERMRRTDGTKEEKDMFPEKMIAQALRAIGEIFNAYYWAFLKLQEFKKYPSVRWEEQEEARKRLKIMMKIGAEWAREIAREMKERIKR  DMFPEKMIAQALRAIGEIFNAY
2 -203.582  [B, A]    AYSTREILLALCIRDSRVHGNGTLHPVLELAARETPLRLSPEDTVVLRYHVLLEEIIERNSETFTETWNRFITHTEHVDLDFNSVFLEIFHRGDPSLGRALAWMAWCMHACRTLCCNQSTPYYVVDLSVRGMLEASEGLDGWIHQQGGWSTLIEDNI  TKPEEMAREAYKRMLKALKQGEEEMKRMYEQMKKGVDSKEERDMEPEKMIAIALRAIGELFNAWMKALRHMKELRKLGTSGPKEEEKHWRWIFELHRWAGEEIQRAAEIQERKARW  DMEPEKMIAIALRAIGELFNAW
3 -213.779  [B, A]    AYSTREILLALCIRDSRVHGNGTLHPVLELAARETPLRLSPEDTVVLRYHVLLEEIIERNSETFTETWNRFITHTEHVDLDFNSVFLEIFHRGDPSLGRALAWMAWCMHACRTLCCNQSTPYYVVDLSVRGMLEASEGLDGWIHQQGGWSTLIEDNI  TKPEEWARWAYKEHLKMAEKHRKEMEIEWEELKRRDGKEEEKDMWPERMIAMALRAIGELFNHHMYAEMRAKEEKKKPEAKTEEARRARREIMKYHHEAGRLIEEAMRRLMERHKK  DMWPERMIAMALRAIGELFNHH
4 -213.972  [B, A]    AYSTREILLALCIRDSRVHGNGTLHPVLELAARETPLRLSPEDTVVLRYHVLLEEIIERNSETFTETWNRFITHTEHVDLDFNSVFLEIFHRGDPSLGRALAWMAWCMHACRTLCCNQSTPYYVVDLSVRGMLEASEGLDGWIHQQGGWSTLIEDNI  KKWEEMMREAERQGKEYAQKAWKEALLEWKWMRKRPVTEEMKDMAPEWMIAAALRAIGEHFNIYWQQKLEHEKLRKIPNVPEEELEKGKEELKRIEEEAARMAEKYMQELRKKMES  DMAPEWMIAAALRAIGEHFNIY