When analysing the outcome of a protein design set, it can be useful to retrieve the data from the source structure (template) in order to assess, on a sequence level, which changes have happened.
For the purpose of this example, we will use a domain from a Putative formate dehydrogenase accessory protein.
To load the data, we will use get_sequence_and_structure()
. To this function we are going to provide the PDB file 2pw9C.pdb
, which contains only
the chain C from that crystal structure and it will generate a file named 2pw9C.dssp.minisilent
. The generated object will contain all the data generated
by Rosetta over the particular structure, including the sequence.
Note
Through all the process several times the chainID
of the decoy of interest will be called. This is due to the fact that the library can manipulate
decoys with multiple chains (designed or not), and, thus, analysis must be called upon the sequences of interest.
Warning
To generate this file, the function will run Rosetta. If Rosetta is not locally installed, the documentation of get_sequence_and_structure()
provides the RosettaScript that it will run. One can run it in a different computer/cluster and place the obtained silent file output in the same directory.
As long as the naming schema for the file is maintained, if the function finds the file it will skip the Rosetta execution.
In [1]: import rstoolbox as rs
...: import pandas as pd
...: import matplotlib.pyplot as plt
...: import seaborn as sns
...: plt.rcParams['svg.fonttype'] = 'none' # When plt.savefig to 'svg', text is still text object
...: sns.set_style('whitegrid')
...: pd.set_option('display.width', 1000)
...: pd.set_option('display.max_columns', 500)
...: pd.set_option("display.max_seq_items", 3)
...: baseline = rs.io.get_sequence_and_structure('../rstoolbox/tests/data/2pw9C.pdb')
...: baseline
...:
Out[1]:
score fa_atr fa_rep fa_sol fa_intra_rep fa_intra_sol_xover4 lk_ball_wtd fa_elec pro_close hbond_sr_bb hbond_lr_bb hbond_bb_sc hbond_sc dslf_fa13 omega fa_dun p_aa_pp yhh_planarity ref rama_prepro time description sequence_C structure_C phi_C psi_C
0 73.639 -283.433 74.006 150.946 0.639 11.541 -5.366 -65.591 15.155 -15.177 -14.377 -6.115 0.0 0.0 18.662 171.168 -13.001 0.0 25.828 8.754 2.0 2pw9C_0001 ETPYAIALNDRVIGSSMVLPVDLEEFGAGFLFGQGYIKKAEEIREILVCPQGRISVYA LEEEEEEELLEEEEEEEELLLLHHHHHHHHHHHHLLLLLLLLLLLEEEELLLEEEELL [0.0, -141.589, -72.8185, ...] [121.777, 140.739, 148.274, ...]
In [2]: baseline.get_sequence('C')