rstoolbox.utils.adapt_length

rstoolbox.utils.adapt_length(seqlist, start, stop, inclusive=False)

Pick only the sequence between the provided pattern tags.

When inclusive is False and the boundary tags are not found, the original sequence is returned, as it is assumed that the tags were out of the boundary of the retrieved sequence.

When inclusive is True and the boundary tags are not found, an empty sequence is returned for that position, as we understand that the interest was of getting them too and we could not.

Parameters:
  • seqlist (str) – list of protein sequence
  • start (str) – start pattern (not included in final sequence)
  • stop (str) – stop pattern (not included in final sequence)
  • inclusive (bool) – If False, retrieve sequence between the protein tags, otherwise include the protein tags in the returned sequence.
Returns:

list() of str

Example

In [1]: from rstoolbox.io import read_fastq
   ...: from rstoolbox.utils import translate_dna_sequence
   ...: from rstoolbox.utils import adapt_length
   ...: import pandas as pd
   ...: pd.set_option('display.width', 1000)
   ...: pd.set_option('display.max_columns', 500)
   ...: df = read_fastq("../rstoolbox/tests/data/cdk2_rand_001.fasq.gz")
   ...: df['sequence_A'] = df.apply(lambda row: translate_dna_sequence(row['sequence_A']),
   ...:                             axis=1)
   ...: bounds = ['GAS', 'FFG']
   ...: df['sequence_A'].values[:5]
   ...: 
Out[1]: 
array(['GASYFMQIPHRRMSVFGIAKVHARHKHLTGEVVALKKIRLFQPEPGPIMVKPNMCPYYYEWIGKRNQLDSFAPCISCKIKKRDTKVRGVCFHNSAIHCKSYRCVDQIFCGCIKWMMMGRDCEGQGESQNNTDIGGPTGCDINWRTCHFTELRHDCENWQSVICSTHHICTMGHIDQTSASETQDWDSFQWVMLRYIHGEQKKYSIQLGNWDAKQAVNMHRQELKVLVKKRHEEGKICACCVMSHIGVEISFFGKRSQRFQSEFMQHWVANFAMKFKFRNIGWPHTSWTQLAALGGWEGWHKPGT',
       'HGMPITNCPSDRYDRLEHMCVRTYLTGEVVALKKIRLHVQDMAHTLDHTLDHMKWAQSFRNGLMYSEHRGHCAYPVCSLRSSTVVRWTMVVEYPFWHTALWKPIQGTKVLMIGTRKNCVIQMLMRFETRANENTACPNTNFTDGGERCWCCACRFCKHEMLQHIEEKQIDITDWCLFMSQRQVRFKWVVLRLWLDTPIKTSSAVGIGSTNGATDNFEGCSWDTMALEYGSQEHNNCPVDIRDRLEFQDDGGLRNLNPSTDIYPYEMTLFLFMIKKYTFVRCEVNLDCQMRPEWIGDAL',
       'GASKYCPRARIQCEHYQEAFVCQTIITLTGEVVALKKIRLMFFEQSAEMLKQRMHGHHMGDDRRGWEYVSCWWCYAIHRWIHHSHFHEIRQETVTILGEYIRITCDQYLCKFKFAEVIRDAFVGMECITAKKKSQNKRNGIQYMTTASVALTQWHQVGLFTNVNQLDINQMTDSAREANFTPIYWIKQDCFLKTPYQNYEATVFQTADIWCRHEAECWDHQTWDWPNPLTQFCEEHKPSDVNGLENYRVFYFDWAFHKAILCHVKDMAQPFALRVFDEGCFWRCQVEQDYTLIPESWKCVTPGT',
       'MANNCDPAMEEVMLRCFGLDRCGLLTGEVVALKKIRLHEYIHQLWMSSYQSHNAHKTRYSTCHSQEQVCWQCDVFICAWCDQTFLVYTVNAYDYCCWWRKRCLNEGTTFKMPVAVPNWPYQLTHLCEEAISMDFGMGNWMCEHHEEWLHYSHMNMCFCFFTQWIQEYEDYQPDMCIDVNQQCTVMGDYKEIPIECKVQAYIARCLIYPIVKAGTHTFHGVGFPPGWGQGDKFHWNHKMSKFPGGEIMKTVYVVCQISGPPMQNEYRYLKPSNTLQNMSWYDGNHTSLSVASGWEDFFT',
       'GASFATLQKDKPPVMDTPKHCDAKKTWLTGEVVALKKIRLRPPCTSLGQRYVAIDIKHAVYSHKQHRSVMDIFHTLYFGKKWAIRVKEADSVREQTAWDFWWNWKHINQTCGFEDEVIHPNQMCMRILNNTKRDLLFQWPVVSCVKRVHIRIQTAFRFYACIVGFPYEPKMDIRTQICRGTIEEWFRFDVYRERIWRNEMYSQSEKSHNCWNNKNTQMCAPNMLKGSHNACKQSRHHWNAMDIRIHDELRIVGSPYYQTISYRVYNLNTPVKKSRNSGTARGHWHVGNHLKLRHDIYDCCAPGT'],
      dtype=object)

In [2]: adapt_length(df['sequence_A'].values[:5], bounds[0], bounds[1])
Out[2]: 
['YFMQIPHRRMSVFGIAKVHARHKHLTGEVVALKKIRLFQPEPGPIMVKPNMCPYYYEWIGKRNQLDSFAPCISCKIKKRDTKVRGVCFHNSAIHCKSYRCVDQIFCGCIKWMMMGRDCEGQGESQNNTDIGGPTGCDINWRTCHFTELRHDCENWQSVICSTHHICTMGHIDQTSASETQDWDSFQWVMLRYIHGEQKKYSIQLGNWDAKQAVNMHRQELKVLVKKRHEEGKICACCVMSHIGVEIS',
 'HGMPITNCPSDRYDRLEHMCVRTYLTGEVVALKKIRLHVQDMAHTLDHTLDHMKWAQSFRNGLMYSEHRGHCAYPVCSLRSSTVVRWTMVVEYPFWHTALWKPIQGTKVLMIGTRKNCVIQMLMRFETRANENTACPNTNFTDGGERCWCCACRFCKHEMLQHIEEKQIDITDWCLFMSQRQVRFKWVVLRLWLDTPIKTSSAVGIGSTNGATDNFEGCSWDTMALEYGSQEHNNCPVDIRDRLEFQDDGGLRNLNPSTDIYPYEMTLFLFMIKKYTFVRCEVNLDCQMRPEWIGDAL',
 'GASKYCPRARIQCEHYQEAFVCQTIITLTGEVVALKKIRLMFFEQSAEMLKQRMHGHHMGDDRRGWEYVSCWWCYAIHRWIHHSHFHEIRQETVTILGEYIRITCDQYLCKFKFAEVIRDAFVGMECITAKKKSQNKRNGIQYMTTASVALTQWHQVGLFTNVNQLDINQMTDSAREANFTPIYWIKQDCFLKTPYQNYEATVFQTADIWCRHEAECWDHQTWDWPNPLTQFCEEHKPSDVNGLENYRVFYFDWAFHKAILCHVKDMAQPFALRVFDEGCFWRCQVEQDYTLIPESWKCVTPGT',
 'MANNCDPAMEEVMLRCFGLDRCGLLTGEVVALKKIRLHEYIHQLWMSSYQSHNAHKTRYSTCHSQEQVCWQCDVFICAWCDQTFLVYTVNAYDYCCWWRKRCLNEGTTFKMPVAVPNWPYQLTHLCEEAISMDFGMGNWMCEHHEEWLHYSHMNMCFCFFTQWIQEYEDYQPDMCIDVNQQCTVMGDYKEIPIECKVQAYIARCLIYPIVKAGTHTFHGVGFPPGWGQGDKFHWNHKMSKFPGGEIMKTVYVVCQISGPPMQNEYRYLKPSNTLQNMSWYDGNHTSLSVASGWEDFFT',
 'GASFATLQKDKPPVMDTPKHCDAKKTWLTGEVVALKKIRLRPPCTSLGQRYVAIDIKHAVYSHKQHRSVMDIFHTLYFGKKWAIRVKEADSVREQTAWDFWWNWKHINQTCGFEDEVIHPNQMCMRILNNTKRDLLFQWPVVSCVKRVHIRIQTAFRFYACIVGFPYEPKMDIRTQICRGTIEEWFRFDVYRERIWRNEMYSQSEKSHNCWNNKNTQMCAPNMLKGSHNACKQSRHHWNAMDIRIHDELRIVGSPYYQTISYRVYNLNTPVKKSRNSGTARGHWHVGNHLKLRHDIYDCCAPGT']

In [3]: adapt_length(df['sequence_A'].values[:5], bounds[0], bounds[1], True)
Out[3]: 
['GASYFMQIPHRRMSVFGIAKVHARHKHLTGEVVALKKIRLFQPEPGPIMVKPNMCPYYYEWIGKRNQLDSFAPCISCKIKKRDTKVRGVCFHNSAIHCKSYRCVDQIFCGCIKWMMMGRDCEGQGESQNNTDIGGPTGCDINWRTCHFTELRHDCENWQSVICSTHHICTMGHIDQTSASETQDWDSFQWVMLRYIHGEQKKYSIQLGNWDAKQAVNMHRQELKVLVKKRHEEGKICACCVMSHIGVEISFFG',
 '',
 '',
 '',
 '']