Strings¶
What you’ll learn
How to store a sequence of character data in a string
How to extract substrings by slicing
How to use string methods
Example: RNA Sequences¶
In biology, an RNA sequence consists of a chain of the nucleotides Adenine, Uracil, Cytosine and Guanine in a specific order. We can represent an RNA sequence using the four letters A
, U
, C
and G
.
In the body, the RNA sequence is used to produce a protein in a process called translation. The sequence is first divided into three character subsequences termed ‘codons’. For example, the 15 character RNA sequence AUGAGACUCUGAGAC
is divided into the codons AUG
, AGA
, CUC
, UGA
, and GAC
.
Each of the codons identifies a specific amino acid, as shown in the partial amino acid translation table on the right. The RNA sequence AUGAGACUCUGAGAC
would therefore be translated by the body into the amino acid sequence methionine
, arginine
, leucine
, (stop)
, aspartic acid
. Using the abbreviated one-letter characters, this could be written as MRL.D
.
Finally, the body chains together these amino acids into a protein. The stop codon represents the end of the chain, so the RNA sequence would be translated into a protein comprising a chain of three amino acids methionine-arginine-leucine
.
RNA Translation
RNA sequence:
AUGAGACUCUGAGAC
Codons:
AUG
AGA
CUC
UGA
GAC
Amino Acids:
MRL.D
Protein sequence:
MRL
Bioinformatics
Bioinformatics is the application of tools of computation and analysis to the capture and interpretation of biological data. One of the main applications of bioinformatics is the analysis of genome sequence data, such as that undertaken by the Human Genome Project.
we’ll now see how we can implement this procedure programmatically. In Python, a sequence of character data is termed a string.
rna_seq = "AUGAGACUCUGAGAC"
print("RNA sequence:", rna_seq)
RNA sequence: AUGAGACUCUGAGAC
First, let’s write a function translate
which takes a three letter string and returns a single letter representing the corresponding amino acid.
def translate(codon):
codon_list = ["UUA", "UUG", "CUU", "CUC", "CUA", "CUG", "AUG", "AGA", "AGG", "CGA", "CGU", "CGG", "CGC", "GAU", "GAC", "UAA", "UAG", "UGA"]
amino_acids = ["L", "L", "L", "L", "L", "L", "M", "R", "R", "R", "R", "R", "R", "D", "D", ".", ".", "."]
i = codon_list.index(codon)
aa = amino_acids[i]
return aa
# Test the function using the CGG codon
codon = "CGG"
aa = translate("CGG")
print("Codon:", codon)
print("Amino acid:", aa)
Codon: CGG
Amino acid: R
The list codon_list
contains the three letter codons, and the list amino_acids
contains the correspdonding single-letter amino acid abbreviations. Translating from codon to amino acid is simply a case of finding the index position of the codon in codon_list
and identifying the character in the same position in amino_acids
:
i = codon_list.index(codon)
aa = amino_acids[i]
i
is an integer represnting the index of the string codon
in codon_list
.
Finding items in a list
See the previous section Lists and Plotting for how to find items in a list.
Next, we would like to split the string rna_seq
into three-character codons, then use our function to determine the amino acid for each.
n = len(rna_seq)
for i in range(0, n, 3):
codon = rna_seq[i:i+3]
print(translate(codon))
M
R
L
.
D
First we use the len
function to determine the number of characters in the string rna_seq
. Next we generate a loop from 0
to n
in steps of 3
:
for i in range(0, n, 3):
The expression rna_seq[i:i+3]
extracts the a 3-character substring from rna_seq
begining at the character at index i
.
Finally, what if we’d like to stop processing the sequence once we reach the ‘stop’ codon? Python has a useful keyword break
which allows us to do exactly that:
n = len(rna_seq)
for i in range(0, n, 3):
codon = rna_seq[i:i+3]
if codon == "UGA":
break
print(translate(codon))
M
R
L
As soon as the break
keyword is reached, the enclosing for
loop is exited, even if this means aborting the loop early.
String Variables¶
A string is data type representing character data. In Python, string literals are surrounded either by double quote "
or single quote '
characters.
greeting_start = "Season's"
greeting_end = 'greetings'
print(greeting_start, greeting_end)
Season's greetings
String Concatenation¶
Use the +
symbol to concatentate strings
greeting = greeting_start + " " + greeting_end
print(greeting)
Season's greetings
But it is not possible to concatentate a string and a number:
id = greeting + 55
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_2761/2560201398.py in <module>
----> 1 id = greeting + 55
TypeError: can only concatenate str (not "int") to str
Converting between strings and numbers¶
Functions str
, int
and float
are available to convert between strings and other data types.
# convert from integer to string
id = 1729
new_id = str(id) + "_NEW"
print(new_id)
1729_NEW
# convert from string to floating-point number
price = "12.99"
total_price = float(price) * 1.2
print(total_price)
15.588
String Methods¶
A string is an object, which is a data type with methods directly attached with it which can be called similarly to calling a function. The upper
method converts a string to upper case, and lower
to lowercase:
name = "Jeremy Bentham"
name_uppercase = name.upper()
print(name_uppercase)
name_lowercase = name.lower()
print(name_lowercase)
JEREMY BENTHAM
jeremy bentham
Other useful methods are split
, join
and trim
. split
splits the string into individual words and returns them as a list:
text = "The time has come"
word_list = text.split()
print(word_list)
['The', 'time', 'has', 'come']
join
does the reverse, combining a list of strings into a single string. s1.join(word_list)
joins the strings in ``
", ".join(word_list)
'The, time, has, come'
strip
removes any white space characters (spaces, tabs or newlines) at the start or end of the string:
text = " too much space! "
text2 = text.strip()
print(text2)
too much space!
Strings and Characters¶
A string is composed of a sequence of characters, and most of the operations that can be performed on lists can also be performed on strings. For example, individual characters can be accessed using square brackets enclosing the index position.
text = "Natural Sciences"
# first character is at index 0
first_initial = text[0]
# last character is at index -1
final_character = text[-1]
print(first_initial, final_character)
N s
Use the len
function to find the length of a string.
s = "Mighty"
x = len(s)
print(x)
6
Note
An important difference between lists and strings: whereas it is possible to change the value of an an individual list item, it is not possible to change an indivdual string character. We say that strings are immutable.
x = [4, 5, 6]
x[0] = 10 # this is OK
s = "ABC"
s[0] = "X" # this will result in an error
Likewise, it is not possible to append a character to a string. Instead, use string concatenation.
s.append("D") # Error
s = s + "D" # This is OK
Slicing Lists and Strings¶
Given a list or string, we can access a single element using square brackets:
x = [4, 5, 6, 7, 8, 9]
y = x[0]
print(y)
4
If we want to access a sublist, we can use array slicing. Given integers a
and b
, x[a:b]
returns a new list which contains the elements of x
from index a
to b - 1
(i.e. including element a
but excluding element b
).
z = x[0:3]
print(z)
[4, 5, 6]
x[a:b:c]
returns a list containing items a
to b - 1
with a step size of c
(this is very similar to the range
function),
w = x[0:9:2]
print(w)
[4, 6, 8]
Example¶
Natural Sciences modules are identified by a 8 character code consisting of NSCI
followed by a four digit number. The following paragraph of text contains Natural Sciences module codes mixed up with other data. We will write Python code to extract a list of Module codes from the text.
text = "Surrounded NSCI0007 me occasional pianoforte NSCI0011 alteration unaffected impossible ye. For saw half than cold. arrived adapted. Numerous ladyship so raillery humoured goodness received an. So NSCI0004 formal length my highly NSCI0005 afford oh. Tall neat he make or at dull ye."
n = len(text) # determine the number of characters in the text
module_list = [] # create an empty list
for i in range(n): # i loops of all index positions in text
if text[i:i + 4] == "NSCI": # exctact a 4 character substring and check if it is equal to "NSCI"
module_list.append(text[i:i + 8]) # add 8 characters to the list
print(module_list)
['NSCI0007', 'NSCI0011', 'NSCI0004', 'NSCI0005']
Note
String comparison is case-sensitive so
"S" == "s"
isFalse
.Remember to use a double equals to check for equality.
Escape Sequences¶
If you want to include special characters in a string, use an escape sequence. Precede the character you want to want to escape by a backslash character \
.
quote = '"The time has come", the Walrus said.'
print(quote)
"The time has come", the Walrus said.
quote = "\"The time has come\", the Walrus said."
print(quote)
"The time has come", the Walrus said.
This can also be used to include a backslash character in the string.
quote = "A\\B"
print(quote)
A\B
A very useful excape sequence is \n
, which denotes a newline character.
print("A\nAB\nABC")
A
AB
ABC
Multi-line Strings¶
String literals can span multiple lines, using triple-quotes: """
…"""
or '''
…'''
.
long_string = """Twas the night before Christmas,
when all through the house
Not a creature was stirring,
not even a mouse."""
print(long_string)
Twas the night before Christmas,
when all through the house
Not a creature was stirring,
not even a mouse.