Strings

What you’ll learn

  • How to store a sequence of character data in a string

  • How to extract substrings by slicing

  • How to use string methods

Example: RNA Sequences

In biology, an RNA sequence consists of a chain of the nucleotides Adenine, Uracil, Cytosine and Guanine in a specific order. We can represent an RNA sequence using the four letters A, U, C and G.

../_images/rna.png

Fig. 1 An RNA sequence represented by the string AUGAGACUCUGAGAC. The sequence is composed of three-character subsequences called codons, each of which identifies a specific amino acid.

In the body, the RNA sequence is used to produce a protein in a process called translation. The sequence is first divided into three character subsequences termed ‘codons’. For example, the 15 character RNA sequence AUGAGACUCUGAGAC is divided into the codons AUG, AGA, CUC, UGA, and GAC.

Each of the codons identifies a specific amino acid, as shown in the partial amino acid translation table on the right. The RNA sequence AUGAGACUCUGAGAC would therefore be translated by the body into the amino acid sequence methionine, arginine, leucine, (stop), aspartic acid. Using the abbreviated one-letter characters, this could be written as MRL.D.

Finally, the body chains together these amino acids into a protein. The stop codon represents the end of the chain, so the RNA sequence would be translated into a protein comprising a chain of three amino acids methionine-arginine-leucine.

RNA Translation

  1. RNA sequence: AUGAGACUCUGAGAC

  2. Codons: AUG AGA CUC UGA GAC

  3. Amino Acids: MRL.D

  4. Protein sequence: MRL

Bioinformatics

Bioinformatics is the application of tools of computation and analysis to the capture and interpretation of biological data. One of the main applications of bioinformatics is the analysis of genome sequence data, such as that undertaken by the Human Genome Project.

we’ll now see how we can implement this procedure programmatically. In Python, a sequence of character data is termed a string.

rna_seq = "AUGAGACUCUGAGAC"
print("RNA sequence:", rna_seq)
RNA sequence: AUGAGACUCUGAGAC

First, let’s write a function translate which takes a three letter string and returns a single letter representing the corresponding amino acid.

def translate(codon):
    codon_list = ["UUA", "UUG", "CUU", "CUC", "CUA", "CUG", "AUG", "AGA", "AGG", "CGA", "CGU", "CGG", "CGC", "GAU", "GAC", "UAA", "UAG", "UGA"]
    amino_acids = ["L", "L", "L", "L", "L", "L", "M", "R", "R", "R", "R", "R", "R", "D", "D", ".",  ".",  "."]
    
    i = codon_list.index(codon)
    aa = amino_acids[i]
    return aa


# Test the function using the CGG codon
codon = "CGG"
aa = translate("CGG")
print("Codon:", codon)
print("Amino acid:", aa)
   
Codon: CGG
Amino acid: R

The list codon_list contains the three letter codons, and the list amino_acids contains the correspdonding single-letter amino acid abbreviations. Translating from codon to amino acid is simply a case of finding the index position of the codon in codon_list and identifying the character in the same position in amino_acids:

i = codon_list.index(codon)
aa = amino_acids[i]

i is an integer represnting the index of the string codon in codon_list.

Finding items in a list

See the previous section Lists and Plotting for how to find items in a list.

Next, we would like to split the string rna_seq into three-character codons, then use our function to determine the amino acid for each.

n = len(rna_seq)
for i in range(0, n, 3):
    codon = rna_seq[i:i+3]
    print(translate(codon))
M
R
L
.
D

First we use the len function to determine the number of characters in the string rna_seq. Next we generate a loop from 0 to n in steps of 3:

for i in range(0, n, 3):

The expression rna_seq[i:i+3] extracts the a 3-character substring from rna_seq begining at the character at index i.

Finally, what if we’d like to stop processing the sequence once we reach the ‘stop’ codon? Python has a useful keyword break which allows us to do exactly that:

n = len(rna_seq)
for i in range(0, n, 3):
    codon = rna_seq[i:i+3]
    if codon == "UGA":
        break
    print(translate(codon))
M
R
L

As soon as the break keyword is reached, the enclosing for loop is exited, even if this means aborting the loop early.

String Variables

A string is data type representing character data. In Python, string literals are surrounded either by double quote " or single quote ' characters.

greeting_start = "Season's"
greeting_end = 'greetings'

print(greeting_start, greeting_end)
Season's greetings

String Concatenation

Use the + symbol to concatentate strings

greeting = greeting_start + " " + greeting_end
print(greeting)
Season's greetings

But it is not possible to concatentate a string and a number:

id = greeting + 55
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_2761/2560201398.py in <module>
----> 1 id = greeting + 55

TypeError: can only concatenate str (not "int") to str

Converting between strings and numbers

Functions str, int and float are available to convert between strings and other data types.

# convert from integer to string
id = 1729
new_id = str(id) + "_NEW"
print(new_id)
1729_NEW
# convert from string to floating-point number
price = "12.99"
total_price = float(price) * 1.2
print(total_price)
15.588

String Methods

A string is an object, which is a data type with methods directly attached with it which can be called similarly to calling a function. The upper method converts a string to upper case, and lower to lowercase:

name = "Jeremy Bentham"
name_uppercase = name.upper()
print(name_uppercase)
name_lowercase = name.lower()
print(name_lowercase)
JEREMY BENTHAM
jeremy bentham

Other useful methods are split, join and trim. split splits the string into individual words and returns them as a list:

text = "The time has come"
word_list = text.split()
print(word_list)
['The', 'time', 'has', 'come']

join does the reverse, combining a list of strings into a single string. s1.join(word_list) joins the strings in ``

", ".join(word_list)
'The, time, has, come'

strip removes any white space characters (spaces, tabs or newlines) at the start or end of the string:

text = "  too much space!   "
text2 = text.strip()
print(text2)
too much space!

Strings and Characters

A string is composed of a sequence of characters, and most of the operations that can be performed on lists can also be performed on strings. For example, individual characters can be accessed using square brackets enclosing the index position.

text = "Natural Sciences"
# first character is at index 0
first_initial = text[0]
# last character is at index -1
final_character = text[-1]
print(first_initial, final_character)
N s

Use the len function to find the length of a string.

s = "Mighty"
x = len(s)
print(x)
6

Note

An important difference between lists and strings: whereas it is possible to change the value of an an individual list item, it is not possible to change an indivdual string character. We say that strings are immutable.

x = [4, 5, 6]
x[0] = 10 # this is OK
s = "ABC"
s[0] = "X" # this will result in an error

Likewise, it is not possible to append a character to a string. Instead, use string concatenation.

s.append("D") # Error
s = s + "D" # This is OK

Slicing Lists and Strings

Given a list or string, we can access a single element using square brackets:

x = [4, 5, 6, 7, 8, 9]
y = x[0]
print(y)
4

If we want to access a sublist, we can use array slicing. Given integers a and b, x[a:b] returns a new list which contains the elements of x from index a to b - 1 (i.e. including element a but excluding element b).

z = x[0:3]
print(z)
[4, 5, 6]

x[a:b:c] returns a list containing items a to b - 1 with a step size of c (this is very similar to the range function),

w = x[0:9:2]
print(w)
[4, 6, 8]

Example

Natural Sciences modules are identified by a 8 character code consisting of NSCI followed by a four digit number. The following paragraph of text contains Natural Sciences module codes mixed up with other data. We will write Python code to extract a list of Module codes from the text.

text = "Surrounded NSCI0007 me occasional pianoforte NSCI0011 alteration unaffected impossible ye. For saw half than cold.  arrived adapted. Numerous ladyship so raillery humoured goodness received an. So NSCI0004 formal length my highly NSCI0005 afford oh. Tall neat he make or at dull ye."

n = len(text) # determine the number of characters in the text
module_list = [] # create an empty list
for i in range(n): # i loops of all index positions in text
    if text[i:i + 4] == "NSCI": # exctact a 4 character substring and check if it is equal to "NSCI"
        module_list.append(text[i:i + 8]) # add 8 characters to the list
print(module_list)
    
['NSCI0007', 'NSCI0011', 'NSCI0004', 'NSCI0005']

Note

  • String comparison is case-sensitive so "S" == "s" is False.

  • Remember to use a double equals to check for equality.

Escape Sequences

If you want to include special characters in a string, use an escape sequence. Precede the character you want to want to escape by a backslash character \.

quote = '"The time has come", the Walrus said.'
print(quote)
"The time has come", the Walrus said.
quote = "\"The time has come\", the Walrus said."
print(quote)
"The time has come", the Walrus said.

This can also be used to include a backslash character in the string.

quote = "A\\B"
print(quote)
A\B

A very useful excape sequence is \n, which denotes a newline character.

print("A\nAB\nABC")
A
AB
ABC

Multi-line Strings

String literals can span multiple lines, using triple-quotes: """""" or ''''''.

long_string = """Twas the night before Christmas,
when all through the house
Not a creature was stirring,
not even a mouse."""
print(long_string)
Twas the night before Christmas,
when all through the house
Not a creature was stirring,
not even a mouse.