8 Text manipulation

Author

Laurent Modolo

Objective: Learn simple ways to work with text file in Unix

One of the great things with command line tools is that they are simple and fast. Which means that they are great for handling large files. And as bioinformaticians you have to handle large files, so you need to use command line tools for that.

8.1 Text search

The file hg38.ncbiRefSeq.gtf.gz contains the RefSeq annotation for hg38 in GFT format

We can download files with the wget command. Here the annotation is in gz format which is a compressed format, you can use the gzip tool to handle gz files.

On useful command to check large text file is the headcommand.

wget -q -O - http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz | gzip -dc | head

You can change the number of lines displayed with the option -n number_of_line. The command tail has the same function as head but starting from the end of the file.

Try the tail for the same number of lines displayed, does the computation take the same time ?

Download the hg38.ncbiRefSeq.gtf.gz file in your ~/.

The program grep string allows you to search for string through a file or stream line by line.

gzip -dc hg38.ncbiRefSeq.gtf.gz | grep "chr2" | head

What is the last annotation on the chromosome 1 (to write a tabulation character you can type \t) ?

You can count things in text file with the command wc read the wc manual to see how you can count lines in a file.

Does the number of 3UTR match the number of 5UTR ?

How many transcripts does the gene CCR7 have ?

8.2 Regular expression

When you do a loot text search, you will encounter regular expressions (regexp), which allow you to perform fuzzy search. To run grep in regexp mode you can use the switch. -E

The most basic form fo regexp si the exact match:

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "gene_id"

You can use the . wildcard character to match anything

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "...._id"

There are different special characters in regexp, but you can use \ to escape them. For example, if you search for . you can use the following regexp

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "\."

8.2.1 Character classes and alternatives

There are a number of special patterns that match more than one character. You’ve already seen ., which matches any character apart from a newline. There are four other useful tools:

\d: matches any digit.
\s: matches any whitespace (e.g. space, tab, newline).
\S: matches any non-whitespace.
[abc]: matches a, b, or c.
[^abc]: matches anything except a, b, or c.
[a-z]: Match any letter of the alphabet

Search for two digits followed by an uppercase letter and one digit.

Solution

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | perl -E "\d\d[A-Z]\d"

8.2.2 Anchors

By default, regular expressions will match any part of a string. It’s often useful to anchor the regular expression so that it matches from the start or end of the string. You can use

^ to match the start of the string.
$ to match the end of the string.

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E ";"
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E ";$"

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "c"
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "^c"

8.2.3 Repetition

The next step up in power involves controlling how many times a pattern matches

?: 0 or 1
+: 1 or more
*: 0 or more

What is the following regexp going to match ?

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "[a-z]*_[a-z]*\s\"[1-3]\""

You can also specify the number of matches precisely:

{n}: exactly n
{n,}: n or more
{,m}: at most m
{n,m}: between n and m

What is the following regexp going to match ?

gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "^[a-z]{3}[2-3]\s.*exon\s\d{4,5}\s\d{4,5}.*"

How many gene names of more than 16 characters does the annotation contain ?

Solution

gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "transcript\s.*gene_id\s\"\S{16,}\";" | wc -l

8.2.4 Grouping and back references

You can group match using (), for example the following regexp match doublet of 12 .

gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "(12){2}"

Grouping is also used for back references in the case of text replacement. You can use the command sed for text replacement. The syntax of sed for replacement is the following: sed -E 's|regexp|\n|g where n is the grouping number. s stand for substitute and g stand for global (which means that is they are different substitutions per line sed won’t stop at the first one).

Try the following replacement regexp

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | sed -E 's|(transcript_).{2}|\1number|g'

Try to write a sed command to replace ncbiRefSeq with transcript_id .

Solution

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | sed -E 's|ncbiRefSeq(.*)(transcript_id "([A-Z_0-9.]*))|\3\1\2|g'

Regexp can be very complex see for example a regex to validate an email on starckoverflow. When you start you can always use for a given regexp to a more experienced used (just give him the kind of text you want to match and not match). You can test your regex easily with the regex101 website.

8.3 Sorting

GTF files should be sorted by chromosome, starting position and end position. But you can change that with the command sort to select the column to sort on you can use the option -k n,n where n is the column number.

You need to specify where sort keys start and where they end, otherwise (as in when you use -k 3 instead of -k 3,3) they end at the end of the line.

For example, the following command sort on the 4th column and then on the 5th when the values of the 4th column are equal.

gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 4,4 -k 5,5 | head

Sorting operations are complex and can take a long time

You can add more option to the sorting of each column, for example r for reverse order d for dictionary order or n for numeric order.

What will the following command do ?

gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 3,3d -k 4,4n | head

Use the -u option to count the number of different annotation type based on the 3rd column

Solution

gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 3,3d -u | wc -l

You can check if a file is already sorted with the -c switch. Check if the gtf file is sorted by chromosome, start and end position.

Solution

gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 1,1 -k 4,4n -k 5,5n -c

8.4 Field extractor

Sometime rather than using complex regexp, we want to extract a particular column from a file. You can use the command cut to do that.

The following command extracts the 3rd column of the annotation

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | cut -f 3

You can change the field separator with the option -d, set it to ";" to extract the transcript_id and gene_name from the information column.

Solution

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | cut -f 2 -f 5 -d ";"

8.5 Concatenation

There are different tools to concatenate files from the command line cat for vertical concatenation and paste for horizontal concatenation.

cat .bashrc .bashrc | wc -l

What will be the results of the following command ?

gzip -dc hg38.ncbiRefSeq.gtf.gz | head |  paste - -

8.6 Text editor

You often have access to different text editors from the common line, two of the most popular ones are vim and nano.

nano is more friendly to use than vim but also very limited.

To open a text file you can type editor file_path.

In nano everything is written at the bottom, so you only have to remember that ^is the symbol for the key Ctrl.

Open you .bashrc file and delete any comment line (starting with the # character).

vim is a child of the project vi (which should also be available on your system), and which bring him more functionality. The workings of vim can be a little strange at first, but you have to understand that on a US keyboard the distance that your finger have to travel while using vim is minimal.

You have 3 modes in vim:

The normal mode, where you can navigate the file and enter command with the : key. You can come back to this mode by pressing Esc
The insert mode, where you can write things. You enter this mode with the i key or any other key insertion key (for example a to insert after the cursor or A to insert at the end of the line)
The visual mode where you can select text for copy/paste action. You can enter this mode with the v key

If you want to learn more about vim, you can start with the https://vim-adventures.com/ website. Once you master vim everything is faster but you will have to practice a lot.

We have used the following commands:

head / tail to display head or tail of a file

wget to download files

gzip to extract tar.gz files

grep to search text files

wc to count things

sed to search and replace string of text

sort to sort files on specific field

cut to extract a specific field

cat / paste for concatenation

nano / vim for text edition

In the next session, we are going to apply the logic of pipes and text manipulation to batch processing.

License: Creative Commons CC-BY-SA-4.0.
Made with Quarto.