Monday, February 3, 2014

Counting repetitive elements in NGS data

Here I will present my approach on finding out the number of a certain insertion sequence in a bacterial genome using next-gen sequencing data. This approach isn't novel, it is used here and could be graphically illustrated like this:



The basic idea is to identify all the different flanking sequences of IS6110 and count them or map them to the genome.
So my goal is to count all IS6110 sequences found in the sequencing data against Mycobacterium tuberculosis genome. The workflow can be divided in three steps:

  1. extract all reads containing start or end fragments of IS6110;
  2. remove these fragments from reads;
  3. group processed reads containing start or end fragments separately;
  4. cluster grouped reads by their similarity and count the clusters;
If cluster number from starts and ends is equal, then we have completed our task successfully.
To accomplish 1-3. I created a simple program in C. To find out the number of repeat sequences I used python to sort reads, calculate and plot Levenshtein distances of extracted reads. As a result all similar reads were clustered together. 

By counting clusters I will get the number of insertion elements.

As an example I obtained Illumina sequencing data of M. tuberculosis from European Bioinformatics Institute and counted the number of IS6110s in the genome. 
Here are the plots I acquired:

And it is obvious that data contain 19 insertion sequences.

The code and IS6110 is available at GitHub

P. S. This is my first blog post, but I hope you find it useful anyway :)
P. S. S. I don't have much programming experience so there could be bugs and more efficient ways to achieve this :D