Transcriptome assembly

Transcriptome assembly aims to reconstructute the complete set of transcripts that are expressed in a determined sample. Assembly methods can be classified in two categories: de novo assembly and genome-guided assembly. The first uses solely the RNAseq reads to generate the assembly while the latter can make use of a reference genome or transcriptome to aid the assembly process. Reference-guided assembly is particularly useful for the cases where a reference genome of a closely related species exist. In these cases, the alignment to paralogs or other shared sequences can help partition the reads prior to assembly which can help increase the assembly quality and completeness. For highly fragmented draft genomes, a de novo transcriptome assembly strategy is recommended.

Workflow

Step 1

Pre-assembly quality control and filtering

The first step involves a quality assessement of the raw read data by the evaluation of base quality distribution, kmer frequencies and adapter contamination parameters. These parameters provide an indication of underlying quality and should be addressed before the assembly step to aid increase the overall quality of the assembled transcriptome. After the above cleaning process, ribossomal RNA (rRNA) reads, which can be represented in sizable quantities, should be removed.

Step 2

Transcriptome read assembly

For the transcriptome assembly step, literature recommens the comparison of at least two different assemblers tools and multiple k-mer lengths, the latter a major trade-off to take into consideration in the assembly process. For example, sorter k-mer lengths contribute to the recovery of lowly expressed transcripts, but increase the number of incorrect/non-existent contigs. Longer k-mer length reduce the total number of contigs assembled, but also suppress the recovery of lowly expressed transcripts.

Step 3

Assembly quality control and completeness

The quality of an assembly will be assessed from several perspectives: 1) sequence length (N50/ExN50) and fragmentation; 2) the fraction of all reads that map back to the assembly (recommended > 80% total reads) and 3) assembly composition, which consists in testing for the presence of certain genes orthologs that are universal, persistently expressed and occur almost exclusively as single copies in the genome.

Step 4

Annotation

Annotation of the assembly genome can be achieved by a combination of different approaches. The most common involve 1) searching for homologs based on sequence similaraty, 2) searching for functional sequence features (e.g. protein domains, motifs, transmembrane helices) identification and 3) assigning gene ontology (GO) terms to the sequences.

Bioinformatic Workflows