-
Notifications
You must be signed in to change notification settings - Fork 34
The principles of transrate
There is little consensus on how transcriptomes should be assessed for quality. Typically papers either use extremely basic contig metrics (e.g. number of contigs), genome assembly metrics (e.g N50 - not even a good genome metric!), or, more promisingly, metrics that compare the assembly to a related reference.
Transrate attempts to raise the bar for transcriptome assembly by implementing a full suite of metrics that should be informative for most organisms and biological questions.
There are some basic principles that guide transrate.
- an assembly is a model of the set of transcripts from a (set of) biological sample(s) at a fixed point in time
- the sequenced reads are the direct experimental evidence we have for what the set of transcripts was, and therefore how good our model (transcriptome) is
- previous experiments in molecular biology give us evidence about what transcripts look like, and we can use our understanding of evolution combined with this evidence to gain insight into the quality of our model (transcriptome assembly)
It is important to realise that:
- researchers perform transcriptome assembly with one or more biological questions in mind
- regardless of the biological question, there is an absolute truth about what the real set of transcripts was
A transcriptome assembly that perfectly models the real set of transcripts will also be optimal for answering all biological questions. But that situation is unfortunately not realistic with current technology.
That leaves us with a choice: we could optimise for a transcriptome model that best matches the absolute truth, or we could optimise for one that best allows us to answer the biological question at hand. The choice affects which metrics are important, and is affected by what we want the transcriptome for.
In general, if you are generating a transcriptome for a single experiment to answer a single question and you are unlikely to use the transcriptome again for anything afterwards, you should optimise for the question at hand.
However, in many cases, a research group will perform transcriptome sequencing and assembly for a specific experiment - say, differential expression between two conditions - but the assembly will be useful for a multitude of other things afterwards. The fact that an assembly is a model of the set of transcripts makes it a suitable dataset for asking all sorts of questions (e.g. about transcript expression, sequence, structure, or evolution). In these cases, you should optimise for the best reconstruction of the true transcriptome.