Arabidopsis Thaliana Reference Transcript Dataset 3 (AtRTD3)

Background and purpose

We have generated a new and comprehensive Arabidopsis thaliana Reference Transcript Dataset 3 (AtRTD3) based on extensive Iso-seq data (79% of transcripts) from a broad range of plant samples. We developed novel methods to determine splice junctions and transcription start and end sites (TSS and TES) accurately. Mis-match profiles around splice junctions provided a powerful and distinguishable feature between false and correct splice junctions allowing effective removal of spurious splice junctions. Stratified approaches identified high confidence transcription start/end sites and removed fragmentary transcripts due to degradation while taking into account expression abundance. AtRTD3 is a major improvement over existing transcriptomes as demonstrated by analysis of an extensive RNA-seq time-series dataset from Arabidopsis plants exposed to cold. AtRTD3 provided higher resolution of transcript expression profiling and identified cold- and light-induced differential transcription start and polyadenylation site usage.

Annotation of AtRTD3

To annotate AtRTD3, we examined the overlaps of AtRTD3 transcripts with Araport11 gene annotations using bedtools (intersect -wao). Transcripts were assigned to the Araport genes if they overlap on the same strand (where the overlap covers >30% of both transcripts). Transcripts that overlap two Araport11 genes on the same strand would be assigned a gene ID with two concatenated gene names (e.g. AT1G18020-AT1G18030). This allows the identification of biological chimeric transcripts that run-through two or more genes. The origin of these transcripts (AtIso, AtRTD2 or Araport11) are also added in the bed annotation to allow users to distinguish high confidence transcripts from long read assemblies from less confident transcripts from short read assemblies. AtRTD3 also contained 1,541 novel genes compared to Araport11, name start with G, e.g. G12636.

Available transcriptomes

atRTD3_29122021.fa contains all the sequences for the actual transcripts and atRTD3_07082020.bed is corresponding transcript information in gene annotation format.

atRTD3_TS_21Feb22_transfix.gtf is transcript structure/coordinate information together with translations using TranSuite (https://github.com/anonconda/TranSuite). AtRTD3_gene_transcript.csv contains the mapping between the gene names and transcript names. It is a required input file for running 3D RNA-seq(https://3drnaseq.hutton.ac.uk/).

Download atRTD3_29122021.fa
Download atRTD3_07082020.bed
Download atRTD3_TS_21Feb22_transfix.gtf
Download AtRTD3_gene_transcript.csv