Sanger and 454 Assembly
Our March30 reference EST set consists of 16312 contiguous sequences (contigs). It was assembled from available Sanger and 454 EST sequences generated for H. annuus (both wild and cultivated individuals spanning the whole distribution of the species) and from several different tissue types (Heesacker et al. 2008, Lai et al.2012) . Individual genotypes were first assembled with MIRA version 3.0 (Chevreux et al., 2004), using the flags "accurate,est,denovo,454" for the 454 data and "accurate,est,denovo,Sanger" for the Sanger sequences. All MIRA contigs and singletons were reassembled further using the program CAP3 at 94% identity (Huang and Madan, 1999), as in Lai et al. 2012. Contigs found in at least two different genotypes and with total length greater than 500bp were kept from this final merged assembly. It has gone through strict quality trimming and should represent the majority of genes expressed in young seedlings. Based on comparisons to genetic map data (J. Bowers, pers. Com), 783 contain sequences that map to multiple locations and were removed, as they likely represent families of close paralogs that cannot be resolved. Of the 16312 remaining sequences, 8226 map to single locations on genetic maps (J. Bowers, pers. Com) and appear to be single copy, while the remainder were not assessed. In addition to the mapped and unmapped nuclear loci, this reference set also contains the whole chloroplast genome, which spans nearly 127KB.
Trinity assembly of Illumina ESTs
One lane of Illumina sequence was generated for each of the following four tissue or treatment libraries:
- Flowers & Leaves
- Roots & Stems
Reads were quality trimmed and de-novo assembled using Trinity. Raw sequence data is available on GenBank.