This step removes artifacts and contaminate sequences from you data in an effort to speed up and improve the quality of the resulting assemblies.
The script will generate a makefile in the specified output directory called PREFIX_preprocess.mk.
The PreProcessing consists of the following steps. After each step, FastQC is run to generate a summary of the result.
1. Run Cutadapt to trim adapter sequences from the ends of reads. (Intermediate, output is discarded). 2. Run Fastx Toolkit‘s artifact filter to remove low complexity sequences. 3. If paired reads are used, match up paired reads and set unpaired reads aside. 4. Map reads using Bowtie to the specified host reference, keeping only those reads (or pairs) that did not map.
This step uses the SSAKE short read assembler to build contigs from the preprocessed reads. It does this by running SSAKE with various parameter settings and combining the results.
1. multi_ssake.py 2. Run clean_multi_ssake.sh to cleanup intermediate fasta files (interactive, do NOT use qsub)
TODO - Additional Detail Needed
The resulting contigs from the previous SSAKE assemblies are often redundant and can be stitched together using a long-read assembler. We have had success using Gap4 from the Staden package.
TODO - Additional Detail Needed
Once you have generated the best contigs you can with SSAKE and Gap4, the next step is to put them together into a scaffold. To do this, VAMP utilizes the reference genome of a similar strain to determine how to order the contigs.
The first step is to align contigs to a reference genome and output the result in a MAF formatted file. There are many options for alignment tools, however, we have had success with Mugsy, a very fast multiple whole genome alignment tool.
Once you have aligned the contigs to the reference, the next step is to stitch together the various alignment blocks into a scaffold. The maf_net.py utility does this by reassembling the reference sequence from the MAF blocks and using the highest scoring block for each location in the genome to assemble a scaffold genome.
Consensus Sequence - SPECIES.consensus.fasta A FASTA file containing the consensus sequence for this species. N’s in the sequence represent sections where no contigs mapped to a section of the reference (i.e. potential gaps in the scaffold).
Consensus Contig Composition GFF - SPECIES.consensus_contig_composition.gff GFF formatted file describes intervals in the SPECIES genome. The attributes contain information about the contigs used to determine the sequence in this interval. The attributes are:
- src_seq
- src_seq_start
- src_seq_end
- src_strand
- src_size
- maf_block
- block_start
- block_end
- ref_src
- ref_start
- ref_end
- ref_strand
Consensus Contig Composition Summary - SPECIES.consensus_contig_composition_summary.txt Tab delimited file with the following columns that describes intervals in the SPECIES genome and the contigs that were used for the sequence.
- seq - sequence id of the interval in the SPECIES genome
- start - start position of the interval
- end - end position of the interval
- contig - contig id that was used to “build” this interval. If None, that means no contig was found for the analagous region in the reference.
- contig_start - the start position of the contig that aligned to this start interval
- contig_end - the end position of the contig that aligned to the end position of this interval
- contig_strand - the direction that the contig aligned to the reference (if ‘-‘, the reverse complement of the contig aligned to the reference in this interval)
- contig_size - the full size of the contig (including those bases that did not aligned to this interval)
Once a draft of a genome has been completed, it can be useful to migrate annotations from an annotated reference to the new genome. In addition, this step generates a summary of the changes at the nucleic acid as well as amino acid level.
Run compare_genomes.py to migrate annotations and generate a list of differences between two species. The script requires an aligned fasta file (typically use the one generated from the previous scaffold stitching step) and a GFF file of features (genes, exons, etc.) to migrate.