Commands

compare_genomes.py

Compares genomes using an aligned fasta file and migrates annotations from a reference to the other sequences in the alignment

Usage:

compare_genomes.py [-r REFERENCE] [--align_format FORMAT] [-o PREFIX]
                      [--gff_feature_types GFF_FEATURE_TYPES]
                      [--gff_attributes GFF_ATTRIBUTES] [-v] [--version]
                      [-h]
                      aligned_fasta gene_gff

Required Arguments:

aligned_fasta         An aligned fasta file
gene_gff              An gff file with features to be migrated
-r REFERENCE, --reference REFERENCE
                      Sequence id of reference sequence in aligned fasta
                      file

Optional Arguments:

--align_format FORMAT
                      Alignment format (default: fasta)
-o PREFIX, --output PREFIX
                      Output prefix (default: compare_genomes_output/)
--gff_feature_types GFF_FEATURE_TYPES
                      Comma separated list of gff feature types toparse
                      (default: CDS,exon,gene,mRNA,stem_loop)
--gff_attributes GFF_ATTRIBUTES
                      Comma separated list of feature attributes tocarry
                      over (default: ID,Parent,Note,gene,function,product)
-v, --verbose         verbose output
--version             show program's version number and exit
-h, --help            show this help message and exit

fastq_to_fasta.py

Convert a FASTQ file to a FASTA file

Usage:

fastq_to_fasta.py [-h] [-w WRAP] [-v] [--version] fastq_file fasta_file

Required Arguments:

fastq_file
fasta_file

Optional Arguments:

-h, --help            show this help message and exit
-w WRAP, --wrap WRAP  Maximum length of lines, 0 means do not wrap (default:
                      0)
-v, --verbose         verbose output
--version             show program's version number and exit

find_contig_deletions.py

Find contigs with deletions from the contig composition file output from compare_genomes.py

Usage:

find_contig_deletions.py [-h] [-o OUTPUT_DIR] [-q] [-v] [--version]
                         contig_composition aligned_fasta contigs_fasta

Find contigs with deletions from the contig composition file output from compare_genomes.py

Required Arguments:

contig_composition    Contig composition file output from compare_genomes.py
aligned_fasta         Aligned FASTA file
contigs_fasta         Contigs FASTA file

Optional Arguments:

-h, --help            show this help message and exit
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
                      Directory to store output files, default is
                      aligned_fasta directory
-q, --quiet           Quiet, replace all deletions found, no prompts
-v, --verbose         verbose output
--version             show program's version number and exit

gff2gtf_simple.py

Simple conversion of GFF files to GTF files.

Usage:

gff2gtf_simple.py [-h] [-v] [--version] gff_file

Required Arguments:

gff_file       GFF file to convert

Optional Arguments:

-h, --help     show this help message and exit
-v, --verbose  verbose output
--version      show program's version number and exit

maf_net.py

Output an aligned fasta file by stitching together a specified reference sequence in the MAF file and using the highest scoring block for each section.

Usage:

maf_net.py [-r REFERENCE] [-c CHROMOSOME] [-s SPECIES] [-o OUTPUT_DIR]
           [--consensus_sequence] [--reference_fasta REFERENCE_FASTA]
           [-v] [--version] [-h]
           maf_file

Required Arguments:

maf_file              MAF file to stitch together
-r REFERENCE, --reference REFERENCE
                      Reference species (e.g. scerevisiae)
-c CHROMOSOME, --chromosome CHROMOSOME
                      Sequence ID of the chromosome for which to generate
                      the alignment net (e.g. chrI)
-s SPECIES, --species SPECIES
                      List of species to include, comma separated (e.g.
                      scerevisiae,sbayanus)

Optional Arguments:

-o OUTPUT_DIR, --output_dir OUTPUT_DIR
                      Directory to store output file, default is maf file
                      directory
--consensus_sequence  Output "consensus sequence" for each species in files
                      named [species].[chromosome].consensus.fasta
--reference_fasta REFERENCE_FASTA
                      Check MAF file against this fasta (for
                      troubleshooting, debugging)
-v, --verbose         verbose output
--version             show program's version number and exit
-h, --help            show this help message and exit

Output:

  • Aligned Fasta File: BASENAME.net.afa

    This file contains an aligned fasta file created by stitching together MAF blocks based on the reference sequence. Where two blocks overlap, the higher scoring block is used.

Optional Output (one per species):

  • Consensus Sequence: SPECIES.consensus.fasta

    A FASTA file containing the consensus sequence for this species. N’s in the sequence represent sections where no contigs mapped to a section of the reference (i.e. potential gaps in the scaffold).

  • Consensus Contig Composition GFF: SPECIES.consensus_contig_composition.gff

    GFF formatted file describes intervals in the SPECIES genome. The attributes contain information about the contigs used to determine the sequence in this interval. The attributes are:

    • src_seq
    • src_seq_start
    • src_seq_end
    • src_strand
    • src_size
    • maf_block
    • block_start
    • block_end
    • ref_src
    • ref_start
    • ref_end
    • ref_strand
  • Consensus Contig Composition Summary: SPECIES.consensus_contig_composition_summary.txt

    Tab delimited file with the following columns that describes intervals in the SPECIES genome and the contigs that were used for the sequence.

    • seq - sequence id of the interval in the SPECIES genome
    • start - start position of the interval
    • end - end position of the interval
    • contig - contig id that was used to “build” this interval. If None, that means no contig was found for the analogous region in the reference.
    • contig_start - the start position of the contig that aligned to this start interval
    • contig_end - the end position of the contig that aligned to the end position of this interval
    • contig_strand - the direction that the contig aligned to the reference (if ‘-‘, the reverse complement of the contig aligned to the reference in this interval)
    • contig_size - the full size of the contig (including those bases that did not aligned to this interval)

makePairedOutput2EQUALfiles_vamp.pl

Modified versions of scripts provided by SSAKE. They are used to prepare two separate paired end fastq files for use by SSAKE. The modifications made were to accommodate new Illumina style sequence identifiers introduced with CASAVA 1.8.:

Usage: makePairedOutput2EQUALfiles_vamp.pl <fasta file 1> <fasta file 2> <library insert size>
       --- ** Both files must have the same number of records & arranged in the same order

makePairedOutput2UNEQUALfiles_vamp.pl

See makePairedOutput2EQUALfiles_vamp.pl:

Usage: makePairedOutput2UNEQUALfiles_vamp.pl <fasta file 1> <fasta file 2> <library insert size>
           --- files could have different # of records & arranged in different order but template ids must match

TQSfastq_vamp.py

Preforms quality trimming as per the original SSAKE script. It was modified to accommodate larger, zipped fastq files.

Usage:

TQSfastq_vamp.py [options]

Optional Arguments:

-h, --help            show this help message and exit
-f FASTQFILE, --fastq file=FASTQFILE
                      Sanger encoded fastq file - PHRED quality scores,
                      ASCII+33
-t THRESHOLD, --Phred quality threshold=THRESHOLD
                      Base intensity threshold value (Phred quality scores 0
                      to 40, default=10)
-c CONSEC, --consec=CONSEC
                      Minimum number of consecutive bases passing threshold
                      values (default=20)
-v, --verbose         Runs in Verbose mode.
-q, --qualities       Outputs Qualities to FASTQ file (default is FASTA)
-z, --zip             Compress output with gzip
-o OUTPUT_BASE, --output=OUTPUT_BASE
                      Output filename base

translate_cds.py

Extracts the coding sequences (CDS) regions from a fasta reference and gff file and translates them into amino acid sequences, output in FASTA format to STDOUT

Usage:

translate_cds.py [--notrans] [-i IDATTR] [-t FEATURETYPE]
                 [--table TABLE] [-v] [--version] [-h]
                 gff_file fasta_file

Required Arguments:

gff_file              GFF file containing CDS records to be translated
fasta_file            FASTA file containing the nucleotide sequences
                      referenced in the GFF file

Optional Arguments:

--notrans             Do not translate to amino acid sequence, output DNA
-i IDATTR, --idattr IDATTR
                      GFF attribute to use as gene ID. Features with the
                      same ID will be considered parts of the same gene. The
                      default "gene_id" is suitable for GTF files.
-t FEATURETYPE, --featuretype FEATURETYPE
                      GFF feature type(s) (3rd column) to be used. Specify
                      the option multiple times for multiple feature types.
                      The default is "CDS" for GFF files and "CDS" and
                      "stop_codon" for GTF files.
--table TABLE         NCBI Translation table to use when translating DNA
                      (see http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprint
                      gc.cgi). Default: 1.
-v, --verbose         verbose output
--version             show program's version number and exit
-h, --help            show this help message and exit