Modules

vamp.utils

Utilities for working with multiple sequence alignments and MAF objects

Copyright 2012, 2013, 2014 Lance Parsons <lparsons@princeton.edu> All rights reserved.

BSD 2-Clause License http://www.opensource.org/licenses/BSD-2-Clause

class vamp.utils.ContigComposition

Represents the composition of one interval by another interval.

Association of two genomic intervals, used to represent the composition of one interval by another.

seq str

Sequence id of the interval being described

start str

The start position of the interval being described (1-based)

end str

The end position of the interval being described (1-based)

contig str

The sequence id of the second interval

contig_start str

The start position of the second interval (1-based)

contig_end str

The end position of the second interval (1-based)

strand str

The strand (‘+’ or ‘-‘) of the interval being described

contig_size str

The complete length of the sequence of the second interval.

static tab_headings()

Static method that retuns a tab delimited string of header names

Returns:A tab delimited string representing the headers in the order used by the to_tab() method.
Return type:string
to_tab()

Return a tab delimited string of the ContigComposition

Returns:A tab delimited string representing the ContigComposition.
Return type:string
vamp.utils.find_deletions(contig_composition_list, verbose=False)

Find contigs with deletions.

From a contig composition list, find contigs that have deleted sections. When a contig has deleted sections, the pieces of the contig may be replaced by a contiguous section. This function returns tuples containing the indices of the contigs pieces to be replaced along with a replacement ContigComposition object consisting of the contiguous section.

e.g.:

([2, 3],
 {'seq': 'chr', 'start': 3, 'end': 5, 'contig': 'contig1',
  'contig_start': 5, 'contig_end': 10, 'strand': '+',
  'contig_size': 20})

indicates that we may wish to replace contig_composition_list[2:3] with the new ContigComposition specified.

Parameters:
  • contig_composition_list (list) – A list of ContigComposition objects.
  • verbose (bool, optional) – If true, output additional debug info (default is False).
Returns:

A list of tuples with the indices of ContigComposition objects in the list to be replaced along with replacement ContigComposition objects.

Return type:

list

vamp.utils.get_block_by_label(maf_filename, label)

Return the MAF block with the specified label

Parameters:
  • maf_filename (string) – The name of the MAF file.
  • label (string) – The label of the block in the MAF file to search for.
Returns:

The first block found in the MAF file that has the given label

Return type:

block

vamp.utils.get_sequence_length_from_maf(maf_file, reference_species, sequence_id)

Return length of the reference_species.sequence_id

Parameters:
  • maf_filename – The filename of the MAF file.
  • reference_species – The name species used as the reference.
  • sequence_id – The sequence_id used as the reference. The format of sequence names in the MAF file is assumed to be ‘species.sequence_id’ (e.g. ‘scerevisiae.chrI’)
Returns:

The length of the specified sequence in the first component containing that sequence in the MAF file, or None if no matching componenets were found in the MAF file.

Return type:

integer

vamp.utils.get_sequence_net_alignment(maf_filename, reference_species, sequence_id, species, verbose=False)

Return the alignment created by stitching MAF blocks together

Stitches MAF blocks together along an entire reference sequence (including gaps). For regions covered by more than one block, the highest scoring block is used.

Parameters:
  • maf_filename – The filename of the MAF file.
  • reference_species – The name species used as the reference.
  • sequence_id – The sequence_id used as the reference. The format of sequence names in the MAF file is assumed to be ‘species.sequence_id’ (e.g. ‘scerevisiae.chrI’)
  • species – A list of the species names to be returned
  • verbose (bool, optional) – If True, print debug information (default: False)
Returns:

A tuple containing a Bio.Align.MutlipleSeqAlignment object and a list of intervals. The multiple sequence alignments contains each the alignment of each species from the MAF file created by stitching blocks together based on the specified reference sequence The list of intervals is relative to the alignment that indicate the MAF block, block start, and block end of the source of that piece of the alignment.

Return type:

tuple

vamp.utils.get_vamp_home()

Return the directory where the VAMP module is installed

vamp.utils.read_contig_composition_summary(filename)
Generator that reads a contig composition summary file and returns
attributes.
Parameters:filename (string) – The name of contig composition summary file as output by compare_genomes.py.
Yields:ContigComposition – A ContigComposition object for each line in the contig composition summary file.
vamp.utils.replace_alignment_with_block(alignment, block, reference_species, sequence_id, verbose=False)

Update the multiple sequence alignment with the specified MAF block

Use the MAF block alignment to replace the appropriate section of the given multiple sequence alignment by using the specified reference species and sequence as guide

Parameters:
  • block (maf block) – MAF block
  • reference_species (str) – The name species used as the reference.
  • sequence_id (str) – The sequence_id used as the reference. The format of sequence names in the MAF file is assumed to be ‘species.sequence_id’ (e.g. ‘scerevisiae.chrI’)
  • verbose (bool, optional) – If True, print debug information (default: False)
Returns:

The updated alignment and a Pybedtools interval of the section of the alignment that was replaced. The interval contains the following attributes: maf_block; block_start; block_end which indicate the MAF block label and start and end position on the block used in the replacement

Return type:

tuple

vamp.utils.subtract_intervals(interval1, interval2)

Subtract two pybedtools intervals, return list of resulting intervals

Parameters:
  • interval1 – A pybedtools interval
  • interval2 – A pybedtools interval to subtract from interval1
Returns:

A list of pybedtools intervals that contain the region(s) of interval1 that are not overlapped by interval2

Return type:

list

vamp.utils.summarize_contig_composition(interval_list, src_tag, start_tag, end_tag, strand_tag, source_size_tag)

Summarize the contig composition of a stitched MAF file.

Parameters:
  • interval_list (list) – A list of Pybedtools interval objects
  • src_tag (string) – The attribute containing the contig name
  • start_tag (string) – The attribute containing the start position in the contig
  • end_tag (string) – The attribute containing the end position in the contig
  • strand_tag (string) – The attribute containing the strand
  • source_size_tag (string) – The attribute containing the contig size
Returns:

A list of dictionaries with the following keys: (seq, start, end, contig, contig_start, contig_end, strand, contig_size)

Return type:

list

vamp.utils.update_contig_composition_summary(contig_composition_summary, replacements)

Update list of ContigComposition objects with replacements.

Replacements are a list of tuples containing a list of indices of contigs to be replaced along with replacements. The replacements must be non-overlapping and sorted.

Parameters:
  • contig_compostion_summary (list) – List of dictionaries as returned by summarize_contig_composition()
  • replacements (list) – A list of ContigCompostion objects
Retuns:
list: A updated contig composition summary
vamp.utils.update_sequence_with_replacements(seq, replacements, replacement_seq_dict)

Update Seq object with replacements.

The replacements specified by ContigComposition objects and must be non-overlapping and sorted.

Parameters:
  • seq (Bio.Seq) – The sequence object to be updated.
  • replacements (list) – A list ContigComposition objects
  • replacement_seq_dict (dict) – A dictionary to the replacement sequences.
Returns:

Bio.Seq: An updated sequence object with replacements made

vamp.utils.verify_maf_fasta(maf_filename, reference_species, fasta_filename, verbose=False)

Verify the consistency between the sequence in a MAF and a FASTA file

Checks all compenents in all blocks of the MAF file for the specified species and checks that the sequence matches that in the FASTA file.

Parameters:
  • maf_filename (string) – The name of the MAF file.
  • reference_species (string) – The species to select from the MAF file.
  • fasta_filename (string) – The name of the FASTA file to check against.
  • verbose (bool, optional) – If true, output additional debugging info (default is False).
Returns:

Prints to STDOUT if there is a mismatch.

Return type:

None

seq_utils.convert_coordinates

Convert coordinates from GFF or BED file using multi-fasta alignments

seq_utils.convert_coordinates.find_aligned_position(gap_positions, pos)

Update position by adding preceeding gaps

Parameters:
  • gap_positions (list) – list of gaps (must include start and end methods to return the start and end of a gap, typically they are re.MatchObjects)
  • pos (int) – the position to adjust by adding preceding gaps
Returns:

The new position, accounting for preceeding gaps

Return type:

int

seq_utils.fasta_from_gff

Extract fasta sequences from regions defined in GFF/BED file and output fasta to stdout

seq_utils.summarize_alignments

Summarize the differences between sequences in an aligned FASTA file.

This script will output summarize the differences between sequences in an aligned FASTA file.

Usage:

summarize_alignments.py aligned_fasta reference_sequence [-h,--help]
    [-v,--verbose] [--version]
seq_utils.summarize_alignments.main()

Runs summary_of_alignment function on input files from the command line.

seq_utils.summarize_alignments.mismatch_string(mismatches)

Generate a string from a list of mismatches.

Parameters:mismatches (list) – A list of mismatches. A mismatch is a dictionary with a position (pos), reference genotype (ref), and alternate genotype (alt).
Returns:A comma separated string of the mismatches
Return type:string
seq_utils.summarize_alignments.parse_event(event, reference_sequence, alternate_sequence)

Parse an event (sequence of differences) for VCF output.

Parse a simple event with reference_position, reference_base, and new_base and determine the type and add padding if necessary (for VCF compatibility)

Parameters:
  • event (dictonary) – An event has at least a position (pos), reference genotype (ref), and aternate genotype (alt). May also have a flag indicating if it is a snp (snp).
  • reference_sequence (str) – The complete reference sequence
  • alternate_sequence (str) – The complete alternate sequence
Returns:

An event with additional padding to the start of the variant and an added type attribute, for VCF compatibility

Return type:

dictonary

seq_utils.summarize_alignments.summary_of_alignment(alignment, reference_sequence_id)

Summarizes changes in given alignment

Parameters:
  • alignment (Bio.AlignIO object) – Alignment object
  • reference_index (int) – index of the reference sequence in alignment (default is 1)
Returns:

A dctionary with a key for each non-reference sequence

in the alignment

Each entry is another dictionary with the following keys:

  • match_count: The number of matching bases

  • mismatch_count: The number of mismatching bases, including indels

  • mismatches: list of mismatches by base: RefBase(RefPos)NewBase

  • contiguous_change_count: the number of contiguous change

    “events”

Return type:

dictionary

seq_utils.utils

Utility classes and methods for working with sequence data

seq_utils.utils.convert_interval_gapped_to_nongapped(seq, start, end)

Take position with gaps and return position without gaps

Uses 0-based positions

Parameters:
  • seq (str) – sequence string (with gaps included)
  • start (int) – starting position of interval (including gaps)
  • end (int) – ending postion of interval (including gaps)
Returns:

(start, end) the start and end positions after removing gaps in the sequence

Return type:

tuple

seq_utils.utils.convert_interval_nongapped_to_gapped(seq, start, end, include_end_gaps=False)

Take position without gaps and return position with gaps

Uses 0-based positions

Parameters:
  • seq (str) – sequence string (with gaps added)
  • start (int) – starting position of interval (excluding gaps)
  • end (int) – ending postion of interval (excluding gaps)
  • include_end_gaps (bool, optional) – if true, include gap positions that directly follow the end positions in the new interval, default is False and such end positions are not included
Returns:

(start, end) the start and end positions after accouting for gaps in the sequence

Return type:

tuple