Overview

The Pine Alignment and SNP Identification Pipeline (PineSAP) provides a high-throughput solution to single nucleotide polymorphism (SNP) prediction using multiple sequence alignments from re-sequencing data. This pipeline integrates a hybrid of customized scripting, existing utilities and machine learning in order to increase the speed and accuracy of SNP calls. The implementation of this pipeline results in significantly improved multiple sequence alignments and SNP identifications when compared with existing solutions. The use of machine learning in the SNP identifications extends the pipeline’s application to any eukaryotic species where full genome sequence information is unavailable.

 

Custom phredPhrap

The following Phred commands have been implemented for use against a modified phredPhrap run: ?-

-phred -sd ../seq_dir/ -qd ../qual_dir/ – trim_alt 0.01 -trim_cutoff 0.01 -trim_fasta -st fasta -trim_phd

 

Ace2Fasta will convert the ace format (ace_fasta_contigs.pl):

OUTPUTS:

  • [amplicon_name]_contigs.fasta
  • Multiple FASTA files

 

ProbconsRNA

probcons [amplicon_name]_contigs.fasta > 0[amplicon_name] _contigs.fasta_align

ProbconsRNA Information

 

alignedcontgi2readfasta.pl reads in [amplicon_name] _contigs.fasta_align, reads in reads aligned to each contig from the individual contig files from ace2fasta.

OUTPUTS:

  • [amplicon_name]_aligned.fasta

a single multifasta file with all reads aligned corresponding to their alignment to the contig from phredphrap and the contigs alignment to the other contigs from probconsRNA.

 

Fasta2Ace will convert back to Ace format:

Fasta2Ace.pl [amplicon_name]_aligned.fasta ../phd_dir/ > [amplicon_name]_aligned.ace.2

 

Polybayes Parameters

polybayes -maskAmbiguousMatches -reportOut pb_[amplicon_name].out -ac

eIn [amplicon_name]_aligned.ace.2 -readPhdFiles -phdFilePathIn ../phd_dir -inputFormat ac e -thresholdSnp .1 -screenSnps -preScreenSnpsMinimumBaseQuality 20 -priorPoly .0 1333 -priorPoly2 .99666 -priorPoly3 .00333 -priorPoly4 .00001 -priorPolyAC .1666 -priorPolyAT .1666 -priorPolyAG .1666 -priorPolyCG .1666 -priorPolyACG .25 -pri orPolyACT 25 -maxTerms 60 -displayQuality

Polybayes

 

PolyPhred Parameters

polyphred -snp hom -f 50 -indel -o pp_[amplicon_name].out

PolyPhred

 

Phrap Assembly

Number of contigs Percentage
1 23.67%
2 27.00%
3 19.67%
4 14.00%
5 06.00%
6 04.33%
7 02.67%
8 02.33%

 

polybayes_parse.pl and polyphred_parse.pl extract SNP locations, surrounding bases, and probability scores. These two sequences are currently being wrapped into pb_pp_parser.pl to faciliate both the extraction and quick comparison between the two sets

 

SNP Identification DataSets

  • Training set is composed of a total of 300 validated sequuences.
    • Divided to represent the relative precentage of sequence source:
      • 66% UGA, 12% UMN, and 22% Agencourt
      • Total of 198 UGA sequences, 36 UMN, and 66 Agencourt
  • Testing set is composed of a total of 120 validated sequences.
  • Validation = manually observed FP, FN, TP and TN SNP calls through observation of tracefiles in Consed.
  • Cross-validation – divided into ten parts and the ML classifier was recursively trained on nine parts and tested on the remaining part.

feature_extract.pl is responsible for gathering sequence information as dervied from phredPhrap and alignments. In addition, information about from the polybayes parameters calculations, polybayes probabilities, and polyphred scores is also extracted.

All of the following 14 metrics will be considered for classification and/or learning. Two types of classification and one feed-forward back-prop NN will be used for evaluation. The classification tree should give us a better indication of critical parameters.

 

Features Representation
Sequence Depth Continuous
Variation type Categorical
Polybayes Score Continuous
Polyphred Score Continuous
Freq of Major/Minor Alleles Continuous
Max Quality of Major/Minor Alleles Continuous
Local Average Quality Continuous
Overall Average Quality Continuous
Alignment Quality Continuous

 

Summary

  • PineSAP improves
  • Inaccuracies introduced by using Phrap to align sequences.
  • Time which would be required by using an aligner such as ProbconsRNA or ClustalW on its own.
  • PineSAP has a 98% success rate when used to align loblolly resequencing data.
  • PineSAP identified a success list of features to enhance polymorphism predictions.
  • PineSAP obtained an overall prediction accuracy of 93% in SNP Identification.
  • PineSAP provided a full alignment and polymorphism detection system that can be adapted to specific genomes.