diff --git a/dev/_index/index.html b/dev/_index/index.html index e9e6adc..d007045 100644 --- a/dev/_index/index.html +++ b/dev/_index/index.html @@ -1,2 +1,2 @@ -- · GeneFinder.jl

<p align="center"> <img src="../assets/logo.svg" height="150"><br/> <i>A Gene Finder framework for Julia.</i><br/><br/> <a href="https://www.repostatus.org/#wip"> <img src="https://www.repostatus.org/badges/latest/wip.svg"> </a> <a href="https://codecov.io/gh/camilogarciabotero/GeneFinder.jl"> <img src="https://img.shields.io/codecov/c/github/camilogarciabotero/GeneFinder.jl?logo=codecov&logoColor=white"> </a> <a href="https://camilogarciabotero.github.io/GeneFinder.jl/dev/"> <img src="https://img.shields.io/badge/documentation-online-blue.svg?logo=Julia&logoColor=white"> </a> <a href="https://travis-ci.com/camilogarciabotero/GeneFinder.jl"> <img src="https://travis-ci.com/camilogarciabotero/GeneFinder.jl.svg?branch=main"> <a href="https://github.com/camilogarciabotero/GeneFinder.jl/blob/main/LICENSE"> <img src="https://img.shields.io/badge/license-MIT-green.svg"> </a> </p>


Overview

This is a species-agnostic, algorithm extensible, sequence-anonymous (genome, metagenomes) gene finder library for the Julia Language.

The main goal is to create a versatile module that enables apply different implemented algorithm to DNA sequences. See, for instance, BioAlignment implementations of different sequence alignment algorithms (local, global, edit-distance).

Installation

You can install GeneFinder from the julia REPL. Press ] to enter pkg mode, and enter the following:

add GeneFinder

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Algorithms

Coding genes (CDS - ORFs)

  • Simple finder
  • ☐ EasyGene
  • ☐ GLIMMER
  • ☐ Prodigal - Pyrodigal
  • ☐ PHANOTATE
  • ☐ k-mer based gene finders (?)
  • ☐ Augustus (?)

Non-coding genes (RNA)

  • ☐ Infernal
  • ☐ tRNAscan

Other features

  • ☐ parallelism SIMD ?
  • ☐ memory management (?)
  • ☐ specialized types
    • ☒ Gene
    • ☒ ORF
    • ☒ CDS
    • ☐ EukaryoticGene (?)
    • ☐ ProkaryoticGene (?)
    • ☐ Codon
    • ☐ Intron
    • ☐ Exon
    • ☐ GFF –\> See other packages
    • ☐ FASTX –\> See I/O in other packages

Compatibilities

Must interact with or extend:

  • GenomicAnnotations.jl
  • BioSequences.jl
  • SequenceVariation.jl
  • GenomicFeatures.jl
  • FASTX.jl
  • Kmers.jl

Contributing

Citing

See CITATION.bib for the relevant reference(s).

Logo: gene analysis by Vector Points from the Noun Project

+- · GeneFinder.jl

<p align="center"> <img src="../assets/logo.svg" height="150"><br/> <i>A Gene Finder framework for Julia.</i><br/><br/> <a href="https://www.repostatus.org/#wip"> <img src="https://www.repostatus.org/badges/latest/wip.svg"> </a> <a href="https://codecov.io/gh/camilogarciabotero/GeneFinder.jl"> <img src="https://img.shields.io/codecov/c/github/camilogarciabotero/GeneFinder.jl?logo=codecov&logoColor=white"> </a> <a href="https://camilogarciabotero.github.io/GeneFinder.jl/dev/"> <img src="https://img.shields.io/badge/documentation-online-blue.svg?logo=Julia&logoColor=white"> </a> <a href="https://travis-ci.com/camilogarciabotero/GeneFinder.jl"> <img src="https://travis-ci.com/camilogarciabotero/GeneFinder.jl.svg?branch=main"> <a href="https://github.com/camilogarciabotero/GeneFinder.jl/blob/main/LICENSE"> <img src="https://img.shields.io/badge/license-MIT-green.svg"> </a> </p>


Overview

This is a species-agnostic, algorithm extensible, sequence-anonymous (genome, metagenomes) gene finder library for the Julia Language.

The main goal is to create a versatile module that enables apply different implemented algorithm to DNA sequences. See, for instance, BioAlignment implementations of different sequence alignment algorithms (local, global, edit-distance).

Installation

You can install GeneFinder from the julia REPL. Press ] to enter pkg mode, and enter the following:

add GeneFinder

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Algorithms

Coding genes (CDS - ORFs)

  • Simple finder
  • ☐ EasyGene
  • ☐ GLIMMER
  • ☐ Prodigal - Pyrodigal
  • ☐ PHANOTATE
  • ☐ k-mer based gene finders (?)
  • ☐ Augustus (?)

Non-coding genes (RNA)

  • ☐ Infernal
  • ☐ tRNAscan

Other features

  • ☐ parallelism SIMD ?
  • ☐ memory management (?)
  • ☐ specialized types
    • ☒ Gene
    • ☒ ORF
    • ☒ CDS
    • ☐ EukaryoticGene (?)
    • ☐ ProkaryoticGene (?)
    • ☐ Codon
    • ☐ Intron
    • ☐ Exon
    • ☐ GFF –\> See other packages
    • ☐ FASTX –\> See I/O in other packages

Compatibilities

Must interact with or extend:

  • GenomicAnnotations.jl
  • BioSequences.jl
  • SequenceVariation.jl
  • GenomicFeatures.jl
  • FASTX.jl
  • Kmers.jl

Contributing

Citing

See CITATION.bib for the relevant reference(s).

Logo: gene analysis by Vector Points from the Noun Project

diff --git a/dev/api/index.html b/dev/api/index.html index 0a62ec4..30219b5 100644 --- a/dev/api/index.html +++ b/dev/api/index.html @@ -2,7 +2,7 @@ API · GeneFinder.jl
GeneFinder.CDSType
struct CDS
     orf::ORF
     sequence::LongSubSeq{DNAAlphabet{4}}
-end

The CDS struct represents a coding sequence in a DNA sequence. It has three fields:

  • orf: is the basic composible type (location::UnitRange{Int}, strand::Char) displaying the location of the ORF and the associate strand: forward ('+') or reverse ('-')
  • sequence: a LongSequence{DNAAlphabet{4}} sequence representing the actual sequence of the CDS
source
GeneFinder.GeneFeaturesType
struct GeneFeatures
+end

The CDS struct represents a coding sequence in a DNA sequence. It has three fields:

  • orf: is the basic composible type (location::UnitRange{Int}, strand::Char) displaying the location of the ORF and the associate strand: forward ('+') or reverse ('-')
  • sequence: a LongSequence{DNAAlphabet{4}} sequence representing the actual sequence of the CDS
source
GeneFinder.GeneFeaturesType
struct GeneFeatures
     seqname::String
     start::Int64
     stop::Int64
@@ -10,16 +10,16 @@
     strand::Char
     frame::'.'
     attribute::
-end

This is the main Gene struct, based on the fields that could be found in a GFF3, still needs to be defined correctly, The idea is correct the frame and attributes that will have something like a possible list (id=Char;name=;locus_tag). The write and get functions should have a dedicated method for this struct.

source
GeneFinder.ORFType
struct ORF
+end

This is the main Gene struct, based on the fields that could be found in a GFF3, still needs to be defined correctly, The idea is correct the frame and attributes that will have something like a possible list (id=Char;name=;locus_tag). The write and get functions should have a dedicated method for this struct.

source
GeneFinder.ORFType
struct ORF
     location::UnitRange{Int64}
     strand::Char
-end

The ORF struct represents an open reading frame in a DNA sequence. It has two fields:

  • location: which is a UnitRange{Int64} indicating the start and end locations of the ORF in the sequence
  • strand: is a Char type indicating whether the ORF is on the forward ('+') or reverse ('-') strand of the sequence.
source
GeneFinder.ProteinType
struct Protein
+end

The ORF struct represents an open reading frame in a DNA sequence. It has two fields:

  • location: which is a UnitRange{Int64} indicating the start and end locations of the ORF in the sequence
  • strand: is a Char type indicating whether the ORF is on the forward ('+') or reverse ('-') strand of the sequence.
source
GeneFinder.ProteinType
struct Protein
     sequence::LongSequence
     orf::ORF
-end

Similarly to the CDS struct, the Protein struct represents a encoded protein sequence in a DNA sequence. It has three fields:

  • orf: is the basic composible type (location::UnitRange{Int}, strand::Char) of the sequence
  • sequence: a LongSequence sequence representing the actual translated sequence of the CDS
source
GeneFinder.TCMType
TCM(alphabet::Vector{DNA})

A data structure for storing a DNA Transition Count Matrix (TCM). The TCM is a square matrix where each row and column corresponds to a nucleotide in the given alphabet. The value at position (i, j) in the matrix represents the number of times that nucleotide i is immediately followed by nucleotide j in a DNA sequence.

Fields:

  • order::Dict{DNA, Int64}: A dictionary that maps each nucleotide in the alphabet to its corresponding index in the matrix.
  • counts::Matrix{Int64}: The actual matrix of counts.

Internal function:

  • TCM(alphabet::Vector{DNA}): Constructs a new TCM object with the given alphabet. This function initializes the order field by creating a dictionary that maps each nucleotide in the alphabet to its corresponding index in the matrix. It also initializes the counts field to a matrix of zeros with dimensions len x len, where len is the length of the alphabet.

Example usage:

alphabet = [DNA_A, DNA_C, DNA_G, DNA_T]
-dtcm = TCM(alphabet)
source
GeneFinder.TPMType
TPM(alphabet::Vector{DNA})

A data structure for storing a DNA Transition Probability Matrix (TPM). The TPM is a square matrix where each row and column corresponds to a nucleotide in the given alphabet. The value at position (i, j) in the matrix represents the probability of transitioning from nucleotide i to nucleotide j in a DNA sequence.

Fields:

  • order::Dict{DNA, Int64}: A dictionary that maps each nucleotide in the alphabet to its corresponding index in the matrix.
  • probabilities::Matrix{Float64}: The actual matrix of probabilities.

Example usage:

alphabet = [DNA_A, DNA_C, DNA_G, DNA_T]
-dtpm = TPM(alphabet)
source
GeneFinder.TransitionModelType
struct TransitionModel

The TransitionModel struct represents a transition model used in a sequence analysis. It consists of a transition probability matrix (tpm) and initial distribution probabilities.

Fields

  • tpm::Matrix{Float64}: The transition probability matrix, a matrix of type Float64 representing the probabilities of transitioning from one state to another.
  • initials::Matrix{Float64}: The initial distribution probabilities, a matrix of type Float64 representing the probabilities of starting in each state.
  • n: is the order of the transition model, or in other words the order of the resulted Markov chain.

Constructors

  • TransitionModel(tpm::Matrix{Float64}, initials::Matrix{Float64}): Constructs a TransitionModel object with the provided transition probability matrix tpm and initial distribution probabilities initials.
  • TransitionModel(sequence::LongSequence{DNAAlphabet{4}}): Constructs a TransitionModel object based on a given DNA sequence. The transition probability matrix is calculated using transition_probability_matrix(sequence).probabilities, and the initial distribution probabilities are calculated using initial_distribution(sequence).
source
GeneFinder.cdsgeneratorMethod
cdsgenerator(sequence::LongSequence{DNAAlphabet{4}}; kwargs...)
-cdsgenerator(sequence::String; kwargs...)

A function to generete CDSs sequence out of a DNA sequence.

The cdsgenerator is a generator function that takes a LongSequence{DNAAlphabet{4}} sequence and returns an iterator over the given sequence, containing the coding sequences (CDSs) found in the sequence and the ORF. It uses the findorfs function to find open reading frames (ORFs) in the sequence, and then it extracts the actual CDS sequence from each ORF, returining both. The function also searches the reverse complement of the sequence, so it finds CDSs on both strands.

Keywords

  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
GeneFinder.dinucleotidesMethod
dinucleotides(sequence::LongNucOrView{4})

Compute the transition counts of each dinucleotide in a given DNA sequence.

Arguments

  • sequence::LongSequence{DNAAlphabet{4}}: a LongSequence{DNAAlphabet{4}} object representing the DNA sequence.

Keywords

  • extended_alphabet::Bool=false: If true will pass the extended alphabet of DNA to search

Returns

A dictionary with keys being LongSequence{DNAAlphabet{4}} objects representing the dinucleotides, and values being the number of occurrences of each dinucleotide in the sequence.

Example

seq = dna"AGCTAGCTAGCT"
+end

Similarly to the CDS struct, the Protein struct represents a encoded protein sequence in a DNA sequence. It has three fields:

  • orf: is the basic composible type (location::UnitRange{Int}, strand::Char) of the sequence
  • sequence: a LongSequence sequence representing the actual translated sequence of the CDS
source
GeneFinder.TCMType
TCM(alphabet::Vector{DNA})

A data structure for storing a DNA Transition Count Matrix (TCM). The TCM is a square matrix where each row and column corresponds to a nucleotide in the given alphabet. The value at position (i, j) in the matrix represents the number of times that nucleotide i is immediately followed by nucleotide j in a DNA sequence.

Fields:

  • order::Dict{DNA, Int64}: A dictionary that maps each nucleotide in the alphabet to its corresponding index in the matrix.
  • counts::Matrix{Int64}: The actual matrix of counts.

Internal function:

  • TCM(alphabet::Vector{DNA}): Constructs a new TCM object with the given alphabet. This function initializes the order field by creating a dictionary that maps each nucleotide in the alphabet to its corresponding index in the matrix. It also initializes the counts field to a matrix of zeros with dimensions len x len, where len is the length of the alphabet.

Example usage:

alphabet = [DNA_A, DNA_C, DNA_G, DNA_T]
+dtcm = TCM(alphabet)
source
GeneFinder.TPMType
TPM(alphabet::Vector{DNA})

A data structure for storing a DNA Transition Probability Matrix (TPM). The TPM is a square matrix where each row and column corresponds to a nucleotide in the given alphabet. The value at position (i, j) in the matrix represents the probability of transitioning from nucleotide i to nucleotide j in a DNA sequence.

Fields:

  • order::Dict{DNA, Int64}: A dictionary that maps each nucleotide in the alphabet to its corresponding index in the matrix.
  • probabilities::Matrix{Float64}: The actual matrix of probabilities.

Example usage:

alphabet = [DNA_A, DNA_C, DNA_G, DNA_T]
+dtpm = TPM(alphabet)
source
GeneFinder.TransitionModelType
struct TransitionModel

The TransitionModel struct represents a transition model used in a sequence analysis. It consists of a transition probability matrix (tpm) and initial distribution probabilities.

Fields

  • tpm::Matrix{Float64}: The transition probability matrix, a matrix of type Float64 representing the probabilities of transitioning from one state to another.
  • initials::Matrix{Float64}: The initial distribution probabilities, a matrix of type Float64 representing the probabilities of starting in each state.
  • n: is the order of the transition model, or in other words the order of the resulted Markov chain.

Constructors

  • TransitionModel(tpm::Matrix{Float64}, initials::Matrix{Float64}): Constructs a TransitionModel object with the provided transition probability matrix tpm and initial distribution probabilities initials.
  • TransitionModel(sequence::LongSequence{DNAAlphabet{4}}): Constructs a TransitionModel object based on a given DNA sequence. The transition probability matrix is calculated using transition_probability_matrix(sequence).probabilities, and the initial distribution probabilities are calculated using initial_distribution(sequence).
source
GeneFinder.cdsgeneratorMethod
cdsgenerator(sequence::LongSequence{DNAAlphabet{4}}; kwargs...)
+cdsgenerator(sequence::String; kwargs...)

A function to generete CDSs sequence out of a DNA sequence.

The cdsgenerator is a generator function that takes a LongSequence{DNAAlphabet{4}} sequence and returns an iterator over the given sequence, containing the coding sequences (CDSs) found in the sequence and the ORF. It uses the findorfs function to find open reading frames (ORFs) in the sequence, and then it extracts the actual CDS sequence from each ORF, returining both. The function also searches the reverse complement of the sequence, so it finds CDSs on both strands.

Keywords

  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
GeneFinder.dinucleotidesMethod
dinucleotides(sequence::LongNucOrView{4})

Compute the transition counts of each dinucleotide in a given DNA sequence.

Arguments

  • sequence::LongSequence{DNAAlphabet{4}}: a LongSequence{DNAAlphabet{4}} object representing the DNA sequence.

Keywords

  • extended_alphabet::Bool=false: If true will pass the extended alphabet of DNA to search

Returns

A dictionary with keys being LongSequence{DNAAlphabet{4}} objects representing the dinucleotides, and values being the number of occurrences of each dinucleotide in the sequence.

Example

seq = dna"AGCTAGCTAGCT"
 
 dinucleotides(seq)
 
@@ -39,10 +39,10 @@
   CA => 0
   AT => 0
   AA => 0
-  TG => 0
source
GeneFinder.fasta_to_dnaMethod
fasta_to_dna(input::String)

Converts a FASTA formatted file (even if it is a multi-fasta) to an array of LongSequence{DNAAlphabet{4}} objects.

source
GeneFinder.findorfsMethod
findorfs(sequence::LongSequence{DNAAlphabet{4}}; kwargs...)::Vector{ORF}
-findorfs(sequence::String; kwargs...)::Vector{ORF}

A simple implementation that finds ORFs in a DNA sequence.

The findorfs function takes a LongSequence{DNAAlphabet{4}} sequence and returns a Vector{ORF} containing the ORFs found in the sequence. It searches entire regularly expressed CDS, adding each ORF it finds to the vector. The function also searches the reverse complement of the sequence, so it finds ORFs on both strands. Extending the starting codons with the alternative_start = true will search for ATG, GTG, and TTG. Some studies have shown that in E. coli (K-12 strain), ATG, GTG and TTG are used 83 %, 14 % and 3 % respectively.

Note

This function has not ORFs scoring scheme. Thus it might consider aa"M*" a posible encoding protein from the resulting ORFs.

Keywords

  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
GeneFinder.getcdsMethod
getcds(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)
-getcds(input::String, output::String; kwargs...) ## for strings per se

This function will take a LongSequence{DNAAlphabet{4}} or String sequence and by means of the findorfs() function will push LongSubSeq{DNAAlphabet{4}} into a Vector{}

Keywords

  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
GeneFinder.getproteinsMethod
getproteins(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)
-getproteins(input::String, output::String; kwargs...)

Similar to getcds() function, it will take a LongSequence{DNAAlphabet{4}} or String sequence and by means of the findorfs() and the translate() function will push LongAAs into a Vector

Keywords

  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
GeneFinder.hasprematurestopMethod
hasprematurestop(sequence::LongNucOrView{4})::Bool

Determine whether the sequence of type LongSequence{DNAAlphabet{4}} contains a premature stop codon.

Returns a boolean indicating whether the sequence has more than one stop codon.

source
GeneFinder.fasta_to_dnaMethod
fasta_to_dna(input::String)

Converts a FASTA formatted file (even if it is a multi-fasta) to an array of LongSequence{DNAAlphabet{4}} objects.

source
GeneFinder.findorfsMethod
findorfs(sequence::LongSequence{DNAAlphabet{4}}; kwargs...)::Vector{ORF}
+findorfs(sequence::String; kwargs...)::Vector{ORF}

A simple implementation that finds ORFs in a DNA sequence.

The findorfs function takes a LongSequence{DNAAlphabet{4}} sequence and returns a Vector{ORF} containing the ORFs found in the sequence. It searches entire regularly expressed CDS, adding each ORF it finds to the vector. The function also searches the reverse complement of the sequence, so it finds ORFs on both strands. Extending the starting codons with the alternative_start = true will search for ATG, GTG, and TTG. Some studies have shown that in E. coli (K-12 strain), ATG, GTG and TTG are used 83 %, 14 % and 3 % respectively.

Note

This function has not ORFs scoring scheme. Thus it might consider aa"M*" a posible encoding protein from the resulting ORFs.

Keywords

  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
GeneFinder.getcdsMethod
getcds(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)
+getcds(input::String, output::String; kwargs...) ## for strings per se

This function will take a LongSequence{DNAAlphabet{4}} or String sequence and by means of the findorfs() function will push LongSubSeq{DNAAlphabet{4}} into a Vector{}

Keywords

  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
GeneFinder.getproteinsMethod
getproteins(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)
+getproteins(input::String, output::String; kwargs...)

Similar to getcds() function, it will take a LongSequence{DNAAlphabet{4}} or String sequence and by means of the findorfs() and the translate() function will push LongAAs into a Vector

Keywords

  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
GeneFinder.hasprematurestopMethod
hasprematurestop(sequence::LongNucOrView{4})::Bool

Determine whether the sequence of type LongSequence{DNAAlphabet{4}} contains a premature stop codon.

Returns a boolean indicating whether the sequence has more than one stop codon.

source
GeneFinder.iscodingFunction
iscoding(
     sequence::LongSequence{DNAAlphabet{4}}, 
     codingmodel::TransitionModel, 
     noncodingmodel::TransitionModel,
@@ -50,7 +50,7 @@
     )

Check if a given DNA sequence is likely to be coding based on a log-odds ratio. The log-odds ratio is a statistical measure used to assess the likelihood of a sequence being coding or non-coding. It compares the probability of the sequence generated by a coding model to the probability of the sequence generated by a non-coding model. If the log-odds ratio exceeds a given threshold (η), the sequence is considered likely to be coding. It is formally described as a decision rule:

\[S(X) = \log \left( \frac{{P_C(X_1=i_1, \ldots, X_T=i_T)}}{{P_N(X_1=i_1, \ldots, X_T=i_T)}} \right) \begin{cases} > \eta & \Rightarrow \text{{coding}} \\ < \eta & \Rightarrow \text{{noncoding}} \end{cases}\]

Arguments

  • sequence::LongSequence{DNAAlphabet{4}}: The DNA sequence to be evaluated.
  • codingmodel::TransitionModel: The transition model for coding regions.
  • noncodingmodel::TransitionModel: The transition model for non-coding regions.
  • η::Float64 = 1e-5: The threshold value for the log-odds ratio (default: 1e-5).

Returns

  • true if the sequence is likely to be coding.
  • false if the sequence is likely to be non-coding.

Raises

  • ErrorException: if the length of the sequence is not divisible by 3.
  • ErrorException: if the sequence contains a premature stop codon.

Example

sequence = LondDNA("ATGGCATCTAG")
 codingmodel = TransitionModel()
 noncodingmodel = TransitionModel()
-iscoding(sequence, codingmodel, noncodingmodel)  # Returns: true
source
GeneFinder.locationiteratorMethod
locationiterator(sequence::LongSequence{DNAAlphabet{4}}; alternative_start::Bool=false)

This is an iterator function that uses regular expressions to search the entire CDS (instead of start and stop codons) in a LongSequence{DNAAlphabet{4}} sequence. It uses an anonymous function that will find the first regularly expressed CDS. Then using this anonymous function it creates an iterator that will apply it until there is no other CDS.

source
GeneFinder.nucleotidefreqsMethod
nucleotidefreqs(sequence::LongSequence{DNAAlphabet{4}}) -> Dict{DNA, Float64}

Calculate the frequency of each nucleotide in a DNA sequence.

Arguments

  • sequence::LongSequence{DNAAlphabet{4}}: A LongSequence{DNAAlphabet{4}} sequence.

Returns

A dictionary with each nucleotide in the sequence as a key, and its frequency as a value.

Example

seq = dna"CCTCCCGGACCCTGGGCTCGGGAC"
+iscoding(sequence, codingmodel, noncodingmodel)  # Returns: true
source
GeneFinder.locationiteratorMethod
locationiterator(sequence::LongSequence{DNAAlphabet{4}}; alternative_start::Bool=false)

This is an iterator function that uses regular expressions to search the entire CDS (instead of start and stop codons) in a LongSequence{DNAAlphabet{4}} sequence. It uses an anonymous function that will find the first regularly expressed CDS. Then using this anonymous function it creates an iterator that will apply it until there is no other CDS.

source
GeneFinder.nucleotidefreqsMethod
nucleotidefreqs(sequence::LongSequence{DNAAlphabet{4}}) -> Dict{DNA, Float64}

Calculate the frequency of each nucleotide in a DNA sequence.

Arguments

  • sequence::LongSequence{DNAAlphabet{4}}: A LongSequence{DNAAlphabet{4}} sequence.

Returns

A dictionary with each nucleotide in the sequence as a key, and its frequency as a value.

Example

seq = dna"CCTCCCGGACCCTGGGCTCGGGAC"
 
 nucleotidefreqs(seq)
 
@@ -58,8 +58,8 @@
 DNA_T => 0.125
 DNA_A => 0.0833333
 DNA_G => 0.333333
-DNA_C => 0.458333
source
GeneFinder.proteingeneratorMethod
proteingenerator(sequence::LongSequence{DNAAlphabet{4}}; kwargs...)
-proteingenerator(sequence::String; kwargs...)

As its name suggest this generator function iterates over the sequence to find proteins directly from a DNA sequence. The proteingenerator function takes a LongSequence{DNAAlphabet{4}} sequence and returns a Vector{CDS} containing the coding sequences (CDSs) found in the sequence and the associated ORF.

Keywords

  • code::GeneticCode=BioSequences.standard_genetic_code: The genetic code by which codons will be translated. See BioSequences.ncbi_trans_table for more info.
  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
GeneFinder.sequenceprobabilityMethod
sequenceprobability(sequence::LongNucOrView{4}, tpm::Matrix{Float64}, initials=Vector{Float64})

Compute the probability of a given sequence using a transition probability matrix and the initial probabilities distributions.

\[P(X_1 = i_1, \ldots, X_T = i_T) = \pi_{i_1}^{T-1} \prod_{t=1}^{T-1} a_{i_t, i_{t+1}}\]

Arguments

  • sequence::LongNucOrView{4}: The input sequence of nucleotides.
  • tm::TransitionModel is the actual data structure composed of a tpm::Matrix{Float64} the transition probability matrix and initials=Vector{Float64} the initial state probabilities.

Returns

  • probability::Float64: The probability of the input sequence.

Example

mainseq = LongDNA{4}("CCTCCCGGACCCTGGGCTCGGGAC")
+DNA_C => 0.458333
source
GeneFinder.proteingeneratorMethod
proteingenerator(sequence::LongSequence{DNAAlphabet{4}}; kwargs...)
+proteingenerator(sequence::String; kwargs...)

As its name suggest this generator function iterates over the sequence to find proteins directly from a DNA sequence. The proteingenerator function takes a LongSequence{DNAAlphabet{4}} sequence and returns a Vector{CDS} containing the coding sequences (CDSs) found in the sequence and the associated ORF.

Keywords

  • code::GeneticCode=BioSequences.standard_genetic_code: The genetic code by which codons will be translated. See BioSequences.ncbi_trans_table for more info.
  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
GeneFinder.sequenceprobabilityMethod
sequenceprobability(sequence::LongNucOrView{4}, tpm::Matrix{Float64}, initials=Vector{Float64})

Compute the probability of a given sequence using a transition probability matrix and the initial probabilities distributions.

\[P(X_1 = i_1, \ldots, X_T = i_T) = \pi_{i_1}^{T-1} \prod_{t=1}^{T-1} a_{i_t, i_{t+1}}\]

Arguments

  • sequence::LongNucOrView{4}: The input sequence of nucleotides.
  • tm::TransitionModel is the actual data structure composed of a tpm::Matrix{Float64} the transition probability matrix and initials=Vector{Float64} the initial state probabilities.

Returns

  • probability::Float64: The probability of the input sequence.

Example

mainseq = LongDNA{4}("CCTCCCGGACCCTGGGCTCGGGAC")
 
 tpm = transition_probability_matrix(mainseq)
     
@@ -92,7 +92,7 @@
 
 sequenceprobability(newseq, tm)
     
-    0.0217
source
GeneFinder.transition_count_matrixMethod
transition_count_matrix(sequence::LongSequence{DNAAlphabet{4}})

Compute the transition count matrix (TCM) of a given DNA sequence.

Arguments

  • sequence::LongSequence{DNAAlphabet{4}}: a LongSequence{DNAAlphabet{4}} object representing the DNA sequence.

Keywords

  • extended_alphabet::Bool=false: If true will pass the extended alphabet of DNA to search

Returns

A TCM object representing the transition count matrix of the sequence.

Example

seq = LongDNA{4}("AGCTAGCTAGCT")
+    0.0217
source
GeneFinder.transition_count_matrixMethod
transition_count_matrix(sequence::LongSequence{DNAAlphabet{4}})

Compute the transition count matrix (TCM) of a given DNA sequence.

Arguments

  • sequence::LongSequence{DNAAlphabet{4}}: a LongSequence{DNAAlphabet{4}} object representing the DNA sequence.

Keywords

  • extended_alphabet::Bool=false: If true will pass the extended alphabet of DNA to search

Returns

A TCM object representing the transition count matrix of the sequence.

Example

seq = LongDNA{4}("AGCTAGCTAGCT")
 
 tcm = transition_count_matrix(seq)
 
@@ -102,7 +102,7 @@
 C  0 0 0 3
 G  0 3 0 0
 T  2 0 0 0
-
source
GeneFinder.transition_modelFunction
transition_model(sequence::LongNucOrView{4}, n::Int64=1)

Constructs a transition model based on the given DNA sequence and transition order.

Arguments

  • sequence::LongNucOrView{4}: A DNA sequence represented as a LongNucOrView{4} object.
  • n::Int64 (optional): The transition order (default: 1).

Returns

A TransitionModel object representing the transition model.


transition_model(tpm::Matrix{Float64}, initials::Matrix{Float64}, n::Int64=1)

Builds a transtition model based on the transition probability matrix and the initial distributions. It can also calculates higer orders of the model if n is changed.

Arguments

  • tpm::Matrix{Float64}: the transition probability matrix TPM
  • initials::Matrix{Float64}: the initial distributions of the model.
  • n::Int64 (optional): The transition order (default: 1).

Returns

A TransitionProbabilityMatrix object representing the transition probability matrix.

Example

sequence = LongDNA{4}("ACTACATCTA")
+
source
GeneFinder.transition_modelFunction
transition_model(sequence::LongNucOrView{4}, n::Int64=1)

Constructs a transition model based on the given DNA sequence and transition order.

Arguments

  • sequence::LongNucOrView{4}: A DNA sequence represented as a LongNucOrView{4} object.
  • n::Int64 (optional): The transition order (default: 1).

Returns

A TransitionModel object representing the transition model.


transition_model(tpm::Matrix{Float64}, initials::Matrix{Float64}, n::Int64=1)

Builds a transtition model based on the transition probability matrix and the initial distributions. It can also calculates higer orders of the model if n is changed.

Arguments

  • tpm::Matrix{Float64}: the transition probability matrix TPM
  • initials::Matrix{Float64}: the initial distributions of the model.
  • n::Int64 (optional): The transition order (default: 1).

Returns

A TransitionProbabilityMatrix object representing the transition probability matrix.

Example

sequence = LongDNA{4}("ACTACATCTA")
 
 model = transition_model(sequence, 2)
 TransitionModel:
@@ -113,7 +113,7 @@
     0.111	0.444	0.0	0.444
   - Initials (Size: 1 × 4):
     0.333	0.333	0.0	0.333
-  - order: 2
source
GeneFinder.transition_probability_matrixFunction
transition_probability_matrix(sequence::LongSequence{DNAAlphabet{4}})

Compute the transition probability matrix (TPM) of a given DNA sequence. Formally it construct $\hat{A}$ where:

\[a_{ij} = P(X_t = j \mid X_{t-1} = i) = \frac{{P(X_{t-1} = i, X_t = j)}}{{P(X_{t-1} = i)}}\]

Arguments

  • sequence::LongNucOrView{4}: a LongNucOrView{4} object representing the DNA sequence.
  • n::Int64=1: The order of the Markov model. That is the $\hat{A}^{n}$

Keywords

  • extended_alphabet::Bool=false: If true will pass the extended alphabet of DNA to search

Returns

A TPM object representing the transition probability matrix of the sequence.

Example

seq = dna"AGCTAGCTAGCT"
+  - order: 2
source
GeneFinder.transition_probability_matrixFunction
transition_probability_matrix(sequence::LongSequence{DNAAlphabet{4}})

Compute the transition probability matrix (TPM) of a given DNA sequence. Formally it construct $\hat{A}$ where:

\[a_{ij} = P(X_t = j \mid X_{t-1} = i) = \frac{{P(X_{t-1} = i, X_t = j)}}{{P(X_{t-1} = i)}}\]

Arguments

  • sequence::LongNucOrView{4}: a LongNucOrView{4} object representing the DNA sequence.
  • n::Int64=1: The order of the Markov model. That is the $\hat{A}^{n}$

Keywords

  • extended_alphabet::Bool=false: If true will pass the extended alphabet of DNA to search

Returns

A TPM object representing the transition probability matrix of the sequence.

Example

seq = dna"AGCTAGCTAGCT"
 
 tpm = transition_probability_matrix(seq)
 
@@ -122,7 +122,7 @@
 A  0.0 0.0 1.0 0.0
 C  0.0 0.0 0.0 1.0
 G  0.0 1.0 0.0 0.0
-T  1.0 0.0 0.0 0.0
source
GeneFinder.write_bedMethod
write_bed(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)
-write_bed(input::String, output::String; kwargs...)

Write BED data to a file.

Keywords

  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
GeneFinder.write_cdsMethod
write_cds(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)
-write_cds(input::String, output::String; kwargs...)

Write a file containing the coding sequences (CDSs) of a given DNA sequence to the specified file.

Keywords

  • alternative_start: A boolean value indicating whether alternative start codons should be used when identifying CDSs. Default is false.
  • min_len: An integer representing the minimum length that a CDS must have in order to be included in the output file. Default is 6.
source
GeneFinder.write_gffMethod
write_gff(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)
-write_gff(input::String, output::String; kwargs...)

Write GFF data to a file.

Keywords

  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
GeneFinder.write_proteinsMethod
write_proteins(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)

Write the protein sequences encoded by the coding sequences (CDSs) of a given DNA sequence to the specified file.

Keywords

  • code::GeneticCode=BioSequences.standard_genetic_code: The genetic code by which codons will be translated. See BioSequences.ncbi_trans_table for more info.
  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
+T 1.0 0.0 0.0 0.0source
GeneFinder.write_bedMethod
write_bed(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)
+write_bed(input::String, output::String; kwargs...)

Write BED data to a file.

Keywords

  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
GeneFinder.write_cdsMethod
write_cds(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)
+write_cds(input::String, output::String; kwargs...)

Write a file containing the coding sequences (CDSs) of a given DNA sequence to the specified file.

Keywords

  • alternative_start: A boolean value indicating whether alternative start codons should be used when identifying CDSs. Default is false.
  • min_len: An integer representing the minimum length that a CDS must have in order to be included in the output file. Default is 6.
source
GeneFinder.write_gffMethod
write_gff(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)
+write_gff(input::String, output::String; kwargs...)

Write GFF data to a file.

Keywords

  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
GeneFinder.write_proteinsMethod
write_proteins(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)

Write the protein sequences encoded by the coding sequences (CDSs) of a given DNA sequence to the specified file.

Keywords

  • code::GeneticCode=BioSequences.standard_genetic_code: The genetic code by which codons will be translated. See BioSequences.ncbi_trans_table for more info.
  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • min_len::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
diff --git a/dev/index.html b/dev/index.html index 9a8e4bc..cbc402a 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Home · GeneFinder.jl


Overview

This is a species-agnostic, algorithm extensible, sequence-anonymous (genome, metagenomes) gene finder library for the Julia Language.

The main goal is to create a versatile module that enables apply different implemented algorithm to DNA sequences. See, for instance, BioAlignment implementations of different sequence alignment algorithms (local, global, edit-distance).

Installation

You can install GeneFinder from the julia REPL. Press ] to enter pkg mode, and enter the following:

add GeneFinder

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Algorithms

Coding genes (CDS - ORFs)

  • Simple finder
  • ☐ EasyGene
  • ☐ GLIMMER
  • ☐ Prodigal - Pyrodigal
  • ☐ PHANOTATE
  • ☐ k-mer based gene finders (?)
  • ☐ Augustus (?)

Non-coding genes (RNA)

  • ☐ Infernal
  • ☐ tRNAscan

Other features

  • ☐ parallelism SIMD ?
  • ☐ memory management (?)
  • ☐ incorporate Ribosime Binding Sites (RBS)
  • ☐ incorporate Programmed Reading Frame Shifting (PRFS)
  • ☐ specialized types
    • ☒ Gene
    • ☒ ORF
    • ☒ Codon
    • ☒ CDS
    • ☐ EukaryoticGene (?)
    • ☐ ProkaryoticGene (?)
    • ☐ Intron
    • ☐ Exon
    • ☐ GFF –\> See other packages
    • ☐ FASTX –\> See I/O in other packages

Compatibilities

Must interact with or extend:

  • GenomicAnnotations.jl
  • BioSequences.jl
  • SequenceVariation.jl
  • GenomicFeatures.jl
  • FASTX.jl
  • Kmers.jl
  • Graphs.jl

Contributing

Citing

See CITATION.bib for the relevant reference(s).

+Home · GeneFinder.jl


Overview

This is a species-agnostic, algorithm extensible, sequence-anonymous (genome, metagenomes) gene finder library for the Julia Language.

The main goal is to create a versatile module that enables apply different implemented algorithm to DNA sequences. See, for instance, BioAlignment implementations of different sequence alignment algorithms (local, global, edit-distance).

Installation

You can install GeneFinder from the julia REPL. Press ] to enter pkg mode, and enter the following:

add GeneFinder

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Algorithms

Coding genes (CDS - ORFs)

  • Simple finder
  • ☐ EasyGene
  • ☐ GLIMMER
  • ☐ Prodigal - Pyrodigal
  • ☐ PHANOTATE
  • ☐ k-mer based gene finders (?)
  • ☐ Augustus (?)

Non-coding genes (RNA)

  • ☐ Infernal
  • ☐ tRNAscan

Other features

  • ☐ parallelism SIMD ?
  • ☐ memory management (?)
  • ☐ incorporate Ribosime Binding Sites (RBS)
  • ☐ incorporate Programmed Reading Frame Shifting (PRFS)
  • ☐ specialized types
    • ☒ Gene
    • ☒ ORF
    • ☒ Codon
    • ☒ CDS
    • ☐ EukaryoticGene (?)
    • ☐ ProkaryoticGene (?)
    • ☐ Intron
    • ☐ Exon
    • ☐ GFF –\> See other packages
    • ☐ FASTX –\> See I/O in other packages

Compatibilities

Must interact with or extend:

  • GenomicAnnotations.jl
  • BioSequences.jl
  • SequenceVariation.jl
  • GenomicFeatures.jl
  • FASTX.jl
  • Kmers.jl
  • Graphs.jl

Contributing

Citing

See CITATION.bib for the relevant reference(s).

diff --git a/dev/markovchains/index.html b/dev/markovchains/index.html index 2dea4c4..724562a 100644 --- a/dev/markovchains/index.html +++ b/dev/markovchains/index.html @@ -48,4 +48,4 @@ S(X) = \log \frac{{P_C(X_1=i_1, \ldots, X_T=i_T)}}{{P_N(X_1=i_1, \ldots, X_T=i_T)}} \begin{cases} > \eta & \Rightarrow \text{coding} \\ < \eta & \Rightarrow \text{noncoding} \end{cases} \end{align}\]

Where the $P_{C}$ is the probability of the sequence given a CDS model, $P_{N}$ is the probability of the sequence given a No-CDS model, the decision rule is finally based on whether the ratio is greater or lesser than a given threshold η of significance level.

In the GeneFinder we have implemented this rule and a couple of basic transition probability models of CDS and No-CDS of E. coli from Axelson-Fisk (2015) work. To check whether a random sequence could be coding based on these decision we use the predicate iscoding with the ECOLICDS and ECOLINOCDS models:

randseq = getcds(randdnaseq(99))[1] # this will retrieved a random coding ORF
 
-iscoding(randseq, ECOLICDS, ECOLINOCDS)
true

References

Axelson-Fisk, Marina. 2015. Comparative Gene Finding. Vol. 20. Computational Biology. London: Springer London. http://link.springer.com/10.1007/978-1-4471-6693-1.

+iscoding(randseq, ECOLICDS, ECOLINOCDS)
true

References

Axelson-Fisk, Marina. 2015. Comparative Gene Finding. Vol. 20. Computational Biology. London: Springer London. http://link.springer.com/10.1007/978-1-4471-6693-1.

diff --git a/dev/oldindex/index.html b/dev/oldindex/index.html index 0bfddb9..5fe8ded 100644 --- a/dev/oldindex/index.html +++ b/dev/oldindex/index.html @@ -1,2 +1,2 @@ -- · GeneFinder.jl

engine: knitr cache: true –-

<p align="center"> <img src="../assets/logo.svg" height="150"><br/> <i>A Gene Finder framework for Julia.</i><br/><br/> <a href="https://www.repostatus.org/#wip"> <img src="https://www.repostatus.org/badges/latest/wip.svg"> </a> <a href="https://codecov.io/gh/camilogarciabotero/GeneFinder.jl"> <img src="https://img.shields.io/codecov/c/github/camilogarciabotero/GeneFinder.jl?logo=codecov&logoColor=white"> </a> <a href="https://camilogarciabotero.github.io/GeneFinder.jl/dev/"> <img src="https://img.shields.io/badge/documentation-online-blue.svg?logo=Julia&logoColor=white"> </a> <a href="https://travis-ci.com/camilogarciabotero/GeneFinder.jl"> <img src="https://travis-ci.com/camilogarciabotero/GeneFinder.jl.svg?branch=main"> <a href="https://github.com/camilogarciabotero/GeneFinder.jl/blob/main/LICENSE"> <img src="https://img.shields.io/badge/license-MIT-green.svg"> </a> </p>


Overview

This is a species-agnostic, algorithm extensible, sequence-anonymous (genome, metagenomes) gene finder library for the Julia Language.

The main goal is to create a versatile module that enables apply different implemented algorithm to DNA sequences. See, for instance, BioAlignment implementations of different sequence alignment algorithms (local, global, edit-distance).

Installation

You can install GeneFinder from the julia REPL. Press ] to enter pkg mode, and enter the following:

add GeneFinder

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Algorithms

Coding genes (CDS - ORFs)

  • [x] Simple finder
  • [ ] EasyGene
  • [ ] GLIMMER
  • [ ] Prodigal - Pyrodigal
  • [ ] PHANOTATE
  • [ ] k-mer based gene finders (?)
  • [ ] Augustus (?)

Non-coding genes (RNA)

  • [ ] Infernal
  • [ ] tRNAscan

Other features

  • [ ] parallelism SIMD ?
  • [ ] memory management (?)
  • [ ] specialized types
    • [x] Gene
    • [x] ORF
    • [x] CDS
    • [ ] EukaryoticGene (?)
    • [ ] ProkaryoticGene (?)
    • [ ] Codon
    • [ ] Intron
    • [ ] Exon
    • [ ] GFF –> See other packages
    • [ ] FASTX –> See I/O in other packages

Compatibilities

Must interact with or extend:

  • GenomicAnnotations.jl
  • BioSequences.jl
  • SequenceVariation.jl
  • GenomicFeatures.jl
  • FASTX.jl
  • Kmers.jl

Contributing

Citing

See CITATION.bib for the relevant reference(s).

Logo: gene analysis by Vector Points from the Noun Project

+- · GeneFinder.jl

engine: knitr cache: true –-

<p align="center"> <img src="../assets/logo.svg" height="150"><br/> <i>A Gene Finder framework for Julia.</i><br/><br/> <a href="https://www.repostatus.org/#wip"> <img src="https://www.repostatus.org/badges/latest/wip.svg"> </a> <a href="https://codecov.io/gh/camilogarciabotero/GeneFinder.jl"> <img src="https://img.shields.io/codecov/c/github/camilogarciabotero/GeneFinder.jl?logo=codecov&logoColor=white"> </a> <a href="https://camilogarciabotero.github.io/GeneFinder.jl/dev/"> <img src="https://img.shields.io/badge/documentation-online-blue.svg?logo=Julia&logoColor=white"> </a> <a href="https://travis-ci.com/camilogarciabotero/GeneFinder.jl"> <img src="https://travis-ci.com/camilogarciabotero/GeneFinder.jl.svg?branch=main"> <a href="https://github.com/camilogarciabotero/GeneFinder.jl/blob/main/LICENSE"> <img src="https://img.shields.io/badge/license-MIT-green.svg"> </a> </p>


Overview

This is a species-agnostic, algorithm extensible, sequence-anonymous (genome, metagenomes) gene finder library for the Julia Language.

The main goal is to create a versatile module that enables apply different implemented algorithm to DNA sequences. See, for instance, BioAlignment implementations of different sequence alignment algorithms (local, global, edit-distance).

Installation

You can install GeneFinder from the julia REPL. Press ] to enter pkg mode, and enter the following:

add GeneFinder

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Algorithms

Coding genes (CDS - ORFs)

  • [x] Simple finder
  • [ ] EasyGene
  • [ ] GLIMMER
  • [ ] Prodigal - Pyrodigal
  • [ ] PHANOTATE
  • [ ] k-mer based gene finders (?)
  • [ ] Augustus (?)

Non-coding genes (RNA)

  • [ ] Infernal
  • [ ] tRNAscan

Other features

  • [ ] parallelism SIMD ?
  • [ ] memory management (?)
  • [ ] specialized types
    • [x] Gene
    • [x] ORF
    • [x] CDS
    • [ ] EukaryoticGene (?)
    • [ ] ProkaryoticGene (?)
    • [ ] Codon
    • [ ] Intron
    • [ ] Exon
    • [ ] GFF –> See other packages
    • [ ] FASTX –> See I/O in other packages

Compatibilities

Must interact with or extend:

  • GenomicAnnotations.jl
  • BioSequences.jl
  • SequenceVariation.jl
  • GenomicFeatures.jl
  • FASTX.jl
  • Kmers.jl

Contributing

Citing

See CITATION.bib for the relevant reference(s).

Logo: gene analysis by Vector Points from the Noun Project

diff --git a/dev/search/index.html b/dev/search/index.html index 87be23d..ef79f49 100644 --- a/dev/search/index.html +++ b/dev/search/index.html @@ -1,2 +1,2 @@ -Search · GeneFinder.jl

Loading search...

    +Search · GeneFinder.jl

    Loading search...

      diff --git a/dev/simplefinder/index.html b/dev/simplefinder/index.html index 8bd2973..1bc282f 100644 --- a/dev/simplefinder/index.html +++ b/dev/simplefinder/index.html @@ -99,4 +99,4 @@ >location=581:601 strand=+ MCPTAA* >location=695:706 strand=+ -MQP* +MQP*