From b6292f8cbcbe31faec51a849d5dc655ca4c8380b Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Wed, 19 Jul 2023 07:08:00 +0000 Subject: [PATCH] build based on c91a193 --- previews/PR119/custom/index.html | 4 ++-- previews/PR119/debugging/index.html | 2 +- previews/PR119/index.html | 2 +- previews/PR119/io/index.html | 8 ++++---- previews/PR119/parser/index.html | 16 ++++++++-------- previews/PR119/reader/index.html | 2 +- previews/PR119/regex/index.html | 4 ++-- previews/PR119/search/index.html | 2 +- previews/PR119/search_index.js | 2 +- previews/PR119/theory/index.html | 2 +- previews/PR119/tokenizer/index.html | 6 +++--- previews/PR119/validators/index.html | 4 ++-- 12 files changed, 27 insertions(+), 27 deletions(-) diff --git a/previews/PR119/custom/index.html b/previews/PR119/custom/index.html index f457dc9e..22cbdc65 100644 --- a/previews/PR119/custom/index.html +++ b/previews/PR119/custom/index.html @@ -104,7 +104,7 @@ )

Create a CodeGenContext (ctx), a struct that stores options for Automa code generation. Ctxs are used for Automa's various code generator functions. They currently take the following options (more may be added in future versions)

Example

julia> ctx = CodeGenContext(generator=:goto, vars=Variables(buffer=:tbuffer));
 
 julia> generate_code(ctx, compile(re"a+")) isa Expr
-true
source
Automa.VariablesType

Struct used to store variable names used in generated code. Contained in a CodeGenContext. Create a custom Variables for your CodeGenContext if you want to customize the variables used in Automa codegen, typically if you have conflicting variables with the same name.

Automa generates code with the following variables, shown below with their default names:

  • p::Int: current position of data
  • p_end::Int: end position of data
  • is_eof::Bool: Whether p_end marks end file stream
  • cs::Int: current state
  • data::Any: input data
  • mem::SizedMemory: Memory wrapping data
  • byte::UInt8: current byte being read from data
  • buffer::TranscodingStreams.Buffer: (generate_reader only)

Example

julia> ctx = CodeGenContext(vars=Variables(byte=:u8));
+true
source
Automa.VariablesType

Struct used to store variable names used in generated code. Contained in a CodeGenContext. Create a custom Variables for your CodeGenContext if you want to customize the variables used in Automa codegen, typically if you have conflicting variables with the same name.

Automa generates code with the following variables, shown below with their default names:

  • p::Int: current position of data
  • p_end::Int: end position of data
  • is_eof::Bool: Whether p_end marks end file stream
  • cs::Int: current state
  • data::Any: input data
  • mem::SizedMemory: Memory wrapping data
  • byte::UInt8: current byte being read from data
  • buffer::TranscodingStreams.Buffer: (generate_reader only)

Example

julia> ctx = CodeGenContext(vars=Variables(byte=:u8));
 
 julia> ctx.vars.byte
-:u8
source
+:u8source diff --git a/previews/PR119/debugging/index.html b/previews/PR119/debugging/index.html index 51bdc3e1..c883311c 100644 --- a/previews/PR119/debugging/index.html +++ b/previews/PR119/debugging/index.html @@ -65,4 +65,4 @@ println(io, machine2dot(machine)) end # Requires graphviz to be installed -run(pipeline(`dot -Tsvg /tmp/machine.dot`), stdout="/tmp/machine.svg")source +run(pipeline(`dot -Tsvg /tmp/machine.dot`), stdout="/tmp/machine.svg")source diff --git a/previews/PR119/index.html b/previews/PR119/index.html index a18179e6..54d2d269 100644 --- a/previews/PR119/index.html +++ b/previews/PR119/index.html @@ -45,4 +45,4 @@ (headers, reshape(fields, length(headers), :)) end -header, data = parse_tsv("a\tabc\n12\t13\r\nxyc\tz\n\n") +header, data = parse_tsv("a\tabc\n12\t13\r\nxyc\tz\n\n") diff --git a/previews/PR119/io/index.html b/previews/PR119/io/index.html index c9b4e8f4..75cf068b 100644 --- a/previews/PR119/io/index.html +++ b/previews/PR119/io/index.html @@ -107,14 +107,14 @@ mark: ^ p = 9 ^

Finally, when we reach the newline p = 13, the whole header is in the buffer, and so data[@markpos():p-1] will correctly refer to the header (now, 1:12).

content: abcdefghijkl\nA
 mark:    ^
-p = 13               ^

Remember to update the mark, or to clear it with @unmark() in order to be able to flush data from the buffer afterwards.

Reference

Automa.generate_readerFunction
generate_reader(funcname::Symbol, machine::Automa.Machine; kwargs...)

NOTE: This method requires TranscodingStreams to be loaded

Generate a streaming reader function of the name funcname from machine.

The generated function consumes data from a stream passed as the first argument and executes the machine with filling the data buffer.

This function returns an expression object of the generated function. The user need to evaluate it in a module in which the generated function is needed.

Keyword Arguments

  • arguments: Additional arguments funcname will take (default: ()). The default signature of the generated function is (stream::TranscodingStream,), but it is possible to supply more arguments to the signature with this keyword argument.
  • context: Automa's codegenerator (default: Automa.CodeGenContext()).
  • actions: A dictionary of action code (default: Dict{Symbol,Expr}()).
  • initcode: Initialization code (default: :()).
  • loopcode: Loop code (default: :()).
  • returncode: Return code (default: :(return cs)).
  • errorcode: Executed if cs < 0 after loopcode (default error message)

See the source code of this function to see how the generated code looks like ```

source
Automa.@escapeMacro
@escape()

Pseudomacro. When encountered during Machine execution, the machine will stop executing. This is useful to interrupt the parsing process, for example to emit a record during parsing of a larger file. p will be advanced as normally, so if @escape is hit on B during parsing of "ABC", the next byte will be C.

source
Automa.@markMacro
@mark()

Pseudomacro, to be used with IO-parsing Automa functions. This macro will "mark" the position of p in the current buffer. The marked position will not be flushed from the buffer after being consumed. For example, Automa code can call @mark() at the beginning of a large string, then when the string is exited at position p, it is guaranteed that the whole string resides in the buffer at positions markpos():p-1.

source
Automa.@unmarkMacro
unmark()

Pseudomacro. Removes the mark from the buffer. This allows all previous data to be cleared from the buffer.

See also: @mark, @markpos

source
Automa.@markposMacro
markpos()

Pseudomacro. Get the position of the mark in the buffer.

See also: @mark, @markpos

source
Automa.@bufferposMacro
bufferpos()

Pseudomacro. Returns the integer position of the current TranscodingStreams buffer (only used with the generate_reader function).

Example

# Inside some Automa action code
+p = 13               ^

Remember to update the mark, or to clear it with @unmark() in order to be able to flush data from the buffer afterwards.

Reference

Automa.generate_readerFunction
generate_reader(funcname::Symbol, machine::Automa.Machine; kwargs...)

Generate a streaming reader function of the name funcname from machine.

The generated function consumes data from a stream passed as the first argument and executes the machine with filling the data buffer.

This function returns an expression object of the generated function. The user need to evaluate it in a module in which the generated function is needed.

Keyword Arguments

  • arguments: Additional arguments funcname will take (default: ()). The default signature of the generated function is (stream::TranscodingStream,), but it is possible to supply more arguments to the signature with this keyword argument.
  • context: Automa's codegenerator (default: Automa.CodeGenContext()).
  • actions: A dictionary of action code (default: Dict{Symbol,Expr}()).
  • initcode: Initialization code (default: :()).
  • loopcode: Loop code (default: :()).
  • returncode: Return code (default: :(return cs)).
  • errorcode: Executed if cs < 0 after loopcode (default error message)

See the source code of this function to see how the generated code looks like

source
Automa.@escapeMacro
@escape()

Pseudomacro. When encountered during Machine execution, the machine will stop executing. This is useful to interrupt the parsing process, for example to emit a record during parsing of a larger file. p will be advanced as normally, so if @escape is hit on B during parsing of "ABC", the next byte will be C.

source
Automa.@markMacro
@mark()

Pseudomacro, to be used with IO-parsing Automa functions. This macro will "mark" the position of p in the current buffer. The marked position will not be flushed from the buffer after being consumed. For example, Automa code can call @mark() at the beginning of a large string, then when the string is exited at position p, it is guaranteed that the whole string resides in the buffer at positions markpos():p-1.

source
Automa.@unmarkMacro
unmark()

Pseudomacro. Removes the mark from the buffer. This allows all previous data to be cleared from the buffer.

See also: @mark, @markpos

source
Automa.@bufferposMacro
bufferpos()

Pseudomacro. Returns the integer position of the current TranscodingStreams buffer (only used with the generate_reader function).

Example

# Inside some Automa action code
 @setbuffer()
 description = sub_parser(stream)
-p = @bufferpos()

See also: @setbuffer

source
Automa.@relposMacro
relpos(p)

Automa pseudomacro. Return the position of p relative to @markpos(). Equivalent to p - @markpos() + 1. This can be used to mark additional points in the stream when the mark is set, after which their action position can be retrieved using abspos(x)

Example usage:

# In one action
+p = @bufferpos()

See also: @setbuffer

source
Automa.@relposMacro
relpos(p)

Automa pseudomacro. Return the position of p relative to @markpos(). Equivalent to p - @markpos() + 1. This can be used to mark additional points in the stream when the mark is set, after which their action position can be retrieved using abspos(x)

Example usage:

# In one action
 identifier_pos = @relpos(p)
 
 # Later, in a different action
-identifier = data[@abspos(identifier_pos):p]

See also: @abspos

source
Automa.@absposMacro
abspos(p)

Automa pseudomacro. Used to obtain the actual position of a relative position obtained from @relpos. See @relpos for more details.

source
Automa.@setbufferMacro
setbuffer()

Updates the buffer position to match p. The buffer position is syncronized with p before and after calls to functions generated by generate_reader. @setbuffer() can be used to the buffer before calling another parser.

Example

# Inside some Automa action code
+identifier = data[@abspos(identifier_pos):p]

See also: @abspos

source
Automa.@absposMacro
abspos(p)

Automa pseudomacro. Used to obtain the actual position of a relative position obtained from @relpos. See @relpos for more details.

source
Automa.@setbufferMacro
setbuffer()

Updates the buffer position to match p. The buffer position is syncronized with p before and after calls to functions generated by generate_reader. @setbuffer() can be used to the buffer before calling another parser.

Example

# Inside some Automa action code
 @setbuffer()
 description = sub_parser(stream)
-p = @bufferpos()

See also: @bufferpos

source
+p = @bufferpos()

See also: @bufferpos

source diff --git a/previews/PR119/parser/index.html b/previews/PR119/parser/index.html index 1e5ae63b..9cad8a8f 100644 --- a/previews/PR119/parser/index.html +++ b/previews/PR119/parser/index.html @@ -81,17 +81,17 @@ julia> regex2 = onenter!(regex, :entering_regex); julia> regex === regex2 -truesource
Automa.RegExp.onexit!Function
onexit!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re

Set action(s) a to occur when reading the first byte no longer part of regex re, or if experiencing an expected end-of-file. If multiple actions are set by passing a vector, execute the actions in order.

See also: onenter!, onall!, onfinal!

Example

julia> regex = re"ab?c*";
+true
source
Automa.RegExp.onexit!Function
onexit!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re

Set action(s) a to occur when reading the first byte no longer part of regex re, or if experiencing an expected end-of-file. If multiple actions are set by passing a vector, execute the actions in order.

See also: onenter!, onall!, onfinal!

Example

julia> regex = re"ab?c*";
 
 julia> regex2 = onexit!(regex, :exiting_regex);
 
 julia> regex === regex2
-true
source
Automa.RegExp.onall!Function
onall!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re

Set action(s) a to occur when reading any byte part of the regex re. If multiple actions are set by passing a vector, execute the actions in order.

See also: onenter!, onexit!, onfinal!

Example

julia> regex = re"ab?c*";
+true
source
Automa.RegExp.onall!Function
onall!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re

Set action(s) a to occur when reading any byte part of the regex re. If multiple actions are set by passing a vector, execute the actions in order.

See also: onenter!, onexit!, onfinal!

Example

julia> regex = re"ab?c*";
 
 julia> regex2 = onall!(regex, :reading_re_byte);
 
 julia> regex === regex2
-true
source
Automa.RegExp.onfinal!Function
onfinal!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re

Set action(s) a to occur when the last byte of regex re. If re does not have a definite final byte, e.g. re"a(bc)*", where more "bc" can always be added, compiling the regex will error after setting a final action. If multiple actions are set by passing a vector, execute the actions in order.

See also: onenter!, onall!, onexit!

Example

julia> regex = re"ab?c";
+true
source
Automa.RegExp.onfinal!Function
onfinal!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re

Set action(s) a to occur when the last byte of regex re. If re does not have a definite final byte, e.g. re"a(bc)*", where more "bc" can always be added, compiling the regex will error after setting a final action. If multiple actions are set by passing a vector, execute the actions in order.

See also: onenter!, onall!, onexit!

Example

julia> regex = re"ab?c";
 
 julia> regex2 = onfinal!(regex, :entering_last_byte);
 
@@ -99,24 +99,24 @@
 true
 
 julia> compile(onfinal!(re"ab?c*", :does_not_work))
-ERROR: [...]
source
Automa.RegExp.precond!Function
precond!(re::RE, s::Symbol; [when=:enter], [bool=true]) -> re

Set re's precondition to s. Before any state transitions to re, or inside re, the precondition code s is checked to be bool before the transition is taken.

when controls if the condition is checked when the regex is entered (if :enter), or at every state transition inside the regex (if :all)

Example

julia> regex = re"ab?c*";
+ERROR: [...]
source
Automa.RegExp.precond!Function
precond!(re::RE, s::Symbol; [when=:enter], [bool=true]) -> re

Set re's precondition to s. Before any state transitions to re, or inside re, the precondition code s is checked to be bool before the transition is taken.

when controls if the condition is checked when the regex is entered (if :enter), or at every state transition inside the regex (if :all)

Example

julia> regex = re"ab?c*";
 
 julia> regex2 = precond!(regex, :some_condition);
 
 julia> regex === regex2
-true
source
Automa.generate_codeFunction
generate_code([::CodeGenContext], machine::Machine, actions=nothing)::Expr

Generate init and exec code for machine. The default code generator function for creating functions, preferentially use this over generating init and exec code directly, due to its convenience. Shorthand for producing the concatenated code of:

  • generate_init_code(ctx, machine)
  • generate_action_code(ctx, machine, actions)
  • generate_input_error_code(ctx, machine) [elided if actions == :debug]

Examples

@eval function foo(data)
+true
source
Automa.generate_codeFunction
generate_code([::CodeGenContext], machine::Machine, actions=nothing)::Expr

Generate init and exec code for machine. The default code generator function for creating functions, preferentially use this over generating init and exec code directly, due to its convenience. Shorthand for producing the concatenated code of:

  • generate_init_code(ctx, machine)
  • generate_action_code(ctx, machine, actions)
  • generate_input_error_code(ctx, machine) [elided if actions == :debug]

Examples

@eval function foo(data)
     # Initialize variables used in actions
     data_buffer = UInt8[]
     $(generate_code(machine, actions))
     return data_buffer
-end

See also: generate_init_code, generate_exec_code

source
Automa.generate_init_codeFunction
generate_init_code([::CodeGenContext], machine::Machine)::Expr

Generate variable initialization code, initializing variables such as p, and p_end. The names of these variables are set by the CodeGenContext. If not passed, the context defaults to DefaultCodeGenContext

Prefer using the more generic generate_code over this function where possible. This function should be used if the initialized data should be modified before the execution code.

Example

@eval function foo(data)
+end

See also: generate_init_code, generate_exec_code

source
Automa.generate_init_codeFunction
generate_init_code([::CodeGenContext], machine::Machine)::Expr

Generate variable initialization code, initializing variables such as p, and p_end. The names of these variables are set by the CodeGenContext. If not passed, the context defaults to DefaultCodeGenContext

Prefer using the more generic generate_code over this function where possible. This function should be used if the initialized data should be modified before the execution code.

Example

@eval function foo(data)
     $(generate_init_code(machine))
     p = 2 # maybe I want to start from position 2, not 1
     $(generate_exec_code(machine, actions))
     return cs
-end

See also: generate_code, generate_exec_code

source
Automa.generate_exec_codeFunction
generate_exec_code([::CodeGenContext], machine::Machine, actions=nothing)::Expr

Generate machine execution code with actions. This code should be run after the machine has been initialized with generate_init_code. If not passed, the context defaults to DefaultCodeGenContext

Prefer using the more generic generate_code over this function where possible. This function should be used if the initialized data should be modified before the execution code.

Examples

@eval function foo(data)
+end

See also: generate_code, generate_exec_code

source
Automa.generate_exec_codeFunction
generate_exec_code([::CodeGenContext], machine::Machine, actions=nothing)::Expr

Generate machine execution code with actions. This code should be run after the machine has been initialized with generate_init_code. If not passed, the context defaults to DefaultCodeGenContext

Prefer using the more generic generate_code over this function where possible. This function should be used if the initialized data should be modified before the execution code.

Examples

@eval function foo(data)
     $(generate_init_code(machine))
     p = 2 # maybe I want to start from position 2, not 1
     $(generate_exec_code(machine, actions))
     return cs
-end

See also: generate_init_code, generate_exec_code

source
+end

See also: generate_init_code, generate_exec_code

source diff --git a/previews/PR119/reader/index.html b/previews/PR119/reader/index.html index 5db73867..86e2a992 100644 --- a/previews/PR119/reader/index.html +++ b/previews/PR119/reader/index.html @@ -57,4 +57,4 @@ Seq("tag", "GAGATATA") julia> read_record(reader) -ERROR: EOFError: read end of file +ERROR: EOFError: read end of file diff --git a/previews/PR119/regex/index.html b/previews/PR119/regex/index.html index 9abc4ec0..5ab1d3a1 100644 --- a/previews/PR119/regex/index.html +++ b/previews/PR119/regex/index.html @@ -13,5 +13,5 @@ true julia> compile(regex) isa Automa.Machine -true

See also: [@re_str](@ref), [@compile](@ref)

source
Automa.RegExp.@re_strMacro
@re_str -> RE

Construct an Automa regex of type RE from a string. Note that due to Julia's raw string escaping rules, re"\\" means a single backslash, and so does re"\\\\", while re"\\\\\"" means a backslash, then a quote character.

Examples:

julia> re"ab?c*[def][^ghi]+" isa RE
-true 

See also: RE

source
+true

See also: [@re_str](@ref), [@compile](@ref)

source
Automa.RegExp.@re_strMacro
@re_str -> RE

Construct an Automa regex of type RE from a string. Note that due to Julia's raw string escaping rules, re"\\" means a single backslash, and so does re"\\\\", while re"\\\\\"" means a backslash, then a quote character.

Examples:

julia> re"ab?c*[def][^ghi]+" isa RE
+true 

See also: RE

source
diff --git a/previews/PR119/search/index.html b/previews/PR119/search/index.html index 29ea2555..556fe78b 100644 --- a/previews/PR119/search/index.html +++ b/previews/PR119/search/index.html @@ -1,2 +1,2 @@ -Search · Automa.jl

Loading search...

    +Search · Automa.jl

    Loading search...

      diff --git a/previews/PR119/search_index.js b/previews/PR119/search_index.js index a81a8f97..be030f73 100644 --- a/previews/PR119/search_index.js +++ b/previews/PR119/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"debugging/#Debugging-Automa","page":"Debugging Automa","title":"Debugging Automa","text":"","category":"section"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"danger: Danger\nAll Automa's debugging tools are NOT part of the API and are subject to change without warning. You can use them during development, but do NOT rely on their behaviour in your final code.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Automa is a complicated package, and the process of indirectly designing parsers by first designing a machine can be error prone. Therefore, it's crucial to have good debugging tooling.","category":"page"},{"location":"debugging/#Revise","page":"Debugging Automa","title":"Revise","text":"","category":"section"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Revise is not able to update Automa-generated functions. To make your feedback loop faster, you can manually re-run the code that defines the Automa functions - usually this is much faster than modifying the package and reloading it.","category":"page"},{"location":"debugging/#Ambiguity-check","page":"Debugging Automa","title":"Ambiguity check","text":"","category":"section"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"It is easy to accidentally create a machine where it is undecidable what actions should be taken. For example:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"machine = let\n alphabet = re\"BC\"\n band = onenter!(re\"BBA\", :cool_band)\n compile(re\"XYZ A\" * (alphabet | band))\nend\n\n# output\nERROR: Ambiguous NFA.\n[...]","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Consider what the machine should do once it observes the two first bytes AB of the input: Is the B part of alphabet (in which case it should do nothing), or is it part of band (in which case it should do the action :cool_band)? It's impossible to tell.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Automa will not compile this, and will raise the error:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"ERROR: Ambiguous NFA.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Note the error shows an example input which will trigger the ambiguity: XYZ A, then B. By simply running the input through in your head, you may discover yourself how the error happens.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"In the example above, the error was obvious, but consider this example:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"fasta_machine = let\n header = re\"[a-z]+\"\n seq_line = re\"[ACGT]+\"\n sequence = seq_line * rep('\\n' * seq_line)\n record = onexit!('>' * header * '\\n' * sequence, :emit_record)\n compile(rep(record * '\\n') * opt(record))\nend\n\n# output\nERROR: Ambiguous NFA.\n[...]","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"It's the same problem: After a sequence line you observe \\n: Is this the end of the sequence, or just a newline before another sequence line?","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"To work around it, consider when you know for sure you are out of the sequence: It's not before you see a new >, or end-of-file. In a sense, the trailing \\n really IS part of the sequence. So, really, your machine should regex similar to this","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"fasta_machine = let\n header = re\"[a-z]+\"\n seq_line = re\"[ACGT]+\"\n sequence = rep1(seq_line * '\\n')\n record = onexit!('>' * header * '\\n' * sequence, :emit_record)\n\n # A special record that can avoid a trailing newline, but ONLY if it's the last record\n record_eof = '>' * header * '\\n' * seq_line * rep('\\n' * seq_line) * opt('\\n')\n compile(rep(record * '\\n') * opt(record_eof))\nend\n@assert fasta_machine isa Automa.Machine\n\n# output","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"When all else fails, you can also pass unambiguous=false to the compile function - but beware! Ambiguous machines has undefined behaviour if you get into an ambiguous situation.","category":"page"},{"location":"debugging/#Create-Machine-flowchart","page":"Debugging Automa","title":"Create Machine flowchart","text":"","category":"section"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"The function machine2dot(::Machine) will return a string with a Graphviz .dot formatted flowchart of the machine. Graphviz can then convert the dot file to an SVG function.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"On my computer (with Graphviz and Firefox installed), I can use the following Julia code to display a flowchart of a machine. Note that dot is the command-line name of Graphviz.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"function display_machine(m::Machine)\n open(\"/tmp/machine.dot\", \"w\") do io\n println(io, Automa.machine2dot(m))\n end\n run(pipeline(`dot -Tsvg /tmp/machine.dot`, stdout=\"/tmp/machine.svg\"))\n run(`firefox /tmp/machine.svg`)\nend","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"The following function are Automa internals, but they might help with more advanced debugging:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"re2nfa - create an NFA from an Automa regex\nnfa2dot - create a dot-formatted string from an nfa\nnfa2dfa - create a DFA from an NFA\ndfa2dot - create a dot-formatted string from a DFA","category":"page"},{"location":"debugging/#Running-machines-in-debug-mode","page":"Debugging Automa","title":"Running machines in debug mode","text":"","category":"section"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"The function generate_code takes an argument actions. If this is :debug, then all actions in the given Machine will be replaced by :(push!(logger, action_name)). Hence, given a FASTA machine, you could create a debugger function:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":" @eval function debug(data)\n logger = []\n $(generate_code(fasta_machine, :debug))\n logger\nend","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Then see all the actions executed in order, by doing:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"julia> debug(\">abc\\nTAG\")\n4-element Vector{Any}:\n :mark\n :header\n :mark\n :seqline\n :record","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Note that if your machine relies on its actions to work correctly, for example by actions modifying p, this kind of debugger will not work, as it replaces all actions.","category":"page"},{"location":"debugging/#More-advanced-debuggning","page":"Debugging Automa","title":"More advanced debuggning","text":"","category":"section"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"The file test/debug.jl contains extra debugging functionality and may be included. In particular it defines the functions debug_execute and create_debug_function.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"The function of create_debug_function(::Machine; ascii=false) is best demonstrated:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"machine = let\n letters = onenter!(re\"[a-z]+\", :enter_letters)\n compile(onexit!(letters * re\",[0-9],\" * letters, :exiting_regex))\nend\neval(create_debug_function(machine; ascii=true))\n(end_state, transitions) = debug_compile(\"abc,5,d!\")\n@show end_state\ntransitions","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Will create the following output:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"end state = -6\n7-element Vector{Tuple{Char, Int64, Vector{Symbol}}}:\n ('a', 2, [:enter_letters])\n ('b', 2, [])\n ('c', 2, [])\n (',', 3, [])\n ('5', 4, [])\n (',', 5, [])\n ('d', 6, [:enter_letters])","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Where each 3-tuple in the input corresponds to the input byte (displayed as a Char if ascii is set to true), the Automa state reached on reading the letter, and the actions executed.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"The debug_execute function works the same as the debug_compile, but does not need to be generated first, and can be run directly on an Automa regex:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"julia> debug_execute(re\"[A-z]+\", \"abc1def\"; ascii=true)\n(-3, Tuple{Union{Nothing, Char}, Int64, Vector{Symbol}}[('a', 2, []), ('b', 3, []), ('c', 3, [])])","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"machine2dot","category":"page"},{"location":"debugging/#Automa.machine2dot","page":"Debugging Automa","title":"Automa.machine2dot","text":"machine2dot(machine::Machine)::String\n\nReturn a String with a flowchart of the machine in Graphviz (dot) format. Using Graphviz, a command-line tool, the dot file can be converted to various picture formats.\n\nExample\n\nopen(\"/tmp/machine.dot\", \"w\") do io\n println(io, machine2dot(machine))\nend\n# Requires graphviz to be installed\nrun(pipeline(`dot -Tsvg /tmp/machine.dot`), stdout=\"/tmp/machine.svg\")\n\n\n\n\n\n","category":"function"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"tokenizer/#Tokenizers-(lexers)","page":"Tokenizers","title":"Tokenizers (lexers)","text":"","category":"section"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"A tokenizer or a lexer is a program that breaks down an input text into smaller chunks, and classifies them as one of several tokens. For example, consider an imagininary format that only consists of nested tuples of strings containing letters, like this:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"((\"ABC\", \"v\"),((\"x\", (\"pj\",((\"a\", \"k\")), (\"L\")))))","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Any text of this format can be broken down into a sequence of the following tokens:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Left parenthesis: re\"\\(\"\nRight parenthesis: re\"\\)\"\nComma: re\",\"\nQuote: re\"\\\"\"\nSpaces: re\" +\"\nLetters: re\"[A-Za-z]+\"","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Such that e.g. (\"XY\", \"A\") can be represented as lparent, quote, XY, quote, comma, space, quote A quote rparens.","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Breaking the text down to its tokens is called tokenization or lexing. Note that lexing in itself is not sufficient to parse the format: Lexing is context unaware, so e.g. the test \"((A can be perfectly well tokenized to quote lparens lparens A, even if it's invalid.","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"The purpose of tokenization is to make subsequent parsing easier, because each part of the text has been classified. That makes it easier to, for example, to search for letters in the input. Instead of having to muck around with regex to find the letters, you use regex once to classify all text.","category":"page"},{"location":"tokenizer/#Making-and-using-a-tokenizer","page":"Tokenizers","title":"Making and using a tokenizer","text":"","category":"section"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Let's use the example above to create a tokenizer. The most basic default tokenizer uses UInt32 as tokens: You pass in a list of regex matching each token, then evaluate the resulting code:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"julia> make_tokenizer(\n [re\"\\(\", re\"\\)\", re\",\", re\"\\\"\", re\" +\", re\"[a-zA-Z]+\"]\n ) |> eval","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Since the default tokenizer uses UInt32 as tokens, you can then obtain a lazy iterator of tokens by calling tokenize(UInt32, data):","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"julia> iterator = tokenize(UInt32, \"\"\"(\"XY\", \"A\")\"\"\"); typeof(iterator)\nTokenizer{UInt32, String, 1}","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"This will return Tuple{Int64, Int32, UInt32} elements, with each element being:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"The start index of the token\nThe length of the token\nThe token itself, in this example UInt32(1) for '(', UInt32(2) for ')' etc: ","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"julia> collect(iterator)\n10-element Vector{Tuple{Int64, Int32, UInt32}}:\n (1, 1, 0x00000001)\n (2, 1, 0x00000004)\n (3, 2, 0x00000006)\n (5, 1, 0x00000004)\n (6, 1, 0x00000003)\n (7, 1, 0x00000005)\n (8, 1, 0x00000004)\n (9, 1, 0x00000006)\n (10, 1, 0x00000004)\n (11, 1, 0x00000002)","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Any data which could not be tokenized is given the error token UInt32(0):","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"julia> collect(tokenize(UInt32, \"XY!!)\"))\n3-element Vector{Tuple{Int64, Int32, UInt32}}:\n (1, 2, 0x00000006)\n (3, 2, 0x00000000)\n (5, 1, 0x00000002)","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Both tokenize and make_tokenizer takes an optional argument version, which is 1 by default. This sets the last parameter of the Tokenizer struct, and as such allows you to create multiple different tokenizers with the same element type.","category":"page"},{"location":"tokenizer/#Using-enums-as-tokens","page":"Tokenizers","title":"Using enums as tokens","text":"","category":"section"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Using UInt32 as tokens is not very convenient - so it's possible to use enums to create the tokenizer:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"julia> @enum Token error lparens rparens comma quot space letters\n\njulia> make_tokenizer((error, [\n lparens => re\"\\(\",\n rparens => re\"\\)\",\n comma => re\",\",\n quot => re\"\\\"\",\n space => re\" +\",\n letters => re\"[a-zA-Z]+\"\n ])) |> eval\n\njulia> collect(tokenize(Token, \"XY!!)\"))\n3-element Vector{Tuple{Int64, Int32, Token}}:\n (1, 2, letters)\n (3, 2, error)\n (5, 1, rparens)","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"To make it even easier, you can define the enum and the tokenizer in one go:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"tokens = [\n :lparens => re\"\\(\",\n :rparens => re\"\\)\",\n :comma => re\",\",\n :quot => re\"\\\"\",\n :space => re\" +\",\n :letters => re\"[a-zA-Z]+\"\n]\n@eval @enum Token error $(first.(tokens)...)\nmake_tokenizer((error, \n [Token(i) => j for (i,j) in enumerate(last.(tokens))]\n)) |> eval","category":"page"},{"location":"tokenizer/#Token-disambiguation","page":"Tokenizers","title":"Token disambiguation","text":"","category":"section"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"It's possible to create a tokenizer where the different token regexes overlap:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"julia> make_tokenizer([re\"[ab]+\", re\"ab*\", re\"ab\"]) |> eval","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"In this case, an input like ab will match all three regex. Which tokens are emitted is determined by two rules:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"First, the emitted tokens will be as long as possible. So, the input aa could be emitted as one token of the regex re\"[ab]+\", two tokens of the same regex, or of two tokens of the regex re\"ab*\". In this case, it will be emitted as a single token of re\"[ab]+\", since that will make the first token as long as possible (2 bytes), whereas the other options would only make it 1 byte long.","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Second, tokens with a higher index in the input array beats previous tokens. So, a will be emitted as re\"ab*\", as its index of 2 beats the previous regex re\"[ab]+\" with the index 1, and ab will match the third regex.","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"If you don't want emitted tokens to depend on these priority rules, you can set the optional keyword unambiguous=true in the make_tokenizer function, in which case make_tokenizer will error if any input text could be broken down into different tokens. However, note that this may cause most tokenizers to error when being built, as most tokenization processes are ambiguous.","category":"page"},{"location":"tokenizer/#Reference","page":"Tokenizers","title":"Reference","text":"","category":"section"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Automa.Tokenizer\nAutoma.tokenize\nAutoma.make_tokenizer","category":"page"},{"location":"tokenizer/#Automa.Tokenizer","page":"Tokenizers","title":"Automa.Tokenizer","text":"Tokenizer{E, D, C}\n\nLazy iterator of tokens of type E over data of type D.\n\nTokenizer works on any buffer-like object that defines pointer and sizeof. When iterated, it will return a 3-tuple of integers: * The first is the 1-based starting index of the token in the buffer * The second is the length of the token in bytes * The third is the token kind: The index in the input list tokens.\n\nUn-tokenizable data will be emitted as the \"error token\" with index zero.\n\nThe Int C parameter allows multiple tokenizers to be created with the otherwise same type parameters.\n\nSee also: make_tokenizer\n\n\n\n\n\n","category":"type"},{"location":"tokenizer/#Automa.tokenize","page":"Tokenizers","title":"Automa.tokenize","text":"tokenize(::Type{E}, data, version=1)\n\nCreate a Tokenizer{E, typeof(data), version}, iterating tokens of type E over data.\n\nSee also: Tokenizer, make_tokenizer, compile\n\n\n\n\n\n","category":"function"},{"location":"tokenizer/#Automa.make_tokenizer","page":"Tokenizers","title":"Automa.make_tokenizer","text":"make_tokenizer(\n machine::TokenizerMachine;\n tokens::Tuple{E, AbstractVector{E}}= [ integers ],\n goto=true, version=1\n) where E\n\nCreate code which when evaluated, defines Base.iterate(::Tokenizer{E, D, $version}). tokens is a tuple of a vector of non-error tokens of length machine.n_tokens, and the error token, which will be emitted for data that cannot be tokenized.\n\nExample usage\n\njulia> machine = compile([re\"a\", re\"b\"]);\n\njulia> make_tokenizer(machine; tokens=(0x00, [0x01, 0x02])) |> eval\n\njulia> iter = tokenize(UInt8, \"abxxxba\"); typeof(iter)\nTokenizer{UInt8, String, 1}\n\njulia> collect(iter)\n5-element Vector{Tuple{Int64, Int32, UInt8}}:\n (1, 1, 0x01)\n (2, 1, 0x02)\n (3, 3, 0x00)\n (6, 1, 0x02)\n (7, 1, 0x01)\n\nAny actions inside the input regexes will be ignored. If goto (default), use the faster, but more complex goto code generator. The version number will set the last parameter of the Tokenizer, which allows you to create different tokenizers for the same element type.\n\nSee also: Tokenizer, tokenize, compile\n\n\n\n\n\nmake_tokenizer(\n tokens::Union{\n AbstractVector{RE},\n Tuple{E, AbstractVector{Pair{E, RE}}}\n };\n goto::Bool=true,\n version::Int=1,\n unambiguous=false\n) where E\n\nConvenience function for both compiling a tokenizer, then running make_tokenizer on it. If tokens is an abstract vector, create an iterator of integer tokens with the error token being zero and the non-error tokens being the index in the vector. Else, tokens is the error token followed by token => regex pairs. See the relevant other methods of make_tokenizer, and compile.\n\nExample\n\njulia> make_tokenizer([re\"abc\", re\"def\") |> eval\n\njulia> collect(tokenize(Int, \"abcxyzdef123\"))\n4-element Vector{Tuple{Int64, Int32, UInt32}}:\n (1, 3, 0x00000001)\n (4, 3, 0x00000003)\n (7, 3, 0x00000002)\n (10, 3, 0x00000003)\n\n\n\n\n\n","category":"function"},{"location":"validators/","page":"Validators","title":"Validators","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"validators/#Text-validators","page":"Validators","title":"Text validators","text":"","category":"section"},{"location":"validators/","page":"Validators","title":"Validators","text":"The simplest use of Automa is to simply match a regex. It's unlikely you are going to want to use Automa for this instead of Julia's built-in regex engine PCRE, unless you need the extra performance that Automa brings over PCRE. Nonetheless, it serves as a good starting point to introduce Automa.","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"Suppose we have the FASTA regex from the regex page:","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"julia> fasta_regex = let\n header = re\"[a-z]+\"\n seqline = re\"[ACGT]+\"\n record = '>' * header * '\\n' * rep1(seqline * '\\n')\n rep(record)\n end;","category":"page"},{"location":"validators/#Buffer-validator","page":"Validators","title":"Buffer validator","text":"","category":"section"},{"location":"validators/","page":"Validators","title":"Validators","text":"Automa comes with a convenience function generate_buffer_validator:","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"Given a regex (RE) like the one above, we can do:","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"julia> eval(generate_buffer_validator(:validate_fasta, fasta_regex));\n\njulia> validate_fasta\nvalidate_fasta (generic function with 1 method)","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"And we now have a function that checks if some data matches the regex:","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"julia> validate_fasta(\">hello\\nTAGAGA\\nTAGAG\") # missing trailing newline\n0\n\njulia> validate_fasta(\">helloXXX\") # Error at byte index 7\n7\n\njulia> validate_fasta(\">hello\\nTAGAGA\\nTAGAG\\n\") # nothing; it matches","category":"page"},{"location":"validators/#IO-validators","page":"Validators","title":"IO validators","text":"","category":"section"},{"location":"validators/","page":"Validators","title":"Validators","text":"For large files, having to read the data into a buffer to validate it may not be possible. Automa also supports creating IO validators with the generate_io_validator function:","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"This works very similar to generate_buffer_validator, but the generated function takes an IO, and has a different return value:","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"If the data matches, still return nothing\nElse, return (byte, (line, column)) where byte is the first errant byte, and (line, column) the position of the byte. If the errant byte is a newline, column is 0. If the input reaches unexpected EOF, byte is nothing, and (line, column) points to the last line/column in the IO:","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"julia> eval(generate_io_validator(:validate_io, fasta_regex));\n\njulia> validate_io(IOBuffer(\">hello\\nTAGAGA\\n\"))\n\njulia> validate_io(IOBuffer(\">helX\"))\n(0x58, (1, 5))\n\njulia> validate_io(IOBuffer(\">hello\\n\\n\"))\n(0x0a, (3, 0))\n\njulia> validate_io(IOBuffer(\">hello\\nAC\"))\n(nothing, (2, 2))","category":"page"},{"location":"validators/#Reference","page":"Validators","title":"Reference","text":"","category":"section"},{"location":"validators/","page":"Validators","title":"Validators","text":"Automa.generate_buffer_validator\nAutoma.generate_io_validator\nAutoma.compile","category":"page"},{"location":"validators/#Automa.generate_buffer_validator","page":"Validators","title":"Automa.generate_buffer_validator","text":"generate_buffer_validator(name::Symbol, regexp::RE; goto=true; docstring=true)\n\nGenerate code that, when evaluated, defines a function named name, which takes a single argument data, interpreted as a sequence of bytes. The function returns nothing if data matches Machine, else the index of the first invalid byte. If the machine reached unexpected EOF, returns 0. If goto, the function uses the faster but more complicated :goto code. If docstring, automatically create a docstring for the generated function.\n\n\n\n\n\n","category":"function"},{"location":"validators/#Automa.generate_io_validator","page":"Validators","title":"Automa.generate_io_validator","text":"generate_io_validator(funcname::Symbol, regex::RE; goto::Bool=false)\n\nNOTE: This method requires TranscodingStreams to be loaded\n\nCreate code that, when evaluated, defines a function named funcname. This function takes an IO, and checks if the data in the input conforms to the regex, without executing any actions. If the input conforms, return nothing. Else, return (byte, (line, col)), where byte is the first invalid byte, and (line, col) the 1-indexed position of that byte. If the invalid byte is a byte, col is 0 and the line number is incremented. If the input errors due to unexpected EOF, byte is nothing, and the line and column given is the last byte in the file. If goto, the function uses the faster but more complicated :goto code.\n\n\n\n\n\n","category":"function"},{"location":"validators/#Automa.compile","page":"Validators","title":"Automa.compile","text":"compile(re::RE; optimize::Bool=true, unambiguous::Bool=true)::Machine\n\nCompile a finite state machine (FSM) from re. If optimize, attempt to minimize the number of states in the FSM. If unambiguous, disallow creation of FSM where the actions are not deterministic.\n\nExamples\n\nmachine = let\n name = re\"[A-Z][a-z]+\"\n first_last = name * re\" \" * name\n last_first = name * re\", \" * name\n compile(first_last | last_first)\nend\n\n\n\n\n\ncompile(tokens::Vector{RE}; unambiguous=false)::TokenizerMachine\n\nCompile the regex tokens to a tokenizer machine. The machine can be passed to make_tokenizer.\n\nThe keyword unambiguous decides which of multiple matching tokens is emitted: If false (default), the longest token is emitted. If multiple tokens have the same length, the one with the highest index is returned. If true, make_tokenizer will error if any possible input text can be broken ambiguously down into tokens.\n\nSee also: Tokenizer, make_tokenizer, tokenize\n\n\n\n\n\n","category":"function"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"reader/#Creating-a-Reader-type","page":"Creating readers","title":"Creating a Reader type","text":"","category":"section"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"The use of generate_reader as we learned in the previous section \"Parsing from an io\" has an issue we need to address: While we were able to read multiple records from the reader by calling read_record multiple times, no state was preserved between these calls, and so, no state can be preserved between reading individual records. This is also what made it necessary to clumsily reset p after emitting each record.","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"Imagine you have a format with two kinds of records, A and B types. A records must come before B records in the file. Hence, while a B record can appear at any time, once you've seen a B record, there can't be any more A records. When reading records from the file, you must be able to store whether you've seen a B record.","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"We address this by creating a Reader type which wraps the IO being parsed, and which store any state we want to preserve between records. Let's stick to our simplified FASTA format parsing sequences into Seq objects:","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"struct Seq\n name::String\n seq::String\nend\n\nmachine = let\n header = onexit!(onenter!(re\"[a-z]+\", :mark_pos), :header)\n seqline = onexit!(onenter!(re\"[ACGT]+\", :mark_pos), :seqline)\n record = onexit!(re\">\" * header * '\\n' * rep1(seqline * '\\n'), :record)\n compile(rep(record))\nend\n@assert machine isa Automa.Machine","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"This time, we use the following Reader type:","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"mutable struct Reader{S <: TranscodingStream}\n io::S\n automa_state::Int\nend\n\nReader(io::TranscodingStream) = Reader{typeof(io)}(io, 1)\nReader(io::IO) = Reader(NoopStream(io))","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"The Reader contains an instance of TranscodingStream to read from, and stores the Automa state between records. The beginning state of Automa is always 1. We can now create our reader function like below. There are only three differences from the definitions in the previous section:","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"I no longer have the code to decrement p in the :record action - because we can store the Automa state between records such that the machine can handle beginning in the middle of a record if necessary, there is no need to reset the value of p in order to restore the IO to the state right before each record.\nI return (cs, state) instead of just state, because I want to update the Automa state of the Reader, so when it reads the next record, it begins in the same state where the machine left off from the previous state\nIn the arguments, I add start_state, and in the initcode I set cs to the start state, so the machine begins from the correct state","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"actions = Dict{Symbol, Expr}(\n :mark_pos => :(@mark),\n :header => :(header = String(data[@markpos():p-1])),\n :seqline => :(append!(seqbuffer, data[@markpos():p-1])),\n :record => quote\n seq = Seq(header, String(seqbuffer))\n found_sequence = true\n @escape\n end\n)\n\ngenerate_reader(\n :read_record,\n machine;\n actions=actions,\n arguments=(:(start_state::Int),),\n initcode=quote\n seqbuffer = UInt8[]\n found_sequence = false\n header = \"\"\n cs = start_state\n end,\n loopcode=quote\n if (is_eof && p > p_end) || found_sequence\n @goto __return__\n end\n end,\n returncode=:(found_sequence ? (cs, seq) : throw(EOFError()))\n) |> eval","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"We then create a function that reads from the Reader, making sure to update the automa_state of the reader:","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"function read_record(reader::Reader)\n (cs, seq) = read_record(reader.io, reader.automa_state)\n reader.automa_state = cs\n return seq\nend","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"Let's test it out:","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"julia> reader = Reader(IOBuffer(\">a\\nT\\n>tag\\nGAG\\nATATA\\n\"));\n\njulia> read_record(reader)\nSeq(\"a\", \"T\")\n\njulia> read_record(reader)\nSeq(\"tag\", \"GAGATATA\")\n\njulia> read_record(reader)\nERROR: EOFError: read end of file","category":"page"},{"location":"theory/#Theory-of-regular-expressions","page":"Theory","title":"Theory of regular expressions","text":"","category":"section"},{"location":"theory/","page":"Theory","title":"Theory","text":"Most programmers are familiar with regular expressions, or regex, for short. What many programmers don't know is that regex have a deep theoretical underpinning, which is leaned on by regex engines to produce highly efficient code.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Informally, a regular expression can be thought of as any pattern that can be constructed from the following atoms:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"The empty string is a valid regular expression, i.e. re\"\"\nLiteral matching of a single symbol from a finite alphabet, such as a character, i.e. re\"p\"","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Atoms can be combined with the following operations, if R and P are two regular expressions:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Alternation, i.e R | P, meaning either match R or P.\nConcatenation, i.e. R * P, meaning match first R, then P\nRepetition, i.e. R*, meaning match R zero or more times consecutively.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"note: Note\nIn Automa, the alphabet is bytes, i.e. 0x00:0xff, and so each symbol is a single byte. Multi-byte characters such as Æ is interpreted as the two concatenated of two symbols, re\"\\xc3\" * re\"\\x86\". The fact that Automa considers one input to be one byte, not one character, can become relevant if you instruct Automa to complete an action \"on every input\".","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Popular regex libraries include more operations like ? and +. These can trivially be constructed from the above mentioned primitives, i.e. R? is \"\" | R, and R+ is RR*.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Some implementations of regular expression engines, such as PCRE which is the default in Julia as of Julia 1.8, also support operations like backreferences and lookbehind. These operations can NOT be constructed from the above atoms and axioms, meaning that PCRE expressions are not regular expressions in the theoretical sense.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"The practical importance of theoretically sound regular expressions is that there exists algorithms that can match regular expressions on O(N) time and O(1) space, whereas this is not true for PCRE expressions, which are therefore significantly slower.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"note: Note\nAutoma.jl only supports real regex, and as such does not support e.g. backreferences, in order to gurantee fast runtime performance.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"To match regex to strings, the regex are transformed to finite automata, which are then implemented in code.","category":"page"},{"location":"theory/#Nondeterministic-finite-automata","page":"Theory","title":"Nondeterministic finite automata","text":"","category":"section"},{"location":"theory/","page":"Theory","title":"Theory","text":"The programmer Ken Thompson, of Unix fame, deviced Thompson's construction, an algorithm to constuct a nondeterministic finite automaton (NFA) from a regex. An NFA can be thought of as a flowchart (or a directed graph), where one can move from node to node on directed edges. Edges are either labeled ϵ, in which the machine can freely move through the edge to its destination node, or labeled with one or more input symbols, in which the machine may traverse the edge upon consuming said input.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"To illustrate, let's look at one of the simplest regex: re\"a\", matching the letter a:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"(Image: )","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"You begin at the small dot on the right, then immediately go to state 1, the cirle marked by a 1. By moving to the next state, state 2, you consume the next symbol from the input string, which must be the symbol marked on the edge from state 1 to state 2 (in this case, an a). Some states are \"accept states\", illustrated by a double cirle. If you are at an accept state when you've consumed all symbols of the input string, the string matches the regex.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Each of the operaitons that combine regex can also combine NFAs. For example, given the two regex a and b, which correspond to the NFAs A and B, the regex a * b can be expressed with the following NFA:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"(Image: )","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Note the ϵ symbol on the edge - this signifies an \"epsilon transition\", meaning you move directly from A to B without consuming any symbols.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Similarly, a | b correspond to this NFA structure...","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"(Image: )","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"...and a* to this:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"(Image: )","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"For a larger example, re\"(\\+|-)?(0|1)*\" combines alternation, concatenation and repetition and so looks like this:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"(Image: )","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"ϵ-transitions means that there are states from which there are multiple possible next states, e.g. in the larger example above, state 1 can lead to state 2 or state 12. That's what makes NFAs nondeterministic.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"In order to match a regex to a string then, the movement through the NFA must be emulated. You begin at state 1. When a non-ϵ edge is encountered, you consume a byte of the input data if it matches. If there are no edges that match your input, the string does not match. If an ϵ-edge is encountered from state A that leads to states B and C, the machine goes from state A to state {B, C}, i.e. in both states at once.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"For example, if the regex re\"(\\+|-)?(0|1)* visualized above is matched to the string -11, this is what happens:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"NFA starts in state 1\nNFA immediately moves to all states reachable via ϵ transition. It is now in state {3, 5, 7, 9, 10}.\nNFA sees input -. States {5, 7, 9, 10} do not have an edge with - leading out, so these states die. Therefore, the machine is in state 9, consumes the input, and moves to state 2.\nNFA immediately moves to all states reachable from state 2 via ϵ transitions, so goes to {3, 5, 7}\nNFA sees input 1, must be in state 5, moves to state 6, then through ϵ transitions to state {3, 5, 7}\nThe above point repeats, NFA is still in state {3, 5, 7}\nInput ends. Since state 3 is an accept state, the string matches.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Using only a regex-to-NFA converter, you could create a simple regex engine simply by emulating the NFA as above. The existence of ϵ transitions means the NFA can be in multiple states at once which adds unwelcome complexity to the emulation and makes it slower. Luckily, every NFA has an equivalent determinisitic finite automaton, which can be constructed from the NFA using the so-called powerset construction.","category":"page"},{"location":"theory/#Deterministic-finite-automata","page":"Theory","title":"Deterministic finite automata","text":"","category":"section"},{"location":"theory/","page":"Theory","title":"Theory","text":"Or DFAs, as they are called, are similar to NFAs, but do not contain ϵ-edges. This means that a given input string has either zero paths (if it does not match the regex), one, unambiguous path, through the DFA. In other words, every input symbol must trigger one unambiguous state transition from one state to one other state.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Let's visualize the DFA equivalent to the larger NFA above:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"(Image: )","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"It might not be obvious, but the DFA above accepts exactly the same inputs as the previous NFA. DFAs are way simpler to simulate in code than NFAs, precisely because at every state, for every input, there is exactly one action. DFAs can be simulated either using a lookup table, of possible state transitions, or by hardcoding GOTO-statements from node to node when the correct input is matched. Code simulating DFAs can be ridicuously fast, with each state transition taking less than 1 nanosecond, if implemented well.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Furthermore, DFAs can be optimised. Two edges between the same nodes with labels A and B can be collapsed to a single edge with labels [AB], and redundant nodes can be collapsed. The optimised DFA equivalent to the one above is simply: ","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"(Image: )","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Unfortunately, as the name \"powerset construction\" hints, convering an NFA with N nodes may result in a DFA with up to 2^N nodes. This inconvenient fact drives important design decisions in regex implementations. There are basically two approaches:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Automa.jl will just construct the DFA directly, and accept a worst-case complexity of O(2^N). This is acceptable (I think) for Automa, because this construction happens in Julia's package precompilation stage (not on package loading or usage), and because the DFAs are assumed to be constants within a package. So, if a developer accidentally writes an NFA which is unacceptably slow to convert to a DFA, it will be caught in development. Luckily, it's pretty rare to have NFAs that result in truly abysmally slow conversions to DFA's: While bad corner cases exist, they are rarely as catastrophic as the O(2^N) would suggest. Currently, Automa's regex/NFA/DFA compilation pipeline is very slow and unoptimized, but, since it happens during precompile time, it is insignificant compared to LLVM compile times.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Other implementations, like the popular ripgrep command line tool, uses an adaptive approach. It constructs the DFA on the fly, as each symbol is being matched, and then caches the DFA. If the DFA size grows too large, the cache is flushed. If the cache is flushed too often, it falls back to simulating the NFA directly. Such an approach is necessary for ripgrep, because the regex -> NFA -> DFA compilation happens at runtime and must be near-instantaneous, unlike Automa, where it happens during package precompilation and can afford to be slow.","category":"page"},{"location":"theory/#Automa-in-a-nutshell","page":"Theory","title":"Automa in a nutshell","text":"","category":"section"},{"location":"theory/","page":"Theory","title":"Theory","text":"Automa simulates the DFA by having the DFA create a Julia Expr, which is then used to generate a Julia function using metaprogramming. Like all other Julia code, this function is then optimized by Julia and then LLVM, making the DFA simulations very fast.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Because Automa just constructs Julia functions, we can do extra tricks that ordinary regex engines cannot: We can splice arbitrary Julia code into the DFA simulation. Currently, Automa supports two such kinds of code: actions, and preconditions.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Actions are Julia code that is executed during certain state transitions. Preconditions are Julia code, that evaluates to a Bool value, and which is checked before a state transition. If it evaluates to false, the transition is not taken.","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"custom/#Customizing-Automa's-code-generation","page":"Customizing codegen","title":"Customizing Automa's code generation","text":"","category":"section"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"Automa offers a few ways of customising the created code. Note that the precise code generated by automa is considered an implementation detail, and as such is subject to change without warning. Only the overall behavior, i.e. the \"DFA simulation\" can be considered stable.","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"Nonetheless, it is instructive to look at the code generated for the machine in the \"parsing from a buffer\" section. I present it here cleaned up and with comments for human inspection.","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"# Initialize variables used in the code below\nbyte::UInt8 = 0x00\np::Int = 1\np_end::Int = sizeof(data)\np_eof::Int = p_end\ncs::Int = 1\n\n# Turn the input buffer into SizedMemory, to load data from pointer\nGC.@preserve data begin\nmem::Automa.SizedMemory = (Automa.SizedMemory)(data)\n\n# For every input byte:\nwhile p ≤ p_end && cs > 0\n # Load byte\n byte = mem[p]\n\n # Load the action, to execute, if any, by looking up in a table\n # using the current state (cs) and byte\n @inbounds var\"##292\" = Int((Int8[0 0 … 0 0; 0 0 … 0 0; … ; 0 0 … 0 0; 0 0 … 0 0])[(cs - 1) << 8 + byte + 1])\n\n # Look up next state. If invalid input, next state is negative current state\n @inbounds cs = Int((Int8[-1 -2 … -5 -6; -1 -2 … -5 -6; … ; -1 -2 … -5 -6; -1 -2 … -5 -6])[(cs - 1) << 8 + byte + 1])\n\n # Check each possible action looked up above, and execute it\n # if it is not zero\n if var\"##292\" == 1\n pos = p\n elseif var\"##292\" == 2\n header = String(data[pos:p - 1])\n elseif if var\"##292\" == 3\n append!(buffer, data[pos:p - 1])\n elseif var\"##292\" == 4\n seq = Seq(header, String(buffer))\n push!(seqs, seq)\n end\n\n # Increment position by 1\n p += 1\n\n # If we're at end of input, and the current state in in an accept state:\n if p > p_eof ≥ 0 && cs > 0 && (cs < 65) & isodd(0x0000000000000021 >>> ((cs - 1) & 63))\n # What follows is a list of all possible EOF actions.\n\n # If state is state 6, execute the appropriate action\n # tied to reaching end of input at this state\n if cs == 6\n seq = Seq(header, String(buffer))\n push!(seqs, seq)\n cs = 0\n\n # Else, if the state is < 0, we have taken a bad input (see where cs was updated)\n # move position back by one to leave it stuck where it found bad input\n elseif cs < 0\n p -= 1\n end\n\n # If cs is not 0, the machine is in an error state.\n # Gather some information about machine state, then throw an error\n if cs != 0\n cs = -(abs(cs))\n var\"##291\" = if p_eof > -1 && p > p_eof\n nothing\n else\n byte\n end\n Automa.throw_input_error($machine, -cs, var\"##291\", mem, p)\n end\nend\nend # GC.@preserve","category":"page"},{"location":"custom/#Using-CodeGenContext","page":"Customizing codegen","title":"Using CodeGenContext","text":"","category":"section"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"The CodeGenContext (or ctx, for short) struct is a collection of settings used to customize code creation. If not passed to the code generator functions, a default CodeGenContext is used.","category":"page"},{"location":"custom/#Variable-names","page":"Customizing codegen","title":"Variable names","text":"","category":"section"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"One obvious place to customize is variable names. In the code above, for example, the input bytes are named byte. What if you have another variable with that name?","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"The ctx contains a .vars field with a Variables object, which is just a collection of names used in generated code. For example, to rename byte to u8 in the generated code, you first create the appropriate ctx, then use the ctx to make the code.","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"ctx = CodeGenContext(vars=Automa.Variables(byte=:u8))\ncode = generate_code(ctx, machine, actions)","category":"page"},{"location":"custom/#Other-options","page":"Customizing codegen","title":"Other options","text":"","category":"section"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"The clean option strips most linenumber information from the generated code, if set to true.\ngetbyte is a function that is called like this getbyte(data, p) to obtain byte in the main loop. This is usually just Base.getindex, but can be customised to be an arbitrary function.","category":"page"},{"location":"custom/#Code-generator","page":"Customizing codegen","title":"Code generator","text":"","category":"section"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"The code showed at the top of this page is code made with the table code generator. Automa also supports creating code using the goto code generator instead of the default table generator. The goto generator creates code with the following properties:","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"It is much harder to read than table code\nThe code is much larger\nIt does not use boundschecking\nIt does not allow customizing getbyte\nIt is much faster than the table generator","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"Normally, the table generator is good enough, but for performance sensitive applications, the goto generator can be used.","category":"page"},{"location":"custom/#Optimising-the-previous-example","page":"Customizing codegen","title":"Optimising the previous example","text":"","category":"section"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"Let's try optimising the previous FASTA parsing example. My original code did 300 MB/s.","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"To recap, the Machine was:","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"machine = let\n header = onexit!(onenter!(re\"[a-z]+\", :mark_pos), :header)\n seqline = onexit!(onenter!(re\"[ACGT]+\", :mark_pos), :seqline)\n record = onexit!(re\">\" * header * '\\n' * rep1(seqline * '\\n'), :record)\n compile(rep(record))\nend\n@assert machine isa Automa.Machine","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"The first improvement is to the algorithm itself: Instead of of parsing to a vector of Seq, I'm simply going to index the input data, filling up an existing vector of:","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"struct SeqPos\n offset::Int\n hlen::Int32\n slen::Int32\nend","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"The idea here is to remove as many allocations as possible. This will more accurately show the speed of the DFA simulation, which is now the bottleneck. The actions will therefore be ","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"actions = Dict(\n :mark_pos => :(pos = p),\n :header => :(hlen = p - pos),\n :seqline => :(slen += p - pos),\n :record => quote\n seqpos = SeqPos(offset, hlen, slen)\n nseqs += 1\n seqs[nseqs] = seqpos\n offset += hlen + slen\n slen = 0\n end\n);\n\n@assert actions isa Dict","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"With the new variables such as slen, we need to update the function code as well:","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"@eval function parse_fasta(data)\n pos = slen = hlen = offset = nseqs = 0\n seqs = Vector{SeqPos}(undef, 400000)\n $(generate_code(machine, actions))\n return seqs\nend","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"This parses a 45 MB file in about 100 ms in my laptop, that's 450 MB/s. Now let's try the exact same, except with the code being generated by:","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"$(generate_code(CodeGenContext(generator=:goto), machine, actions))","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"Now the code parses the same 45 MB FASTA file in 11.14 miliseconds, parsing at about 4 GB/s.","category":"page"},{"location":"custom/#Reference","page":"Customizing codegen","title":"Reference","text":"","category":"section"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"Automa.CodeGenContext\nAutoma.Variables","category":"page"},{"location":"custom/#Automa.CodeGenContext","page":"Customizing codegen","title":"Automa.CodeGenContext","text":"CodeGenContext(;\n vars=Variables(:p, :p_end, :is_eof, :cs, :data, :mem, :byte, :buffer),\n generator=:table,\n getbyte=Base.getindex,\n clean=false\n)\n\nCreate a CodeGenContext (ctx), a struct that stores options for Automa code generation. Ctxs are used for Automa's various code generator functions. They currently take the following options (more may be added in future versions)\n\nvars::Variables: variable names used in generated code. See the Variables struct.\ngenerator::Symbol: code generator mechanism (:table or :goto). The table generator creates smaller, simpler code that uses a vector of integers to determine state transitions. The goto-generator uses a maze of @goto-statements, and create larger, more complex code, that is faster.\ngetbyte::Function (table generator only): function f(data, p) to access byte from data. Default: Base.getindex.\nclean: Whether to remove some QuoteNodes (line information) from the generated code\n\nExample\n\njulia> ctx = CodeGenContext(generator=:goto, vars=Variables(buffer=:tbuffer));\n\njulia> generate_code(ctx, compile(re\"a+\")) isa Expr\ntrue\n\n\n\n\n\n","category":"type"},{"location":"custom/#Automa.Variables","page":"Customizing codegen","title":"Automa.Variables","text":"Struct used to store variable names used in generated code. Contained in a CodeGenContext. Create a custom Variables for your CodeGenContext if you want to customize the variables used in Automa codegen, typically if you have conflicting variables with the same name.\n\nAutoma generates code with the following variables, shown below with their default names:\n\np::Int: current position of data\np_end::Int: end position of data\nis_eof::Bool: Whether p_end marks end file stream\ncs::Int: current state\ndata::Any: input data\nmem::SizedMemory: Memory wrapping data\nbyte::UInt8: current byte being read from data\nbuffer::TranscodingStreams.Buffer: (generate_reader only)\n\nExample\n\njulia> ctx = CodeGenContext(vars=Variables(byte=:u8));\n\njulia> ctx.vars.byte\n:u8\n\n\n\n\n\n","category":"type"},{"location":"regex/","page":"Regex","title":"Regex","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"regex/#Regex","page":"Regex","title":"Regex","text":"","category":"section"},{"location":"regex/","page":"Regex","title":"Regex","text":"Automa regex (of the type Automa.RE) are conceptually similar to the Julia built-in regex. They are made using the @re_str macro, like this: re\"ABC[DEF]\".","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"Automa regex matches individual bytes, not characters. Hence, re\"Æ\" (with the UTF-8 encoding [0xc3, 0x86]) is equivalent to re\"\\xc3\\x86\", and is considered the concatenation of two independent input bytes.","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"The @re_str macro supports the following content:","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"Literal symbols, such as re\"ABC\", re\"\\xfe\\xa2\" or re\"Ø\"\n| for alternation, as in re\"A|B\", meaning \"A or B\". \nByte sets with [], like re\"[ABC]\". This means any of the bytes in the brackets, e.g. re\"[ABC]\" is equivalent to re\"A|B|C\".\nInverted byte sets, e.g. re\"[^ABC]\", meaning any byte, except those in re[ABC].\nRepetition, with X* meaning zero or more repetitions of X\n+, where X+ means XX*, i.e. 1 or more repetitions of X\n?, where X? means X | \"\", i.e. 0 or 1 occurrences of X. It applies to the last element of the regex\nParentheses to group expressions, like in A(B|C)?","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"You can combine regex with the following operations:","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"* for concatenation, with re\"A\" * re\"B\" being the same as re\"AB\". Regex can also be concatenated with Chars and Strings, which will cause the chars/strings to be converted to regex first.\n| for alternation, with re\"A\" | re\"B\" being the same as re\"A|B\"\n& for intersection of regex, i.e. for regex A and B, the set of inputs matching A & B is exactly the intersection of the inputs match A and those matching B. As an example, re\"A[AB]C+D?\" & re\"[ABC]+\" is re\"ABC\".\n\\ for difference, such that for regex A and B, A \\ B creates a new regex matching all those inputs that match A but not B.\n! for inversion, such that !re\"[A-Z]\" matches all other strings than those which match re\"[A-Z]\". Note that !re\"a\" also matches e.g. \"aa\", since this does not match re\"a\".","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"Finally, the funtions opt, rep and rep1 is equivalent to the operators ?, * and +, so i.e. opt(re\"a\" * rep(re\"b\") * re\"c\") is equivalent to re\"(ab*c)?\".","category":"page"},{"location":"regex/#Example","page":"Regex","title":"Example","text":"","category":"section"},{"location":"regex/","page":"Regex","title":"Regex","text":"Suppose we want to create a regex that matches a simplified version of the FASTA format. This \"simple FASTA\" format is defined like so:","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"The format is a series of zero or more records, concatenated\nA record consists of the concatenation of:\nA leading '>'\nA header, composed of one or more letters in 'a-z',\nA newline symbol '\\n'\nA series of one or more sequence lines\nA sequence line is the concatenation of:\nOne or more symbols from the alphabet [ACGT]\nA newline","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"We can represent this concisely as a regex: re\"(>[a-z]+\\n([ACGT]+\\n)+)*\" To make it easier to read, we typically construct regex incrementally, like such:","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"fasta_regex = let\n header = re\"[a-z]+\"\n seqline = re\"[ACGT]+\"\n record = '>' * header * '\\n' * rep1(seqline * '\\n')\n rep(record)\nend\n@assert fasta_regex isa RE","category":"page"},{"location":"regex/#Reference","page":"Regex","title":"Reference","text":"","category":"section"},{"location":"regex/","page":"Regex","title":"Regex","text":"RE\n@re_str","category":"page"},{"location":"regex/#Automa.RegExp.RE","page":"Regex","title":"Automa.RegExp.RE","text":"RE(s::AbstractString)\n\nAutoma regular expression (regex) that is used to match a sequence of input bytes. Regex should preferentially be constructed using the @re_str macro: re\"ab+c?\". Regex can be combined with other regex, strings or chars with *, |, & and \\:\n\na * b matches inputs that matches first a, then b\na | b matches inputs that matches a or b\na & b matches inputs that matches a and b\na \\ b matches input that mathes a but not b\n!a matches all inputs that does not match a.\n\nSet actions to regex with onenter!, onexit!, onall! and onfinal!, and preconditions with precond!.\n\nExample\n\njulia> regex = (re\"a*b?\" | opt('c')) * re\"[a-z]+\";\n\njulia> regex = rep1((regex \\ \"aba\") & !re\"ca\");\n\njulia> regex isa RE\ntrue\n\njulia> compile(regex) isa Automa.Machine\ntrue\n\nSee also: [@re_str](@ref), [@compile](@ref)\n\n\n\n\n\n","category":"type"},{"location":"regex/#Automa.RegExp.@re_str","page":"Regex","title":"Automa.RegExp.@re_str","text":"@re_str -> RE\n\nConstruct an Automa regex of type RE from a string. Note that due to Julia's raw string escaping rules, re\"\\\\\" means a single backslash, and so does re\"\\\\\\\\\", while re\"\\\\\\\\\\\"\" means a backslash, then a quote character.\n\nExamples:\n\njulia> re\"ab?c*[def][^ghi]+\" isa RE\ntrue \n\nSee also: RE\n\n\n\n\n\n","category":"macro"},{"location":"","page":"Home","title":"Home","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"#Automa.jl","page":"Home","title":"Automa.jl","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"Automa is a regex-to-Julia compiler. By compiling regex to Julia code in the form of Expr objects, Automa provides facilities to create efficient and robust regex-based lexers, tokenizers and parsers using Julia's metaprogramming capabilities. You can view Automa as a regex engine that can insert arbitrary Julia code into its input matching process, which will be executed when certain parts of the regex matches an input.","category":"page"},{"location":"","page":"Home","title":"Home","text":"(Image: Schema of Automa.jl)","category":"page"},{"location":"","page":"Home","title":"Home","text":"Automa.jl is designed to generate very efficient code to scan large text data, which is often much faster than handcrafted code. Automa.jl is a regex engine that can insert arbitrary Julia code into its input matching process, that will be executed in when certain parts of the regex matches an input.","category":"page"},{"location":"#Where-to-start","page":"Home","title":"Where to start","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"If you're not familiar with regex engines, start by reading the theory section, then you might want to read every section from the top. They're structured like a tutorial, beginning from the simplest use of Automa and moving to more advanced uses.","category":"page"},{"location":"","page":"Home","title":"Home","text":"If you like to dive straight in, you might want to start by reading the examples below, then go through the examples in the examples/ directory in the Automa repository.","category":"page"},{"location":"#Examples","page":"Home","title":"Examples","text":"","category":"section"},{"location":"#Validate-some-text-only-is-composed-of-ASCII-alphanumeric-characters","page":"Home","title":"Validate some text only is composed of ASCII alphanumeric characters","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"generate_buffer_validator(:validate_alphanumeric, re\"[a-zA-Z0-9]*\") |> eval\n\nfor s in [\"abc\", \"aU81m\", \"!,>\"]\n println(\"$s is alphanumeric? $(isnothing(validate_alphanumeric(s)))\")\nend","category":"page"},{"location":"#Making-a-lexer","page":"Home","title":"Making a lexer","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"tokens = [\n :identifier => re\"[A-Za-z_][0-9A-Za-z_!]*\",\n :lparens => re\"\\(\",\n :rparens => re\"\\)\",\n :comma => re\",\",\n :quot => re\"\\\"\",\n :space => re\"[\\t\\f ]+\",\n];\n@eval @enum Token errortoken $(first.(tokens)...)\nmake_tokenizer((errortoken, \n [Token(i) => j for (i,j) in enumerate(last.(tokens))]\n)) |> eval\n\ncollect(tokenize(Token, \"\"\"(alpha, \"beta15\")\"\"\"))","category":"page"},{"location":"#Make-a-simple-TSV-file-parser","page":"Home","title":"Make a simple TSV file parser","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"machine = let\n name = onexit!(onenter!(re\"[^\\t\\r\\n]+\", :mark), :name)\n field = onexit!(onenter!(re\"[^\\t\\r\\n]+\", :mark), :field)\n nameline = name * rep('\\t' * name)\n record = onexit!(field * rep('\\t' * field), :record)\n compile(nameline * re\"\\r?\\n\" * record * rep(re\"\\r?\\n\" * record) * rep(re\"\\r?\\n\"))\nend\n\nactions = Dict(\n :mark => :(pos = p),\n :name => :(push!(headers, String(data[pos:p-1]))),\n :field => quote\n n_fields += 1\n push!(fields, String(data[pos:p-1]))\n end,\n :record => quote\n n_fields == length(headers) || error(\"Malformed TSV\")\n n_fields = 0\n end\n)\n\n@eval function parse_tsv(data)\n headers = String[]\n fields = String[]\n pos = n_fields = 0\n $(generate_code(machine, actions))\n (headers, reshape(fields, length(headers), :))\nend\n\nheader, data = parse_tsv(\"a\\tabc\\n12\\t13\\r\\nxyc\\tz\\n\\n\")","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"io/#Parsing-from-an-IO","page":"Parsing IOs","title":"Parsing from an IO","text":"","category":"section"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Some file types are gigabytes or tens of gigabytes in size. For these files, parsing from a buffer may be impractical, as they require you to read in the entire file in memory at once. Automa enables this by hooking into TranscodingStreams.jl, a package that provides a wrapper IO of the type TranscodingStream. Importantly, these streams buffer their input data. Automa is thus able to operate directly on the input buffers of TranscodingStream objects.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Unfortunately, this significantly complicates things compared to parsing from a simple buffer. The main problem is that, when reading from a buffered stream, the byte array visible from Automa is only a small slice of the total input data. Worse, when the end of the stream is reached, data from the buffer is flushed, i.e. removed from the stream. To handle this, Automa must reach deep into the implementation details of TranscodingStreams, and also break some of its own abstractions. It's not pretty, but it's what we have.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Practically speaking, parsing from IO is done with the function Automa.generate_reader. Despite its name, this function is NOT directly used to generate objects like FASTA.Reader. Instead, this function produces Julia code (an Expr object) that, when evaluated, defines a function that can execute an Automa machine on an IO. Let me first show the code generated by generate_reader in pseudocode format:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"function { function name }(stream::TranscodingStream, { args... })\n { init code }\n\n @label __exec__\n\n p = current buffer position\n p_end = final buffer position\n\n # the eof call below will first flush any used data from buffer,\n # then load in new data, before checking if it's really eof.\n is_eof = eof(stream)\n execute normal automa parsing of the buffer\n update buffer position to match p\n\n { loop code }\n\n if cs < 0 # meaning: erroneous input or erroneous EOF\n { error code }\n end\n\n if machine errored or reached EOF\n @label __return__\n { return code }\n end\n @goto __exec__\nend","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"The content marked { function name }, { args... }, { init code }, { loop code }, { error code } and { return code } are arguments provided to Automa.generate_reader. By providing these, the user can customize the generated function further.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"The main difference from the code generated to parse a buffer is the label/GOTO pair __exec__, which causes Automa to repeatedly load data into the buffer, execute the machine, then flush used data from the buffer, then execute the machine, and so on, until interrupted.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Importantly, when parsing from a buffer, p and p_end refer to the position in the current buffer. This may not be the position in the stream, and when the data in the buffer is flushed, it may move the data in the buffer so that p now become invalid. This means you can't simply store a variable marked_pos that points to the current value of p and expect that the same data is at that position later. Furthermore, is_eof is set to whether the stream has reached EOF.","category":"page"},{"location":"io/#Example-use","page":"Parsing IOs","title":"Example use","text":"","category":"section"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Let's show the simplest possible example of such a function. We have a Machine (which, recall, is a compiled regex) called machine, and we want to make a function that returns true if a given IO contain data that conforms to the regex format specified by the Machine.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"We will still use the machine from before, just without any actions:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"machine = let\n header = re\"[a-z]+\"\n seqline = re\"[ACGT]+\"\n record = re\">\" * header * '\\n' * rep1(seqline * '\\n')\n compile(rep(record))\nend\n@assert machine isa Automa.Machine","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"To create our simple IO reader, we simply need to call generate_reader, where the { return code } is a check if iszero(cs), meaning if the machine exited at a proper exit state. We also need to set error_code to an empty expression in order to prevent throwing an error on invalid code. Instead, we want it to go immediately to return - we call this section __return__, so we need to @goto __return__. Then, we need to evaluate the code created by generate_reader in order to define the function validate_fasta","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"julia> return_code = :(iszero(cs));\n\njulia> error_code = :(@goto __return__);\n\njulia> eval(generate_reader(:validate_fasta, machine; returncode=return_code, errorcode=error_code));","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"The generated function validate_fasta has the function signature: validate_fasta(stream::TranscodingStream). If our input IO is not a TranscodingStream, we can wrap it in the relatively lightweight NoopStream, which, as the name suggests, does nothing to the data:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"julia> io = NoopStream(IOBuffer(\">a\\nTAG\\nTA\\n>bac\\nG\\n\"));\n\njulia> validate_fasta(io)\ntrue\n\njulia> validate_fasta(NoopStream(IOBuffer(\"random data\")))\nfalse","category":"page"},{"location":"io/#Reading-a-single-record","page":"Parsing IOs","title":"Reading a single record","text":"","category":"section"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"danger: Danger\nThe following code is only for demonstration purposes. It has several one important flaw, which will be adressed in a later section, so do not copy-paste it for serious work.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"There are a few more subtleties related to the generate_reader function. Suppose we instead want to create a function that reads a single FASTA record from an IO. In this case, it's no good that the function created from generate_reader will loop until the IO reaches EOF - we need to find a way to stop it after reading a single record. We can do this with the pseudomacro @escape, as shown below.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"We will reuse our Seq struct and our Machine from the \"parsing from a buffer\" section of this tutorial:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"struct Seq\n name::String\n seq::String\nend\n\nmachine = let\n header = onexit!(onenter!(re\"[a-z]+\", :mark_pos), :header)\n seqline = onexit!(onenter!(re\"[ACGT]+\", :mark_pos), :seqline)\n record = onexit!(re\">\" * header * '\\n' * rep1(seqline * '\\n'), :record)\n compile(rep(record))\nend\n@assert machine isa Automa.Machine\n\n# output","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"The code below contains @escape in the :record action - meaning: Break out of machine execution.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"actions = Dict{Symbol, Expr}(\n :mark_pos => :(pos = p),\n :header => :(header = String(data[pos:p-1])),\n :seqline => :(append!(seqbuffer, data[pos:p-1])),\n\n # Only this action is different from before!\n :record => quote\n seq = Seq(header, String(seqbuffer))\n found_sequence = true\n # Reset p one byte if we're not at the end\n p -= !(is_eof && p > p_end)\n @escape\n end\n)\n@assert actions isa Dict\n\n# output","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"@escape is not actually a real macro, but what Automa calls a \"pseudomacro\". It is expanded during Automa's own compiler pass before Julia's lowering. The @escape pseudomacro is replaced with code that breaks it out of the executing machine, without reaching EOF or an invalid byte.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Let's see how I use generate_reader, then I will explain each part:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"generate_reader(\n :read_record,\n machine;\n actions=actions,\n initcode=quote\n seqbuffer = UInt8[]\n pos = 0\n found_sequence = false\n header = \"\"\n end,\n loopcode=quote\n if (is_eof && p > p_end) || found_sequence\n @goto __return__\n end\n end,\n returncode=:(found_sequence ? seq : nothing)\n) |> eval","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"In the :record, action, a few new things happen.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"First, I set the flag found_sequence = false. In the loop code, I look for this flag to signal that the function should return. Remember, the loop code happens after machine execution, which can mean either that the execution was broken out of by @escape, or than the buffer ran out and need to be refilled. I could just return the sequence directly in the action, but then I would skip a bunch of the code generated by generate_reader which sets the buffer state correctly, so this is never adviced. Instead, in the loop code, which executes after the buffer has been flushed, I check for this flag, and goes to __return__ if necessary. I could also just return directly in the loopcode, but I prefer only having one place to retun from the function.\nI use @escape to break out of the machine, i.e. stop machine execution\nFinally, I decrement p, if and only if the machine has not reached EOF (which happens when is_eof is true, meaning the last part of the IO has been buffered, and p > p_end, meaning the end of the buffer has been reached). This is because, the first record ends when the IO reads the second > symbol. If I then were to read another record from the same IO, I would have already read the > symbol. I need to reset p by 1, so the > is also read on the next call to read_record.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"I can use the function like this:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"julia> io = NoopStream(IOBuffer(\">a\\nT\\n>tag\\nGAGA\\nTATA\\n\"));\n\njulia> read_record(io)\nSeq(\"a\", \"T\")\n\njulia> read_record(io)\nSeq(\"tag\", \"GAGATATA\")\n\njulia> read_record(io)","category":"page"},{"location":"io/#Preserving-data-by-marking-the-buffer","page":"Parsing IOs","title":"Preserving data by marking the buffer","text":"","category":"section"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"There are several problems with the implementation above: The following code in my actions dict:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"header = String(data[pos:p-1])","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Creates header by accessing the data buffer. However, when reading an IO, how can I know that the data hasn't shifted around in the buffer between when I defined pos? For example, suppose we have a short buffer of only 8 bytes, and the following FASTA file: >abcdefghijkl\\nA. Then, the buffer is first filled with >abcdefg. When entering the header, I execute the action :mark_position at p = 2, so pos = 2. But now, when I reach the end of the header, the used data in the buffer has been flushed, and the data is now: hijkl\\nA, and p = 14. I then try to access data[2:13], which is out of bounds!","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Luckily, the buffers of TranscodingStreams allow us to \"mark\" a position to save it. The buffer will not flush the marked position, or any position after the marked position. If necessary, it will resize the buffer to be able to load more data while keeping the marked position.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Inside the function generated by generate_reader, we can use the zero-argument pseudomacro @mark(), which marks the position p. The macro @markpos() can then be used to get the marked position, which will point to the same data in the buffer, even after the data in the buffer has been shifted after it's been flushed. This works because the mark is stored inside the TranscodingStream buffer, and the buffer makes sure to update the mark if the content moves. Hence, we can re-write the actions:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"actions = Dict{Symbol, Expr}(\n :mark_position => :(@mark),\n :header => :(header = String(data[@markpos():p-1])),\n :seqline => :(append!(buffer, data[@markpos():p-1])),\n\n [:record action omitted...]\n)","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"In our example above with the small 8-byte buffer, this is what would happen: First, the buffer contains the first 8 bytes. When p = 2, the mark is set, and the second byte is marked::","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"content: >abcdefg\nmark: ^\np = 2 ^","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Then, when p = 9 the buffer is exhausted, the used data is removed, BUT, the mark stays, so byte 2 is preserved, and only the first byte is removed. The code in generate_reader loops around to @label __exec__, which sets p to the current buffer position. The buffer now looks like this:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"content: abcdefgh\nmark: ^\np = 8 ^","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Only 1 byte was cleared, so when p = 9, the buffer will be exhausted again. This time, no data can be cleared, so instead, the buffer is resized to fit more data:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"content: abcdefghijkl\\nA\nmark: ^\np = 9 ^","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Finally, when we reach the newline p = 13, the whole header is in the buffer, and so data[@markpos():p-1] will correctly refer to the header (now, 1:12).","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"content: abcdefghijkl\\nA\nmark: ^\np = 13 ^","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Remember to update the mark, or to clear it with @unmark() in order to be able to flush data from the buffer afterwards.","category":"page"},{"location":"io/#Reference","page":"Parsing IOs","title":"Reference","text":"","category":"section"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Automa.generate_reader\nAutoma.@escape\nAutoma.@mark\nAutoma.@unmark\nAutoma.@markpos\nAutoma.@bufferpos\nAutoma.@relpos\nAutoma.@abspos\nAutoma.@setbuffer","category":"page"},{"location":"io/#Automa.generate_reader","page":"Parsing IOs","title":"Automa.generate_reader","text":"generate_reader(funcname::Symbol, machine::Automa.Machine; kwargs...)\n\nNOTE: This method requires TranscodingStreams to be loaded\n\nGenerate a streaming reader function of the name funcname from machine.\n\nThe generated function consumes data from a stream passed as the first argument and executes the machine with filling the data buffer.\n\nThis function returns an expression object of the generated function. The user need to evaluate it in a module in which the generated function is needed.\n\nKeyword Arguments\n\narguments: Additional arguments funcname will take (default: ()). The default signature of the generated function is (stream::TranscodingStream,), but it is possible to supply more arguments to the signature with this keyword argument.\ncontext: Automa's codegenerator (default: Automa.CodeGenContext()).\nactions: A dictionary of action code (default: Dict{Symbol,Expr}()).\ninitcode: Initialization code (default: :()).\nloopcode: Loop code (default: :()).\nreturncode: Return code (default: :(return cs)).\nerrorcode: Executed if cs < 0 after loopcode (default error message)\n\nSee the source code of this function to see how the generated code looks like ```\n\n\n\n\n\n","category":"function"},{"location":"io/#Automa.@escape","page":"Parsing IOs","title":"Automa.@escape","text":"@escape()\n\nPseudomacro. When encountered during Machine execution, the machine will stop executing. This is useful to interrupt the parsing process, for example to emit a record during parsing of a larger file. p will be advanced as normally, so if @escape is hit on B during parsing of \"ABC\", the next byte will be C.\n\n\n\n\n\n","category":"macro"},{"location":"io/#Automa.@mark","page":"Parsing IOs","title":"Automa.@mark","text":"@mark()\n\nPseudomacro, to be used with IO-parsing Automa functions. This macro will \"mark\" the position of p in the current buffer. The marked position will not be flushed from the buffer after being consumed. For example, Automa code can call @mark() at the beginning of a large string, then when the string is exited at position p, it is guaranteed that the whole string resides in the buffer at positions markpos():p-1.\n\n\n\n\n\n","category":"macro"},{"location":"io/#Automa.@unmark","page":"Parsing IOs","title":"Automa.@unmark","text":"unmark()\n\nPseudomacro. Removes the mark from the buffer. This allows all previous data to be cleared from the buffer.\n\nSee also: @mark, @markpos\n\n\n\n\n\n","category":"macro"},{"location":"io/#Automa.@markpos","page":"Parsing IOs","title":"Automa.@markpos","text":"markpos()\n\nPseudomacro. Get the position of the mark in the buffer.\n\nSee also: @mark, @markpos\n\n\n\n\n\n","category":"macro"},{"location":"io/#Automa.@bufferpos","page":"Parsing IOs","title":"Automa.@bufferpos","text":"bufferpos()\n\nPseudomacro. Returns the integer position of the current TranscodingStreams buffer (only used with the generate_reader function).\n\nExample\n\n# Inside some Automa action code\n@setbuffer()\ndescription = sub_parser(stream)\np = @bufferpos()\n\nSee also: @setbuffer\n\n\n\n\n\n","category":"macro"},{"location":"io/#Automa.@relpos","page":"Parsing IOs","title":"Automa.@relpos","text":"relpos(p)\n\nAutoma pseudomacro. Return the position of p relative to @markpos(). Equivalent to p - @markpos() + 1. This can be used to mark additional points in the stream when the mark is set, after which their action position can be retrieved using abspos(x)\n\nExample usage:\n\n# In one action\nidentifier_pos = @relpos(p)\n\n# Later, in a different action\nidentifier = data[@abspos(identifier_pos):p]\n\nSee also: @abspos\n\n\n\n\n\n","category":"macro"},{"location":"io/#Automa.@abspos","page":"Parsing IOs","title":"Automa.@abspos","text":"abspos(p)\n\nAutoma pseudomacro. Used to obtain the actual position of a relative position obtained from @relpos. See @relpos for more details.\n\n\n\n\n\n","category":"macro"},{"location":"io/#Automa.@setbuffer","page":"Parsing IOs","title":"Automa.@setbuffer","text":"setbuffer()\n\nUpdates the buffer position to match p. The buffer position is syncronized with p before and after calls to functions generated by generate_reader. @setbuffer() can be used to the buffer before calling another parser.\n\nExample\n\n# Inside some Automa action code\n@setbuffer()\ndescription = sub_parser(stream)\np = @bufferpos()\n\nSee also: @bufferpos\n\n\n\n\n\n","category":"macro"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"parser/#Parsing-from-a-buffer","page":"Parsing buffers","title":"Parsing from a buffer","text":"","category":"section"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Automa can leverage metaprogramming to combine regex and julia code to create parsers. This is significantly more difficult than simply using validators or tokenizers, but still simpler than parsing from an IO. Currently, Automa loads data through pointers, and therefore needs data backed by Array{UInt8} or String or similar - it does not work with types such as UnitRange{UInt8}. Furthermore, be careful about passing strided views to Automa - while Automa can extract a pointer from a strided view, it will always advance the pointer one byte at a time, disregarding the view's stride.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"As an example, let's use the simplified FASTA format intoduced in the regex section, with the following format: re\"(>[a-z]+\\n([ACGT]+\\n)+)*\". We want to parse it into a Vector{Seq}, where Seq is defined as:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> struct Seq\n name::String\n seq::String\n end","category":"page"},{"location":"parser/#Adding-actions-to-regex","page":"Parsing buffers","title":"Adding actions to regex","text":"","category":"section"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"To do this, we need to inject Julia code into the regex validator while it is running. The first step is to add actions to our regex: These are simply names of Julia expressions to splice in, where the expressions will be executed when the regex is matched. We can choose the names arbitrarily.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Currently, actions can be added in the following places in a regex:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"With onenter!, meaning it will be executed when reading the first byte of the regex\nWith onfinal!, where it will be executed when reading the last byte of the regex. Note that it's not possible to determine the final byte for some regex like re\"X+\", since the machine reads only 1 byte at a time and cannot look ahead.\nWith onexit!, meaning it will be executed on reading the first byte AFTER the regex, or when exiting the regex by encountering the end of inputs (only for a regex match, not an unexpected end of input)\nWith onall!, where it will be executed when reading every byte that is part of the regex.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"You can set the actions to be a single action name (represented by a Symbol), or a list of action names:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> my_regex = re\"ABC\";\n\njulia> onenter!(my_regex, [:action_a, :action_b]);\n\njulia> onexit!(my_regex, :action_c);","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"In which case the code named action_a, then that named action_b will executed in order when entering the regex, and the code named action_c will be executed when exiting the regex.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"The onenter! etc functions return the regex they modify, so the above can be written:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> my_regex = onexit!(onenter!(re\"ABC\", [:action_a, :action_b]), :action_c);\n\njulia> my_regex isa RE\ntrue","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"When the the following regex' actions are visualized in its corresponding DFA:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"regex = \n ab = re\"ab*\"\n onenter!(ab, :enter_ab)\n onexit!(ab, :exit_ab)\n onfinal!(ab, :final_ab)\n onall!(ab, :all_ab)\n c = re\"c\"\n onenter!(c, :enter_c)\n onexit!(c, :exit_c)\n onfinal!(c, :final_c)\n\n ab * c\nend","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"The result DFA looks below. Here, the edge labeled 'a'/enter_ab,all_ab,final_ab means that the edge consumes input byte 'a', and executes the three actions enter_ab, all_ab and final_ab, in that order. ","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"(Image: Visualization of regex with actions)","category":"page"},{"location":"parser/#Compiling-regex-to-Machines","page":"Parsing buffers","title":"Compiling regex to Machines","text":"","category":"section"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"In order to create code, the regex must first be compiled to a Machine, which is a struct that represents an optimised DFA. We can do that with compile(regex). Under the hood, this compiles the regex to an NFA, then compiles the NFA to a DFA, and then optimises the DFA to a Machine (see the section on Automa theory).","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Normally, we don't care about the regex directly, but only want the Machine. So, it is idiomatic to compile the regex in the same let statement it is being built in:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"machine = let\n header = re\"[a-z]+\"\n seqline = re\"[ACGT]+\"\n record = re\">\" * header * '\\n' * rep1(seqline * '\\n')\n compile(rep(record))\nend\n@assert machine isa Automa.Machine","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Note that, if this code is placed at top level in a package, the regex will be constructed and compiled to a Machine during package precompilation, which greatly helps load times.","category":"page"},{"location":"parser/#Creating-our-parser","page":"Parsing buffers","title":"Creating our parser","text":"","category":"section"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"However, in this case, we don't just need a Machine with the regex, we need a Machine with the regex containing the relevant actions. To parse a simplified FASTA file into a Vector{Seq}, I'm using these four actions:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"When the machine enters into the header, or a sequence line, I want it to mark the position with where it entered into the regex. The marked position will be used as the leftmost position where the header or sequence is extracted later.\nWhen exiting the header, I want to extract the bytes from the marked position in the action above, to the last header byte (i.e. the byte before the current byte), and use these bytes as the sequence header\nWhen exiting a sequence line, I want to do the same: Extract from the marked position to one before the current position, but this time I want to append the current line to a buffer containing all the lines of the sequence\nWhen exiting a record, I want to construct a Seq object from the header bytes and the buffer with all the sequence lines, then push the Seq to the result,","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> machine = let\n header = onexit!(onenter!(re\"[a-z]+\", :mark_pos), :header)\n seqline = onexit!(onenter!(re\"[ACGT]+\", :mark_pos), :seqline)\n record = onexit!(re\">\" * header * '\\n' * rep1(seqline * '\\n'), :record)\n compile(rep(record))\n end;","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"We can now write the code we want executed. When writing this code, we want access to a few variables used by the machine simulation. For example, we might want to know at which byte position the machine is when an action is executed. Currently, the following variables are accessible in the code:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"byte: The current input byte as a UInt8\np: The 1-indexed position of byte in the buffer\np_end: The length of the input buffer\nis_eof: Whether the machine has reached the end of the input.\ncs: The current state of the machine, as an integer\ndata: The input buffer\nmem: The memory being read from, an Automa.SizedMemory object containing a pointer and a length","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"The actions we want executed, we place in a Dict{Symbol, Expr}:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> actions = Dict(\n :mark_pos => :(pos = p),\n :header => :(header = String(data[pos:p-1])),\n :seqline => :(append!(buffer, data[pos:p-1])),\n :record => :(push!(seqs, Seq(header, String(buffer))))\n );","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"For multi-line Expr, you can construct them with quote ... end blocks.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"We can now construct a function that parses our data. In the code written in the action dict above, besides the variables defined for us by Automa, we also refer to the variables buffer, header, pos and seqs. Some of these variables are defined in the code above (for example, in the :(pos = p) expression), but we can't necessarily control the order in which Automa will insert these expressions into out final function. Hence, let's initialize these variables at the top of the function we generate, such that we know for sure they are defined when used - whenever they are used.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"The code itself is generated using generate_code:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> @eval function parse_fasta(data)\n pos = 0\n buffer = UInt8[]\n seqs = Seq[]\n header = \"\"\n $(generate_code(machine, actions))\n return seqs\n end\nparse_fasta (generic function with 1 method)","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"We can now use it:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> parse_fasta(\">abc\\nTAGA\\nAAGA\\n>header\\nAAAG\\nGGCG\\n\")\n2-element Vector{Seq}:\n Seq(\"abc\", \"TAGAAAGA\")\n Seq(\"header\", \"AAAGGGCG\")","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"If we give out function a bad input - for example, if we forget the trailing newline, it throws an error:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> parse_fasta(\">abc\\nTAGA\\nAAGA\\n>header\\nAAAG\\nGGCG\")\nERROR: Error during FSM execution at buffer position 33.\nLast 32 byte(s) were:\n\n\">abc\\nTAGA\\nAAGA\\n>header\\nAAAG\\nGGCG\"\n\nObserved input: EOF at state 5. Outgoing edges:\n * '\\n'/seqline\n * [ACGT]\n\nInput is not in any outgoing edge, and machine therefore errored.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"The code above parses with about 300 MB/s on my laptop. Not bad, but Automa can do better - read on to learn how to customize codegen.","category":"page"},{"location":"parser/#Preconditions","page":"Parsing buffers","title":"Preconditions","text":"","category":"section"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"You might have noticed a peculiar detail about our FASTA format: It demands a trailing newline after each record. In other words, >a\\nA is not a valid FASTA record.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"We can easily rewrite the regex such that the last record does not need a trailing \\n. But look what happens when we try that:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> machine = let\n header = onexit!(onenter!(re\"[a-z]+\", :mark_pos), :header)\n seqline = onexit!(onenter!(re\"[ACGT]+\", :mark_pos), :seqline)\n record = onexit!(re\">\" * header * '\\n' * seqline * rep('\\n' * seqline), :record)\n compile(opt(record) * rep('\\n' * record) * rep(re\"\\n\"))\n end;\nERROR: Ambiguous NFA.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Why does this error? Well, remember that Automa processes one byte at a time, and at each byte, makes a decision on what actions to execute. Hence, if it sees the input >a\\nA\\n, it does not know what to do when encountering the second \\n. If the next byte e,g. A, then it would need to execute the :seqline action. If the byte is >, it would need to execute first :seqline, then :record. Automa can't read ahead, so, the regex is ambiguous and the true behaviour when reading the inputs >a\\nA\\n is undefined. Therefore, Automa refuses to compile it.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"There are several ways to solve this:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"First, you can rewrite the regex to not be ambiguous. This is usually the preferred option: After all, if the regex is ambiguous, you probably made a mistake with the regex\nYou can manually diasable the ambiguity check by passing the keyword unambiguous=false to compile. This will cause the machine to undefined behaviour if an input like >a\\nA\\n is seen, so this is usually a poor idea.\nYou can rewrite the actions, such that the action itself uses an if-statement to check what to do. In the example above, you could remove the :record action and have the :seqline action conditionally emit a record if the next byte was >.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Finally, you can use preconditions. A precondition is a symbol, attached to a regex, just like an action. Just like an action, the symbol is attached to an Expr object, but for preconditions this must evaluate to a Bool. If false, the regex is not entered.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Let's have an example. The following machine is obviously ambiguous:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> machine = let\n a = onenter!(re\"XY\", :a)\n b = onenter!(re\"XZ\", :b)\n compile('A' * (a | b))\n end;\nERROR: Ambiguous NFA.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"We can add a precondition with precond!. Below, precond!(regex, label) is equivalent to precond!(regex, label; when=:enter, bool=true). This means \"only enter regex when the boolean expression label evaluates to bool (true)\":","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> machine = let\n a = precond!(onenter!(re\"XY\", :a), :test)\n b = precond!(onenter!(re\"XZ\", :b), :test; bool=false)\n compile('A' * (a | b))\n end;\n\njulia> machine isa Automa.Machine\ntrue","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Here, re\"XY\" can only be entered when :test is true, and re\"XZ\" only when :test is false. So, there can be no ambiguous behaviour and the regex compiles fine.","category":"page"},{"location":"parser/#Reference","page":"Parsing buffers","title":"Reference","text":"","category":"section"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Automa.onenter!\nAutoma.onexit!\nAutoma.onall!\nAutoma.onfinal!\nAutoma.precond!\nAutoma.generate_code\nAutoma.generate_init_code\nAutoma.generate_exec_code","category":"page"},{"location":"parser/#Automa.RegExp.onenter!","page":"Parsing buffers","title":"Automa.RegExp.onenter!","text":"onenter!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re\n\nSet action(s) a to occur when reading the first byte of regex re. If multiple actions are set by passing a vector, execute the actions in order.\n\nSee also: onexit!, onall!, onfinal!\n\nExample\n\njulia> regex = re\"ab?c*\";\n\njulia> regex2 = onenter!(regex, :entering_regex);\n\njulia> regex === regex2\ntrue\n\n\n\n\n\n","category":"function"},{"location":"parser/#Automa.RegExp.onexit!","page":"Parsing buffers","title":"Automa.RegExp.onexit!","text":"onexit!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re\n\nSet action(s) a to occur when reading the first byte no longer part of regex re, or if experiencing an expected end-of-file. If multiple actions are set by passing a vector, execute the actions in order.\n\nSee also: onenter!, onall!, onfinal!\n\nExample\n\njulia> regex = re\"ab?c*\";\n\njulia> regex2 = onexit!(regex, :exiting_regex);\n\njulia> regex === regex2\ntrue\n\n\n\n\n\n","category":"function"},{"location":"parser/#Automa.RegExp.onall!","page":"Parsing buffers","title":"Automa.RegExp.onall!","text":"onall!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re\n\nSet action(s) a to occur when reading any byte part of the regex re. If multiple actions are set by passing a vector, execute the actions in order.\n\nSee also: onenter!, onexit!, onfinal!\n\nExample\n\njulia> regex = re\"ab?c*\";\n\njulia> regex2 = onall!(regex, :reading_re_byte);\n\njulia> regex === regex2\ntrue\n\n\n\n\n\n","category":"function"},{"location":"parser/#Automa.RegExp.onfinal!","page":"Parsing buffers","title":"Automa.RegExp.onfinal!","text":"onfinal!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re\n\nSet action(s) a to occur when the last byte of regex re. If re does not have a definite final byte, e.g. re\"a(bc)*\", where more \"bc\" can always be added, compiling the regex will error after setting a final action. If multiple actions are set by passing a vector, execute the actions in order.\n\nSee also: onenter!, onall!, onexit!\n\nExample\n\njulia> regex = re\"ab?c\";\n\njulia> regex2 = onfinal!(regex, :entering_last_byte);\n\njulia> regex === regex2\ntrue\n\njulia> compile(onfinal!(re\"ab?c*\", :does_not_work))\nERROR: [...]\n\n\n\n\n\n","category":"function"},{"location":"parser/#Automa.RegExp.precond!","page":"Parsing buffers","title":"Automa.RegExp.precond!","text":"precond!(re::RE, s::Symbol; [when=:enter], [bool=true]) -> re\n\nSet re's precondition to s. Before any state transitions to re, or inside re, the precondition code s is checked to be bool before the transition is taken.\n\nwhen controls if the condition is checked when the regex is entered (if :enter), or at every state transition inside the regex (if :all)\n\nExample\n\njulia> regex = re\"ab?c*\";\n\njulia> regex2 = precond!(regex, :some_condition);\n\njulia> regex === regex2\ntrue\n\n\n\n\n\n","category":"function"},{"location":"parser/#Automa.generate_code","page":"Parsing buffers","title":"Automa.generate_code","text":"generate_code([::CodeGenContext], machine::Machine, actions=nothing)::Expr\n\nGenerate init and exec code for machine. The default code generator function for creating functions, preferentially use this over generating init and exec code directly, due to its convenience. Shorthand for producing the concatenated code of:\n\ngenerate_init_code(ctx, machine)\ngenerate_action_code(ctx, machine, actions)\ngenerate_input_error_code(ctx, machine) [elided if actions == :debug]\n\nExamples\n\n@eval function foo(data)\n # Initialize variables used in actions\n data_buffer = UInt8[]\n $(generate_code(machine, actions))\n return data_buffer\nend\n\nSee also: generate_init_code, generate_exec_code\n\n\n\n\n\n","category":"function"},{"location":"parser/#Automa.generate_init_code","page":"Parsing buffers","title":"Automa.generate_init_code","text":"generate_init_code([::CodeGenContext], machine::Machine)::Expr\n\nGenerate variable initialization code, initializing variables such as p, and p_end. The names of these variables are set by the CodeGenContext. If not passed, the context defaults to DefaultCodeGenContext\n\nPrefer using the more generic generate_code over this function where possible. This function should be used if the initialized data should be modified before the execution code.\n\nExample\n\n@eval function foo(data)\n $(generate_init_code(machine))\n p = 2 # maybe I want to start from position 2, not 1\n $(generate_exec_code(machine, actions))\n return cs\nend\n\nSee also: generate_code, generate_exec_code\n\n\n\n\n\n","category":"function"},{"location":"parser/#Automa.generate_exec_code","page":"Parsing buffers","title":"Automa.generate_exec_code","text":"generate_exec_code([::CodeGenContext], machine::Machine, actions=nothing)::Expr\n\nGenerate machine execution code with actions. This code should be run after the machine has been initialized with generate_init_code. If not passed, the context defaults to DefaultCodeGenContext\n\nPrefer using the more generic generate_code over this function where possible. This function should be used if the initialized data should be modified before the execution code.\n\nExamples\n\n@eval function foo(data)\n $(generate_init_code(machine))\n p = 2 # maybe I want to start from position 2, not 1\n $(generate_exec_code(machine, actions))\n return cs\nend\n\nSee also: generate_init_code, generate_exec_code\n\n\n\n\n\n","category":"function"}] +[{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"debugging/#Debugging-Automa","page":"Debugging Automa","title":"Debugging Automa","text":"","category":"section"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"danger: Danger\nAll Automa's debugging tools are NOT part of the API and are subject to change without warning. You can use them during development, but do NOT rely on their behaviour in your final code.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Automa is a complicated package, and the process of indirectly designing parsers by first designing a machine can be error prone. Therefore, it's crucial to have good debugging tooling.","category":"page"},{"location":"debugging/#Revise","page":"Debugging Automa","title":"Revise","text":"","category":"section"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Revise is not able to update Automa-generated functions. To make your feedback loop faster, you can manually re-run the code that defines the Automa functions - usually this is much faster than modifying the package and reloading it.","category":"page"},{"location":"debugging/#Ambiguity-check","page":"Debugging Automa","title":"Ambiguity check","text":"","category":"section"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"It is easy to accidentally create a machine where it is undecidable what actions should be taken. For example:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"machine = let\n alphabet = re\"BC\"\n band = onenter!(re\"BBA\", :cool_band)\n compile(re\"XYZ A\" * (alphabet | band))\nend\n\n# output\nERROR: Ambiguous NFA.\n[...]","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Consider what the machine should do once it observes the two first bytes AB of the input: Is the B part of alphabet (in which case it should do nothing), or is it part of band (in which case it should do the action :cool_band)? It's impossible to tell.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Automa will not compile this, and will raise the error:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"ERROR: Ambiguous NFA.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Note the error shows an example input which will trigger the ambiguity: XYZ A, then B. By simply running the input through in your head, you may discover yourself how the error happens.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"In the example above, the error was obvious, but consider this example:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"fasta_machine = let\n header = re\"[a-z]+\"\n seq_line = re\"[ACGT]+\"\n sequence = seq_line * rep('\\n' * seq_line)\n record = onexit!('>' * header * '\\n' * sequence, :emit_record)\n compile(rep(record * '\\n') * opt(record))\nend\n\n# output\nERROR: Ambiguous NFA.\n[...]","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"It's the same problem: After a sequence line you observe \\n: Is this the end of the sequence, or just a newline before another sequence line?","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"To work around it, consider when you know for sure you are out of the sequence: It's not before you see a new >, or end-of-file. In a sense, the trailing \\n really IS part of the sequence. So, really, your machine should regex similar to this","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"fasta_machine = let\n header = re\"[a-z]+\"\n seq_line = re\"[ACGT]+\"\n sequence = rep1(seq_line * '\\n')\n record = onexit!('>' * header * '\\n' * sequence, :emit_record)\n\n # A special record that can avoid a trailing newline, but ONLY if it's the last record\n record_eof = '>' * header * '\\n' * seq_line * rep('\\n' * seq_line) * opt('\\n')\n compile(rep(record * '\\n') * opt(record_eof))\nend\n@assert fasta_machine isa Automa.Machine\n\n# output","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"When all else fails, you can also pass unambiguous=false to the compile function - but beware! Ambiguous machines has undefined behaviour if you get into an ambiguous situation.","category":"page"},{"location":"debugging/#Create-Machine-flowchart","page":"Debugging Automa","title":"Create Machine flowchart","text":"","category":"section"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"The function machine2dot(::Machine) will return a string with a Graphviz .dot formatted flowchart of the machine. Graphviz can then convert the dot file to an SVG function.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"On my computer (with Graphviz and Firefox installed), I can use the following Julia code to display a flowchart of a machine. Note that dot is the command-line name of Graphviz.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"function display_machine(m::Machine)\n open(\"/tmp/machine.dot\", \"w\") do io\n println(io, Automa.machine2dot(m))\n end\n run(pipeline(`dot -Tsvg /tmp/machine.dot`, stdout=\"/tmp/machine.svg\"))\n run(`firefox /tmp/machine.svg`)\nend","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"The following function are Automa internals, but they might help with more advanced debugging:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"re2nfa - create an NFA from an Automa regex\nnfa2dot - create a dot-formatted string from an nfa\nnfa2dfa - create a DFA from an NFA\ndfa2dot - create a dot-formatted string from a DFA","category":"page"},{"location":"debugging/#Running-machines-in-debug-mode","page":"Debugging Automa","title":"Running machines in debug mode","text":"","category":"section"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"The function generate_code takes an argument actions. If this is :debug, then all actions in the given Machine will be replaced by :(push!(logger, action_name)). Hence, given a FASTA machine, you could create a debugger function:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":" @eval function debug(data)\n logger = []\n $(generate_code(fasta_machine, :debug))\n logger\nend","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Then see all the actions executed in order, by doing:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"julia> debug(\">abc\\nTAG\")\n4-element Vector{Any}:\n :mark\n :header\n :mark\n :seqline\n :record","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Note that if your machine relies on its actions to work correctly, for example by actions modifying p, this kind of debugger will not work, as it replaces all actions.","category":"page"},{"location":"debugging/#More-advanced-debuggning","page":"Debugging Automa","title":"More advanced debuggning","text":"","category":"section"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"The file test/debug.jl contains extra debugging functionality and may be included. In particular it defines the functions debug_execute and create_debug_function.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"The function of create_debug_function(::Machine; ascii=false) is best demonstrated:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"machine = let\n letters = onenter!(re\"[a-z]+\", :enter_letters)\n compile(onexit!(letters * re\",[0-9],\" * letters, :exiting_regex))\nend\neval(create_debug_function(machine; ascii=true))\n(end_state, transitions) = debug_compile(\"abc,5,d!\")\n@show end_state\ntransitions","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Will create the following output:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"end state = -6\n7-element Vector{Tuple{Char, Int64, Vector{Symbol}}}:\n ('a', 2, [:enter_letters])\n ('b', 2, [])\n ('c', 2, [])\n (',', 3, [])\n ('5', 4, [])\n (',', 5, [])\n ('d', 6, [:enter_letters])","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"Where each 3-tuple in the input corresponds to the input byte (displayed as a Char if ascii is set to true), the Automa state reached on reading the letter, and the actions executed.","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"The debug_execute function works the same as the debug_compile, but does not need to be generated first, and can be run directly on an Automa regex:","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"julia> debug_execute(re\"[A-z]+\", \"abc1def\"; ascii=true)\n(-3, Tuple{Union{Nothing, Char}, Int64, Vector{Symbol}}[('a', 2, []), ('b', 3, []), ('c', 3, [])])","category":"page"},{"location":"debugging/","page":"Debugging Automa","title":"Debugging Automa","text":"machine2dot","category":"page"},{"location":"debugging/#Automa.machine2dot","page":"Debugging Automa","title":"Automa.machine2dot","text":"machine2dot(machine::Machine)::String\n\nReturn a String with a flowchart of the machine in Graphviz (dot) format. Using Graphviz, a command-line tool, the dot file can be converted to various picture formats.\n\nExample\n\nopen(\"/tmp/machine.dot\", \"w\") do io\n println(io, machine2dot(machine))\nend\n# Requires graphviz to be installed\nrun(pipeline(`dot -Tsvg /tmp/machine.dot`), stdout=\"/tmp/machine.svg\")\n\n\n\n\n\n","category":"function"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"tokenizer/#Tokenizers-(lexers)","page":"Tokenizers","title":"Tokenizers (lexers)","text":"","category":"section"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"A tokenizer or a lexer is a program that breaks down an input text into smaller chunks, and classifies them as one of several tokens. For example, consider an imagininary format that only consists of nested tuples of strings containing letters, like this:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"((\"ABC\", \"v\"),((\"x\", (\"pj\",((\"a\", \"k\")), (\"L\")))))","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Any text of this format can be broken down into a sequence of the following tokens:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Left parenthesis: re\"\\(\"\nRight parenthesis: re\"\\)\"\nComma: re\",\"\nQuote: re\"\\\"\"\nSpaces: re\" +\"\nLetters: re\"[A-Za-z]+\"","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Such that e.g. (\"XY\", \"A\") can be represented as lparent, quote, XY, quote, comma, space, quote A quote rparens.","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Breaking the text down to its tokens is called tokenization or lexing. Note that lexing in itself is not sufficient to parse the format: Lexing is context unaware, so e.g. the test \"((A can be perfectly well tokenized to quote lparens lparens A, even if it's invalid.","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"The purpose of tokenization is to make subsequent parsing easier, because each part of the text has been classified. That makes it easier to, for example, to search for letters in the input. Instead of having to muck around with regex to find the letters, you use regex once to classify all text.","category":"page"},{"location":"tokenizer/#Making-and-using-a-tokenizer","page":"Tokenizers","title":"Making and using a tokenizer","text":"","category":"section"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Let's use the example above to create a tokenizer. The most basic default tokenizer uses UInt32 as tokens: You pass in a list of regex matching each token, then evaluate the resulting code:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"julia> make_tokenizer(\n [re\"\\(\", re\"\\)\", re\",\", re\"\\\"\", re\" +\", re\"[a-zA-Z]+\"]\n ) |> eval","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Since the default tokenizer uses UInt32 as tokens, you can then obtain a lazy iterator of tokens by calling tokenize(UInt32, data):","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"julia> iterator = tokenize(UInt32, \"\"\"(\"XY\", \"A\")\"\"\"); typeof(iterator)\nTokenizer{UInt32, String, 1}","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"This will return Tuple{Int64, Int32, UInt32} elements, with each element being:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"The start index of the token\nThe length of the token\nThe token itself, in this example UInt32(1) for '(', UInt32(2) for ')' etc: ","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"julia> collect(iterator)\n10-element Vector{Tuple{Int64, Int32, UInt32}}:\n (1, 1, 0x00000001)\n (2, 1, 0x00000004)\n (3, 2, 0x00000006)\n (5, 1, 0x00000004)\n (6, 1, 0x00000003)\n (7, 1, 0x00000005)\n (8, 1, 0x00000004)\n (9, 1, 0x00000006)\n (10, 1, 0x00000004)\n (11, 1, 0x00000002)","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Any data which could not be tokenized is given the error token UInt32(0):","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"julia> collect(tokenize(UInt32, \"XY!!)\"))\n3-element Vector{Tuple{Int64, Int32, UInt32}}:\n (1, 2, 0x00000006)\n (3, 2, 0x00000000)\n (5, 1, 0x00000002)","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Both tokenize and make_tokenizer takes an optional argument version, which is 1 by default. This sets the last parameter of the Tokenizer struct, and as such allows you to create multiple different tokenizers with the same element type.","category":"page"},{"location":"tokenizer/#Using-enums-as-tokens","page":"Tokenizers","title":"Using enums as tokens","text":"","category":"section"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Using UInt32 as tokens is not very convenient - so it's possible to use enums to create the tokenizer:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"julia> @enum Token error lparens rparens comma quot space letters\n\njulia> make_tokenizer((error, [\n lparens => re\"\\(\",\n rparens => re\"\\)\",\n comma => re\",\",\n quot => re\"\\\"\",\n space => re\" +\",\n letters => re\"[a-zA-Z]+\"\n ])) |> eval\n\njulia> collect(tokenize(Token, \"XY!!)\"))\n3-element Vector{Tuple{Int64, Int32, Token}}:\n (1, 2, letters)\n (3, 2, error)\n (5, 1, rparens)","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"To make it even easier, you can define the enum and the tokenizer in one go:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"tokens = [\n :lparens => re\"\\(\",\n :rparens => re\"\\)\",\n :comma => re\",\",\n :quot => re\"\\\"\",\n :space => re\" +\",\n :letters => re\"[a-zA-Z]+\"\n]\n@eval @enum Token error $(first.(tokens)...)\nmake_tokenizer((error, \n [Token(i) => j for (i,j) in enumerate(last.(tokens))]\n)) |> eval","category":"page"},{"location":"tokenizer/#Token-disambiguation","page":"Tokenizers","title":"Token disambiguation","text":"","category":"section"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"It's possible to create a tokenizer where the different token regexes overlap:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"julia> make_tokenizer([re\"[ab]+\", re\"ab*\", re\"ab\"]) |> eval","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"In this case, an input like ab will match all three regex. Which tokens are emitted is determined by two rules:","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"First, the emitted tokens will be as long as possible. So, the input aa could be emitted as one token of the regex re\"[ab]+\", two tokens of the same regex, or of two tokens of the regex re\"ab*\". In this case, it will be emitted as a single token of re\"[ab]+\", since that will make the first token as long as possible (2 bytes), whereas the other options would only make it 1 byte long.","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Second, tokens with a higher index in the input array beats previous tokens. So, a will be emitted as re\"ab*\", as its index of 2 beats the previous regex re\"[ab]+\" with the index 1, and ab will match the third regex.","category":"page"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"If you don't want emitted tokens to depend on these priority rules, you can set the optional keyword unambiguous=true in the make_tokenizer function, in which case make_tokenizer will error if any input text could be broken down into different tokens. However, note that this may cause most tokenizers to error when being built, as most tokenization processes are ambiguous.","category":"page"},{"location":"tokenizer/#Reference","page":"Tokenizers","title":"Reference","text":"","category":"section"},{"location":"tokenizer/","page":"Tokenizers","title":"Tokenizers","text":"Automa.Tokenizer\nAutoma.tokenize\nAutoma.make_tokenizer","category":"page"},{"location":"tokenizer/#Automa.Tokenizer","page":"Tokenizers","title":"Automa.Tokenizer","text":"Tokenizer{E, D, C}\n\nLazy iterator of tokens of type E over data of type D.\n\nTokenizer works on any buffer-like object that defines pointer and sizeof. When iterated, it will return a 3-tuple of integers: * The first is the 1-based starting index of the token in the buffer * The second is the length of the token in bytes * The third is the token kind: The index in the input list tokens.\n\nUn-tokenizable data will be emitted as the \"error token\" with index zero.\n\nThe Int C parameter allows multiple tokenizers to be created with the otherwise same type parameters.\n\nSee also: make_tokenizer\n\n\n\n\n\n","category":"type"},{"location":"tokenizer/#Automa.tokenize","page":"Tokenizers","title":"Automa.tokenize","text":"tokenize(::Type{E}, data, version=1)\n\nCreate a Tokenizer{E, typeof(data), version}, iterating tokens of type E over data.\n\nSee also: Tokenizer, make_tokenizer, compile\n\n\n\n\n\n","category":"function"},{"location":"tokenizer/#Automa.make_tokenizer","page":"Tokenizers","title":"Automa.make_tokenizer","text":"make_tokenizer(\n machine::TokenizerMachine;\n tokens::Tuple{E, AbstractVector{E}}= [ integers ],\n goto=true, version=1\n) where E\n\nCreate code which when evaluated, defines Base.iterate(::Tokenizer{E, D, $version}). tokens is a tuple of a vector of non-error tokens of length machine.n_tokens, and the error token, which will be emitted for data that cannot be tokenized.\n\nExample usage\n\njulia> machine = compile([re\"a\", re\"b\"]);\n\njulia> make_tokenizer(machine; tokens=(0x00, [0x01, 0x02])) |> eval\n\njulia> iter = tokenize(UInt8, \"abxxxba\"); typeof(iter)\nTokenizer{UInt8, String, 1}\n\njulia> collect(iter)\n5-element Vector{Tuple{Int64, Int32, UInt8}}:\n (1, 1, 0x01)\n (2, 1, 0x02)\n (3, 3, 0x00)\n (6, 1, 0x02)\n (7, 1, 0x01)\n\nAny actions inside the input regexes will be ignored. If goto (default), use the faster, but more complex goto code generator. The version number will set the last parameter of the Tokenizer, which allows you to create different tokenizers for the same element type.\n\nSee also: Tokenizer, tokenize, compile\n\n\n\n\n\nmake_tokenizer(\n tokens::Union{\n AbstractVector{RE},\n Tuple{E, AbstractVector{Pair{E, RE}}}\n };\n goto::Bool=true,\n version::Int=1,\n unambiguous=false\n) where E\n\nConvenience function for both compiling a tokenizer, then running make_tokenizer on it. If tokens is an abstract vector, create an iterator of integer tokens with the error token being zero and the non-error tokens being the index in the vector. Else, tokens is the error token followed by token => regex pairs. See the relevant other methods of make_tokenizer, and compile.\n\nExample\n\njulia> make_tokenizer([re\"abc\", re\"def\") |> eval\n\njulia> collect(tokenize(Int, \"abcxyzdef123\"))\n4-element Vector{Tuple{Int64, Int32, UInt32}}:\n (1, 3, 0x00000001)\n (4, 3, 0x00000003)\n (7, 3, 0x00000002)\n (10, 3, 0x00000003)\n\n\n\n\n\n","category":"function"},{"location":"validators/","page":"Validators","title":"Validators","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"validators/#Text-validators","page":"Validators","title":"Text validators","text":"","category":"section"},{"location":"validators/","page":"Validators","title":"Validators","text":"The simplest use of Automa is to simply match a regex. It's unlikely you are going to want to use Automa for this instead of Julia's built-in regex engine PCRE, unless you need the extra performance that Automa brings over PCRE. Nonetheless, it serves as a good starting point to introduce Automa.","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"Suppose we have the FASTA regex from the regex page:","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"julia> fasta_regex = let\n header = re\"[a-z]+\"\n seqline = re\"[ACGT]+\"\n record = '>' * header * '\\n' * rep1(seqline * '\\n')\n rep(record)\n end;","category":"page"},{"location":"validators/#Buffer-validator","page":"Validators","title":"Buffer validator","text":"","category":"section"},{"location":"validators/","page":"Validators","title":"Validators","text":"Automa comes with a convenience function generate_buffer_validator:","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"Given a regex (RE) like the one above, we can do:","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"julia> eval(generate_buffer_validator(:validate_fasta, fasta_regex));\n\njulia> validate_fasta\nvalidate_fasta (generic function with 1 method)","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"And we now have a function that checks if some data matches the regex:","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"julia> validate_fasta(\">hello\\nTAGAGA\\nTAGAG\") # missing trailing newline\n0\n\njulia> validate_fasta(\">helloXXX\") # Error at byte index 7\n7\n\njulia> validate_fasta(\">hello\\nTAGAGA\\nTAGAG\\n\") # nothing; it matches","category":"page"},{"location":"validators/#IO-validators","page":"Validators","title":"IO validators","text":"","category":"section"},{"location":"validators/","page":"Validators","title":"Validators","text":"For large files, having to read the data into a buffer to validate it may not be possible. Automa also supports creating IO validators with the generate_io_validator function:","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"This works very similar to generate_buffer_validator, but the generated function takes an IO, and has a different return value:","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"If the data matches, still return nothing\nElse, return (byte, (line, column)) where byte is the first errant byte, and (line, column) the position of the byte. If the errant byte is a newline, column is 0. If the input reaches unexpected EOF, byte is nothing, and (line, column) points to the last line/column in the IO:","category":"page"},{"location":"validators/","page":"Validators","title":"Validators","text":"julia> eval(generate_io_validator(:validate_io, fasta_regex));\n\njulia> validate_io(IOBuffer(\">hello\\nTAGAGA\\n\"))\n\njulia> validate_io(IOBuffer(\">helX\"))\n(0x58, (1, 5))\n\njulia> validate_io(IOBuffer(\">hello\\n\\n\"))\n(0x0a, (3, 0))\n\njulia> validate_io(IOBuffer(\">hello\\nAC\"))\n(nothing, (2, 2))","category":"page"},{"location":"validators/#Reference","page":"Validators","title":"Reference","text":"","category":"section"},{"location":"validators/","page":"Validators","title":"Validators","text":"Automa.generate_buffer_validator\nAutoma.generate_io_validator\nAutoma.compile","category":"page"},{"location":"validators/#Automa.generate_buffer_validator","page":"Validators","title":"Automa.generate_buffer_validator","text":"generate_buffer_validator(name::Symbol, regexp::RE; goto=true; docstring=true)\n\nGenerate code that, when evaluated, defines a function named name, which takes a single argument data, interpreted as a sequence of bytes. The function returns nothing if data matches Machine, else the index of the first invalid byte. If the machine reached unexpected EOF, returns 0. If goto, the function uses the faster but more complicated :goto code. If docstring, automatically create a docstring for the generated function.\n\n\n\n\n\n","category":"function"},{"location":"validators/#Automa.generate_io_validator","page":"Validators","title":"Automa.generate_io_validator","text":"generate_io_validator(funcname::Symbol, regex::RE; goto::Bool=false)\n\nNOTE: This method requires TranscodingStreams to be loaded\n\nCreate code that, when evaluated, defines a function named funcname. This function takes an IO, and checks if the data in the input conforms to the regex, without executing any actions. If the input conforms, return nothing. Else, return (byte, (line, col)), where byte is the first invalid byte, and (line, col) the 1-indexed position of that byte. If the invalid byte is a byte, col is 0 and the line number is incremented. If the input errors due to unexpected EOF, byte is nothing, and the line and column given is the last byte in the file. If goto, the function uses the faster but more complicated :goto code.\n\n\n\n\n\n","category":"function"},{"location":"validators/#Automa.compile","page":"Validators","title":"Automa.compile","text":"compile(re::RE; optimize::Bool=true, unambiguous::Bool=true)::Machine\n\nCompile a finite state machine (FSM) from re. If optimize, attempt to minimize the number of states in the FSM. If unambiguous, disallow creation of FSM where the actions are not deterministic.\n\nExamples\n\nmachine = let\n name = re\"[A-Z][a-z]+\"\n first_last = name * re\" \" * name\n last_first = name * re\", \" * name\n compile(first_last | last_first)\nend\n\n\n\n\n\ncompile(tokens::Vector{RE}; unambiguous=false)::TokenizerMachine\n\nCompile the regex tokens to a tokenizer machine. The machine can be passed to make_tokenizer.\n\nThe keyword unambiguous decides which of multiple matching tokens is emitted: If false (default), the longest token is emitted. If multiple tokens have the same length, the one with the highest index is returned. If true, make_tokenizer will error if any possible input text can be broken ambiguously down into tokens.\n\nSee also: Tokenizer, make_tokenizer, tokenize\n\n\n\n\n\n","category":"function"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"reader/#Creating-a-Reader-type","page":"Creating readers","title":"Creating a Reader type","text":"","category":"section"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"The use of generate_reader as we learned in the previous section \"Parsing from an io\" has an issue we need to address: While we were able to read multiple records from the reader by calling read_record multiple times, no state was preserved between these calls, and so, no state can be preserved between reading individual records. This is also what made it necessary to clumsily reset p after emitting each record.","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"Imagine you have a format with two kinds of records, A and B types. A records must come before B records in the file. Hence, while a B record can appear at any time, once you've seen a B record, there can't be any more A records. When reading records from the file, you must be able to store whether you've seen a B record.","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"We address this by creating a Reader type which wraps the IO being parsed, and which store any state we want to preserve between records. Let's stick to our simplified FASTA format parsing sequences into Seq objects:","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"struct Seq\n name::String\n seq::String\nend\n\nmachine = let\n header = onexit!(onenter!(re\"[a-z]+\", :mark_pos), :header)\n seqline = onexit!(onenter!(re\"[ACGT]+\", :mark_pos), :seqline)\n record = onexit!(re\">\" * header * '\\n' * rep1(seqline * '\\n'), :record)\n compile(rep(record))\nend\n@assert machine isa Automa.Machine","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"This time, we use the following Reader type:","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"mutable struct Reader{S <: TranscodingStream}\n io::S\n automa_state::Int\nend\n\nReader(io::TranscodingStream) = Reader{typeof(io)}(io, 1)\nReader(io::IO) = Reader(NoopStream(io))","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"The Reader contains an instance of TranscodingStream to read from, and stores the Automa state between records. The beginning state of Automa is always 1. We can now create our reader function like below. There are only three differences from the definitions in the previous section:","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"I no longer have the code to decrement p in the :record action - because we can store the Automa state between records such that the machine can handle beginning in the middle of a record if necessary, there is no need to reset the value of p in order to restore the IO to the state right before each record.\nI return (cs, state) instead of just state, because I want to update the Automa state of the Reader, so when it reads the next record, it begins in the same state where the machine left off from the previous state\nIn the arguments, I add start_state, and in the initcode I set cs to the start state, so the machine begins from the correct state","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"actions = Dict{Symbol, Expr}(\n :mark_pos => :(@mark),\n :header => :(header = String(data[@markpos():p-1])),\n :seqline => :(append!(seqbuffer, data[@markpos():p-1])),\n :record => quote\n seq = Seq(header, String(seqbuffer))\n found_sequence = true\n @escape\n end\n)\n\ngenerate_reader(\n :read_record,\n machine;\n actions=actions,\n arguments=(:(start_state::Int),),\n initcode=quote\n seqbuffer = UInt8[]\n found_sequence = false\n header = \"\"\n cs = start_state\n end,\n loopcode=quote\n if (is_eof && p > p_end) || found_sequence\n @goto __return__\n end\n end,\n returncode=:(found_sequence ? (cs, seq) : throw(EOFError()))\n) |> eval","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"We then create a function that reads from the Reader, making sure to update the automa_state of the reader:","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"function read_record(reader::Reader)\n (cs, seq) = read_record(reader.io, reader.automa_state)\n reader.automa_state = cs\n return seq\nend","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"Let's test it out:","category":"page"},{"location":"reader/","page":"Creating readers","title":"Creating readers","text":"julia> reader = Reader(IOBuffer(\">a\\nT\\n>tag\\nGAG\\nATATA\\n\"));\n\njulia> read_record(reader)\nSeq(\"a\", \"T\")\n\njulia> read_record(reader)\nSeq(\"tag\", \"GAGATATA\")\n\njulia> read_record(reader)\nERROR: EOFError: read end of file","category":"page"},{"location":"theory/#Theory-of-regular-expressions","page":"Theory","title":"Theory of regular expressions","text":"","category":"section"},{"location":"theory/","page":"Theory","title":"Theory","text":"Most programmers are familiar with regular expressions, or regex, for short. What many programmers don't know is that regex have a deep theoretical underpinning, which is leaned on by regex engines to produce highly efficient code.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Informally, a regular expression can be thought of as any pattern that can be constructed from the following atoms:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"The empty string is a valid regular expression, i.e. re\"\"\nLiteral matching of a single symbol from a finite alphabet, such as a character, i.e. re\"p\"","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Atoms can be combined with the following operations, if R and P are two regular expressions:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Alternation, i.e R | P, meaning either match R or P.\nConcatenation, i.e. R * P, meaning match first R, then P\nRepetition, i.e. R*, meaning match R zero or more times consecutively.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"note: Note\nIn Automa, the alphabet is bytes, i.e. 0x00:0xff, and so each symbol is a single byte. Multi-byte characters such as Æ is interpreted as the two concatenated of two symbols, re\"\\xc3\" * re\"\\x86\". The fact that Automa considers one input to be one byte, not one character, can become relevant if you instruct Automa to complete an action \"on every input\".","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Popular regex libraries include more operations like ? and +. These can trivially be constructed from the above mentioned primitives, i.e. R? is \"\" | R, and R+ is RR*.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Some implementations of regular expression engines, such as PCRE which is the default in Julia as of Julia 1.8, also support operations like backreferences and lookbehind. These operations can NOT be constructed from the above atoms and axioms, meaning that PCRE expressions are not regular expressions in the theoretical sense.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"The practical importance of theoretically sound regular expressions is that there exists algorithms that can match regular expressions on O(N) time and O(1) space, whereas this is not true for PCRE expressions, which are therefore significantly slower.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"note: Note\nAutoma.jl only supports real regex, and as such does not support e.g. backreferences, in order to gurantee fast runtime performance.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"To match regex to strings, the regex are transformed to finite automata, which are then implemented in code.","category":"page"},{"location":"theory/#Nondeterministic-finite-automata","page":"Theory","title":"Nondeterministic finite automata","text":"","category":"section"},{"location":"theory/","page":"Theory","title":"Theory","text":"The programmer Ken Thompson, of Unix fame, deviced Thompson's construction, an algorithm to constuct a nondeterministic finite automaton (NFA) from a regex. An NFA can be thought of as a flowchart (or a directed graph), where one can move from node to node on directed edges. Edges are either labeled ϵ, in which the machine can freely move through the edge to its destination node, or labeled with one or more input symbols, in which the machine may traverse the edge upon consuming said input.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"To illustrate, let's look at one of the simplest regex: re\"a\", matching the letter a:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"(Image: )","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"You begin at the small dot on the right, then immediately go to state 1, the cirle marked by a 1. By moving to the next state, state 2, you consume the next symbol from the input string, which must be the symbol marked on the edge from state 1 to state 2 (in this case, an a). Some states are \"accept states\", illustrated by a double cirle. If you are at an accept state when you've consumed all symbols of the input string, the string matches the regex.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Each of the operaitons that combine regex can also combine NFAs. For example, given the two regex a and b, which correspond to the NFAs A and B, the regex a * b can be expressed with the following NFA:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"(Image: )","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Note the ϵ symbol on the edge - this signifies an \"epsilon transition\", meaning you move directly from A to B without consuming any symbols.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Similarly, a | b correspond to this NFA structure...","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"(Image: )","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"...and a* to this:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"(Image: )","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"For a larger example, re\"(\\+|-)?(0|1)*\" combines alternation, concatenation and repetition and so looks like this:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"(Image: )","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"ϵ-transitions means that there are states from which there are multiple possible next states, e.g. in the larger example above, state 1 can lead to state 2 or state 12. That's what makes NFAs nondeterministic.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"In order to match a regex to a string then, the movement through the NFA must be emulated. You begin at state 1. When a non-ϵ edge is encountered, you consume a byte of the input data if it matches. If there are no edges that match your input, the string does not match. If an ϵ-edge is encountered from state A that leads to states B and C, the machine goes from state A to state {B, C}, i.e. in both states at once.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"For example, if the regex re\"(\\+|-)?(0|1)* visualized above is matched to the string -11, this is what happens:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"NFA starts in state 1\nNFA immediately moves to all states reachable via ϵ transition. It is now in state {3, 5, 7, 9, 10}.\nNFA sees input -. States {5, 7, 9, 10} do not have an edge with - leading out, so these states die. Therefore, the machine is in state 9, consumes the input, and moves to state 2.\nNFA immediately moves to all states reachable from state 2 via ϵ transitions, so goes to {3, 5, 7}\nNFA sees input 1, must be in state 5, moves to state 6, then through ϵ transitions to state {3, 5, 7}\nThe above point repeats, NFA is still in state {3, 5, 7}\nInput ends. Since state 3 is an accept state, the string matches.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Using only a regex-to-NFA converter, you could create a simple regex engine simply by emulating the NFA as above. The existence of ϵ transitions means the NFA can be in multiple states at once which adds unwelcome complexity to the emulation and makes it slower. Luckily, every NFA has an equivalent determinisitic finite automaton, which can be constructed from the NFA using the so-called powerset construction.","category":"page"},{"location":"theory/#Deterministic-finite-automata","page":"Theory","title":"Deterministic finite automata","text":"","category":"section"},{"location":"theory/","page":"Theory","title":"Theory","text":"Or DFAs, as they are called, are similar to NFAs, but do not contain ϵ-edges. This means that a given input string has either zero paths (if it does not match the regex), one, unambiguous path, through the DFA. In other words, every input symbol must trigger one unambiguous state transition from one state to one other state.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Let's visualize the DFA equivalent to the larger NFA above:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"(Image: )","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"It might not be obvious, but the DFA above accepts exactly the same inputs as the previous NFA. DFAs are way simpler to simulate in code than NFAs, precisely because at every state, for every input, there is exactly one action. DFAs can be simulated either using a lookup table, of possible state transitions, or by hardcoding GOTO-statements from node to node when the correct input is matched. Code simulating DFAs can be ridicuously fast, with each state transition taking less than 1 nanosecond, if implemented well.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Furthermore, DFAs can be optimised. Two edges between the same nodes with labels A and B can be collapsed to a single edge with labels [AB], and redundant nodes can be collapsed. The optimised DFA equivalent to the one above is simply: ","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"(Image: )","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Unfortunately, as the name \"powerset construction\" hints, convering an NFA with N nodes may result in a DFA with up to 2^N nodes. This inconvenient fact drives important design decisions in regex implementations. There are basically two approaches:","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Automa.jl will just construct the DFA directly, and accept a worst-case complexity of O(2^N). This is acceptable (I think) for Automa, because this construction happens in Julia's package precompilation stage (not on package loading or usage), and because the DFAs are assumed to be constants within a package. So, if a developer accidentally writes an NFA which is unacceptably slow to convert to a DFA, it will be caught in development. Luckily, it's pretty rare to have NFAs that result in truly abysmally slow conversions to DFA's: While bad corner cases exist, they are rarely as catastrophic as the O(2^N) would suggest. Currently, Automa's regex/NFA/DFA compilation pipeline is very slow and unoptimized, but, since it happens during precompile time, it is insignificant compared to LLVM compile times.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Other implementations, like the popular ripgrep command line tool, uses an adaptive approach. It constructs the DFA on the fly, as each symbol is being matched, and then caches the DFA. If the DFA size grows too large, the cache is flushed. If the cache is flushed too often, it falls back to simulating the NFA directly. Such an approach is necessary for ripgrep, because the regex -> NFA -> DFA compilation happens at runtime and must be near-instantaneous, unlike Automa, where it happens during package precompilation and can afford to be slow.","category":"page"},{"location":"theory/#Automa-in-a-nutshell","page":"Theory","title":"Automa in a nutshell","text":"","category":"section"},{"location":"theory/","page":"Theory","title":"Theory","text":"Automa simulates the DFA by having the DFA create a Julia Expr, which is then used to generate a Julia function using metaprogramming. Like all other Julia code, this function is then optimized by Julia and then LLVM, making the DFA simulations very fast.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Because Automa just constructs Julia functions, we can do extra tricks that ordinary regex engines cannot: We can splice arbitrary Julia code into the DFA simulation. Currently, Automa supports two such kinds of code: actions, and preconditions.","category":"page"},{"location":"theory/","page":"Theory","title":"Theory","text":"Actions are Julia code that is executed during certain state transitions. Preconditions are Julia code, that evaluates to a Bool value, and which is checked before a state transition. If it evaluates to false, the transition is not taken.","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"custom/#Customizing-Automa's-code-generation","page":"Customizing codegen","title":"Customizing Automa's code generation","text":"","category":"section"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"Automa offers a few ways of customising the created code. Note that the precise code generated by automa is considered an implementation detail, and as such is subject to change without warning. Only the overall behavior, i.e. the \"DFA simulation\" can be considered stable.","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"Nonetheless, it is instructive to look at the code generated for the machine in the \"parsing from a buffer\" section. I present it here cleaned up and with comments for human inspection.","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"# Initialize variables used in the code below\nbyte::UInt8 = 0x00\np::Int = 1\np_end::Int = sizeof(data)\np_eof::Int = p_end\ncs::Int = 1\n\n# Turn the input buffer into SizedMemory, to load data from pointer\nGC.@preserve data begin\nmem::Automa.SizedMemory = (Automa.SizedMemory)(data)\n\n# For every input byte:\nwhile p ≤ p_end && cs > 0\n # Load byte\n byte = mem[p]\n\n # Load the action, to execute, if any, by looking up in a table\n # using the current state (cs) and byte\n @inbounds var\"##292\" = Int((Int8[0 0 … 0 0; 0 0 … 0 0; … ; 0 0 … 0 0; 0 0 … 0 0])[(cs - 1) << 8 + byte + 1])\n\n # Look up next state. If invalid input, next state is negative current state\n @inbounds cs = Int((Int8[-1 -2 … -5 -6; -1 -2 … -5 -6; … ; -1 -2 … -5 -6; -1 -2 … -5 -6])[(cs - 1) << 8 + byte + 1])\n\n # Check each possible action looked up above, and execute it\n # if it is not zero\n if var\"##292\" == 1\n pos = p\n elseif var\"##292\" == 2\n header = String(data[pos:p - 1])\n elseif if var\"##292\" == 3\n append!(buffer, data[pos:p - 1])\n elseif var\"##292\" == 4\n seq = Seq(header, String(buffer))\n push!(seqs, seq)\n end\n\n # Increment position by 1\n p += 1\n\n # If we're at end of input, and the current state in in an accept state:\n if p > p_eof ≥ 0 && cs > 0 && (cs < 65) & isodd(0x0000000000000021 >>> ((cs - 1) & 63))\n # What follows is a list of all possible EOF actions.\n\n # If state is state 6, execute the appropriate action\n # tied to reaching end of input at this state\n if cs == 6\n seq = Seq(header, String(buffer))\n push!(seqs, seq)\n cs = 0\n\n # Else, if the state is < 0, we have taken a bad input (see where cs was updated)\n # move position back by one to leave it stuck where it found bad input\n elseif cs < 0\n p -= 1\n end\n\n # If cs is not 0, the machine is in an error state.\n # Gather some information about machine state, then throw an error\n if cs != 0\n cs = -(abs(cs))\n var\"##291\" = if p_eof > -1 && p > p_eof\n nothing\n else\n byte\n end\n Automa.throw_input_error($machine, -cs, var\"##291\", mem, p)\n end\nend\nend # GC.@preserve","category":"page"},{"location":"custom/#Using-CodeGenContext","page":"Customizing codegen","title":"Using CodeGenContext","text":"","category":"section"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"The CodeGenContext (or ctx, for short) struct is a collection of settings used to customize code creation. If not passed to the code generator functions, a default CodeGenContext is used.","category":"page"},{"location":"custom/#Variable-names","page":"Customizing codegen","title":"Variable names","text":"","category":"section"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"One obvious place to customize is variable names. In the code above, for example, the input bytes are named byte. What if you have another variable with that name?","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"The ctx contains a .vars field with a Variables object, which is just a collection of names used in generated code. For example, to rename byte to u8 in the generated code, you first create the appropriate ctx, then use the ctx to make the code.","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"ctx = CodeGenContext(vars=Automa.Variables(byte=:u8))\ncode = generate_code(ctx, machine, actions)","category":"page"},{"location":"custom/#Other-options","page":"Customizing codegen","title":"Other options","text":"","category":"section"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"The clean option strips most linenumber information from the generated code, if set to true.\ngetbyte is a function that is called like this getbyte(data, p) to obtain byte in the main loop. This is usually just Base.getindex, but can be customised to be an arbitrary function.","category":"page"},{"location":"custom/#Code-generator","page":"Customizing codegen","title":"Code generator","text":"","category":"section"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"The code showed at the top of this page is code made with the table code generator. Automa also supports creating code using the goto code generator instead of the default table generator. The goto generator creates code with the following properties:","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"It is much harder to read than table code\nThe code is much larger\nIt does not use boundschecking\nIt does not allow customizing getbyte\nIt is much faster than the table generator","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"Normally, the table generator is good enough, but for performance sensitive applications, the goto generator can be used.","category":"page"},{"location":"custom/#Optimising-the-previous-example","page":"Customizing codegen","title":"Optimising the previous example","text":"","category":"section"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"Let's try optimising the previous FASTA parsing example. My original code did 300 MB/s.","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"To recap, the Machine was:","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"machine = let\n header = onexit!(onenter!(re\"[a-z]+\", :mark_pos), :header)\n seqline = onexit!(onenter!(re\"[ACGT]+\", :mark_pos), :seqline)\n record = onexit!(re\">\" * header * '\\n' * rep1(seqline * '\\n'), :record)\n compile(rep(record))\nend\n@assert machine isa Automa.Machine","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"The first improvement is to the algorithm itself: Instead of of parsing to a vector of Seq, I'm simply going to index the input data, filling up an existing vector of:","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"struct SeqPos\n offset::Int\n hlen::Int32\n slen::Int32\nend","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"The idea here is to remove as many allocations as possible. This will more accurately show the speed of the DFA simulation, which is now the bottleneck. The actions will therefore be ","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"actions = Dict(\n :mark_pos => :(pos = p),\n :header => :(hlen = p - pos),\n :seqline => :(slen += p - pos),\n :record => quote\n seqpos = SeqPos(offset, hlen, slen)\n nseqs += 1\n seqs[nseqs] = seqpos\n offset += hlen + slen\n slen = 0\n end\n);\n\n@assert actions isa Dict","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"With the new variables such as slen, we need to update the function code as well:","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"@eval function parse_fasta(data)\n pos = slen = hlen = offset = nseqs = 0\n seqs = Vector{SeqPos}(undef, 400000)\n $(generate_code(machine, actions))\n return seqs\nend","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"This parses a 45 MB file in about 100 ms in my laptop, that's 450 MB/s. Now let's try the exact same, except with the code being generated by:","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"$(generate_code(CodeGenContext(generator=:goto), machine, actions))","category":"page"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"Now the code parses the same 45 MB FASTA file in 11.14 miliseconds, parsing at about 4 GB/s.","category":"page"},{"location":"custom/#Reference","page":"Customizing codegen","title":"Reference","text":"","category":"section"},{"location":"custom/","page":"Customizing codegen","title":"Customizing codegen","text":"Automa.CodeGenContext\nAutoma.Variables","category":"page"},{"location":"custom/#Automa.CodeGenContext","page":"Customizing codegen","title":"Automa.CodeGenContext","text":"CodeGenContext(;\n vars=Variables(:p, :p_end, :is_eof, :cs, :data, :mem, :byte, :buffer),\n generator=:table,\n getbyte=Base.getindex,\n clean=false\n)\n\nCreate a CodeGenContext (ctx), a struct that stores options for Automa code generation. Ctxs are used for Automa's various code generator functions. They currently take the following options (more may be added in future versions)\n\nvars::Variables: variable names used in generated code. See the Variables struct.\ngenerator::Symbol: code generator mechanism (:table or :goto). The table generator creates smaller, simpler code that uses a vector of integers to determine state transitions. The goto-generator uses a maze of @goto-statements, and create larger, more complex code, that is faster.\ngetbyte::Function (table generator only): function f(data, p) to access byte from data. Default: Base.getindex.\nclean: Whether to remove some QuoteNodes (line information) from the generated code\n\nExample\n\njulia> ctx = CodeGenContext(generator=:goto, vars=Variables(buffer=:tbuffer));\n\njulia> generate_code(ctx, compile(re\"a+\")) isa Expr\ntrue\n\n\n\n\n\n","category":"type"},{"location":"custom/#Automa.Variables","page":"Customizing codegen","title":"Automa.Variables","text":"Struct used to store variable names used in generated code. Contained in a CodeGenContext. Create a custom Variables for your CodeGenContext if you want to customize the variables used in Automa codegen, typically if you have conflicting variables with the same name.\n\nAutoma generates code with the following variables, shown below with their default names:\n\np::Int: current position of data\np_end::Int: end position of data\nis_eof::Bool: Whether p_end marks end file stream\ncs::Int: current state\ndata::Any: input data\nmem::SizedMemory: Memory wrapping data\nbyte::UInt8: current byte being read from data\nbuffer::TranscodingStreams.Buffer: (generate_reader only)\n\nExample\n\njulia> ctx = CodeGenContext(vars=Variables(byte=:u8));\n\njulia> ctx.vars.byte\n:u8\n\n\n\n\n\n","category":"type"},{"location":"regex/","page":"Regex","title":"Regex","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"regex/#Regex","page":"Regex","title":"Regex","text":"","category":"section"},{"location":"regex/","page":"Regex","title":"Regex","text":"Automa regex (of the type Automa.RE) are conceptually similar to the Julia built-in regex. They are made using the @re_str macro, like this: re\"ABC[DEF]\".","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"Automa regex matches individual bytes, not characters. Hence, re\"Æ\" (with the UTF-8 encoding [0xc3, 0x86]) is equivalent to re\"\\xc3\\x86\", and is considered the concatenation of two independent input bytes.","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"The @re_str macro supports the following content:","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"Literal symbols, such as re\"ABC\", re\"\\xfe\\xa2\" or re\"Ø\"\n| for alternation, as in re\"A|B\", meaning \"A or B\". \nByte sets with [], like re\"[ABC]\". This means any of the bytes in the brackets, e.g. re\"[ABC]\" is equivalent to re\"A|B|C\".\nInverted byte sets, e.g. re\"[^ABC]\", meaning any byte, except those in re[ABC].\nRepetition, with X* meaning zero or more repetitions of X\n+, where X+ means XX*, i.e. 1 or more repetitions of X\n?, where X? means X | \"\", i.e. 0 or 1 occurrences of X. It applies to the last element of the regex\nParentheses to group expressions, like in A(B|C)?","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"You can combine regex with the following operations:","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"* for concatenation, with re\"A\" * re\"B\" being the same as re\"AB\". Regex can also be concatenated with Chars and Strings, which will cause the chars/strings to be converted to regex first.\n| for alternation, with re\"A\" | re\"B\" being the same as re\"A|B\"\n& for intersection of regex, i.e. for regex A and B, the set of inputs matching A & B is exactly the intersection of the inputs match A and those matching B. As an example, re\"A[AB]C+D?\" & re\"[ABC]+\" is re\"ABC\".\n\\ for difference, such that for regex A and B, A \\ B creates a new regex matching all those inputs that match A but not B.\n! for inversion, such that !re\"[A-Z]\" matches all other strings than those which match re\"[A-Z]\". Note that !re\"a\" also matches e.g. \"aa\", since this does not match re\"a\".","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"Finally, the funtions opt, rep and rep1 is equivalent to the operators ?, * and +, so i.e. opt(re\"a\" * rep(re\"b\") * re\"c\") is equivalent to re\"(ab*c)?\".","category":"page"},{"location":"regex/#Example","page":"Regex","title":"Example","text":"","category":"section"},{"location":"regex/","page":"Regex","title":"Regex","text":"Suppose we want to create a regex that matches a simplified version of the FASTA format. This \"simple FASTA\" format is defined like so:","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"The format is a series of zero or more records, concatenated\nA record consists of the concatenation of:\nA leading '>'\nA header, composed of one or more letters in 'a-z',\nA newline symbol '\\n'\nA series of one or more sequence lines\nA sequence line is the concatenation of:\nOne or more symbols from the alphabet [ACGT]\nA newline","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"We can represent this concisely as a regex: re\"(>[a-z]+\\n([ACGT]+\\n)+)*\" To make it easier to read, we typically construct regex incrementally, like such:","category":"page"},{"location":"regex/","page":"Regex","title":"Regex","text":"fasta_regex = let\n header = re\"[a-z]+\"\n seqline = re\"[ACGT]+\"\n record = '>' * header * '\\n' * rep1(seqline * '\\n')\n rep(record)\nend\n@assert fasta_regex isa RE","category":"page"},{"location":"regex/#Reference","page":"Regex","title":"Reference","text":"","category":"section"},{"location":"regex/","page":"Regex","title":"Regex","text":"RE\n@re_str","category":"page"},{"location":"regex/#Automa.RegExp.RE","page":"Regex","title":"Automa.RegExp.RE","text":"RE(s::AbstractString)\n\nAutoma regular expression (regex) that is used to match a sequence of input bytes. Regex should preferentially be constructed using the @re_str macro: re\"ab+c?\". Regex can be combined with other regex, strings or chars with *, |, & and \\:\n\na * b matches inputs that matches first a, then b\na | b matches inputs that matches a or b\na & b matches inputs that matches a and b\na \\ b matches input that mathes a but not b\n!a matches all inputs that does not match a.\n\nSet actions to regex with onenter!, onexit!, onall! and onfinal!, and preconditions with precond!.\n\nExample\n\njulia> regex = (re\"a*b?\" | opt('c')) * re\"[a-z]+\";\n\njulia> regex = rep1((regex \\ \"aba\") & !re\"ca\");\n\njulia> regex isa RE\ntrue\n\njulia> compile(regex) isa Automa.Machine\ntrue\n\nSee also: [@re_str](@ref), [@compile](@ref)\n\n\n\n\n\n","category":"type"},{"location":"regex/#Automa.RegExp.@re_str","page":"Regex","title":"Automa.RegExp.@re_str","text":"@re_str -> RE\n\nConstruct an Automa regex of type RE from a string. Note that due to Julia's raw string escaping rules, re\"\\\\\" means a single backslash, and so does re\"\\\\\\\\\", while re\"\\\\\\\\\\\"\" means a backslash, then a quote character.\n\nExamples:\n\njulia> re\"ab?c*[def][^ghi]+\" isa RE\ntrue \n\nSee also: RE\n\n\n\n\n\n","category":"macro"},{"location":"","page":"Home","title":"Home","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"#Automa.jl","page":"Home","title":"Automa.jl","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"Automa is a regex-to-Julia compiler. By compiling regex to Julia code in the form of Expr objects, Automa provides facilities to create efficient and robust regex-based lexers, tokenizers and parsers using Julia's metaprogramming capabilities. You can view Automa as a regex engine that can insert arbitrary Julia code into its input matching process, which will be executed when certain parts of the regex matches an input.","category":"page"},{"location":"","page":"Home","title":"Home","text":"(Image: Schema of Automa.jl)","category":"page"},{"location":"","page":"Home","title":"Home","text":"Automa.jl is designed to generate very efficient code to scan large text data, which is often much faster than handcrafted code. Automa.jl is a regex engine that can insert arbitrary Julia code into its input matching process, that will be executed in when certain parts of the regex matches an input.","category":"page"},{"location":"#Where-to-start","page":"Home","title":"Where to start","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"If you're not familiar with regex engines, start by reading the theory section, then you might want to read every section from the top. They're structured like a tutorial, beginning from the simplest use of Automa and moving to more advanced uses.","category":"page"},{"location":"","page":"Home","title":"Home","text":"If you like to dive straight in, you might want to start by reading the examples below, then go through the examples in the examples/ directory in the Automa repository.","category":"page"},{"location":"#Examples","page":"Home","title":"Examples","text":"","category":"section"},{"location":"#Validate-some-text-only-is-composed-of-ASCII-alphanumeric-characters","page":"Home","title":"Validate some text only is composed of ASCII alphanumeric characters","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"generate_buffer_validator(:validate_alphanumeric, re\"[a-zA-Z0-9]*\") |> eval\n\nfor s in [\"abc\", \"aU81m\", \"!,>\"]\n println(\"$s is alphanumeric? $(isnothing(validate_alphanumeric(s)))\")\nend","category":"page"},{"location":"#Making-a-lexer","page":"Home","title":"Making a lexer","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"tokens = [\n :identifier => re\"[A-Za-z_][0-9A-Za-z_!]*\",\n :lparens => re\"\\(\",\n :rparens => re\"\\)\",\n :comma => re\",\",\n :quot => re\"\\\"\",\n :space => re\"[\\t\\f ]+\",\n];\n@eval @enum Token errortoken $(first.(tokens)...)\nmake_tokenizer((errortoken, \n [Token(i) => j for (i,j) in enumerate(last.(tokens))]\n)) |> eval\n\ncollect(tokenize(Token, \"\"\"(alpha, \"beta15\")\"\"\"))","category":"page"},{"location":"#Make-a-simple-TSV-file-parser","page":"Home","title":"Make a simple TSV file parser","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"machine = let\n name = onexit!(onenter!(re\"[^\\t\\r\\n]+\", :mark), :name)\n field = onexit!(onenter!(re\"[^\\t\\r\\n]+\", :mark), :field)\n nameline = name * rep('\\t' * name)\n record = onexit!(field * rep('\\t' * field), :record)\n compile(nameline * re\"\\r?\\n\" * record * rep(re\"\\r?\\n\" * record) * rep(re\"\\r?\\n\"))\nend\n\nactions = Dict(\n :mark => :(pos = p),\n :name => :(push!(headers, String(data[pos:p-1]))),\n :field => quote\n n_fields += 1\n push!(fields, String(data[pos:p-1]))\n end,\n :record => quote\n n_fields == length(headers) || error(\"Malformed TSV\")\n n_fields = 0\n end\n)\n\n@eval function parse_tsv(data)\n headers = String[]\n fields = String[]\n pos = n_fields = 0\n $(generate_code(machine, actions))\n (headers, reshape(fields, length(headers), :))\nend\n\nheader, data = parse_tsv(\"a\\tabc\\n12\\t13\\r\\nxyc\\tz\\n\\n\")","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"io/#Parsing-from-an-IO","page":"Parsing IOs","title":"Parsing from an IO","text":"","category":"section"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Some file types are gigabytes or tens of gigabytes in size. For these files, parsing from a buffer may be impractical, as they require you to read in the entire file in memory at once. Automa enables this by hooking into TranscodingStreams.jl, a package that provides a wrapper IO of the type TranscodingStream. Importantly, these streams buffer their input data. Automa is thus able to operate directly on the input buffers of TranscodingStream objects.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Unfortunately, this significantly complicates things compared to parsing from a simple buffer. The main problem is that, when reading from a buffered stream, the byte array visible from Automa is only a small slice of the total input data. Worse, when the end of the stream is reached, data from the buffer is flushed, i.e. removed from the stream. To handle this, Automa must reach deep into the implementation details of TranscodingStreams, and also break some of its own abstractions. It's not pretty, but it's what we have.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Practically speaking, parsing from IO is done with the function Automa.generate_reader. Despite its name, this function is NOT directly used to generate objects like FASTA.Reader. Instead, this function produces Julia code (an Expr object) that, when evaluated, defines a function that can execute an Automa machine on an IO. Let me first show the code generated by generate_reader in pseudocode format:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"function { function name }(stream::TranscodingStream, { args... })\n { init code }\n\n @label __exec__\n\n p = current buffer position\n p_end = final buffer position\n\n # the eof call below will first flush any used data from buffer,\n # then load in new data, before checking if it's really eof.\n is_eof = eof(stream)\n execute normal automa parsing of the buffer\n update buffer position to match p\n\n { loop code }\n\n if cs < 0 # meaning: erroneous input or erroneous EOF\n { error code }\n end\n\n if machine errored or reached EOF\n @label __return__\n { return code }\n end\n @goto __exec__\nend","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"The content marked { function name }, { args... }, { init code }, { loop code }, { error code } and { return code } are arguments provided to Automa.generate_reader. By providing these, the user can customize the generated function further.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"The main difference from the code generated to parse a buffer is the label/GOTO pair __exec__, which causes Automa to repeatedly load data into the buffer, execute the machine, then flush used data from the buffer, then execute the machine, and so on, until interrupted.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Importantly, when parsing from a buffer, p and p_end refer to the position in the current buffer. This may not be the position in the stream, and when the data in the buffer is flushed, it may move the data in the buffer so that p now become invalid. This means you can't simply store a variable marked_pos that points to the current value of p and expect that the same data is at that position later. Furthermore, is_eof is set to whether the stream has reached EOF.","category":"page"},{"location":"io/#Example-use","page":"Parsing IOs","title":"Example use","text":"","category":"section"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Let's show the simplest possible example of such a function. We have a Machine (which, recall, is a compiled regex) called machine, and we want to make a function that returns true if a given IO contain data that conforms to the regex format specified by the Machine.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"We will still use the machine from before, just without any actions:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"machine = let\n header = re\"[a-z]+\"\n seqline = re\"[ACGT]+\"\n record = re\">\" * header * '\\n' * rep1(seqline * '\\n')\n compile(rep(record))\nend\n@assert machine isa Automa.Machine","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"To create our simple IO reader, we simply need to call generate_reader, where the { return code } is a check if iszero(cs), meaning if the machine exited at a proper exit state. We also need to set error_code to an empty expression in order to prevent throwing an error on invalid code. Instead, we want it to go immediately to return - we call this section __return__, so we need to @goto __return__. Then, we need to evaluate the code created by generate_reader in order to define the function validate_fasta","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"julia> return_code = :(iszero(cs));\n\njulia> error_code = :(@goto __return__);\n\njulia> eval(generate_reader(:validate_fasta, machine; returncode=return_code, errorcode=error_code));","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"The generated function validate_fasta has the function signature: validate_fasta(stream::TranscodingStream). If our input IO is not a TranscodingStream, we can wrap it in the relatively lightweight NoopStream, which, as the name suggests, does nothing to the data:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"julia> io = NoopStream(IOBuffer(\">a\\nTAG\\nTA\\n>bac\\nG\\n\"));\n\njulia> validate_fasta(io)\ntrue\n\njulia> validate_fasta(NoopStream(IOBuffer(\"random data\")))\nfalse","category":"page"},{"location":"io/#Reading-a-single-record","page":"Parsing IOs","title":"Reading a single record","text":"","category":"section"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"danger: Danger\nThe following code is only for demonstration purposes. It has several one important flaw, which will be adressed in a later section, so do not copy-paste it for serious work.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"There are a few more subtleties related to the generate_reader function. Suppose we instead want to create a function that reads a single FASTA record from an IO. In this case, it's no good that the function created from generate_reader will loop until the IO reaches EOF - we need to find a way to stop it after reading a single record. We can do this with the pseudomacro @escape, as shown below.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"We will reuse our Seq struct and our Machine from the \"parsing from a buffer\" section of this tutorial:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"struct Seq\n name::String\n seq::String\nend\n\nmachine = let\n header = onexit!(onenter!(re\"[a-z]+\", :mark_pos), :header)\n seqline = onexit!(onenter!(re\"[ACGT]+\", :mark_pos), :seqline)\n record = onexit!(re\">\" * header * '\\n' * rep1(seqline * '\\n'), :record)\n compile(rep(record))\nend\n@assert machine isa Automa.Machine\n\n# output","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"The code below contains @escape in the :record action - meaning: Break out of machine execution.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"actions = Dict{Symbol, Expr}(\n :mark_pos => :(pos = p),\n :header => :(header = String(data[pos:p-1])),\n :seqline => :(append!(seqbuffer, data[pos:p-1])),\n\n # Only this action is different from before!\n :record => quote\n seq = Seq(header, String(seqbuffer))\n found_sequence = true\n # Reset p one byte if we're not at the end\n p -= !(is_eof && p > p_end)\n @escape\n end\n)\n@assert actions isa Dict\n\n# output","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"@escape is not actually a real macro, but what Automa calls a \"pseudomacro\". It is expanded during Automa's own compiler pass before Julia's lowering. The @escape pseudomacro is replaced with code that breaks it out of the executing machine, without reaching EOF or an invalid byte.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Let's see how I use generate_reader, then I will explain each part:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"generate_reader(\n :read_record,\n machine;\n actions=actions,\n initcode=quote\n seqbuffer = UInt8[]\n pos = 0\n found_sequence = false\n header = \"\"\n end,\n loopcode=quote\n if (is_eof && p > p_end) || found_sequence\n @goto __return__\n end\n end,\n returncode=:(found_sequence ? seq : nothing)\n) |> eval","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"In the :record, action, a few new things happen.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"First, I set the flag found_sequence = false. In the loop code, I look for this flag to signal that the function should return. Remember, the loop code happens after machine execution, which can mean either that the execution was broken out of by @escape, or than the buffer ran out and need to be refilled. I could just return the sequence directly in the action, but then I would skip a bunch of the code generated by generate_reader which sets the buffer state correctly, so this is never adviced. Instead, in the loop code, which executes after the buffer has been flushed, I check for this flag, and goes to __return__ if necessary. I could also just return directly in the loopcode, but I prefer only having one place to retun from the function.\nI use @escape to break out of the machine, i.e. stop machine execution\nFinally, I decrement p, if and only if the machine has not reached EOF (which happens when is_eof is true, meaning the last part of the IO has been buffered, and p > p_end, meaning the end of the buffer has been reached). This is because, the first record ends when the IO reads the second > symbol. If I then were to read another record from the same IO, I would have already read the > symbol. I need to reset p by 1, so the > is also read on the next call to read_record.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"I can use the function like this:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"julia> io = NoopStream(IOBuffer(\">a\\nT\\n>tag\\nGAGA\\nTATA\\n\"));\n\njulia> read_record(io)\nSeq(\"a\", \"T\")\n\njulia> read_record(io)\nSeq(\"tag\", \"GAGATATA\")\n\njulia> read_record(io)","category":"page"},{"location":"io/#Preserving-data-by-marking-the-buffer","page":"Parsing IOs","title":"Preserving data by marking the buffer","text":"","category":"section"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"There are several problems with the implementation above: The following code in my actions dict:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"header = String(data[pos:p-1])","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Creates header by accessing the data buffer. However, when reading an IO, how can I know that the data hasn't shifted around in the buffer between when I defined pos? For example, suppose we have a short buffer of only 8 bytes, and the following FASTA file: >abcdefghijkl\\nA. Then, the buffer is first filled with >abcdefg. When entering the header, I execute the action :mark_position at p = 2, so pos = 2. But now, when I reach the end of the header, the used data in the buffer has been flushed, and the data is now: hijkl\\nA, and p = 14. I then try to access data[2:13], which is out of bounds!","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Luckily, the buffers of TranscodingStreams allow us to \"mark\" a position to save it. The buffer will not flush the marked position, or any position after the marked position. If necessary, it will resize the buffer to be able to load more data while keeping the marked position.","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Inside the function generated by generate_reader, we can use the zero-argument pseudomacro @mark(), which marks the position p. The macro @markpos() can then be used to get the marked position, which will point to the same data in the buffer, even after the data in the buffer has been shifted after it's been flushed. This works because the mark is stored inside the TranscodingStream buffer, and the buffer makes sure to update the mark if the content moves. Hence, we can re-write the actions:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"actions = Dict{Symbol, Expr}(\n :mark_position => :(@mark),\n :header => :(header = String(data[@markpos():p-1])),\n :seqline => :(append!(buffer, data[@markpos():p-1])),\n\n [:record action omitted...]\n)","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"In our example above with the small 8-byte buffer, this is what would happen: First, the buffer contains the first 8 bytes. When p = 2, the mark is set, and the second byte is marked::","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"content: >abcdefg\nmark: ^\np = 2 ^","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Then, when p = 9 the buffer is exhausted, the used data is removed, BUT, the mark stays, so byte 2 is preserved, and only the first byte is removed. The code in generate_reader loops around to @label __exec__, which sets p to the current buffer position. The buffer now looks like this:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"content: abcdefgh\nmark: ^\np = 8 ^","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Only 1 byte was cleared, so when p = 9, the buffer will be exhausted again. This time, no data can be cleared, so instead, the buffer is resized to fit more data:","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"content: abcdefghijkl\\nA\nmark: ^\np = 9 ^","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Finally, when we reach the newline p = 13, the whole header is in the buffer, and so data[@markpos():p-1] will correctly refer to the header (now, 1:12).","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"content: abcdefghijkl\\nA\nmark: ^\np = 13 ^","category":"page"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Remember to update the mark, or to clear it with @unmark() in order to be able to flush data from the buffer afterwards.","category":"page"},{"location":"io/#Reference","page":"Parsing IOs","title":"Reference","text":"","category":"section"},{"location":"io/","page":"Parsing IOs","title":"Parsing IOs","text":"Automa.generate_reader\nAutoma.@escape\nAutoma.@mark\nAutoma.@unmark\nAutoma.@markpos\nAutoma.@bufferpos\nAutoma.@relpos\nAutoma.@abspos\nAutoma.@setbuffer","category":"page"},{"location":"io/#Automa.generate_reader","page":"Parsing IOs","title":"Automa.generate_reader","text":"generate_reader(funcname::Symbol, machine::Automa.Machine; kwargs...)\n\nGenerate a streaming reader function of the name funcname from machine.\n\nThe generated function consumes data from a stream passed as the first argument and executes the machine with filling the data buffer.\n\nThis function returns an expression object of the generated function. The user need to evaluate it in a module in which the generated function is needed.\n\nKeyword Arguments\n\narguments: Additional arguments funcname will take (default: ()). The default signature of the generated function is (stream::TranscodingStream,), but it is possible to supply more arguments to the signature with this keyword argument.\ncontext: Automa's codegenerator (default: Automa.CodeGenContext()).\nactions: A dictionary of action code (default: Dict{Symbol,Expr}()).\ninitcode: Initialization code (default: :()).\nloopcode: Loop code (default: :()).\nreturncode: Return code (default: :(return cs)).\nerrorcode: Executed if cs < 0 after loopcode (default error message)\n\nSee the source code of this function to see how the generated code looks like\n\n\n\n\n\n","category":"function"},{"location":"io/#Automa.@escape","page":"Parsing IOs","title":"Automa.@escape","text":"@escape()\n\nPseudomacro. When encountered during Machine execution, the machine will stop executing. This is useful to interrupt the parsing process, for example to emit a record during parsing of a larger file. p will be advanced as normally, so if @escape is hit on B during parsing of \"ABC\", the next byte will be C.\n\n\n\n\n\n","category":"macro"},{"location":"io/#Automa.@mark","page":"Parsing IOs","title":"Automa.@mark","text":"@mark()\n\nPseudomacro, to be used with IO-parsing Automa functions. This macro will \"mark\" the position of p in the current buffer. The marked position will not be flushed from the buffer after being consumed. For example, Automa code can call @mark() at the beginning of a large string, then when the string is exited at position p, it is guaranteed that the whole string resides in the buffer at positions markpos():p-1.\n\n\n\n\n\n","category":"macro"},{"location":"io/#Automa.@unmark","page":"Parsing IOs","title":"Automa.@unmark","text":"unmark()\n\nPseudomacro. Removes the mark from the buffer. This allows all previous data to be cleared from the buffer.\n\nSee also: @mark, @markpos\n\n\n\n\n\n","category":"macro"},{"location":"io/#Automa.@markpos","page":"Parsing IOs","title":"Automa.@markpos","text":"markpos()\n\nPseudomacro. Get the position of the mark in the buffer.\n\nSee also: @mark, @markpos\n\n\n\n\n\n","category":"macro"},{"location":"io/#Automa.@bufferpos","page":"Parsing IOs","title":"Automa.@bufferpos","text":"bufferpos()\n\nPseudomacro. Returns the integer position of the current TranscodingStreams buffer (only used with the generate_reader function).\n\nExample\n\n# Inside some Automa action code\n@setbuffer()\ndescription = sub_parser(stream)\np = @bufferpos()\n\nSee also: @setbuffer\n\n\n\n\n\n","category":"macro"},{"location":"io/#Automa.@relpos","page":"Parsing IOs","title":"Automa.@relpos","text":"relpos(p)\n\nAutoma pseudomacro. Return the position of p relative to @markpos(). Equivalent to p - @markpos() + 1. This can be used to mark additional points in the stream when the mark is set, after which their action position can be retrieved using abspos(x)\n\nExample usage:\n\n# In one action\nidentifier_pos = @relpos(p)\n\n# Later, in a different action\nidentifier = data[@abspos(identifier_pos):p]\n\nSee also: @abspos\n\n\n\n\n\n","category":"macro"},{"location":"io/#Automa.@abspos","page":"Parsing IOs","title":"Automa.@abspos","text":"abspos(p)\n\nAutoma pseudomacro. Used to obtain the actual position of a relative position obtained from @relpos. See @relpos for more details.\n\n\n\n\n\n","category":"macro"},{"location":"io/#Automa.@setbuffer","page":"Parsing IOs","title":"Automa.@setbuffer","text":"setbuffer()\n\nUpdates the buffer position to match p. The buffer position is syncronized with p before and after calls to functions generated by generate_reader. @setbuffer() can be used to the buffer before calling another parser.\n\nExample\n\n# Inside some Automa action code\n@setbuffer()\ndescription = sub_parser(stream)\np = @bufferpos()\n\nSee also: @bufferpos\n\n\n\n\n\n","category":"macro"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"CurrentModule = Automa\nDocTestSetup = quote\n using TranscodingStreams\n using Automa\nend","category":"page"},{"location":"parser/#Parsing-from-a-buffer","page":"Parsing buffers","title":"Parsing from a buffer","text":"","category":"section"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Automa can leverage metaprogramming to combine regex and julia code to create parsers. This is significantly more difficult than simply using validators or tokenizers, but still simpler than parsing from an IO. Currently, Automa loads data through pointers, and therefore needs data backed by Array{UInt8} or String or similar - it does not work with types such as UnitRange{UInt8}. Furthermore, be careful about passing strided views to Automa - while Automa can extract a pointer from a strided view, it will always advance the pointer one byte at a time, disregarding the view's stride.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"As an example, let's use the simplified FASTA format intoduced in the regex section, with the following format: re\"(>[a-z]+\\n([ACGT]+\\n)+)*\". We want to parse it into a Vector{Seq}, where Seq is defined as:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> struct Seq\n name::String\n seq::String\n end","category":"page"},{"location":"parser/#Adding-actions-to-regex","page":"Parsing buffers","title":"Adding actions to regex","text":"","category":"section"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"To do this, we need to inject Julia code into the regex validator while it is running. The first step is to add actions to our regex: These are simply names of Julia expressions to splice in, where the expressions will be executed when the regex is matched. We can choose the names arbitrarily.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Currently, actions can be added in the following places in a regex:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"With onenter!, meaning it will be executed when reading the first byte of the regex\nWith onfinal!, where it will be executed when reading the last byte of the regex. Note that it's not possible to determine the final byte for some regex like re\"X+\", since the machine reads only 1 byte at a time and cannot look ahead.\nWith onexit!, meaning it will be executed on reading the first byte AFTER the regex, or when exiting the regex by encountering the end of inputs (only for a regex match, not an unexpected end of input)\nWith onall!, where it will be executed when reading every byte that is part of the regex.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"You can set the actions to be a single action name (represented by a Symbol), or a list of action names:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> my_regex = re\"ABC\";\n\njulia> onenter!(my_regex, [:action_a, :action_b]);\n\njulia> onexit!(my_regex, :action_c);","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"In which case the code named action_a, then that named action_b will executed in order when entering the regex, and the code named action_c will be executed when exiting the regex.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"The onenter! etc functions return the regex they modify, so the above can be written:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> my_regex = onexit!(onenter!(re\"ABC\", [:action_a, :action_b]), :action_c);\n\njulia> my_regex isa RE\ntrue","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"When the the following regex' actions are visualized in its corresponding DFA:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"regex = \n ab = re\"ab*\"\n onenter!(ab, :enter_ab)\n onexit!(ab, :exit_ab)\n onfinal!(ab, :final_ab)\n onall!(ab, :all_ab)\n c = re\"c\"\n onenter!(c, :enter_c)\n onexit!(c, :exit_c)\n onfinal!(c, :final_c)\n\n ab * c\nend","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"The result DFA looks below. Here, the edge labeled 'a'/enter_ab,all_ab,final_ab means that the edge consumes input byte 'a', and executes the three actions enter_ab, all_ab and final_ab, in that order. ","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"(Image: Visualization of regex with actions)","category":"page"},{"location":"parser/#Compiling-regex-to-Machines","page":"Parsing buffers","title":"Compiling regex to Machines","text":"","category":"section"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"In order to create code, the regex must first be compiled to a Machine, which is a struct that represents an optimised DFA. We can do that with compile(regex). Under the hood, this compiles the regex to an NFA, then compiles the NFA to a DFA, and then optimises the DFA to a Machine (see the section on Automa theory).","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Normally, we don't care about the regex directly, but only want the Machine. So, it is idiomatic to compile the regex in the same let statement it is being built in:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"machine = let\n header = re\"[a-z]+\"\n seqline = re\"[ACGT]+\"\n record = re\">\" * header * '\\n' * rep1(seqline * '\\n')\n compile(rep(record))\nend\n@assert machine isa Automa.Machine","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Note that, if this code is placed at top level in a package, the regex will be constructed and compiled to a Machine during package precompilation, which greatly helps load times.","category":"page"},{"location":"parser/#Creating-our-parser","page":"Parsing buffers","title":"Creating our parser","text":"","category":"section"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"However, in this case, we don't just need a Machine with the regex, we need a Machine with the regex containing the relevant actions. To parse a simplified FASTA file into a Vector{Seq}, I'm using these four actions:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"When the machine enters into the header, or a sequence line, I want it to mark the position with where it entered into the regex. The marked position will be used as the leftmost position where the header or sequence is extracted later.\nWhen exiting the header, I want to extract the bytes from the marked position in the action above, to the last header byte (i.e. the byte before the current byte), and use these bytes as the sequence header\nWhen exiting a sequence line, I want to do the same: Extract from the marked position to one before the current position, but this time I want to append the current line to a buffer containing all the lines of the sequence\nWhen exiting a record, I want to construct a Seq object from the header bytes and the buffer with all the sequence lines, then push the Seq to the result,","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> machine = let\n header = onexit!(onenter!(re\"[a-z]+\", :mark_pos), :header)\n seqline = onexit!(onenter!(re\"[ACGT]+\", :mark_pos), :seqline)\n record = onexit!(re\">\" * header * '\\n' * rep1(seqline * '\\n'), :record)\n compile(rep(record))\n end;","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"We can now write the code we want executed. When writing this code, we want access to a few variables used by the machine simulation. For example, we might want to know at which byte position the machine is when an action is executed. Currently, the following variables are accessible in the code:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"byte: The current input byte as a UInt8\np: The 1-indexed position of byte in the buffer\np_end: The length of the input buffer\nis_eof: Whether the machine has reached the end of the input.\ncs: The current state of the machine, as an integer\ndata: The input buffer\nmem: The memory being read from, an Automa.SizedMemory object containing a pointer and a length","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"The actions we want executed, we place in a Dict{Symbol, Expr}:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> actions = Dict(\n :mark_pos => :(pos = p),\n :header => :(header = String(data[pos:p-1])),\n :seqline => :(append!(buffer, data[pos:p-1])),\n :record => :(push!(seqs, Seq(header, String(buffer))))\n );","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"For multi-line Expr, you can construct them with quote ... end blocks.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"We can now construct a function that parses our data. In the code written in the action dict above, besides the variables defined for us by Automa, we also refer to the variables buffer, header, pos and seqs. Some of these variables are defined in the code above (for example, in the :(pos = p) expression), but we can't necessarily control the order in which Automa will insert these expressions into out final function. Hence, let's initialize these variables at the top of the function we generate, such that we know for sure they are defined when used - whenever they are used.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"The code itself is generated using generate_code:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> @eval function parse_fasta(data)\n pos = 0\n buffer = UInt8[]\n seqs = Seq[]\n header = \"\"\n $(generate_code(machine, actions))\n return seqs\n end\nparse_fasta (generic function with 1 method)","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"We can now use it:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> parse_fasta(\">abc\\nTAGA\\nAAGA\\n>header\\nAAAG\\nGGCG\\n\")\n2-element Vector{Seq}:\n Seq(\"abc\", \"TAGAAAGA\")\n Seq(\"header\", \"AAAGGGCG\")","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"If we give out function a bad input - for example, if we forget the trailing newline, it throws an error:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> parse_fasta(\">abc\\nTAGA\\nAAGA\\n>header\\nAAAG\\nGGCG\")\nERROR: Error during FSM execution at buffer position 33.\nLast 32 byte(s) were:\n\n\">abc\\nTAGA\\nAAGA\\n>header\\nAAAG\\nGGCG\"\n\nObserved input: EOF at state 5. Outgoing edges:\n * '\\n'/seqline\n * [ACGT]\n\nInput is not in any outgoing edge, and machine therefore errored.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"The code above parses with about 300 MB/s on my laptop. Not bad, but Automa can do better - read on to learn how to customize codegen.","category":"page"},{"location":"parser/#Preconditions","page":"Parsing buffers","title":"Preconditions","text":"","category":"section"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"You might have noticed a peculiar detail about our FASTA format: It demands a trailing newline after each record. In other words, >a\\nA is not a valid FASTA record.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"We can easily rewrite the regex such that the last record does not need a trailing \\n. But look what happens when we try that:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> machine = let\n header = onexit!(onenter!(re\"[a-z]+\", :mark_pos), :header)\n seqline = onexit!(onenter!(re\"[ACGT]+\", :mark_pos), :seqline)\n record = onexit!(re\">\" * header * '\\n' * seqline * rep('\\n' * seqline), :record)\n compile(opt(record) * rep('\\n' * record) * rep(re\"\\n\"))\n end;\nERROR: Ambiguous NFA.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Why does this error? Well, remember that Automa processes one byte at a time, and at each byte, makes a decision on what actions to execute. Hence, if it sees the input >a\\nA\\n, it does not know what to do when encountering the second \\n. If the next byte e,g. A, then it would need to execute the :seqline action. If the byte is >, it would need to execute first :seqline, then :record. Automa can't read ahead, so, the regex is ambiguous and the true behaviour when reading the inputs >a\\nA\\n is undefined. Therefore, Automa refuses to compile it.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"There are several ways to solve this:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"First, you can rewrite the regex to not be ambiguous. This is usually the preferred option: After all, if the regex is ambiguous, you probably made a mistake with the regex\nYou can manually diasable the ambiguity check by passing the keyword unambiguous=false to compile. This will cause the machine to undefined behaviour if an input like >a\\nA\\n is seen, so this is usually a poor idea.\nYou can rewrite the actions, such that the action itself uses an if-statement to check what to do. In the example above, you could remove the :record action and have the :seqline action conditionally emit a record if the next byte was >.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Finally, you can use preconditions. A precondition is a symbol, attached to a regex, just like an action. Just like an action, the symbol is attached to an Expr object, but for preconditions this must evaluate to a Bool. If false, the regex is not entered.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Let's have an example. The following machine is obviously ambiguous:","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> machine = let\n a = onenter!(re\"XY\", :a)\n b = onenter!(re\"XZ\", :b)\n compile('A' * (a | b))\n end;\nERROR: Ambiguous NFA.","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"We can add a precondition with precond!. Below, precond!(regex, label) is equivalent to precond!(regex, label; when=:enter, bool=true). This means \"only enter regex when the boolean expression label evaluates to bool (true)\":","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"julia> machine = let\n a = precond!(onenter!(re\"XY\", :a), :test)\n b = precond!(onenter!(re\"XZ\", :b), :test; bool=false)\n compile('A' * (a | b))\n end;\n\njulia> machine isa Automa.Machine\ntrue","category":"page"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Here, re\"XY\" can only be entered when :test is true, and re\"XZ\" only when :test is false. So, there can be no ambiguous behaviour and the regex compiles fine.","category":"page"},{"location":"parser/#Reference","page":"Parsing buffers","title":"Reference","text":"","category":"section"},{"location":"parser/","page":"Parsing buffers","title":"Parsing buffers","text":"Automa.onenter!\nAutoma.onexit!\nAutoma.onall!\nAutoma.onfinal!\nAutoma.precond!\nAutoma.generate_code\nAutoma.generate_init_code\nAutoma.generate_exec_code","category":"page"},{"location":"parser/#Automa.RegExp.onenter!","page":"Parsing buffers","title":"Automa.RegExp.onenter!","text":"onenter!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re\n\nSet action(s) a to occur when reading the first byte of regex re. If multiple actions are set by passing a vector, execute the actions in order.\n\nSee also: onexit!, onall!, onfinal!\n\nExample\n\njulia> regex = re\"ab?c*\";\n\njulia> regex2 = onenter!(regex, :entering_regex);\n\njulia> regex === regex2\ntrue\n\n\n\n\n\n","category":"function"},{"location":"parser/#Automa.RegExp.onexit!","page":"Parsing buffers","title":"Automa.RegExp.onexit!","text":"onexit!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re\n\nSet action(s) a to occur when reading the first byte no longer part of regex re, or if experiencing an expected end-of-file. If multiple actions are set by passing a vector, execute the actions in order.\n\nSee also: onenter!, onall!, onfinal!\n\nExample\n\njulia> regex = re\"ab?c*\";\n\njulia> regex2 = onexit!(regex, :exiting_regex);\n\njulia> regex === regex2\ntrue\n\n\n\n\n\n","category":"function"},{"location":"parser/#Automa.RegExp.onall!","page":"Parsing buffers","title":"Automa.RegExp.onall!","text":"onall!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re\n\nSet action(s) a to occur when reading any byte part of the regex re. If multiple actions are set by passing a vector, execute the actions in order.\n\nSee also: onenter!, onexit!, onfinal!\n\nExample\n\njulia> regex = re\"ab?c*\";\n\njulia> regex2 = onall!(regex, :reading_re_byte);\n\njulia> regex === regex2\ntrue\n\n\n\n\n\n","category":"function"},{"location":"parser/#Automa.RegExp.onfinal!","page":"Parsing buffers","title":"Automa.RegExp.onfinal!","text":"onfinal!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re\n\nSet action(s) a to occur when the last byte of regex re. If re does not have a definite final byte, e.g. re\"a(bc)*\", where more \"bc\" can always be added, compiling the regex will error after setting a final action. If multiple actions are set by passing a vector, execute the actions in order.\n\nSee also: onenter!, onall!, onexit!\n\nExample\n\njulia> regex = re\"ab?c\";\n\njulia> regex2 = onfinal!(regex, :entering_last_byte);\n\njulia> regex === regex2\ntrue\n\njulia> compile(onfinal!(re\"ab?c*\", :does_not_work))\nERROR: [...]\n\n\n\n\n\n","category":"function"},{"location":"parser/#Automa.RegExp.precond!","page":"Parsing buffers","title":"Automa.RegExp.precond!","text":"precond!(re::RE, s::Symbol; [when=:enter], [bool=true]) -> re\n\nSet re's precondition to s. Before any state transitions to re, or inside re, the precondition code s is checked to be bool before the transition is taken.\n\nwhen controls if the condition is checked when the regex is entered (if :enter), or at every state transition inside the regex (if :all)\n\nExample\n\njulia> regex = re\"ab?c*\";\n\njulia> regex2 = precond!(regex, :some_condition);\n\njulia> regex === regex2\ntrue\n\n\n\n\n\n","category":"function"},{"location":"parser/#Automa.generate_code","page":"Parsing buffers","title":"Automa.generate_code","text":"generate_code([::CodeGenContext], machine::Machine, actions=nothing)::Expr\n\nGenerate init and exec code for machine. The default code generator function for creating functions, preferentially use this over generating init and exec code directly, due to its convenience. Shorthand for producing the concatenated code of:\n\ngenerate_init_code(ctx, machine)\ngenerate_action_code(ctx, machine, actions)\ngenerate_input_error_code(ctx, machine) [elided if actions == :debug]\n\nExamples\n\n@eval function foo(data)\n # Initialize variables used in actions\n data_buffer = UInt8[]\n $(generate_code(machine, actions))\n return data_buffer\nend\n\nSee also: generate_init_code, generate_exec_code\n\n\n\n\n\n","category":"function"},{"location":"parser/#Automa.generate_init_code","page":"Parsing buffers","title":"Automa.generate_init_code","text":"generate_init_code([::CodeGenContext], machine::Machine)::Expr\n\nGenerate variable initialization code, initializing variables such as p, and p_end. The names of these variables are set by the CodeGenContext. If not passed, the context defaults to DefaultCodeGenContext\n\nPrefer using the more generic generate_code over this function where possible. This function should be used if the initialized data should be modified before the execution code.\n\nExample\n\n@eval function foo(data)\n $(generate_init_code(machine))\n p = 2 # maybe I want to start from position 2, not 1\n $(generate_exec_code(machine, actions))\n return cs\nend\n\nSee also: generate_code, generate_exec_code\n\n\n\n\n\n","category":"function"},{"location":"parser/#Automa.generate_exec_code","page":"Parsing buffers","title":"Automa.generate_exec_code","text":"generate_exec_code([::CodeGenContext], machine::Machine, actions=nothing)::Expr\n\nGenerate machine execution code with actions. This code should be run after the machine has been initialized with generate_init_code. If not passed, the context defaults to DefaultCodeGenContext\n\nPrefer using the more generic generate_code over this function where possible. This function should be used if the initialized data should be modified before the execution code.\n\nExamples\n\n@eval function foo(data)\n $(generate_init_code(machine))\n p = 2 # maybe I want to start from position 2, not 1\n $(generate_exec_code(machine, actions))\n return cs\nend\n\nSee also: generate_init_code, generate_exec_code\n\n\n\n\n\n","category":"function"}] } diff --git a/previews/PR119/theory/index.html b/previews/PR119/theory/index.html index 75904526..863af533 100644 --- a/previews/PR119/theory/index.html +++ b/previews/PR119/theory/index.html @@ -1,2 +1,2 @@ -Theory · Automa.jl

      Theory of regular expressions

      Most programmers are familiar with regular expressions, or regex, for short. What many programmers don't know is that regex have a deep theoretical underpinning, which is leaned on by regex engines to produce highly efficient code.

      Informally, a regular expression can be thought of as any pattern that can be constructed from the following atoms:

      • The empty string is a valid regular expression, i.e. re""
      • Literal matching of a single symbol from a finite alphabet, such as a character, i.e. re"p"

      Atoms can be combined with the following operations, if R and P are two regular expressions:

      • Alternation, i.e R | P, meaning either match R or P.
      • Concatenation, i.e. R * P, meaning match first R, then P
      • Repetition, i.e. R*, meaning match R zero or more times consecutively.
      Note

      In Automa, the alphabet is bytes, i.e. 0x00:0xff, and so each symbol is a single byte. Multi-byte characters such as Æ is interpreted as the two concatenated of two symbols, re"\xc3" * re"\x86". The fact that Automa considers one input to be one byte, not one character, can become relevant if you instruct Automa to complete an action "on every input".

      Popular regex libraries include more operations like ? and +. These can trivially be constructed from the above mentioned primitives, i.e. R? is "" | R, and R+ is RR*.

      Some implementations of regular expression engines, such as PCRE which is the default in Julia as of Julia 1.8, also support operations like backreferences and lookbehind. These operations can NOT be constructed from the above atoms and axioms, meaning that PCRE expressions are not regular expressions in the theoretical sense.

      The practical importance of theoretically sound regular expressions is that there exists algorithms that can match regular expressions on O(N) time and O(1) space, whereas this is not true for PCRE expressions, which are therefore significantly slower.

      Note

      Automa.jl only supports real regex, and as such does not support e.g. backreferences, in order to gurantee fast runtime performance.

      To match regex to strings, the regex are transformed to finite automata, which are then implemented in code.

      Nondeterministic finite automata

      The programmer Ken Thompson, of Unix fame, deviced Thompson's construction, an algorithm to constuct a nondeterministic finite automaton (NFA) from a regex. An NFA can be thought of as a flowchart (or a directed graph), where one can move from node to node on directed edges. Edges are either labeled ϵ, in which the machine can freely move through the edge to its destination node, or labeled with one or more input symbols, in which the machine may traverse the edge upon consuming said input.

      To illustrate, let's look at one of the simplest regex: re"a", matching the letter a:

      You begin at the small dot on the right, then immediately go to state 1, the cirle marked by a 1. By moving to the next state, state 2, you consume the next symbol from the input string, which must be the symbol marked on the edge from state 1 to state 2 (in this case, an a). Some states are "accept states", illustrated by a double cirle. If you are at an accept state when you've consumed all symbols of the input string, the string matches the regex.

      Each of the operaitons that combine regex can also combine NFAs. For example, given the two regex a and b, which correspond to the NFAs A and B, the regex a * b can be expressed with the following NFA:

      Note the ϵ symbol on the edge - this signifies an "epsilon transition", meaning you move directly from A to B without consuming any symbols.

      Similarly, a | b correspond to this NFA structure...

      ...and a* to this:

      For a larger example, re"(\+|-)?(0|1)*" combines alternation, concatenation and repetition and so looks like this:

      ϵ-transitions means that there are states from which there are multiple possible next states, e.g. in the larger example above, state 1 can lead to state 2 or state 12. That's what makes NFAs nondeterministic.

      In order to match a regex to a string then, the movement through the NFA must be emulated. You begin at state 1. When a non-ϵ edge is encountered, you consume a byte of the input data if it matches. If there are no edges that match your input, the string does not match. If an ϵ-edge is encountered from state A that leads to states B and C, the machine goes from state A to state {B, C}, i.e. in both states at once.

      For example, if the regex re"(\+|-)?(0|1)* visualized above is matched to the string -11, this is what happens:

      • NFA starts in state 1
      • NFA immediately moves to all states reachable via ϵ transition. It is now in state {3, 5, 7, 9, 10}.
      • NFA sees input -. States {5, 7, 9, 10} do not have an edge with - leading out, so these states die. Therefore, the machine is in state 9, consumes the input, and moves to state 2.
      • NFA immediately moves to all states reachable from state 2 via ϵ transitions, so goes to {3, 5, 7}
      • NFA sees input 1, must be in state 5, moves to state 6, then through ϵ transitions to state {3, 5, 7}
      • The above point repeats, NFA is still in state {3, 5, 7}
      • Input ends. Since state 3 is an accept state, the string matches.

      Using only a regex-to-NFA converter, you could create a simple regex engine simply by emulating the NFA as above. The existence of ϵ transitions means the NFA can be in multiple states at once which adds unwelcome complexity to the emulation and makes it slower. Luckily, every NFA has an equivalent determinisitic finite automaton, which can be constructed from the NFA using the so-called powerset construction.

      Deterministic finite automata

      Or DFAs, as they are called, are similar to NFAs, but do not contain ϵ-edges. This means that a given input string has either zero paths (if it does not match the regex), one, unambiguous path, through the DFA. In other words, every input symbol must trigger one unambiguous state transition from one state to one other state.

      Let's visualize the DFA equivalent to the larger NFA above:

      It might not be obvious, but the DFA above accepts exactly the same inputs as the previous NFA. DFAs are way simpler to simulate in code than NFAs, precisely because at every state, for every input, there is exactly one action. DFAs can be simulated either using a lookup table, of possible state transitions, or by hardcoding GOTO-statements from node to node when the correct input is matched. Code simulating DFAs can be ridicuously fast, with each state transition taking less than 1 nanosecond, if implemented well.

      Furthermore, DFAs can be optimised. Two edges between the same nodes with labels A and B can be collapsed to a single edge with labels [AB], and redundant nodes can be collapsed. The optimised DFA equivalent to the one above is simply:

      Unfortunately, as the name "powerset construction" hints, convering an NFA with N nodes may result in a DFA with up to 2^N nodes. This inconvenient fact drives important design decisions in regex implementations. There are basically two approaches:

      Automa.jl will just construct the DFA directly, and accept a worst-case complexity of O(2^N). This is acceptable (I think) for Automa, because this construction happens in Julia's package precompilation stage (not on package loading or usage), and because the DFAs are assumed to be constants within a package. So, if a developer accidentally writes an NFA which is unacceptably slow to convert to a DFA, it will be caught in development. Luckily, it's pretty rare to have NFAs that result in truly abysmally slow conversions to DFA's: While bad corner cases exist, they are rarely as catastrophic as the O(2^N) would suggest. Currently, Automa's regex/NFA/DFA compilation pipeline is very slow and unoptimized, but, since it happens during precompile time, it is insignificant compared to LLVM compile times.

      Other implementations, like the popular ripgrep command line tool, uses an adaptive approach. It constructs the DFA on the fly, as each symbol is being matched, and then caches the DFA. If the DFA size grows too large, the cache is flushed. If the cache is flushed too often, it falls back to simulating the NFA directly. Such an approach is necessary for ripgrep, because the regex -> NFA -> DFA compilation happens at runtime and must be near-instantaneous, unlike Automa, where it happens during package precompilation and can afford to be slow.

      Automa in a nutshell

      Automa simulates the DFA by having the DFA create a Julia Expr, which is then used to generate a Julia function using metaprogramming. Like all other Julia code, this function is then optimized by Julia and then LLVM, making the DFA simulations very fast.

      Because Automa just constructs Julia functions, we can do extra tricks that ordinary regex engines cannot: We can splice arbitrary Julia code into the DFA simulation. Currently, Automa supports two such kinds of code: actions, and preconditions.

      Actions are Julia code that is executed during certain state transitions. Preconditions are Julia code, that evaluates to a Bool value, and which is checked before a state transition. If it evaluates to false, the transition is not taken.

      +Theory · Automa.jl

      Theory of regular expressions

      Most programmers are familiar with regular expressions, or regex, for short. What many programmers don't know is that regex have a deep theoretical underpinning, which is leaned on by regex engines to produce highly efficient code.

      Informally, a regular expression can be thought of as any pattern that can be constructed from the following atoms:

      • The empty string is a valid regular expression, i.e. re""
      • Literal matching of a single symbol from a finite alphabet, such as a character, i.e. re"p"

      Atoms can be combined with the following operations, if R and P are two regular expressions:

      • Alternation, i.e R | P, meaning either match R or P.
      • Concatenation, i.e. R * P, meaning match first R, then P
      • Repetition, i.e. R*, meaning match R zero or more times consecutively.
      Note

      In Automa, the alphabet is bytes, i.e. 0x00:0xff, and so each symbol is a single byte. Multi-byte characters such as Æ is interpreted as the two concatenated of two symbols, re"\xc3" * re"\x86". The fact that Automa considers one input to be one byte, not one character, can become relevant if you instruct Automa to complete an action "on every input".

      Popular regex libraries include more operations like ? and +. These can trivially be constructed from the above mentioned primitives, i.e. R? is "" | R, and R+ is RR*.

      Some implementations of regular expression engines, such as PCRE which is the default in Julia as of Julia 1.8, also support operations like backreferences and lookbehind. These operations can NOT be constructed from the above atoms and axioms, meaning that PCRE expressions are not regular expressions in the theoretical sense.

      The practical importance of theoretically sound regular expressions is that there exists algorithms that can match regular expressions on O(N) time and O(1) space, whereas this is not true for PCRE expressions, which are therefore significantly slower.

      Note

      Automa.jl only supports real regex, and as such does not support e.g. backreferences, in order to gurantee fast runtime performance.

      To match regex to strings, the regex are transformed to finite automata, which are then implemented in code.

      Nondeterministic finite automata

      The programmer Ken Thompson, of Unix fame, deviced Thompson's construction, an algorithm to constuct a nondeterministic finite automaton (NFA) from a regex. An NFA can be thought of as a flowchart (or a directed graph), where one can move from node to node on directed edges. Edges are either labeled ϵ, in which the machine can freely move through the edge to its destination node, or labeled with one or more input symbols, in which the machine may traverse the edge upon consuming said input.

      To illustrate, let's look at one of the simplest regex: re"a", matching the letter a:

      You begin at the small dot on the right, then immediately go to state 1, the cirle marked by a 1. By moving to the next state, state 2, you consume the next symbol from the input string, which must be the symbol marked on the edge from state 1 to state 2 (in this case, an a). Some states are "accept states", illustrated by a double cirle. If you are at an accept state when you've consumed all symbols of the input string, the string matches the regex.

      Each of the operaitons that combine regex can also combine NFAs. For example, given the two regex a and b, which correspond to the NFAs A and B, the regex a * b can be expressed with the following NFA:

      Note the ϵ symbol on the edge - this signifies an "epsilon transition", meaning you move directly from A to B without consuming any symbols.

      Similarly, a | b correspond to this NFA structure...

      ...and a* to this:

      For a larger example, re"(\+|-)?(0|1)*" combines alternation, concatenation and repetition and so looks like this:

      ϵ-transitions means that there are states from which there are multiple possible next states, e.g. in the larger example above, state 1 can lead to state 2 or state 12. That's what makes NFAs nondeterministic.

      In order to match a regex to a string then, the movement through the NFA must be emulated. You begin at state 1. When a non-ϵ edge is encountered, you consume a byte of the input data if it matches. If there are no edges that match your input, the string does not match. If an ϵ-edge is encountered from state A that leads to states B and C, the machine goes from state A to state {B, C}, i.e. in both states at once.

      For example, if the regex re"(\+|-)?(0|1)* visualized above is matched to the string -11, this is what happens:

      • NFA starts in state 1
      • NFA immediately moves to all states reachable via ϵ transition. It is now in state {3, 5, 7, 9, 10}.
      • NFA sees input -. States {5, 7, 9, 10} do not have an edge with - leading out, so these states die. Therefore, the machine is in state 9, consumes the input, and moves to state 2.
      • NFA immediately moves to all states reachable from state 2 via ϵ transitions, so goes to {3, 5, 7}
      • NFA sees input 1, must be in state 5, moves to state 6, then through ϵ transitions to state {3, 5, 7}
      • The above point repeats, NFA is still in state {3, 5, 7}
      • Input ends. Since state 3 is an accept state, the string matches.

      Using only a regex-to-NFA converter, you could create a simple regex engine simply by emulating the NFA as above. The existence of ϵ transitions means the NFA can be in multiple states at once which adds unwelcome complexity to the emulation and makes it slower. Luckily, every NFA has an equivalent determinisitic finite automaton, which can be constructed from the NFA using the so-called powerset construction.

      Deterministic finite automata

      Or DFAs, as they are called, are similar to NFAs, but do not contain ϵ-edges. This means that a given input string has either zero paths (if it does not match the regex), one, unambiguous path, through the DFA. In other words, every input symbol must trigger one unambiguous state transition from one state to one other state.

      Let's visualize the DFA equivalent to the larger NFA above:

      It might not be obvious, but the DFA above accepts exactly the same inputs as the previous NFA. DFAs are way simpler to simulate in code than NFAs, precisely because at every state, for every input, there is exactly one action. DFAs can be simulated either using a lookup table, of possible state transitions, or by hardcoding GOTO-statements from node to node when the correct input is matched. Code simulating DFAs can be ridicuously fast, with each state transition taking less than 1 nanosecond, if implemented well.

      Furthermore, DFAs can be optimised. Two edges between the same nodes with labels A and B can be collapsed to a single edge with labels [AB], and redundant nodes can be collapsed. The optimised DFA equivalent to the one above is simply:

      Unfortunately, as the name "powerset construction" hints, convering an NFA with N nodes may result in a DFA with up to 2^N nodes. This inconvenient fact drives important design decisions in regex implementations. There are basically two approaches:

      Automa.jl will just construct the DFA directly, and accept a worst-case complexity of O(2^N). This is acceptable (I think) for Automa, because this construction happens in Julia's package precompilation stage (not on package loading or usage), and because the DFAs are assumed to be constants within a package. So, if a developer accidentally writes an NFA which is unacceptably slow to convert to a DFA, it will be caught in development. Luckily, it's pretty rare to have NFAs that result in truly abysmally slow conversions to DFA's: While bad corner cases exist, they are rarely as catastrophic as the O(2^N) would suggest. Currently, Automa's regex/NFA/DFA compilation pipeline is very slow and unoptimized, but, since it happens during precompile time, it is insignificant compared to LLVM compile times.

      Other implementations, like the popular ripgrep command line tool, uses an adaptive approach. It constructs the DFA on the fly, as each symbol is being matched, and then caches the DFA. If the DFA size grows too large, the cache is flushed. If the cache is flushed too often, it falls back to simulating the NFA directly. Such an approach is necessary for ripgrep, because the regex -> NFA -> DFA compilation happens at runtime and must be near-instantaneous, unlike Automa, where it happens during package precompilation and can afford to be slow.

      Automa in a nutshell

      Automa simulates the DFA by having the DFA create a Julia Expr, which is then used to generate a Julia function using metaprogramming. Like all other Julia code, this function is then optimized by Julia and then LLVM, making the DFA simulations very fast.

      Because Automa just constructs Julia functions, we can do extra tricks that ordinary regex engines cannot: We can splice arbitrary Julia code into the DFA simulation. Currently, Automa supports two such kinds of code: actions, and preconditions.

      Actions are Julia code that is executed during certain state transitions. Preconditions are Julia code, that evaluates to a Bool value, and which is checked before a state transition. If it evaluates to false, the transition is not taken.

      diff --git a/previews/PR119/tokenizer/index.html b/previews/PR119/tokenizer/index.html index ed9444cb..8685ec65 100644 --- a/previews/PR119/tokenizer/index.html +++ b/previews/PR119/tokenizer/index.html @@ -43,7 +43,7 @@ @eval @enum Token error $(first.(tokens)...) make_tokenizer((error, [Token(i) => j for (i,j) in enumerate(last.(tokens))] -)) |> eval

      Token disambiguation

      It's possible to create a tokenizer where the different token regexes overlap:

      julia> make_tokenizer([re"[ab]+", re"ab*", re"ab"]) |> eval

      In this case, an input like ab will match all three regex. Which tokens are emitted is determined by two rules:

      First, the emitted tokens will be as long as possible. So, the input aa could be emitted as one token of the regex re"[ab]+", two tokens of the same regex, or of two tokens of the regex re"ab*". In this case, it will be emitted as a single token of re"[ab]+", since that will make the first token as long as possible (2 bytes), whereas the other options would only make it 1 byte long.

      Second, tokens with a higher index in the input array beats previous tokens. So, a will be emitted as re"ab*", as its index of 2 beats the previous regex re"[ab]+" with the index 1, and ab will match the third regex.

      If you don't want emitted tokens to depend on these priority rules, you can set the optional keyword unambiguous=true in the make_tokenizer function, in which case make_tokenizer will error if any input text could be broken down into different tokens. However, note that this may cause most tokenizers to error when being built, as most tokenization processes are ambiguous.

      Reference

      Automa.TokenizerType
      Tokenizer{E, D, C}

      Lazy iterator of tokens of type E over data of type D.

      Tokenizer works on any buffer-like object that defines pointer and sizeof. When iterated, it will return a 3-tuple of integers: * The first is the 1-based starting index of the token in the buffer * The second is the length of the token in bytes * The third is the token kind: The index in the input list tokens.

      Un-tokenizable data will be emitted as the "error token" with index zero.

      The Int C parameter allows multiple tokenizers to be created with the otherwise same type parameters.

      See also: make_tokenizer

      source
      Automa.tokenizeFunction
      tokenize(::Type{E}, data, version=1)

      Create a Tokenizer{E, typeof(data), version}, iterating tokens of type E over data.

      See also: Tokenizer, make_tokenizer, compile

      source
      Automa.make_tokenizerFunction
      make_tokenizer(
      +)) |> eval

      Token disambiguation

      It's possible to create a tokenizer where the different token regexes overlap:

      julia> make_tokenizer([re"[ab]+", re"ab*", re"ab"]) |> eval

      In this case, an input like ab will match all three regex. Which tokens are emitted is determined by two rules:

      First, the emitted tokens will be as long as possible. So, the input aa could be emitted as one token of the regex re"[ab]+", two tokens of the same regex, or of two tokens of the regex re"ab*". In this case, it will be emitted as a single token of re"[ab]+", since that will make the first token as long as possible (2 bytes), whereas the other options would only make it 1 byte long.

      Second, tokens with a higher index in the input array beats previous tokens. So, a will be emitted as re"ab*", as its index of 2 beats the previous regex re"[ab]+" with the index 1, and ab will match the third regex.

      If you don't want emitted tokens to depend on these priority rules, you can set the optional keyword unambiguous=true in the make_tokenizer function, in which case make_tokenizer will error if any input text could be broken down into different tokens. However, note that this may cause most tokenizers to error when being built, as most tokenization processes are ambiguous.

      Reference

      Automa.TokenizerType
      Tokenizer{E, D, C}

      Lazy iterator of tokens of type E over data of type D.

      Tokenizer works on any buffer-like object that defines pointer and sizeof. When iterated, it will return a 3-tuple of integers: * The first is the 1-based starting index of the token in the buffer * The second is the length of the token in bytes * The third is the token kind: The index in the input list tokens.

      Un-tokenizable data will be emitted as the "error token" with index zero.

      The Int C parameter allows multiple tokenizers to be created with the otherwise same type parameters.

      See also: make_tokenizer

      source
      Automa.make_tokenizerFunction
      make_tokenizer(
           machine::TokenizerMachine;
           tokens::Tuple{E, AbstractVector{E}}= [ integers ],
           goto=true, version=1
      @@ -60,7 +60,7 @@
        (2, 1, 0x02)
        (3, 3, 0x00)
        (6, 1, 0x02)
      - (7, 1, 0x01)

      Any actions inside the input regexes will be ignored. If goto (default), use the faster, but more complex goto code generator. The version number will set the last parameter of the Tokenizer, which allows you to create different tokenizers for the same element type.

      See also: Tokenizer, tokenize, compile

      source
      make_tokenizer(
      + (7, 1, 0x01)

      Any actions inside the input regexes will be ignored. If goto (default), use the faster, but more complex goto code generator. The version number will set the last parameter of the Tokenizer, which allows you to create different tokenizers for the same element type.

      See also: Tokenizer, tokenize, compile

      source
      make_tokenizer(
           tokens::Union{
               AbstractVector{RE},
               Tuple{E, AbstractVector{Pair{E, RE}}}
      @@ -75,4 +75,4 @@
        (1, 3, 0x00000001)
        (4, 3, 0x00000003)
        (7, 3, 0x00000002)
      - (10, 3, 0x00000003)
      source
      + (10, 3, 0x00000003)source diff --git a/previews/PR119/validators/index.html b/previews/PR119/validators/index.html index 5b733393..a554b9c5 100644 --- a/previews/PR119/validators/index.html +++ b/previews/PR119/validators/index.html @@ -24,9 +24,9 @@ (0x0a, (3, 0)) julia> validate_io(IOBuffer(">hello\nAC")) -(nothing, (2, 2))

      Reference

      Automa.generate_buffer_validatorFunction
      generate_buffer_validator(name::Symbol, regexp::RE; goto=true; docstring=true)

      Generate code that, when evaluated, defines a function named name, which takes a single argument data, interpreted as a sequence of bytes. The function returns nothing if data matches Machine, else the index of the first invalid byte. If the machine reached unexpected EOF, returns 0. If goto, the function uses the faster but more complicated :goto code. If docstring, automatically create a docstring for the generated function.

      source
      Automa.generate_io_validatorFunction
      generate_io_validator(funcname::Symbol, regex::RE; goto::Bool=false)

      NOTE: This method requires TranscodingStreams to be loaded

      Create code that, when evaluated, defines a function named funcname. This function takes an IO, and checks if the data in the input conforms to the regex, without executing any actions. If the input conforms, return nothing. Else, return (byte, (line, col)), where byte is the first invalid byte, and (line, col) the 1-indexed position of that byte. If the invalid byte is a byte, col is 0 and the line number is incremented. If the input errors due to unexpected EOF, byte is nothing, and the line and column given is the last byte in the file. If goto, the function uses the faster but more complicated :goto code.

      source
      Automa.compileFunction
      compile(re::RE; optimize::Bool=true, unambiguous::Bool=true)::Machine

      Compile a finite state machine (FSM) from re. If optimize, attempt to minimize the number of states in the FSM. If unambiguous, disallow creation of FSM where the actions are not deterministic.

      Examples

      machine = let
      +(nothing, (2, 2))

      Reference

      Automa.generate_buffer_validatorFunction
      generate_buffer_validator(name::Symbol, regexp::RE; goto=true; docstring=true)

      Generate code that, when evaluated, defines a function named name, which takes a single argument data, interpreted as a sequence of bytes. The function returns nothing if data matches Machine, else the index of the first invalid byte. If the machine reached unexpected EOF, returns 0. If goto, the function uses the faster but more complicated :goto code. If docstring, automatically create a docstring for the generated function.

      source
      Automa.generate_io_validatorFunction
      generate_io_validator(funcname::Symbol, regex::RE; goto::Bool=false)

      NOTE: This method requires TranscodingStreams to be loaded

      Create code that, when evaluated, defines a function named funcname. This function takes an IO, and checks if the data in the input conforms to the regex, without executing any actions. If the input conforms, return nothing. Else, return (byte, (line, col)), where byte is the first invalid byte, and (line, col) the 1-indexed position of that byte. If the invalid byte is a byte, col is 0 and the line number is incremented. If the input errors due to unexpected EOF, byte is nothing, and the line and column given is the last byte in the file. If goto, the function uses the faster but more complicated :goto code.

      source
      Automa.compileFunction
      compile(re::RE; optimize::Bool=true, unambiguous::Bool=true)::Machine

      Compile a finite state machine (FSM) from re. If optimize, attempt to minimize the number of states in the FSM. If unambiguous, disallow creation of FSM where the actions are not deterministic.

      Examples

      machine = let
           name = re"[A-Z][a-z]+"
           first_last = name * re" " * name
           last_first = name * re", " * name
           compile(first_last | last_first)
      -end
      source
      compile(tokens::Vector{RE}; unambiguous=false)::TokenizerMachine

      Compile the regex tokens to a tokenizer machine. The machine can be passed to make_tokenizer.

      The keyword unambiguous decides which of multiple matching tokens is emitted: If false (default), the longest token is emitted. If multiple tokens have the same length, the one with the highest index is returned. If true, make_tokenizer will error if any possible input text can be broken ambiguously down into tokens.

      See also: Tokenizer, make_tokenizer, tokenize

      source
      +endsource
      compile(tokens::Vector{RE}; unambiguous=false)::TokenizerMachine

      Compile the regex tokens to a tokenizer machine. The machine can be passed to make_tokenizer.

      The keyword unambiguous decides which of multiple matching tokens is emitted: If false (default), the longest token is emitted. If multiple tokens have the same length, the one with the highest index is returned. If true, make_tokenizer will error if any possible input text can be broken ambiguously down into tokens.

      See also: Tokenizer, make_tokenizer, tokenize

      source