From ee9d2834ac6d1e31e5952e327a8a76769debdd54 Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Sun, 2 Jun 2024 05:48:31 +0000 Subject: [PATCH] build based on 587913e --- dev/.documenter-siteinfo.json | 2 +- dev/api/index.html | 64 +++++++++++++++++------------------ dev/example/index.html | 2 +- dev/index.html | 2 +- dev/term/index.html | 2 +- 5 files changed, 36 insertions(+), 36 deletions(-) diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index b166f2c..60a1c4b 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.3","generation_timestamp":"2024-06-01T15:55:47","documenter_version":"1.4.1"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.3","generation_timestamp":"2024-06-02T05:48:26","documenter_version":"1.4.1"}} \ No newline at end of file diff --git a/dev/api/index.html b/dev/api/index.html index 3f2edca..452dba2 100644 --- a/dev/api/index.html +++ b/dev/api/index.html @@ -1,19 +1,19 @@ -API Reference · NeuralAttentionlib.jl

API Reference

Functional

NeuralAttentionlib.alibi_position_embeddingFunction
alibi_position_embedding(mask::Union{AbstractMask, Nothing}, score, args...)

Add the non-trainable ALiBi position embedding to the attention score. The ALiBi embedding varied for each head, which assuming the attention is multi-head variants. The first dimension of the batch dimension of the attention score is treated as the head dimension. mask can either be a attention mask or nothing. Usually, it is needed when there are gaps or prefix paddings in the samples.

source
NeuralAttentionlib.biased_scoreFunction
biased_score(bias, score, args...)

Adding a precomputed bias to the attention score. bias should be in shape (key length, query length, ...) and size(bias, 1) == size(s, 1) == size(bias, 2) == size(s, 2) && ndims(bias) <= ndims(s) where s = score(args...) must hold.

source
NeuralAttentionlib.layer_normFunction
layer_norm([epsilon = 1e-5,] alpha, beta, x)

Function which perform layer normalization on x. alpha and beta can a Vector, Number or Nothing.

$layer_norm(α, β, x) = α\frac{(x - μ)}{σ} + β$

If both alpha and beta is Nothing, this is just a standardize function applied on the first dimension.

source
NeuralAttentionlib.masked_scoreFunction
masked_score(mask) = masked_score $ mask
+API Reference · NeuralAttentionlib.jl

API Reference

Functional

NeuralAttentionlib.alibi_position_embeddingFunction
alibi_position_embedding(mask::Union{AbstractMask, Nothing}, score, args...)

Add the non-trainable ALiBi position embedding to the attention score. The ALiBi embedding varied for each head, which assuming the attention is multi-head variants. The first dimension of the batch dimension of the attention score is treated as the head dimension. mask can either be a attention mask or nothing. Usually, it is needed when there are gaps or prefix paddings in the samples.

source
NeuralAttentionlib.biased_scoreFunction
biased_score(bias, score, args...)

Adding a precomputed bias to the attention score. bias should be in shape (key length, query length, ...) and size(bias, 1) == size(s, 1) == size(bias, 2) == size(s, 2) && ndims(bias) <= ndims(s) where s = score(args...) must hold.

source
NeuralAttentionlib.layer_normFunction
layer_norm([epsilon = 1e-5,] alpha, beta, x)

Function which perform layer normalization on x. alpha and beta can a Vector, Number or Nothing.

$layer_norm(α, β, x) = α\frac{(x - μ)}{σ} + β$

If both alpha and beta is Nothing, this is just a standardize function applied on the first dimension.

source
NeuralAttentionlib.move_head_dim_in_permFunction
move_head_dim_in_perm(x::AbstractArray{T, N}, nobatch=false)
 move_head_dim_in_perm(N::Int, nobatch=false)

Dimension order for permutedims to move the head dimension (created by split_head) from batch dimension to feature dimension (for merge_head). Return a tuple of integer of length n. nobatch specify where x is a batch of data.

Example

julia> Functional.move_head_dim_in_perm(5, false)
 (1, 4, 2, 3, 5)
 
 julia> Functional.move_head_dim_in_perm(5, true)
 (1, 5, 2, 3, 4)
-

See also: merge_head, move_head_dim_in

source
NeuralAttentionlib.move_head_dim_out_permFunction
move_head_dim_out_perm(x::AbstractArray{T, N}, nobatch=false)
 move_head_dim_out_perm(N::Int, nobatch=false)

Dimension order for permutedims to move the head dimension (created by split_head) to batch dimension. Return a tuple of integer of length n. nobatch specify where x is a batch of data.

Example

julia> Functional.move_head_dim_out_perm(5, false)
 (1, 3, 4, 2, 5)
 
 julia> Functional.move_head_dim_out_perm(5, true)
 (1, 3, 4, 5, 2)
-

See also: split_head, move_head_dim_out

source
NeuralAttentionlib.naive_qkv_attentionFunction
naive_qkv_attention(q, k, v, mask=nothing)

The scaled dot-product attention of a regular transformer layer.

$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$

It's equivalent to generic_qkv_attention(weighted_sum_mixing, normalized_score(NNlib.softmax) $ masked_score(GenericMaskOp(), mask) $ scaled_dot_product_score, q, k, v).

#Example

julia> fdim, ldim, bdim = 32, 10, 4;
+

See also: split_head, move_head_dim_out

source
NeuralAttentionlib.naive_qkv_attentionFunction
naive_qkv_attention(q, k, v, mask=nothing)

The scaled dot-product attention of a regular transformer layer.

$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$

It's equivalent to generic_qkv_attention(weighted_sum_mixing, normalized_score(NNlib.softmax) $ masked_score(GenericMaskOp(), mask) $ scaled_dot_product_score, q, k, v).

#Example

julia> fdim, ldim, bdim = 32, 10, 4;
 
 julia> x = randn(fdim, ldim, bdim);
 
@@ -24,30 +24,30 @@
 
 julia> y ≈ z
 true
-

See also: generic_qkv_attention

source
NeuralAttentionlib.normalized_scoreFunction
normalized_score(norm) = normalized_score $ norm
-normalized_score(norm, score, args...)

Normalized attenion score api. norm is the normalize function (like softmax) and score is the function that compute attention score from args....

See also: naive_qkv_attention

source
NeuralAttentionlib.rms_layer_normFunction
rms_layer_norm([epsilon = 1e-5,] alpha, x)

Function which perform root-mean-square layer normalization on x. alpha and beta can a Vector, Number or Nothing.

$rms_layer_norm(α, x) = α\frac{x}{\sqrt{\sum_{i=1}^{N} x^2 / N}}$

If both alpha is Nothing, this is just a normalization with root-mean-square function applied on the first dimension.

source
NeuralAttentionlib.scalar_relative_position_embeddingFunction
scalar_relative_position_embedding(relative_position_id_func, embedding_table, score, args...)

A relative position embedding that produce a trainable scalar bias for each value in the attention score. relative_position_id_func is a function that take the attention score and return a relative_position_id matrix with the same size of the attention score with batches (normally (key length, query length)). This relative_position_id would be used to index (or gather) the embedding_table. embedding_table is an array with multiple dimensions, where the first dimension is the number of possible "id"s and the remaining dimensions are for giving different value to each heads. By default we treat the last dimension of attention score as the batch dimension and the dimension between last dimension and the "length" dimension as the head dimensions.

source
NeuralAttentionlib.scaled_dot_product_scoreFunction
 scaled_dot_product_score(q, k, s = sqrt(inv(size(k, 1))))

The scaled dot-product attention score function of a regular transformer layer.

$Score(Q, K) = \frac{QK^T}{\sqrt{d_k}}$

scaled_dot_product_score(f, q, k)

Apply a transform function f on q/k before dot-product.

See also: naive_qkv_attention

source
NeuralAttentionlib.split_headFunction
split_head(head::Int, x)

Split the first dimension into head piece of small vector. Equivalent to reshape(x, :, head, tail(size(x))...).

source
NeuralAttentionlib.with_rotary_position_embeddingFunction
with_rotary_position_embedding([size,] x)

Apply rotary position embedding to x. Can take an size argument and the rotary position embedding will only apply to x[1:size, :, ...]. Should be used with scaled_dot_product_score/dot_product_score.

source
NeuralAttentionlib.normalized_scoreFunction
normalized_score(norm) = normalized_score $ norm
+normalized_score(norm, score, args...)

Normalized attenion score api. norm is the normalize function (like softmax) and score is the function that compute attention score from args....

See also: naive_qkv_attention

source
NeuralAttentionlib.rms_layer_normFunction
rms_layer_norm([epsilon = 1e-5,] alpha, x)

Function which perform root-mean-square layer normalization on x. alpha and beta can a Vector, Number or Nothing.

$rms_layer_norm(α, x) = α\frac{x}{\sqrt{\sum_{i=1}^{N} x^2 / N}}$

If both alpha is Nothing, this is just a normalization with root-mean-square function applied on the first dimension.

source
NeuralAttentionlib.scalar_relative_position_embeddingFunction
scalar_relative_position_embedding(relative_position_id_func, embedding_table, score, args...)

A relative position embedding that produce a trainable scalar bias for each value in the attention score. relative_position_id_func is a function that take the attention score and return a relative_position_id matrix with the same size of the attention score with batches (normally (key length, query length)). This relative_position_id would be used to index (or gather) the embedding_table. embedding_table is an array with multiple dimensions, where the first dimension is the number of possible "id"s and the remaining dimensions are for giving different value to each heads. By default we treat the last dimension of attention score as the batch dimension and the dimension between last dimension and the "length" dimension as the head dimensions.

source
NeuralAttentionlib.scaled_dot_product_scoreFunction
 scaled_dot_product_score(q, k, s = sqrt(inv(size(k, 1))))

The scaled dot-product attention score function of a regular transformer layer.

$Score(Q, K) = \frac{QK^T}{\sqrt{d_k}}$

scaled_dot_product_score(f, q, k)

Apply a transform function f on q/k before dot-product.

See also: naive_qkv_attention

source
NeuralAttentionlib.split_headFunction
split_head(head::Int, x)

Split the first dimension into head piece of small vector. Equivalent to reshape(x, :, head, tail(size(x))...).

source
NeuralAttentionlib.with_rotary_position_embeddingFunction
with_rotary_position_embedding([size,] x)

Apply rotary position embedding to x. Can take an size argument and the rotary position embedding will only apply to x[1:size, :, ...]. Should be used with scaled_dot_product_score/dot_product_score.

source
NeuralAttentionlib.PrefixedFunctionType
PrefixedFunction(f, args::NTuple{N}) <: Function

A type representating a partially-applied version of the function f, with the first N arguments fixed to the values args. In other words, PrefixedFunction(f, args) behaves similarly to (xs...)->f(args..., xs...).

See also NeuralAttentionlib.:$.

source

Mask

NeuralAttentionlib.apply_maskMethod
apply_mask(op::GenericMaskOp, mask::AbstractMask, score)

Equivalent to op.apply(score, op.scale .* (op.flip ? .! mask : mask)).

Example

julia> x = randn(10, 10);
+end

Structure for holding parameter of multihead_qkv_attention.

(op::MultiheadQKVAttenOp)(q, k, v, mask = nothing)

Perform multihead attention.

source
NeuralAttentionlib.PrefixedFunctionType
PrefixedFunction(f, args::NTuple{N}) <: Function

A type representating a partially-applied version of the function f, with the first N arguments fixed to the values args. In other words, PrefixedFunction(f, args) behaves similarly to (xs...)->f(args..., xs...).

See also NeuralAttentionlib.:$.

source

Mask

NeuralAttentionlib.apply_maskMethod
apply_mask(op::GenericMaskOp, mask::AbstractMask, score)

Equivalent to op.apply(score, op.scale .* (op.flip ? .! mask : mask)).

Example

julia> x = randn(10, 10);
 
 julia> m = CausalMask()
 CausalMask()
 
 julia> apply_mask(GenericMaskOp(.+, true, -1e9), m, x) ==  @. x + (!m * -1e9)
 true
-
source
NeuralAttentionlib.BandPartMaskType
BandPartMask(l::Int, u::Int) <: AbstractAttenMask{DATALESS}

Attention mask that only allow band_part values to pass.

Example

julia> trues(10, 10) .* BandPartMask(3, 5)
 10×10 BitMatrix:
  1  1  1  1  1  1  0  0  0  0
  1  1  1  1  1  1  1  0  0  0
@@ -58,7 +58,7 @@
  0  0  0  1  1  1  1  1  1  1
  0  0  0  0  1  1  1  1  1  1
  0  0  0  0  0  1  1  1  1  1
- 0  0  0  0  0  0  1  1  1  1
source
NeuralAttentionlib.BatchedMaskType
BatchedMask(mask::AbstractMask) <: AbstractWrapperMask

Attention mask wrapper over array mask for applying the same mask within the same batch.

Example

julia> m = SymLengthMask([2,3])
+ 0  0  0  0  0  0  1  1  1  1
source
NeuralAttentionlib.BatchedMaskType
BatchedMask(mask::AbstractMask) <: AbstractWrapperMask

Attention mask wrapper over array mask for applying the same mask within the same batch.

Example

julia> m = SymLengthMask([2,3])
 SymLengthMask{1, Vector{Int32}}(Int32[2, 3])
 
 julia> trues(3,3, 2) .* m
@@ -99,7 +99,7 @@
  1  1  1
  1  1  1
  1  1  1
-
source
NeuralAttentionlib.BiLengthMaskType
BiLengthMask(q_len::A, k_len::A) where {A <: AbstractArray{Int, N}} <: AbstractAttenMask{ARRAYDATA}

Attention mask specified by two arrays of integer that indicate the length dimension size.

Example

julia> bm = BiLengthMask([2,3], [3, 5])
+
source
NeuralAttentionlib.BiLengthMaskType
BiLengthMask(q_len::A, k_len::A) where {A <: AbstractArray{Int, N}} <: AbstractAttenMask{ARRAYDATA}

Attention mask specified by two arrays of integer that indicate the length dimension size.

Example

julia> bm = BiLengthMask([2,3], [3, 5])
 BiLengthMask{1, Vector{Int32}}(Int32[2, 3], Int32[3, 5])
 
 julia> trues(5,5, 2) .* bm
@@ -117,7 +117,7 @@
  1  1  1  0  0
  1  1  1  0  0
  1  1  1  0  0
-

See also: SymLengthMask, BiSeqMask, BatchedMask, RepeatMask

source
NeuralAttentionlib.BiSeqMaskType
BiSeqMask(qmask::A1, kmask::A2) where {A1 <: AbstractSeqMask, A2 <: AbstractSeqMask} <: AbstractAttenMask

Take two sequence mask and construct an attention mask.

Example

julia> trues(7, 7, 2) .* Masks.BiSeqMask(Masks.LengthMask([3, 5]), Masks.RevLengthMask([3, 5]))
 7×7×2 BitArray{3}:
 [:, :, 1] =
  0  0  0  0  0  0  0
@@ -135,7 +135,7 @@
  1  1  1  1  1  0  0
  1  1  1  1  1  0  0
  1  1  1  1  1  0  0
- 1  1  1  1  1  0  0

See also: BiLengthMask, RevBiLengthMask

source
NeuralAttentionlib.CausalMaskType
CausalMask() <: AbstractAttenMask{DATALESS}

Attention mask that block the future values.

Similar to applying LinearAlgebra.triu! on the score matrix

Example

julia> trues(10, 10) .* CausalMask()
 10×10 BitMatrix:
  1  1  1  1  1  1  1  1  1  1
  0  1  1  1  1  1  1  1  1  1
@@ -146,7 +146,7 @@
  0  0  0  0  0  0  1  1  1  1
  0  0  0  0  0  0  0  1  1  1
  0  0  0  0  0  0  0  0  1  1
- 0  0  0  0  0  0  0  0  0  1
source
NeuralAttentionlib.GenericAttenMaskType
GenericAttenMask <: AbstractAttenMask{ARRAYDATA}

Generic attention mask. Just a wrapper over AbstractArray{Bool} for dispatch.

Example

julia> bitmask = rand(Bool, 10, 10)
+ 0  0  0  0  0  0  0  0  0  1
source
NeuralAttentionlib.GenericAttenMaskType
GenericAttenMask <: AbstractAttenMask{ARRAYDATA}

Generic attention mask. Just a wrapper over AbstractArray{Bool} for dispatch.

Example

julia> bitmask = rand(Bool, 10, 10)
 10×10 Matrix{Bool}:
  1  0  1  1  0  0  1  0  1  1
  0  0  1  1  0  0  0  1  1  1
@@ -170,7 +170,7 @@
  0  0  0  1  1  1  0  1  1  1
  1  0  1  0  1  1  1  0  0  1
  0  1  0  1  0  0  1  1  0  1
- 0  0  0  1  0  1  0  0  0  1
source
NeuralAttentionlib.GenericSeqMaskType
GenericSeqMask(mask::AbstractArray{Bool}) <: AbstractSeqMask{ARRAYDATA}

Create a sequence mask from an array of Bool.

Example

julia> m = GenericSeqMask(rand(Bool, 10, 2))
+ 0  0  0  1  0  1  0  0  0  1
source
NeuralAttentionlib.GenericSeqMaskType
GenericSeqMask(mask::AbstractArray{Bool}) <: AbstractSeqMask{ARRAYDATA}

Create a sequence mask from an array of Bool.

Example

julia> m = GenericSeqMask(rand(Bool, 10, 2))
 GenericSeqMask{3, Array{Bool, 3}}([0 1 … 0 0;;; 1 0 … 1 0])
 
 julia> trues(7, 10, 2) .* m
@@ -200,8 +200,8 @@
 
 [:, :, 2] =
  1  0  1  1  0  1  1  1  1  0
-
source
NeuralAttentionlib.IndexerType
Indexer(m::AbstractMask, size::Dims{N}) <: AbstractArray{Bool, N}
-Indexer(m::AbstractMask, size::Dims{N}, scale::T) <: AbstractArray{T, N}

A lazy array-like object that "materialize" the mask m with size and a optional scale without size check.

See also: GetIndexer

source
NeuralAttentionlib.IndexerType
Indexer(m::AbstractMask, size::Dims{N}) <: AbstractArray{Bool, N}
+Indexer(m::AbstractMask, size::Dims{N}, scale::T) <: AbstractArray{T, N}

A lazy array-like object that "materialize" the mask m with size and a optional scale without size check.

See also: GetIndexer

source
NeuralAttentionlib.LengthMaskType
LengthMask(len::AbstractArray{Int, N}) <: AbstractSeqMask{ARRAYDATA}

A Sequence Mask specified by an array of integer that indicate the length dimension size. Can be convert to attention mask (SymLengthMask, BiLengthMask) with AttenMask.

Example

julia> ones(7, 7, 2) .* LengthMask([3, 5])
 7×7×2 Array{Float64, 3}:
 [:, :, 1] =
  1.0  1.0  1.0  0.0  0.0  0.0  0.0
@@ -220,7 +220,7 @@
  1.0  1.0  1.0  1.0  1.0  0.0  0.0
  1.0  1.0  1.0  1.0  1.0  0.0  0.0
  1.0  1.0  1.0  1.0  1.0  0.0  0.0
-
source
NeuralAttentionlib.LocalMaskType
LocalMask(width::Int) <: AbstractAttenMask{DATALESS}

Attention mask that only allow local (diagonal like) values to pass.

width should be ≥ 0 and A .* LocalMask(1) is similar to Diagonal(A)

Example

julia> trues(10, 10) .* LocalMask(3)
+
source
NeuralAttentionlib.LocalMaskType
LocalMask(width::Int) <: AbstractAttenMask{DATALESS}

Attention mask that only allow local (diagonal like) values to pass.

width should be ≥ 0 and A .* LocalMask(1) is similar to Diagonal(A)

Example

julia> trues(10, 10) .* LocalMask(3)
 10×10 BitMatrix:
  1  1  1  0  0  0  0  0  0  0
  1  1  1  1  0  0  0  0  0  0
@@ -231,7 +231,7 @@
  0  0  0  0  1  1  1  1  1  0
  0  0  0  0  0  1  1  1  1  1
  0  0  0  0  0  0  1  1  1  1
- 0  0  0  0  0  0  0  1  1  1
source
NeuralAttentionlib.RandomMaskType
RandomMask(p::Float32) <: AbstractAttenMask{DATALESS}

Attention mask that block value randomly.

p specify the percentage of value to block. e.g. A .* RandomMask(0) is equivalent to identity(A) and A .* RandomMask(1) is equivalent to zero(A).

Example

julia> trues(10, 10) .* RandomMask(0.5)
+ 0  0  0  0  0  0  0  1  1  1
source
NeuralAttentionlib.RandomMaskType
RandomMask(p::Float32) <: AbstractAttenMask{DATALESS}

Attention mask that block value randomly.

p specify the percentage of value to block. e.g. A .* RandomMask(0) is equivalent to identity(A) and A .* RandomMask(1) is equivalent to zero(A).

Example

julia> trues(10, 10) .* RandomMask(0.5)
 10×10 BitMatrix:
  1  1  1  1  1  1  0  1  1  1
  0  0  1  0  1  0  0  0  1  0
@@ -255,7 +255,7 @@
  1  1  1  0  1  1  1  0  0  0
  0  0  1  1  0  0  1  1  1  0
  0  1  1  1  1  0  1  0  1  0
- 0  0  1  0  0  0  0  1  1  1
source
NeuralAttentionlib.RepeatMaskType
RepeatMask(mask::AbstractMask, num::Int) <: AbstractWrapperMask

Attention mask wrapper over array mask for doing inner repeat on the last dimension.

Example

julia> m = SymLengthMask([2,3])
+ 0  0  1  0  0  0  0  1  1  1
source
NeuralAttentionlib.RepeatMaskType
RepeatMask(mask::AbstractMask, num::Int) <: AbstractWrapperMask

Attention mask wrapper over array mask for doing inner repeat on the last dimension.

Example

julia> m = SymLengthMask([2,3])
 SymLengthMask{1, Vector{Int32}}(Int32[2, 3])
 
 julia> trues(3,3, 2) .* m
@@ -296,7 +296,7 @@
  1  1  1
  1  1  1
  1  1  1
-
source
NeuralAttentionlib.RevBiLengthMaskType
RevBiLengthMask(q_len::A, k_len::A) where {A <: AbstractArray{Int, N}} <: AbstractAttenMask{ARRAYDATA}

BiLengthMask but counts from the end of array, used for left padding.

Example

julia> bm = RevBiLengthMask([2,3], [3, 5])
+
source
NeuralAttentionlib.RevLengthMaskType
RevLengthMask(len::AbstractArray{Int, N}) <: AbstractSeqMask{ARRAYDATA}

LengthMask but counts from the end of array, used for left padding. Can be convert to attention mask (RevSymLengthMask, RevBiLengthMask) with AttenMask.

Example

julia> ones(7, 7, 2) .* RevLengthMask([3, 5])
 7×7×2 Array{Float64, 3}:
 [:, :, 1] =
  0.0  0.0  0.0  0.0  1.0  1.0  1.0
@@ -333,7 +333,7 @@
  0.0  0.0  1.0  1.0  1.0  1.0  1.0
  0.0  0.0  1.0  1.0  1.0  1.0  1.0
  0.0  0.0  1.0  1.0  1.0  1.0  1.0
-
source
NeuralAttentionlib.SymLengthMaskType
SymLengthMask(len::AbstractArray{Int, N}) <: AbstractAttenMask{ARRAYDATA}

Attention mask specified by an array of integer that indicate the length dimension size. assuming Query length and Key length are the same.

Example

julia> m = SymLengthMask([2,3])
 SymLengthMask{1, Vector{Int32}}(Int32[2, 3])
 
 julia> trues(3,3, 2) .* m
@@ -361,7 +361,7 @@
  1  1  1
  1  1  1
  1  1  1
-

See also: LengthMask, BiLengthMask, BatchedMask, RepeatMask

source
Base.:!Method
!m::AbstractMask

Boolean not of an attention mask

source
Base.:&Method
m1::AbstractMask & m2::AbstractMask

logical and of two attention mask

source
Base.:|Method
m1::AbstractMask | m2::AbstractMask

logical or of two attention mask

source
NeuralAttentionlib.AttenMaskFunction
AttenMask(m::AbstractMask)

Convert mask into corresponding attention mask.

AttenMask(q_mask::AbstractSeqMask, k_mask::AbstractSeqMask)

Create a attention mask from 2 sequence masks specific the sequence mask for "query" and "key".

source
Base.:!Method
!m::AbstractMask

Boolean not of an attention mask

source
Base.:&Method
m1::AbstractMask & m2::AbstractMask

logical and of two attention mask

source
Base.:|Method
m1::AbstractMask | m2::AbstractMask

logical or of two attention mask

source
NeuralAttentionlib.AttenMaskFunction
AttenMask(m::AbstractMask)

Convert mask into corresponding attention mask.

AttenMask(q_mask::AbstractSeqMask, k_mask::AbstractSeqMask)

Create a attention mask from 2 sequence masks specific the sequence mask for "query" and "key".

source
NeuralAttentionlib.getmaskFunction
getmask(m::AbstractMask, score, scale = 1)

Convert m into mask array of AbstractArray for score with scale.

Example

julia> getmask(CausalMask(), randn(7,7), 2)
 7×7 Matrix{Float64}:
  2.0  2.0  2.0  2.0  2.0  2.0  2.0
  0.0  2.0  2.0  2.0  2.0  2.0  2.0
@@ -370,7 +370,7 @@
  0.0  0.0  0.0  0.0  2.0  2.0  2.0
  0.0  0.0  0.0  0.0  0.0  2.0  2.0
  0.0  0.0  0.0  0.0  0.0  0.0  2.0
-
source

Matmul

NeuralAttentionlib.collapsed_sizeFunction
collapsed_size(x, ni, nj [, n])::Dim{3}

Collapse the dimensionality of x into 3 according to ni and nj where ni, nj specify the number of second and third dimensions it take.

(X1, X2, ..., Xk, Xk+1, Xk+2, ..., Xk+ni, Xk+ni+1, ..., Xn)
+
source

Matmul

NeuralAttentionlib.collapsed_sizeFunction
collapsed_size(x, ni, nj [, n])::Dim{3}

Collapse the dimensionality of x into 3 according to ni and nj where ni, nj specify the number of second and third dimensions it take.

(X1, X2, ..., Xk, Xk+1, Xk+2, ..., Xk+ni, Xk+ni+1, ..., Xn)
  |______dim1___|  |_________ni_________|  |______nj______|

Example

julia> x = randn(7,6,5,4,3,2);
 
 julia> collapsed_size(x, 2, 2, 1)
@@ -384,7 +384,7 @@
 
 julia> collapsed_size(x, 2, 2)
 (42, 20, 6)
-

See also: noncollapsed_size

source
NeuralAttentionlib.matmulFunction
matmul(a::AbstractArray, b::AbstractArray, s::Number = 1)

Equivalent to s .* (a * b) if a and b are Vector or Matrix. For array with higher dimension, it will convert a and b to CollapsedDimsArray and perform batched matrix multiplication, and then return the result as CollapsedDimsArray. This is useful for preserving the dimensionality. If the batch dimension of a and b have different shape, it pick the shape of b for batch dimension. Work with NNlib.batch_transpose and NNlib.batch_adjoint.

Example

# b-dim shape: (6,)
+

See also: noncollapsed_size

source
NeuralAttentionlib.matmulFunction
matmul(a::AbstractArray, b::AbstractArray, s::Number = 1)

Equivalent to s .* (a * b) if a and b are Vector or Matrix. For array with higher dimension, it will convert a and b to CollapsedDimsArray and perform batched matrix multiplication, and then return the result as CollapsedDimsArray. This is useful for preserving the dimensionality. If the batch dimension of a and b have different shape, it pick the shape of b for batch dimension. Work with NNlib.batch_transpose and NNlib.batch_adjoint.

Example

# b-dim shape: (6,)
 julia> a = CollapsedDimsArray(randn(3,4,2,3,6), 2, 1); size(a)
 (12, 6, 6)
 
@@ -402,7 +402,7 @@
 # equivanlent to `batched_mul` but preserve shape
 julia> NNlib.batched_mul(collapseddims(a), collapseddims(b)) == collapseddims(matmul(a, b))
 true
-

See also: CollapsedDimsArray, unwrap_collapse, collapseddims

source
NeuralAttentionlib.noncollapsed_sizeFunction
noncollapsed_size(x, ni, nj [, n])

Collapse the dimensionality of x into 3 according to ni and nj.

(X1, X2, ..., Xk, Xk+1, Xk+2, ..., Xk+ni, Xk+ni+1, ..., Xn)
  |______dim1___|  |_________ni_________|  |______nj______|

But take the size before collapse. e.g. noncollapsed_size(x, ni, nj, 2) will be (Xi, Xi+1, ..., Xj-1).

Example

julia> x = randn(7,6,5,4,3,2);
 
 julia> noncollapsed_size(x, 2, 2, 1)
@@ -416,4 +416,4 @@
 
 julia> noncollapsed_size(x, 2, 2)
 ((7, 6), (5, 4), (3, 2))
-

See also: collapsed_size

source
NeuralAttentionlib.scaled_matmulFunction
scaled_matmul(a::AbstractArray, b::AbstractArray, s::Number = 1)

Basically equivalent to unwrap_collapse(matmul(a, b, s)), but not differentiable w.r.t. to s.

source
+

See also: collapsed_size

source
NeuralAttentionlib.scaled_matmulFunction
scaled_matmul(a::AbstractArray, b::AbstractArray, s::Number = 1)

Basically equivalent to unwrap_collapse(matmul(a, b, s)), but not differentiable w.r.t. to s.

source
diff --git a/dev/example/index.html b/dev/example/index.html index 865b278..fafdf96 100644 --- a/dev/example/index.html +++ b/dev/example/index.html @@ -1,2 +1,2 @@ -Example · NeuralAttentionlib.jl
+Example · NeuralAttentionlib.jl
diff --git a/dev/index.html b/dev/index.html index 716da0c..3b8c62e 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Home · NeuralAttentionlib.jl

NeuralAttentionlib

Reusable functionality for defining custom attention/transformer layers.

NeuralAttentionlib.jl aim to be highly extendable and reusable function for implementing attention variants. Will be powering Transformers.jl.

Design

overview

The core idea of this package is to make the attention operation composable, so that most of the attention variants can be easily defined without rewriting other parts. For example, normal attention use softmax on the attention score to normalize weight of each entries. If you want to replace softmax with other normalization function, such as L2-norm, there is a problem that they require different ways to mask specific entries such as paddings. With this package, we can easily do this by providing a different AbstractMaskOp to masked_score, so no copy-paste is needed. For another example, some position embeddings are adding values to the attention scores, with this package, you can directly chain the position embedding function (or use biased_score) with other score functions. Moreover, the same definition can be used directly for high dimensional attentions, such as image or video.

This package contain 3 submodules: Matmul, Masks, and Functional.

  1. Matmul defines an Array wrapper CollapsedDimsArray{T}(array, ni::Integer, nj::Integer) which treat n-dimensional array as 3-dimensional array while preserving the original shape. By explicitly specifying which dimensions should be the "batch" and "length" dimensions, the implementations of attention do not need to worry about the input dimensions.
  2. Masks provides an interface to define non-allocating masks with support for both CPU and GPU (using Julia's broadcast interface) and many pre-defined masks. For example, CausalMask() is just a Julia object and it would NOT allocate a n^2 attention score mask either on CPU or GPU. These masks are also composable, you can use &/| to combine, for example, causal mask and padding mask without extra allocation or the need to write extra code.
  3. Functional contains the implementation for the "attention score"s, "mixing"s, and "attention operation"s. The interface of "attention score"s allow you to chain different score function together, such as normalized_score, masked_score, and biased_score. And the interface of "attention operation"s allow you to provide different score functions and mixing functions. The other part, such as reshaping for multi-head, are automatically handled.

Outline

+Home · NeuralAttentionlib.jl

NeuralAttentionlib

Reusable functionality for defining custom attention/transformer layers.

NeuralAttentionlib.jl aim to be highly extendable and reusable function for implementing attention variants. Will be powering Transformers.jl.

Design

overview

The core idea of this package is to make the attention operation composable, so that most of the attention variants can be easily defined without rewriting other parts. For example, normal attention use softmax on the attention score to normalize weight of each entries. If you want to replace softmax with other normalization function, such as L2-norm, there is a problem that they require different ways to mask specific entries such as paddings. With this package, we can easily do this by providing a different AbstractMaskOp to masked_score, so no copy-paste is needed. For another example, some position embeddings are adding values to the attention scores, with this package, you can directly chain the position embedding function (or use biased_score) with other score functions. Moreover, the same definition can be used directly for high dimensional attentions, such as image or video.

This package contain 3 submodules: Matmul, Masks, and Functional.

  1. Matmul defines an Array wrapper CollapsedDimsArray{T}(array, ni::Integer, nj::Integer) which treat n-dimensional array as 3-dimensional array while preserving the original shape. By explicitly specifying which dimensions should be the "batch" and "length" dimensions, the implementations of attention do not need to worry about the input dimensions.
  2. Masks provides an interface to define non-allocating masks with support for both CPU and GPU (using Julia's broadcast interface) and many pre-defined masks. For example, CausalMask() is just a Julia object and it would NOT allocate a n^2 attention score mask either on CPU or GPU. These masks are also composable, you can use &/| to combine, for example, causal mask and padding mask without extra allocation or the need to write extra code.
  3. Functional contains the implementation for the "attention score"s, "mixing"s, and "attention operation"s. The interface of "attention score"s allow you to chain different score function together, such as normalized_score, masked_score, and biased_score. And the interface of "attention operation"s allow you to provide different score functions and mixing functions. The other part, such as reshaping for multi-head, are automatically handled.

Outline

diff --git a/dev/term/index.html b/dev/term/index.html index 7e15dd1..90ce834 100644 --- a/dev/term/index.html +++ b/dev/term/index.html @@ -25,4 +25,4 @@ | | | +--------------------|----------------------------------------------------------+ Attentive Value - (main output)

The attention operation is actually a special way to "mix" (or "pick" in common lecture) the input information. In (probably) the first attention paper, the attention is defined as weighted sum of the input sequence given a word embedding. The idea is furthur generalize to QKV attention in the first transformer paper.

1. Attention Score

The attention score is used to decide how much the each piece of input information will contribute to the output value and also how many entry the attention operation will output. The operation that will modify the attention score matrix should be consider as part of this block. For example: Different attention masks (local attention, random attention, ...), normalization (softmax, l2-norm, ...), and some special attention that take other inputs (transformer decoder, relative position encoding, ...).

2. Mixing

We refer to the operation that take the attention score and input value as "mixing". Usually it's just a weighted sum over the input value and use the attention score as the weight.

3. Attention Operation

The whole scoring + mixing and other pre/post processing made up an attention operation. Things like handling multi-head should happen at this level.

Attention Mask

Attention masks are a bunch of operation that modified the attention score.

1. Dataless mask

We use "dataless" to refer to masks that are independent to the input. For example, CausalMask works the same on each data regardless of the batch size or the data content.

2. Array mask

We call the mask that is dependent to the input as "array mask". For example, SymLengthMask is used to avoid the padding token being considered in the attention operation, thus each data batch might have different mask value.

+ (main output)

The attention operation is actually a special way to "mix" (or "pick" in common lecture) the input information. In (probably) the first attention paper, the attention is defined as weighted sum of the input sequence given a word embedding. The idea is furthur generalize to QKV attention in the first transformer paper.

1. Attention Score

The attention score is used to decide how much the each piece of input information will contribute to the output value and also how many entry the attention operation will output. The operation that will modify the attention score matrix should be consider as part of this block. For example: Different attention masks (local attention, random attention, ...), normalization (softmax, l2-norm, ...), and some special attention that take other inputs (transformer decoder, relative position encoding, ...).

2. Mixing

We refer to the operation that take the attention score and input value as "mixing". Usually it's just a weighted sum over the input value and use the attention score as the weight.

3. Attention Operation

The whole scoring + mixing and other pre/post processing made up an attention operation. Things like handling multi-head should happen at this level.

Attention Mask

Attention masks are a bunch of operation that modified the attention score.

1. Dataless mask

We use "dataless" to refer to masks that are independent to the input. For example, CausalMask works the same on each data regardless of the batch size or the data content.

2. Array mask

We call the mask that is dependent to the input as "array mask". For example, SymLengthMask is used to avoid the padding token being considered in the attention operation, thus each data batch might have different mask value.