A configuration class for the Bigram language model. Inherits from PretrainedConfig
.
- Attributes:
vocab_size
: int, default=2 - The size of the vocabulary.max_length
: int, default=2 - The maximum length of the input sequences.
A simple Bigram Language Model.
-
Attributes:
config_class
:BigramLMConfig
- Configuration class for the model.vocab_size
: int - The size of the vocabulary.max_length
: int - The maximum length of the input sequences.logits
:torch.nn.Parameter
- The logits for bigram predictions.
-
Methods:
forward(input_ids, labels=None, **kwargs)
: Computes the forward pass.
A configuration class for the Dummy language model. Inherits from PretrainedConfig
.
A dummy model for testing purposes.
-
Attributes:
config_class
:DummyLMConfig
- Configuration class for the model.
-
Methods:
load_model(cls, config, pre_trained=False, model_name_or_path=None)
: Loads the model.forward(input_ids, labels=None, **kwargs)
: Computes the forward pass.
An implementation of attention with ALiBi (Attention with Linear Biases).
Configuration class for the model with ALiBi. Inherits from OPTConfig
.
A decoder layer for the OPT model with ALiBi.
A full decoder stack for the OPT model with ALiBi.
The full OPT model with ALiBi support.
A class for causal language modeling with OPT and ALiBi.
A class for sequence classification with OPT and ALiBi.
Configuration class for the RNN language model. Inherits from PretrainedConfig
.
- Attributes:
vocab_size
: int, default=10000 - The size of the vocabulary.block_size
: int, default=128 - The maximum length of the input sequences.embedding_dim
: int, default=256 - The dimension of the embeddings.hidden_dim
: int, default=256 - The dimension of the hidden states.num_layers
: int, default=4 - The number of layers.cell_type
: str, default="lstm" - The type of RNN cell.add_bias
: bool, default=True - Whether to add a bias term.embedding_dropout
: float, default=0.1 - The dropout rate for embeddings.dropout
: float, default=0.1 - The dropout rate.
A Recurrent Neural Network language model.
-
Attributes:
config_class
:RnnLMConfig
- Configuration class for the model.wte
:nn.Embedding
- The embedding layer.encoder
:nn.Module
- The encoder module, could be RNN, GRU, or LSTM.lm_head
:nn.Linear
- The output layer.
-
Methods:
_init_hidden()
: Initializes the hidden states._expand_hidden(batch_size)
: Expands the initial hidden state to match the batch size.forward(input_ids, labels=None, hidden_state=None, pad_id=-100, reduction="mean", return_dict=True, **kwargs)
: Computes the forward pass.
An abstract base class for character-level tokenization functions.
-
Attributes:
name
: str - The name of the tokenization function.
-
Methods:
__call__(text: str) -> List[str]
: Tokenizes the input text.get_config() -> Dict[str, str]
: Returns the configuration of the tokenization function.from_config(config: Dict[str, str]) -> "CharTokenizationFunction"
: Creates an instance from a configuration.
A tokenization function using regular expressions.
-
Attributes:
pattern
:re.Pattern
- The compiled regex pattern.
-
Methods:
__call__(text: str) -> List[str]
: Tokenizes the input text using the regex pattern.get_config() -> Dict[str, str]
: Returns the regex pattern as configuration.
A tokenization function using IPA (International Phonetic Alphabet).
- Methods:
__call__(text: str) -> List[str]
: Tokenizes the input text using IPA.get_config() -> Dict[str, str]
: Returns an empty configuration.
A character-level tokenizer.
-
Attributes:
vocab_files_names
:Dict[str, str]
- The vocabulary files names._vocab_int_to_str
:Dict[int, str]
- Mapping from token IDs to strings._vocab_str_to_int
:Dict[str, int]
- Mapping from strings to token IDs.
-
Methods:
vocab_size -> int
: Returns the size of the vocabulary._tokenize(text: str) -> List[str]
: Tokenizes the input text.convert_tokens_to_ids(tokens: Union[str, List[str]]) -> int
: Converts tokens to their corresponding IDs.convert_ids_to_tokens(indices: Union[int, List[int]]) -> str
: Converts IDs to their corresponding tokens.tokens_to_string(tokens)
: Converts a list of tokens to a string.build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) -> List[int]
: Builds inputs with special tokens.train(files)
: Trains the tokenizer on a list of files.encode_batch(input, add_special_tokens=False, padding='max_length')
: Encodes a batch of inputs._add_items_to_encodings(encodings)
: Adds items to encodings.get_special_tokens_mask(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False) -> List[int]
: Returns a mask for special tokens.create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) -> List[int]
: Creates token type IDs from sequences.save_vocabulary(save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]
: Saves the vocabulary.set_tokenization_function(tokenization_function: CharTokenizationFunction)
: Sets the tokenization function.
A tokenizer derived from the HuggingFace BertTokenizer
, specialized for the SaGe model.
-
Attributes:
vocab_files_names
:Dict[str, str]
- The vocabulary files names.vocab
:Dict[str, int]
- The vocabulary.ids_to_tokens
:Dict[int, str]
- Mapping from token IDs to tokens.do_basic_tokenize
: bool - Whether to do basic tokenization.basic_tokenizer
:BasicTokenizer
- The basic tokenizer.wordpiece_tokenizer
:WordpieceTokenizer
- The WordPiece tokenizer.
-
Methods:
do_lower_case
: Property indicating if lowercasing is applied.vocab_size
: Property returning the size of the vocabulary.get_vocab() -> Dict[str, int]
: Returns the vocabulary._tokenize(text: str)
: Tokenizes the input text._convert_token_to_id(token: str) -> int
: Converts a token to an ID._convert_id_to_token(index: int) -> str
: Converts an ID to a token.convert_tokens_to_string(tokens: List[str]) -> str
: Converts a list of tokens to a string.get_special_tokens_mask(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False) -> List[int]
: Returns a mask for special tokens.save_vocabulary(save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]
: Saves the vocabulary.
A tokenizer for WordPiece tokenization.
-
Attributes:
vocab
:Dict[str, int]
- The vocabulary.unk_token
: str - The unknown token.max_input_chars_per_word
: int - The maximum number of input characters per word.
-
Methods:
tokenize(text: str) -> List[str]
: Tokenizes the input text into WordPiece tokens.
A trainer class for RNN language models.
- Methods:
compute_loss(model, inputs, return_outputs=False)
: Computes the loss for the model.
- block_size (
Optional[int]
): The input size of the model. Default isNone
. - auto_model_class (
Literal["causal", "masked", "seq2seq"]
): Specifies which automodel class to use. Default is"causal"
.
- input_files_path (
str
): Path to the train, validation, and test files. Default files names aretrain.txt
,validation.txt
, andtest.txt
. - train_file_name (
str
): Filename of the train data file. Default istrain.txt
. - validation_file_name (
str
): Filename of the validation data file. Default isvalidation.txt
. - test_file_name (
str
): Filename of the test data file. Default istest.txt
. - tokenizer_name_or_path (
str
): Name of a HuggingFace tokenizer or path to a self-trained tokenizer. - model_type (
str
): Specifies the type of AutoModel class to use. - model_name_or_path (
Optional[str]
): Name of a HuggingFace model or path to a self-trained model. - config_name_or_path (
Optional[str]
): Path to a model config file. - use_early_stopping (
bool
): Whether to stop training if validation loss does not improve for a certain number of steps. Default isFalse
. - early_stopping_patience (
int
): Number of eval steps to wait before stopping training if using early stopping. Default is1
. - lossy_context (
bool
): Whether to break sequences randomly before concatenation. Default isFalse
. - lossy_context_token (
str
): Token used for sequence breaking with lossy context. Default is<b>
.
- wandb_group_name (
str
): Name of the wandb group. - wandb_project_name (
str
): Name of the wandb project.
- input_files_path (
str
): Path to the file with the evaluation data. - eval_file_name (
Optional[str]
): Filename of the evaluation data file. - eval_string (
Optional[str]
): JSON string with strings to be annotated for surprisal. - tokenizer_name_or_path (
str
): Name of a HuggingFace tokenizer or path to a self-trained tokenizer. - model_type (
str
): Specifies the type of AutoModel class to use. - model_name_or_path (
str
): Name of a HuggingFace model or path to a self-trained model. - batch_size (
int
): Batch size for evaluation. Default is8
. - device (
Optional[str]
): Device to use for evaluation. Default is"cpu"
. - output_dir (
str
): Path to save the evaluation results. - output_file_name (
str
): Filename used to save evaluation results. Default iseval_results.tsv
. - log_level (
Literal["info", "warning", "error", "debug"]
): Log level for evaluation. Default is"warning"
. - sum_subword_surprisal (
bool
): Sum surprisals based on--subword_prefix
. Default isFalse
. - subword_prefix (
str
): Prefix of words for subword tokenization. Default is"Ġ"
. - prepend_token (
bool
): Prepend an EOS token to each batch of sequences. Default isFalse
.
- input_files_path (
str
): Path to the directory containing the training files for the tokenizer. - input_files_type (
str
): File type of input files, e.g., '.txt'. Default is"txt"
. - tokenizer_type (
Literal["bpe", "unigram", "word-level"]
): Algorithm of the tokenizer. Default isNone
. - vocab_size (
int
): Vocabulary size of the tokenizer. Default isNone
. - output_dir (
str
): Where to save the tokenizer. - pad_token (
str
): Padding token. Default is<pad>
. - eos_token (
str
): End of sequence token. Default is</s>
. - unk_token (
str
): Unknown token. Default is<unk>
. - lossy_context (
bool
): Whether to add the lossy context token to the tokenizer. Default isFalse
. - lossy_context_token (
str
): Sequence breaking token used with lossy context. Default is<b>
.
- annotation_style (
Literal["misc", "column"]
): How annotation should take place. Default is"misc"
. - tag (
str
): Tag for the surprisal column.
Returns a list of argument data classes based on the parser type.
get_automodel(model_type: str) -> Type[AutoModelForCausalLM | AutoModelForMaskedLM | AutoModelForSeq2SeqLM]
Returns the appropriate AutoModel class based on the model type.
repackage_hidden(hidden_state: torch.Tensor) -> torch.Tensor
Detaches hidden states from their history to avoid memory issues.
Returns the number of times a value occurs in a tensor.
Tokenizes the given sequences.
Prefixes each batch of sequences with an EOS token.
preprocess_function_eval(sequences: dict, tokenizer: AutoTokenizer, model_max_length: int, stride: int, prefix_eos_token: bool = False) -> dict
Processes sequences for evaluation.
preprocess_function(sequences: dict, tokenizer: AutoTokenizer, model_max_length: int, stride: int) -> dict
Processes sequences with optional padding and truncation.
Processes sequences using a sliding window approach.
compute_batch_surprisal(batch_input_ids: torch.Tensor, batch_mask: torch.Tensor, batch_logits: torch.Tensor, sequence_ids: list, tokenizer: AutoTokenizer) -> dict
Computes the surprisal for each token in the batch.
Computes the surprisal for a cloze task.
Loads a CoNLL-U formatted file.
Saves data frames in CoNLL-U format.
get_word_surprisal(surprisal: List[float], tokens: List[str], words: List[str], tokenizer: AutoTokenizer, subword_prefix: str) -> List[float]
Calculates word-level surprisal.
Applies a lossy context to the sequences.
Sets the seed for reproducibility.
Returns the current timestamp.
Creates a directory and returns its path.
- Methods
get_argparser(cls, parser_type: str)
: Returns the argument parser for the specified type.
- Methods
from_pretrained(cls, model_type: str, model_name_or_path: Optional[str] = None)
: Loads a model from a pre-trained model or path.from_config(cls, model_type: str, config_name_or_path: Optional[str] = None)
: Loads a model from a configuration file.
- Methods
get_trainer(cls, model_type: str, **trainer_args)
: Returns a trainer instance based on the model type.