Core classes
These are the core classes and functions which make up the library.
- class python_mg.Lexicon(s)
A MG grammar that can be used to generate SyntacticStructures or parse strings into SyntacticStructures
- continuations(prefix, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256)
Compute valid next string for a prefix string.
- Parameters:
prefix (str) – A prefix string to be continued
category (str) – The syntactic category of the parsed string
min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is 256.
- Returns:
Set indicating the next possible word, affixed word or whether the sentence can be ended.
- Return type:
set of Continauations
- detokenize(s)
Convert a sequence of tokens to their corresponding strings.
- Parameters:
s (Sequence[int] or npt.NDArray[np.uint]) – A sequence or array of token IDs to be converted to strings.
- Returns:
List of strings corresponding to the input tokens.
- Return type:
list[str]
- detokenize_batch(batch)
Convert a batch of sequence of tokens to their corresponding strings.
- Parameters:
s (Sequence[Sequence[int]], npt.NDArray[np.uint] or list[npt.NDArray[np.uint]]) – A sequence or array of token IDs to be converted to strings.
- Returns:
List of list of strings corresponding to the input tokens.
- Return type:
list[list[str]]
- generate_grammar(category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256, max_strings=None)
Generates all syntactic structures for the lexicon.
- Parameters:
category (str) – The syntactic category to be generated.
min_log_prob (float or None, optional) – Minimum log probability threshold to be generated. If none, there is no limited on log probability.
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If None, will not be limited. Default is 256.
- Return type:
an iterator which yields all parses as they are found
- generate_unique_strings(category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256, max_strings=None)
Generates all strings for the lexicon, without paying attention to their SyntacticStructure. This differs from
python_mg.Lexicon.generate_grammar()as different parses will be collapsed, and only strings will be returned.- Parameters:
category (str) – The syntactic category to be generated.
min_log_prob (float or None, optional) – Minimum log probability threshold to be generated. If none, there is no limited on log probability.
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If None, will not be limited. Default is 256.
- Returns:
The list of all strings along with their log probability
- Return type:
list[tuple[list[str], float]]
- mdl(n_phonemes)
Gets the model description length of this lexicon. The precise calculation is described in Deconstructing syntactic generalizations with minimalist grammars (Ermolaeva, CoNLL 2021)
- Returns:
the MDL of the lexicon.
- Return type:
float
- parse(s, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256, max_parses=None)
Parses a string and returns all found parses in a list The string, s, should be delimited by spaces for words and hyphens for multi-word expressions from head-movement
- Parameters:
s (str) – A string to be parsed
category (str) – The syntactic category of the parsed string
min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider If none, there is no limited on log probability.
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If None, will not be limited. Default is 256.
- Returns:
All found parses of the string.
- Return type:
list of SyntacticStructure
- parse_tokens(s, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256, max_parses=None)
Converts a sequence of tokens into a list of SyntacticStructure. Will throw a ValueError if the tokens are not formatted properly (but the list will be empty if there is no parse).
- Parameters:
x (ndarray of uint, shape (L,)) – Input token sequences where L is the sequence length
category (str) – The syntactic category of the parsed strings
min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is 256.
- Returns:
list of
python_mg.SyntacticStructure()List of all parses of the token string
- static random_lexicon(lemmas)
Generates a random lexicon with random categories.
- Returns:
a random Lexicon
- Return type:
- token_continuations(x, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256)
Compute valid next token continuations for grammar sequences.
Takes an array of token sequences in a grammar and returns a boolean mask indicating which tokens are valid continuations at each position.
- Parameters:
x (ndarray of uint, shape (..., N, L)) – Input token sequences where N is the number of sequences and L is the maximum sequence length.
category (str) – The syntactic category of the parsed strings
min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is 256.
- Returns:
Boolean mask indicating valid next tokens for each position, where C is the number of tokens in the grammar vocabulary.
- Return type:
ndarray of bool, shape (…, N, L, C)
Notes
The output dimensions correspond to:
…: Misc batch dimensions (preserved from input)
N: Number of sequences
L: Maximum sequence length
C: Grammar vocabulary size
- tokens()
Gets a dictionary of the word to token ID mapping of this lexicon
- Returns:
Dictionary with string to token ID mapping.
- Return type:
dictionary of (str, int)
- class python_mg.SyntacticStructure
The representation of a syntactic structure generated by a grammar, or alternatively the result of parsing a string.
- contains_lexical_entry(s)
- contains_word(s)
The probability of generating this SyntacticStructure using its associated Lexicon.
- Parameters:
s (str or None) – The word (or empty word) that may or may not be present
- Returns:
whether the word is present in the structure
- Return type:
bool
- latex()
Turns the SyntacticStructure into a tree that can be rendered with LaTeX. Requires including latex-commands.tex) in the LaTeX preamble.
- Returns:
A LaTeX representation of the parse tree
- Return type:
str
- log_prob()
The log probability of generating this SyntacticStructure using its associated Lexicon.
- Returns:
the log probability
- Return type:
float
- max_memory_load()
The maximum number of moving elements stored in memory at one time.
- Returns:
the maximum number of moved items held in memory in the derivation
- Return type:
int
- n_steps()
The number of derivational steps necessary to derive this SyntacticStructure using its Lexicon
- Returns:
the number of steps
- Return type:
int
- prob()
The probability of generating this SyntacticStructure using its associated Lexicon.
- Returns:
the probability of the structure
- Return type:
float
- to_tree() ParseTree
Converts a SyntacticStructure to a ParseTree
- Return type:
The SyntacticStructure as a
python_mg.ParseTree()
- tokens()
Converts the SyntacticStructure to a tokenized representation of its string.
- Returns:
the tokenized string.
- Return type:
ndarray of uint
- class python_mg.Continuation(s)
A class to represent a possible continuation of a string according to some grammar.
- static EOS()
Get the EndOfSentence marker
- Returns:
A continuation marking the end of a sentence. Equivalent to Continuation(“[EOS]”)
- Return type:
- is_end_of_string()
Checks if a continuation is the end of string marker.
- Returns:
Trueif the continuation is the end of string marker, elseFalse.- Return type:
bool
- is_multi_word()
Checks if a continuation is a multi-word, or an affixed string (as the result of head movement)
- Returns:
Trueif the continuation is a multi-word, elseFalse.- Return type:
bool
- is_word()
Checks if a continuation is a word
- Returns:
Trueif the continuation is a word, elseFalse.- Return type:
bool