Core classes

These are the core classes and functions which make up the library.

class python_mg.Lexicon(s)

A MG grammar that can be used to generate SyntacticStructures or parse strings into SyntacticStructures

continuations(prefix, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256)

Compute valid next string for a prefix string.

Parameters:

prefix (str) – A prefix string to be continued
category (str) – The syntactic category of the parsed string
min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is 256.

Returns:

Set indicating the next possible word, affixed word or whether the sentence can be ended.

Return type:

set of Continauations

detokenize(s)

Convert a sequence of tokens to their corresponding strings.

Parameters:: s (Sequence[int] or npt.NDArray[np.uint]) – A sequence or array of token IDs to be converted to strings.
Returns:: List of strings corresponding to the input tokens.
Return type:: list[str]

detokenize_batch(batch)

Convert a batch of sequence of tokens to their corresponding strings.

Parameters:: s (Sequence[Sequence[int]], npt.NDArray[np.uint] or list[npt.NDArray[np.uint]]) – A sequence or array of token IDs to be converted to strings.
Returns:: List of list of strings corresponding to the input tokens.
Return type:: list[list[str]]

generate_grammar(category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256, max_strings=None)

Generates all syntactic structures for the lexicon.

Parameters:

category (str) – The syntactic category to be generated.
min_log_prob (float or None, optional) – Minimum log probability threshold to be generated. If none, there is no limited on log probability.
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If None, will not be limited. Default is 256.

Return type:

an iterator which yields all parses as they are found

generate_unique_strings(category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256, max_strings=None)

Generates all strings for the lexicon, without paying attention to their SyntacticStructure. This differs from python_mg.Lexicon.generate_grammar() as different parses will be collapsed, and only strings will be returned.

Parameters:

category (str) – The syntactic category to be generated.
min_log_prob (float or None, optional) – Minimum log probability threshold to be generated. If none, there is no limited on log probability.
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If None, will not be limited. Default is 256.

Returns:

The list of all strings along with their log probability

Return type:

list[tuple[list[str], float]]

mdl(n_phonemes)

Gets the model description length of this lexicon. The precise calculation is described in Deconstructing syntactic generalizations with minimalist grammars (Ermolaeva, CoNLL 2021)

Returns:: the MDL of the lexicon.
Return type:: float

parse(s, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256, max_parses=None)

Parses a string and returns all found parses in a list The string, s, should be delimited by spaces for words and hyphens for multi-word expressions from head-movement

Parameters:

s (str) – A string to be parsed
category (str) – The syntactic category of the parsed string
min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider If none, there is no limited on log probability.
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If None, will not be limited. Default is 256.

Returns:

All found parses of the string.

Return type:

list of SyntacticStructure

parse_tokens(s, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256, max_parses=None)

Converts a sequence of tokens into a list of SyntacticStructure. Will throw a ValueError if the tokens are not formatted properly (but the list will be empty if there is no parse).

Parameters:

x (ndarray of uint, shape (L,)) – Input token sequences where L is the sequence length
category (str) – The syntactic category of the parsed strings
min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is 256.

Returns:

list of python_mg.SyntacticStructure()
List of all parses of the token string

static random_lexicon(lemmas)

Generates a random lexicon with random categories.

Returns:: a random Lexicon
Return type:: python_mg.Lexicon()

token_continuations(x, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256)

Compute valid next token continuations for grammar sequences.

Takes an array of token sequences in a grammar and returns a boolean mask indicating which tokens are valid continuations at each position.

Parameters:

x (ndarray of uint, shape (..., N, L)) – Input token sequences where N is the number of sequences and L is the maximum sequence length.
category (str) – The syntactic category of the parsed strings
min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is 256.

Returns:

Boolean mask indicating valid next tokens for each position, where C is the number of tokens in the grammar vocabulary.

Return type:

ndarray of bool, shape (…, N, L, C)

Notes

The output dimensions correspond to:

…: Misc batch dimensions (preserved from input)

N: Number of sequences

L: Maximum sequence length

C: Grammar vocabulary size

tokens()

Gets a dictionary of the word to token ID mapping of this lexicon

Returns:: Dictionary with string to token ID mapping.
Return type:: dictionary of (str, int)

class python_mg.SyntacticStructure

The representation of a syntactic structure generated by a grammar, or alternatively the result of parsing a string.

contains_lexical_entry(s)

contains_word(s)

The probability of generating this SyntacticStructure using its associated Lexicon.

Parameters:: s (str or None) – The word (or empty word) that may or may not be present
Returns:: whether the word is present in the structure
Return type:: bool

latex()

Turns the SyntacticStructure into a tree that can be rendered with LaTeX. Requires including latex-commands.tex) in the LaTeX preamble.

Returns:: A LaTeX representation of the parse tree
Return type:: str

log_prob()

The log probability of generating this SyntacticStructure using its associated Lexicon.

Returns:: the log probability
Return type:: float

max_memory_load()

The maximum number of moving elements stored in memory at one time.

Returns:: the maximum number of moved items held in memory in the derivation
Return type:: int

n_steps()

The number of derivational steps necessary to derive this SyntacticStructure using its Lexicon

Returns:: the number of steps
Return type:: int

prob()

The probability of generating this SyntacticStructure using its associated Lexicon.

Returns:: the probability of the structure
Return type:: float

to_tree() → ParseTree

Converts a SyntacticStructure to a ParseTree

Return type:: The SyntacticStructure as a python_mg.ParseTree()

tokens()

Converts the SyntacticStructure to a tokenized representation of its string.

Returns:: the tokenized string.
Return type:: ndarray of uint

class python_mg.Continuation(s)

A class to represent a possible continuation of a string according to some grammar.

static EOS()

Get the EndOfSentence marker

Returns:: A continuation marking the end of a sentence. Equivalent to Continuation(“[EOS]”)
Return type:: python_mg.Continuation()

is_end_of_string()

Checks if a continuation is the end of string marker.

Returns:: True if the continuation is the end of string marker, else False.
Return type:: bool

is_multi_word()

Checks if a continuation is a multi-word, or an affixed string (as the result of head movement)

Returns:: True if the continuation is a multi-word, else False.
Return type:: bool

is_word()

Checks if a continuation is a word

Returns:: True if the continuation is a word, else False.
Return type:: bool