Core classes

These are the core classes and functions which make up the library.

class python_mg.Lexicon(s)

A MG grammar that can be used to generate SyntacticStructures or parse strings into SyntacticStructures

continuations(prefix, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256)

Compute valid next string for a prefix string.

Parameters:
  • prefix (str) – A prefix string to be continued

  • category (str) – The syntactic category of the parsed string

  • min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider

  • move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5

  • max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.

  • n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is 256.

Returns:

Set indicating the next possible word, affixed word or whether the sentence can be ended.

Return type:

set of Continauations

detokenize(s)

Convert a sequence of tokens to their corresponding strings.

Parameters:

s (Sequence[int] or npt.NDArray[np.uint]) – A sequence or array of token IDs to be converted to strings.

Returns:

List of strings corresponding to the input tokens.

Return type:

list[str]

detokenize_batch(batch)

Convert a batch of sequence of tokens to their corresponding strings.

Parameters:

s (Sequence[Sequence[int]], npt.NDArray[np.uint] or list[npt.NDArray[np.uint]]) – A sequence or array of token IDs to be converted to strings.

Returns:

List of list of strings corresponding to the input tokens.

Return type:

list[list[str]]

generate_grammar(category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256, max_strings=None)

Generates all syntactic structures for the lexicon.

Parameters:
  • category (str) – The syntactic category to be generated.

  • min_log_prob (float or None, optional) – Minimum log probability threshold to be generated. If none, there is no limited on log probability.

  • move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5

  • max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.

  • n_beams (int or None, optional) – Number of beams to maintain while parsing. If None, will not be limited. Default is 256.

Return type:

an iterator which yields all parses as they are found

generate_unique_strings(category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256, max_strings=None)

Generates all strings for the lexicon, without paying attention to their SyntacticStructure. This differs from python_mg.Lexicon.generate_grammar() as different parses will be collapsed, and only strings will be returned.

Parameters:
  • category (str) – The syntactic category to be generated.

  • min_log_prob (float or None, optional) – Minimum log probability threshold to be generated. If none, there is no limited on log probability.

  • move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5

  • max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.

  • n_beams (int or None, optional) – Number of beams to maintain while parsing. If None, will not be limited. Default is 256.

Returns:

The list of all strings along with their log probability

Return type:

list[tuple[list[str], float]]

mdl(n_phonemes)

Gets the model description length of this lexicon. The precise calculation is described in Deconstructing syntactic generalizations with minimalist grammars (Ermolaeva, CoNLL 2021)

Returns:

the MDL of the lexicon.

Return type:

float

parse(s, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256, max_parses=None)

Parses a string and returns all found parses in a list The string, s, should be delimited by spaces for words and hyphens for multi-word expressions from head-movement

Parameters:
  • s (str) – A string to be parsed

  • category (str) – The syntactic category of the parsed string

  • min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider If none, there is no limited on log probability.

  • move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5

  • max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.

  • n_beams (int or None, optional) – Number of beams to maintain while parsing. If None, will not be limited. Default is 256.

Returns:

All found parses of the string.

Return type:

list of SyntacticStructure

parse_tokens(s, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256, max_parses=None)

Converts a sequence of tokens into a list of SyntacticStructure. Will throw a ValueError if the tokens are not formatted properly (but the list will be empty if there is no parse).

Parameters:
  • x (ndarray of uint, shape (L,)) – Input token sequences where L is the sequence length

  • category (str) – The syntactic category of the parsed strings

  • min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider

  • move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5

  • max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.

  • n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is 256.

Returns:

static random_lexicon(lemmas)

Generates a random lexicon with random categories.

Returns:

a random Lexicon

Return type:

python_mg.Lexicon()

token_continuations(x, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256)

Compute valid next token continuations for grammar sequences.

Takes an array of token sequences in a grammar and returns a boolean mask indicating which tokens are valid continuations at each position.

Parameters:
  • x (ndarray of uint, shape (..., N, L)) – Input token sequences where N is the number of sequences and L is the maximum sequence length.

  • category (str) – The syntactic category of the parsed strings

  • min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider

  • move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5

  • max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.

  • n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is 256.

Returns:

Boolean mask indicating valid next tokens for each position, where C is the number of tokens in the grammar vocabulary.

Return type:

ndarray of bool, shape (…, N, L, C)

Notes

The output dimensions correspond to:

  • : Misc batch dimensions (preserved from input)

  • N: Number of sequences

  • L: Maximum sequence length

  • C: Grammar vocabulary size

tokens()

Gets a dictionary of the word to token ID mapping of this lexicon

Returns:

Dictionary with string to token ID mapping.

Return type:

dictionary of (str, int)

class python_mg.SyntacticStructure

The representation of a syntactic structure generated by a grammar, or alternatively the result of parsing a string.

contains_lexical_entry(s)
contains_word(s)

The probability of generating this SyntacticStructure using its associated Lexicon.

Parameters:

s (str or None) – The word (or empty word) that may or may not be present

Returns:

whether the word is present in the structure

Return type:

bool

latex()

Turns the SyntacticStructure into a tree that can be rendered with LaTeX. Requires including latex-commands.tex) in the LaTeX preamble.

Returns:

A LaTeX representation of the parse tree

Return type:

str

log_prob()

The log probability of generating this SyntacticStructure using its associated Lexicon.

Returns:

the log probability

Return type:

float

max_memory_load()

The maximum number of moving elements stored in memory at one time.

Returns:

the maximum number of moved items held in memory in the derivation

Return type:

int

n_steps()

The number of derivational steps necessary to derive this SyntacticStructure using its Lexicon

Returns:

the number of steps

Return type:

int

prob()

The probability of generating this SyntacticStructure using its associated Lexicon.

Returns:

the probability of the structure

Return type:

float

to_tree() ParseTree

Converts a SyntacticStructure to a ParseTree

Return type:

The SyntacticStructure as a python_mg.ParseTree()

tokens()

Converts the SyntacticStructure to a tokenized representation of its string.

Returns:

the tokenized string.

Return type:

ndarray of uint

class python_mg.Continuation(s)

A class to represent a possible continuation of a string according to some grammar.

static EOS()

Get the EndOfSentence marker

Returns:

A continuation marking the end of a sentence. Equivalent to Continuation(“[EOS]”)

Return type:

python_mg.Continuation()

is_end_of_string()

Checks if a continuation is the end of string marker.

Returns:

True if the continuation is the end of string marker, else False.

Return type:

bool

is_multi_word()

Checks if a continuation is a multi-word, or an affixed string (as the result of head movement)

Returns:

True if the continuation is a multi-word, else False.

Return type:

bool

is_word()

Checks if a continuation is a word

Returns:

True if the continuation is a word, else False.

Return type:

bool