Core classes

These are the core classes and functions which make up the library.

class python_mg.Lexicon(grammar)

A MG grammar that can be used to generate SyntacticStructures or parse strings into SyntacticStructures.

You may include semantic interpretations or not. You may also generate all valid sentences in the grammar.

Parameters:

grammar (str)

Raises:

ValueError – If the string is not a valid lexicon.

Examples

Generating all sentences of a grammar.

grammar = """John::d
runs::=d v
Mary::d
likes::d= =d v"""
lexicon = Lexicon(grammar)
strings = [str(p) for p in lexicon.generate_grammar("v")]
assert strings == [
    "John runs",
    "Mary runs",
    "Mary likes John",
    "John likes John",
    "John likes Mary",
    "Mary likes Mary",
]

Creating a lexicon with interpretations and getting the interpretation of a sentence.

grammar = """John::d::a_John
run::=d v::lambda a x some_e(e, pe_run(e), AgentOf(x,e))
Mary::d::a_Mary
likes::d= =d v::lambda a x lambda a y some_e(e, pe_likes(e), AgentOf(y,e) & PatientOf(x, e))"""
    semantic_lexicon = Lexicon(grammar)
    assert semantic_lexicon.is_semantic()
    s = semantic_lexicon.parse("John likes Mary", "v")
    assert len(s) == 1
    parse = s[0]
    assert parse.meaning is not None
    assert parse.meaning == [
        "some_e(x, pe_likes(x), AgentOf(a_John, x) & PatientOf(a_Mary, x))"
    ]
continuations(prefix, category, min_log_prob=None, move_prob=0.5, max_steps=64, n_beams=None)

Compute valid next string for a prefix string.

Parameters:
  • prefix (str) – A prefix string to be continued

  • category (str) – The syntactic category of the parsed string

  • min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider Default is None.

  • move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5

  • max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.

  • n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is None.

Returns:

Set indicating the next possible word, affixed word or whether the sentence can be ended.

Return type:

set of Continuation

detokenize(s)

Convert a sequence of tokens to their corresponding strings.

Parameters:

s (Sequence[int] or npt.NDArray[np.uint]) – A sequence or array of token IDs to be converted to strings.

Returns:

List of strings corresponding to the input tokens.

Return type:

list[str]

detokenize_batch(batch)

Convert a batch of sequence of tokens to their corresponding strings.

Parameters:

s (Sequence[Sequence[int]], npt.NDArray[np.uint] or list[npt.NDArray[np.uint]]) – A sequence or array of token IDs to be converted to strings.

Returns:

List of list of strings corresponding to the input tokens.

Return type:

list[list[str]]

generate_grammar(category, min_log_prob=None, move_prob=0.5, max_steps=64, n_beams=None, max_strings=None)

Generates all syntactic structures for the lexicon.

Parameters:
  • category (str) – The syntactic category to be generated.

  • min_log_prob (float or None, optional) – Minimum log probability threshold to be generated. If none, there is no limit on log probability. Default is None.

  • move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5

  • max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.

  • n_beams (int or None, optional) – Number of beams to maintain while parsing. If None, will not be limited. Default is None.

  • max_strings (int or None, optional) – Number of strings to generate before stopping. Default is None.

Return type:

an iterator which yields all parses as they are found

generate_unique_strings(category, min_log_prob=None, move_prob=0.5, max_steps=64, n_beams=None, max_strings=None)

Generates all strings for the lexicon, without paying attention to their SyntacticStructure. This differs from python_mg.Lexicon.generate_grammar() as different parses will be collapsed, and only strings will be returned.

Parameters:
  • category (str) – The syntactic category to be generated.

  • min_log_prob (float or None, optional) – Minimum log probability threshold to be generated. If none, there is no limit on log probability. Default is None.

  • move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5

  • max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.

  • n_beams (int or None, optional) – Number of beams to maintain while parsing. If None, will not be limited. Default is None.

  • max_strings (int or None, optional) – Number of strings to generate before stopping. Default is None.

Returns:

The list of all strings along with their log probability

Return type:

list[tuple[list[str], float]]

is_semantic()

Check if this lexicon has semantics

mdl(n_phonemes)

Gets the model description length of this lexicon. The precise calculation is described in Deconstructing syntactic generalizations with minimalist grammars (Ermolaeva, CoNLL 2021)

Parameters:

n_phonemes (int) – The number of phonemes that are possible in the phonology of the grammar (e.g. how many letters)

Returns:

the MDL of the lexicon.

Return type:

float

parse(s, category, min_log_prob=None, move_prob=0.5, max_steps=64, n_beams=None, max_parses=None)

Parses a string and returns all found parses in a list The string, s, should be delimited by spaces for words and hyphens for multi-word expressions from head-movement

Parameters:
  • s (str) – A string to be parsed

  • category (str) – The syntactic category of the parsed string

  • min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider If none, there is no limit on log probability. Default is None.

  • move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5

  • max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.

  • n_beams (int or None, optional) – Number of beams to maintain while parsing. If None, will not be limited. Default is None.

Returns:

All found parses of the string.

Return type:

list of SyntacticStructure

parse_tokens(s, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256, max_parses=None)

Converts a sequence of tokens into a list of SyntacticStructure. Will throw a ValueError if the tokens are not formatted properly (but the list will be empty if there is no parse).

Parameters:
  • s (ndarray of uint, shape (L,)) – Input token sequences where L is the sequence length

  • category (str) – The syntactic category of the parsed strings

  • min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider

  • move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5

  • max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.

  • n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is 256.

Returns:

static random_lexicon(lemmas)

Generates a random lexicon with random categories.

Returns:

a random Lexicon

Return type:

python_mg.Lexicon()

token_continuations(x, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256)

Compute valid next token continuations for grammar sequences.

Takes an array of token sequences in a grammar and returns a boolean mask indicating which tokens are valid continuations at each position.

Parameters:
  • x (ndarray of uint, shape (..., N, L)) – Input token sequences where N is the number of sequences and L is the maximum sequence length.

  • category (str) – The syntactic category of the parsed strings

  • min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider

  • move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5

  • max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.

  • n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is 256.

Returns:

Boolean mask indicating valid next tokens for each position, where C is the number of tokens in the grammar vocabulary.

Return type:

ndarray of bool, shape (…, N, L, C)

Notes

The output dimensions correspond to:

  • : Misc batch dimensions (preserved from input)

  • N: Number of sequences

  • L: Maximum sequence length

  • C: Grammar vocabulary size

tokens()

Gets a dictionary of the word to token ID mapping of this lexicon

Returns:

Dictionary with string to token ID mapping.

Return type:

dictionary of (str, int)

class python_mg.SyntacticStructure

The representation of a syntactic structure generated by a grammar, or alternatively the result of parsing a string.

contains_lexical_entry(s)

Check whether this string (representing a lexical entry) is used in this tree.

Returns:

Whether the lexical entry is used

Return type:

bool

Raises:

ValueException – If the lexical entry is not parseable as a lexical entry.

contains_word(s)

Checks whether a word is present in a syntactic structure.

Parameters:

s (str or None) – The word (or empty word) that may or may not be present

Returns:

whether the word is present in the structure

Return type:

bool

latex()

Turns the SyntacticStructure into a tree that can be rendered with LaTeX. Requires including latex-commands.tex) in the LaTeX preamble.

Returns:

A LaTeX representation of the parse tree

Return type:

str

log_prob()

The log probability of generating this SyntacticStructure using its associated Lexicon.

Returns:

the log probability

Return type:

float

max_memory_load()

The maximum number of moving elements stored in memory at one time.

Returns:

the maximum number of moved items held in memory in the derivation

Return type:

int

meaning

Returns the interpretation of this SyntacticStructure, provided that its associated Lexicon has semantics

Returns:

The language of thought expression associated with this syntactic structure (if it exists)

Return type:

Meaning or none

n_steps()

The number of derivational steps necessary to derive this SyntacticStructure using its Lexicon

Returns:

the number of steps

Return type:

int

prob()

The probability of generating this SyntacticStructure using its associated Lexicon.

Returns:

the probability of the structure

Return type:

float

pronunciation()

The pronunciation of this SyntacticStructure.

Returns:

A list of strings of each word. Multi-morphemic words are seperated by -.

Return type:

list[str]

to_tree() ParseTree

Convert a SyntacticStructure to a ParseTree.

Returns:

The SyntacticStructure as a python_mg.ParseTree()

Return type:

python_mg.ParseTree()

tokens()

Converts the SyntacticStructure to a tokenized representation of its string.

Returns:

the tokenized string.

Return type:

ndarray of uint

class python_mg.Continuation(s)

A class to represent a possible continuation of a string according to some grammar.

static EOS()

Get the EndOfSentence marker

Returns:

A continuation marking the end of a sentence. Equivalent to Continuation(“[EOS]”)

Return type:

python_mg.Continuation()

is_end_of_string()

Checks if a continuation is the end of string marker.

Returns:

True if the continuation is the end of string marker, else False.

Return type:

bool

is_multi_word()

Checks if a continuation is a multi-word, or an affixed string (as the result of head movement)

Returns:

True if the continuation is a multi-word, else False.

Return type:

bool

is_word()

Checks if a continuation is a word

Returns:

True if the continuation is a word, else False.

Return type:

bool