Core classes
These are the core classes and functions which make up the library.
- class python_mg.Lexicon(grammar)
A MG grammar that can be used to generate SyntacticStructures or parse strings into SyntacticStructures.
You may include semantic interpretations or not. You may also generate all valid sentences in the grammar.
- Parameters:
grammar (str)
- Raises:
ValueError – If the string is not a valid lexicon.
Examples
Generating all sentences of a grammar.
grammar = """John::d runs::=d v Mary::d likes::d= =d v""" lexicon = Lexicon(grammar) strings = [str(p) for p in lexicon.generate_grammar("v")] assert strings == [ "John runs", "Mary runs", "Mary likes John", "John likes John", "John likes Mary", "Mary likes Mary", ]
Creating a lexicon with interpretations and getting the interpretation of a sentence.
grammar = """John::d::a_John run::=d v::lambda a x some_e(e, pe_run(e), AgentOf(x,e)) Mary::d::a_Mary likes::d= =d v::lambda a x lambda a y some_e(e, pe_likes(e), AgentOf(y,e) & PatientOf(x, e))""" semantic_lexicon = Lexicon(grammar) assert semantic_lexicon.is_semantic() s = semantic_lexicon.parse("John likes Mary", "v") assert len(s) == 1 parse = s[0] assert parse.meaning is not None assert parse.meaning == [ "some_e(x, pe_likes(x), AgentOf(a_John, x) & PatientOf(a_Mary, x))" ]
- continuations(prefix, category, min_log_prob=None, move_prob=0.5, max_steps=64, n_beams=None)
Compute valid next string for a prefix string.
- Parameters:
prefix (str) – A prefix string to be continued
category (str) – The syntactic category of the parsed string
min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider Default is None.
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is None.
- Returns:
Set indicating the next possible word, affixed word or whether the sentence can be ended.
- Return type:
set of Continuation
- detokenize(s)
Convert a sequence of tokens to their corresponding strings.
- Parameters:
s (Sequence[int] or npt.NDArray[np.uint]) – A sequence or array of token IDs to be converted to strings.
- Returns:
List of strings corresponding to the input tokens.
- Return type:
list[str]
- detokenize_batch(batch)
Convert a batch of sequence of tokens to their corresponding strings.
- Parameters:
s (Sequence[Sequence[int]], npt.NDArray[np.uint] or list[npt.NDArray[np.uint]]) – A sequence or array of token IDs to be converted to strings.
- Returns:
List of list of strings corresponding to the input tokens.
- Return type:
list[list[str]]
- generate_grammar(category, min_log_prob=None, move_prob=0.5, max_steps=64, n_beams=None, max_strings=None)
Generates all syntactic structures for the lexicon.
- Parameters:
category (str) – The syntactic category to be generated.
min_log_prob (float or None, optional) – Minimum log probability threshold to be generated. If none, there is no limit on log probability. Default is None.
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If None, will not be limited. Default is None.
max_strings (int or None, optional) – Number of strings to generate before stopping. Default is None.
- Return type:
an iterator which yields all parses as they are found
- generate_unique_strings(category, min_log_prob=None, move_prob=0.5, max_steps=64, n_beams=None, max_strings=None)
Generates all strings for the lexicon, without paying attention to their SyntacticStructure. This differs from
python_mg.Lexicon.generate_grammar()as different parses will be collapsed, and only strings will be returned.- Parameters:
category (str) – The syntactic category to be generated.
min_log_prob (float or None, optional) – Minimum log probability threshold to be generated. If none, there is no limit on log probability. Default is None.
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If None, will not be limited. Default is None.
max_strings (int or None, optional) – Number of strings to generate before stopping. Default is None.
- Returns:
The list of all strings along with their log probability
- Return type:
list[tuple[list[str], float]]
- is_semantic()
Check if this lexicon has semantics
- mdl(n_phonemes)
Gets the model description length of this lexicon. The precise calculation is described in Deconstructing syntactic generalizations with minimalist grammars (Ermolaeva, CoNLL 2021)
- Parameters:
n_phonemes (int) – The number of phonemes that are possible in the phonology of the grammar (e.g. how many letters)
- Returns:
the MDL of the lexicon.
- Return type:
float
- parse(s, category, min_log_prob=None, move_prob=0.5, max_steps=64, n_beams=None, max_parses=None)
Parses a string and returns all found parses in a list The string, s, should be delimited by spaces for words and hyphens for multi-word expressions from head-movement
- Parameters:
s (str) – A string to be parsed
category (str) – The syntactic category of the parsed string
min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider If none, there is no limit on log probability. Default is None.
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If None, will not be limited. Default is None.
- Returns:
All found parses of the string.
- Return type:
list of SyntacticStructure
- parse_tokens(s, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256, max_parses=None)
Converts a sequence of tokens into a list of SyntacticStructure. Will throw a ValueError if the tokens are not formatted properly (but the list will be empty if there is no parse).
- Parameters:
s (ndarray of uint, shape (L,)) – Input token sequences where L is the sequence length
category (str) – The syntactic category of the parsed strings
min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is 256.
- Returns:
list of
python_mg.SyntacticStructure()List of all parses of the token string
- static random_lexicon(lemmas)
Generates a random lexicon with random categories.
- Returns:
a random Lexicon
- Return type:
- token_continuations(x, category, min_log_prob=Ellipsis, move_prob=0.5, max_steps=64, n_beams=256)
Compute valid next token continuations for grammar sequences.
Takes an array of token sequences in a grammar and returns a boolean mask indicating which tokens are valid continuations at each position.
- Parameters:
x (ndarray of uint, shape (..., N, L)) – Input token sequences where N is the number of sequences and L is the maximum sequence length.
category (str) – The syntactic category of the parsed strings
min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is 256.
- Returns:
Boolean mask indicating valid next tokens for each position, where C is the number of tokens in the grammar vocabulary.
- Return type:
ndarray of bool, shape (…, N, L, C)
Notes
The output dimensions correspond to:
…: Misc batch dimensions (preserved from input)
N: Number of sequences
L: Maximum sequence length
C: Grammar vocabulary size
- tokens()
Gets a dictionary of the word to token ID mapping of this lexicon
- Returns:
Dictionary with string to token ID mapping.
- Return type:
dictionary of (str, int)
- class python_mg.SyntacticStructure
The representation of a syntactic structure generated by a grammar, or alternatively the result of parsing a string.
- contains_lexical_entry(s)
Check whether this string (representing a lexical entry) is used in this tree.
- Returns:
Whether the lexical entry is used
- Return type:
bool
- Raises:
ValueException – If the lexical entry is not parseable as a lexical entry.
- contains_word(s)
Checks whether a word is present in a syntactic structure.
- Parameters:
s (str or None) – The word (or empty word) that may or may not be present
- Returns:
whether the word is present in the structure
- Return type:
bool
- latex()
Turns the SyntacticStructure into a tree that can be rendered with LaTeX. Requires including latex-commands.tex) in the LaTeX preamble.
- Returns:
A LaTeX representation of the parse tree
- Return type:
str
- log_prob()
The log probability of generating this SyntacticStructure using its associated Lexicon.
- Returns:
the log probability
- Return type:
float
- max_memory_load()
The maximum number of moving elements stored in memory at one time.
- Returns:
the maximum number of moved items held in memory in the derivation
- Return type:
int
- meaning
Returns the interpretation of this SyntacticStructure, provided that its associated Lexicon has semantics
- Returns:
The language of thought expression associated with this syntactic structure (if it exists)
- Return type:
Meaning or none
- n_steps()
The number of derivational steps necessary to derive this SyntacticStructure using its Lexicon
- Returns:
the number of steps
- Return type:
int
- prob()
The probability of generating this SyntacticStructure using its associated Lexicon.
- Returns:
the probability of the structure
- Return type:
float
- pronunciation()
The pronunciation of this SyntacticStructure.
- Returns:
A list of strings of each word. Multi-morphemic words are seperated by -.
- Return type:
list[str]
- to_tree() ParseTree
Convert a SyntacticStructure to a ParseTree.
- Returns:
The SyntacticStructure as a
python_mg.ParseTree()- Return type:
python_mg.ParseTree()
- tokens()
Converts the SyntacticStructure to a tokenized representation of its string.
- Returns:
the tokenized string.
- Return type:
ndarray of uint
- class python_mg.Continuation(s)
A class to represent a possible continuation of a string according to some grammar.
- static EOS()
Get the EndOfSentence marker
- Returns:
A continuation marking the end of a sentence. Equivalent to Continuation(“[EOS]”)
- Return type:
- is_end_of_string()
Checks if a continuation is the end of string marker.
- Returns:
Trueif the continuation is the end of string marker, elseFalse.- Return type:
bool
- is_multi_word()
Checks if a continuation is a multi-word, or an affixed string (as the result of head movement)
- Returns:
Trueif the continuation is a multi-word, elseFalse.- Return type:
bool
- is_word()
Checks if a continuation is a word
- Returns:
Trueif the continuation is a word, elseFalse.- Return type:
bool