Metrics

This outlines metrics that make it easy to calculate the grammar F1 of an autoregressive model trained on a Minimalist Grammar.

python_mg.metrics.grammar_f1(preds: ndarray[tuple[Any, ...], dtype[float64]], correct: ndarray[tuple[Any, ...], dtype[bool]]) → dict[str, ndarray[tuple[Any, ...], dtype[float64]]]

Compute grammar F1 scores from boolean arrays of valid next moves and predictions. The metric is described in Meta-Learning Neural Mechanisms rather than Bayesian Priors (Goodale et al., ACL 2025)

Parameters:

preds (ndarray of float64) – Predicted log probabilities for each token. Shape (…, seq_length, vocab_size).
correct (ndarray of int) – Boolean array for each valid token that can come next at that point in the sequence. Shape (…, seq_length, vocab_size).

Returns:

dict of str – Dictionary containing numpy arrays with keys:

’precision’: Precision scores
’recall’: Recall scores
’f1’: F1 scores

Return type:

ndarray of float64

python_mg.metrics.grammar_f1_from_strings(lexicon: Lexicon, tokens: ndarray[tuple[Any, ...], dtype[int64]], preds: ndarray[tuple[Any, ...], dtype[float64]], category: str, min_log_prob: float | None = -128.0, move_prob: float = 0.5, max_steps: int | None = 64, n_beams: int | None = 256, reduction: Literal['none', 'sentence_mean', 'length_mean'] = 'sentence_mean') → dict[str, ndarray[tuple[Any, ...], dtype[float64]]]

Compute grammar F1 scores from token sequences and predictions. The metric is described in Meta-Learning Neural Mechanisms rather than Bayesian Priors (Goodale et al., ACL 2025)

Parameters:

lexicon (Lexicon)
tokens (ndarray of int) – Token IDs representing the input sequences. Shape (…, seq_length).
preds (ndarray of float64) – Predicted log probabilities for each token. Shape (…, seq_length, vocab_size).
category (str) – The syntactic category of the parsed strings
min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider
move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5
max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.
n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is 256.
reduction ({'none', 'sentence_mean', 'length_mean'}, optional) –
Method for reducing F1 scores across sequences:
- ’none’: Return individual scores per sequence
- ’sentence_mean’: Average over all sequences, ignoring padded tokens
- ’length_mean’: Average over lengths, ignoring padding tokens
Default is ‘sentence_mean’.

Returns:

dict of str – Dictionary containing numpy arrays with keys:

’precision’: Precision scores
’recall’: Recall scores
’f1’: F1 scores

Return type:

ndarray of float64