Metrics

This outlines metrics that make it easy to calculate the grammar F1 of an autoregressive model trained on a Minimalist Grammar.

python_mg.metrics.grammar_f1(preds: ndarray[tuple[Any, ...], dtype[float64]], correct: ndarray[tuple[Any, ...], dtype[bool]]) dict[str, ndarray[tuple[Any, ...], dtype[float64]]]

Compute grammar F1 scores from boolean arrays of valid next moves and predictions. The metric is described in Meta-Learning Neural Mechanisms rather than Bayesian Priors (Goodale et al., ACL 2025)

Parameters:
  • preds (ndarray of float64) – Predicted log probabilities for each token. Shape (…, seq_length, vocab_size).

  • correct (ndarray of int) – Boolean array for each valid token that can come next at that point in the sequence. Shape (…, seq_length, vocab_size).

Returns:

dict of str – Dictionary containing numpy arrays with keys:

  • ’precision’: Precision scores

  • ’recall’: Recall scores

  • ’f1’: F1 scores

Return type:

ndarray of float64

python_mg.metrics.grammar_f1_from_strings(lexicon: Lexicon, tokens: ndarray[tuple[Any, ...], dtype[int64]], preds: ndarray[tuple[Any, ...], dtype[float64]], category: str, min_log_prob: float | None = -128.0, move_prob: float = 0.5, max_steps: int | None = 64, n_beams: int | None = 256, reduction: Literal['none', 'sentence_mean', 'length_mean'] = 'sentence_mean') dict[str, ndarray[tuple[Any, ...], dtype[float64]]]

Compute grammar F1 scores from token sequences and predictions. The metric is described in Meta-Learning Neural Mechanisms rather than Bayesian Priors (Goodale et al., ACL 2025)

Parameters:
  • lexicon (Lexicon)

  • tokens (ndarray of int) – Token IDs representing the input sequences. Shape (…, seq_length).

  • preds (ndarray of float64) – Predicted log probabilities for each token. Shape (…, seq_length, vocab_size).

  • category (str) – The syntactic category of the parsed strings

  • min_log_prob (float or None, optional) – Minimum log probability threshold for the parser to consider

  • move_prob (float, optional) – Probability of preferring a move over a merge when parsing. Default is 0.5

  • max_steps (int or None, optional) – Maximum number of derivation steps. If None, will not be limited. Default is 64.

  • n_beams (int or None, optional) – Number of beams to maintain while parsing. If none, will not be limited. Default is 256.

  • reduction ({'none', 'sentence_mean', 'length_mean'}, optional) –

    Method for reducing F1 scores across sequences:

    • ’none’: Return individual scores per sequence

    • ’sentence_mean’: Average over all sequences, ignoring padded tokens

    • ’length_mean’: Average over lengths, ignoring padding tokens

    Default is ‘sentence_mean’.

Returns:

dict of str – Dictionary containing numpy arrays with keys:

  • ’precision’: Precision scores

  • ’recall’: Recall scores

  • ’f1’: F1 scores

Return type:

ndarray of float64