Optimizing LLM Classifiers with DSPy ¶
Introduction ¶
Let's be candid: prompt engineering is one of the biggest pains when it comes to building LLM-based systems. Slight changes in the phrasing of prompts can yield drastically differing outputs, and ensuring consistent output formatting can be a significant challenge, especially in the context of pipeline applications where the validation of LLM respose formatting is key for downstream applications.
This is where DSPy comes into play. The modules and signatures provided by DSPy abstract away the pains of prompt engineering by providing a programatic way of implementing prompting techniques like chain-of-thought and ReAct. Further, DSPy provides a set of optimizers that allow developers to bootstrap samples for few-shot prediction and optimize the language of prompts themselves.
In this project, we'll explore DSPy's core functionalities and take a look at some of the optimizations it provides in the context of a classification problem.
Package Imports ¶
This project uses a pretty light set of package dependencies, with DSPy being the main one. We'll also use pandas for reading and manipulating our dataset, and the classification_report
from sci-kit learn. Apart from those two, we'll use tqdm and the typing package as utilites.
import pandas as pd
import dspy
from dspy.teleprompt import BootstrapFewShot, MIPROv2
from sklearn.metrics import classification_report
from tqdm import tqdm
from typing import Literal
Data Preperation ¶
The dataset we'll use is the Stanford Natural Language Inference Corpus, publicly available on Kaggle. The data consist of pairs of English sentences, a premise and a hypothesis, in which wither the premise entails the hypothesis, the hypothesis contradicts the premise, or the sentences are unrelated, or neutral.
We'll be using a local instance of Llama3 to predict contradiction, entailment, or neitrality for each sentence pair. The labels for the data are encoded, so our first step in preparing the data will be in mapping the encoded labels to English labels. Then we'll take a sample of 100 pairs as our dataset, from which we'll take bootstrap samples for few-shot learning.
data_path = './data/translated_train_dataset.csv'
label_map = {
0: 'entailment',
1: 'neutral',
2: 'contradiction'
}
inv_label_map = {v:k for k,v in label_map.items()}
def read_dataset(data_path):
df = pd.read_csv(data_path)
output_df = df.query('lang_abv == "en"')
output_df = output_df[['premise', 'hypothesis', 'label']]
output_df['agreement'] = output_df['label'].map(label_map)
return output_df.reset_index(drop=True)
df = read_dataset(data_path)
df.head()
premise | hypothesis | label | agreement | |
---|---|---|---|---|
0 | and these comments were considered in formulat... | The rules developed in the interim were put to... | 0 | entailment |
1 | These are issues that we wrestle with in pract... | Practice groups are not permitted to work on t... | 2 | contradiction |
2 | you know they can't really defend themselves l... | They can't defend themselves because of their ... | 0 | entailment |
3 | From Cockpit Country to St. Ann's Bay | From St. Ann's Bay to Cockpit Country. | 2 | contradiction |
4 | Look, it's your skin, but you're going to be i... | The boss will fire you if he sees you slacking... | 1 | neutral |
We'll reserve 40 samples as our testing dataset to evaluate the performance of our prompts and check our label distribution to see if we need to stratify our bootstrap samples.
sample = df.sample(100, random_state=42)
test_size = 40
train_df = sample[:-test_size]
test_df = sample[-test_size:]
train_df['agreement'].value_counts()
agreement entailment 24 neutral 20 contradiction 16 Name: count, dtype: int64
In order to be consistent with DSPy's API, we'll convert our data into Examples, which isn't strictly necessary, but will make passing them through the Modules and Optimizers much simpler and allow us to specify which fields are inputs versus outputs.
def create_example(row):
example = dspy.Example(
premise=row['premise'],
hypothesis=row['hypothesis'],
agreement=row['agreement']
)
return example.with_inputs('premise', 'hypothesis')
train_examples = [create_example(row) for _, row in train_df.iterrows()]
test_examples = [create_example(row) for _, row in test_df.iterrows()]
train_examples[0]
Example({'premise': '3) Dare you rise to the occasion, like Raskolnikov, and reject the petty rules that govern lesser men?', 'hypothesis': 'Would you rise up and defeaat all evil lords in the town?', 'agreement': 'neutral'}) (input_keys={'hypothesis', 'premise'})
Simple Zero-Shot vs. Chain of Thought ¶
As mentioned above, we'll be using a local installation of Llama3 with Ollama as our LLM, but we can very easily swap it out for any other LLM just by editing the LM configuration below.
We'll start our analysis with a baseline comparison of a simple zero-shot predictor and chain-of-thought. As we'll see, we won't have to do any prompt engineering to build these predictors. Instead we'll rely on the programatic approach that DSPy provides. For each approach, we'll define a Signature for our data, defining the inputs, outputs, data types, and expected return types. Then we'll pass the Signatures into Modules that will make the actual predictions. As we'll see, we don't have to compose our chain-of-though prompt; we'll simply instantiate a ChainOfThought
object from DSPy's selection of Modules.
lm = dspy.LM('ollama_chat/llama3', api_base='http://localhost:11434')
dspy.configure(lm=lm)
The below defines the prediction module for our simple zero-shot prediction.
class AgreementSignature(dspy.Signature):
# inputs
premise: str = dspy.InputField(desc='The premise to which the hypothesis will be compared.')
hypothesis: str = dspy.InputField(desc='Statement to be compared to the premise for contradiction or entailment')
# ouputs
agreement: Literal['entailment', 'contradiction', 'neutral'] = dspy.OutputField(
desc='entailment/contradiction/neutral indicating whether the premise entails the hypothesis, the hypothesis contradicts the premise, or neither (neutral)'
)
explanation: str = dspy.OutputField(
desc='Explanation or reason why the result was chosen.'
)
class AgreementPredictor(dspy.Module):
def __init__(self):
super().__init__()
self.prog = dspy.Predict(AgreementSignature) # simple prompting with few-shot examples
def forward(self, premise, hypothesis):
prediction = self.prog(premise=premise, hypothesis=hypothesis)
return prediction
The below sets up our chain-of-thought prediction pipeline with zero-shot prediction.
class AgreementCoTSignature(dspy.Signature):
# inputs
premise: str = dspy.InputField(desc='The premise to which the hypothesis will be compared.')
hypothesis: str = dspy.InputField(desc='Statement to be compared to the premise for contradiction or entailment')
# ouputs
agreement: Literal['entailment', 'contradiction', 'neutral'] = dspy.OutputField(
desc='entailment/contradiction/neutral indicating whether the premise entails the hypothesis, the hypothesis contradicts the premise, or neither (neutral)'
)
class AgreementCoT(dspy.Module):
def __init__(self):
super().__init__()
self.prog = dspy.ChainOfThought(AgreementCoTSignature) # chain-of-though implementation
def forward(self, premise, hypothesis):
prediction = self.prog(premise=premise, hypothesis=hypothesis)
return prediction
Now all we have to do is instantiate our predictors and call them on our Examples and analyze the results.
cls = AgreementPredictor()
cot_cls = AgreementCoT()
cls(premise=test_examples[0].premise, hypothesis=test_examples[0].hypothesis)
Prediction( agreement='neutral', explanation='The premise is about serving food to oneself or others, while the hypothesis is about forgetting something. These two statements are unrelated, so there is no entailment or contradiction between them.' )
cot_cls(premise=test_examples[0].premise, hypothesis=test_examples[0].hypothesis)
Prediction( reasoning="The premise suggests that you are considering serving food yourself or for a family, implying a personal or social context. The hypothesis expresses a desire to forget about something, which could be related to the premise if it's something specific to the situation.", agreement='neutral' )
DSPy provides an Evalutaion module that allows for easy evaluation of the test data. We'll set this up below, but for most of our analysis we'll depend more on the usual classification_report
used in traditional ML pipelines.
Let's do a quick initial evaluation of our classifiers to see how they match up as baseline predictors.
def accuracy(example, pred, trace=None):
true = example.agreement.lower()
prediction = pred.agreement.lower()
return true == prediction
eval = dspy.Evaluate(
devset=test_examples,
metric=accuracy,
display_progress=True
)
eval(cls)
Average Metric: 26.00 / 40 (65.0%): 100%|██████████| 40/40 [00:00<00:00, 198.17it/s]
2025/02/15 10:33:41 INFO dspy.evaluate.evaluate: Average Metric: 26 / 40 (65.0%)
65.0
eval(cot_cls)
Average Metric: 26.00 / 40 (65.0%): 100%|██████████| 40/40 [00:00<00:00, 159.03it/s]
2025/02/15 10:33:41 INFO dspy.evaluate.evaluate: Average Metric: 26 / 40 (65.0%)
65.0
The standard zero-shot prompt and chain-of-thought achieved the same accuracy on this particular experiment run, though on average across varying samples, I found that the chain-of-though prompt acheives slightly higher accuracy.
We'll be using the chain-of-thought prompt as we move through our bootstrap and prompt optimizations.
Let's take a look at the classification report to get a more detailed view on how our zero-shot predictors performed. To do this, we'll first create a predict
function that we can use to produce batch predictions.
def predict(cls, examples):
return [cls(premise=ex.premise, hypothesis=ex.hypothesis).agreement for ex in tqdm(examples)]
test_df['zero_shot_preds'] = predict(cls, test_examples)
test_df['cot_zero_shot_preds'] = predict(cot_cls, test_examples)
100%|██████████| 40/40 [00:00<00:00, 1635.70it/s] /tmp/ipykernel_55012/3272087609.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy test_df['zero_shot_preds'] = predict(cls, test_examples) 100%|██████████| 40/40 [00:00<00:00, 3104.59it/s] /tmp/ipykernel_55012/3272087609.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy test_df['cot_zero_shot_preds'] = predict(cot_cls, test_examples)
print(classification_report(test_df['agreement'], test_df['zero_shot_preds']))
precision recall f1-score support contradiction 0.75 0.71 0.73 17 entailment 0.83 0.67 0.74 15 neutral 0.33 0.50 0.40 8 accuracy 0.65 40 macro avg 0.64 0.62 0.62 40 weighted avg 0.70 0.65 0.67 40
print(classification_report(test_df['agreement'], test_df['cot_zero_shot_preds']))
precision recall f1-score support contradiction 0.68 0.76 0.72 17 entailment 0.75 0.80 0.77 15 neutral 0.20 0.12 0.15 8 accuracy 0.65 40 macro avg 0.54 0.56 0.55 40 weighted avg 0.61 0.65 0.63 40
We can see above that there are some differences in performance by class even though the overall accuracy is the same. For example, the simple zero-shot predictor does a better job (in terms of F1-score) on the "neutral" class than chain-of-thought, and chain-of-thought performs better on "entailment."
Let's see how we can improve our chain-of-thought model using few-shot examples with bootstrap and MIPROv2 optimizations.
Optimized Few-shot Prediction ¶
In few-shot prediction, we provide a few samples of data within our prompt to serve as example classifications for the LLM to gain context from. We can think of this analogously with a training process in traditional ML because we need to be sure to pick our few-shot samples in a way that optimizes our objective function (in this case, maximizing accuracy). Bootstrapping will allow us to do just that.
The BootstrapFewShot
optimizer in DSPy allows us to provide a labeled training dataset from which it will take samples to include in our classification prompt. It will iterate over different samples of data, measure the accuracy against the labels defined in our AgreementCoTSignature
and return a Module whose prompt includes the optimal set of few-shot examples.
Another approach is to optimize the prompt itself. DSPy provides a new optimization technique called Multiprompt Instruction Proposal Optimizer Version 2 (MIPROv2) which is an prompt optimizer capable of optimizing both instructions and few-shot examples jointly. According to the documentation, "It does this by bootstrapping few-shot example candidates, proposing instructions grounded in different dynamics of the task, and finding an optimized combination of these options using Bayesian Optimization."
We'll see how each of these optimizers perform in the following sections.
Bootstrap Few-shot¶
We'll start with a simple implementation of the bootstrapping optimization on its own, performing 10 rounds of tests which include 10 data examples each. These parameters will only depend on the time a prompting costs you are willing to spend. A more exhaustive search of the dataset will ensure you obtain the best results.
As with classical ML, we have withheld a "test set" of data that will not be sampled from during the bootstrap optimization. The resulting prompt's performance will be measured against this test set.
opt = BootstrapFewShot(
metric=accuracy,
max_bootstrapped_demos=10,
max_rounds=10
)
optimized_cls = opt.compile(
cot_cls,
trainset=train_examples
)
18%|█▊ | 11/60 [00:00<00:01, 42.22it/s]
Bootstrapped 10 full traces after 11 examples for up to 10 rounds, amounting to 34 attempts.
optimized_cls(test_examples[0].premise, test_examples[0].hypothesis)
Prediction( reasoning='The premise implies that someone will need to remind you about something, which is not supported by the hypothesis where you think you will forget.', agreement='contradiction' )
test_df['bootstrap_fs_preds'] = predict(optimized_cls, test_examples)
100%|██████████| 40/40 [00:00<00:00, 217.29it/s] /tmp/ipykernel_55012/1763673296.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy test_df['bootstrap_fs_preds'] = predict(optimized_cls, test_examples)
print(classification_report(test_df['agreement'], test_df['bootstrap_fs_preds']))
precision recall f1-score support contradiction 0.83 0.88 0.86 17 entailment 0.79 0.73 0.76 15 neutral 0.38 0.38 0.38 8 accuracy 0.72 40 macro avg 0.66 0.66 0.66 40 weighted avg 0.72 0.72 0.72 40
We can see above that including few-shot examples has increased the output predictions by a significant amount, from 65% to 72%.
After performing the optimization, we can inspect the resulting prompt by inspecting the history of our LM object, as below. In theory, this could allow us to copy the prompt and manually tweak it as much as we want, though it probably is not necessary.
dspy.inspect_history(n=1)
[2025-02-15T10:33:42.017304] System message: Your input fields are: 1. `premise` (str): The premise to which the hypothesis will be compared. 2. `hypothesis` (str): Statement to be compared to the premise for contradiction or entailment Your output fields are: 1. `reasoning` (str) 2. `agreement` (Literal[entailment, contradiction, neutral]): entailment/contradiction/neutral indicating whether the premise entails the hypothesis, the hypothesis contradicts the premise, or neither (neutral) All interactions will be structured in the following way, with the appropriate values filled in. [[ ## premise ## ]] {premise} [[ ## hypothesis ## ]] {hypothesis} [[ ## reasoning ## ]] {reasoning} [[ ## agreement ## ]] {agreement} # note: the value you produce must be one of: entailment; contradiction; neutral [[ ## completed ## ]] In adhering to this structure, your objective is: Given the fields `premise`, `hypothesis`, produce the fields `agreement`. User message: This is an example of the task, though some input or output fields are not supplied. [[ ## premise ## ]] In the stock market, however, the damage can get much worse. [[ ## hypothesis ## ]] The stock market can experience much worse damage. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] Not supplied for this particular example. [[ ## agreement ## ]] entailment [[ ## completed ## ]] User message: This is an example of the task, though some input or output fields are not supplied. [[ ## premise ## ]] In addition, Dublin Tourism has devised and signposted three self-guided walking tours of the city, which you can follow using the booklets provided. [[ ## hypothesis ## ]] Dublin's self-guided tours are not easy to follow. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] Not supplied for this particular example. [[ ## agreement ## ]] neutral [[ ## completed ## ]] User message: This is an example of the task, though some input or output fields are not supplied. [[ ## premise ## ]] I shan't stop you." [[ ## hypothesis ## ]] I will stop you. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] Not supplied for this particular example. [[ ## agreement ## ]] contradiction [[ ## completed ## ]] User message: This is an example of the task, though some input or output fields are not supplied. [[ ## premise ## ]] uh-huh well it's good that she does that i mean bring it to people's attention [[ ## hypothesis ## ]] It is good that she brings it to people's attention. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] Not supplied for this particular example. [[ ## agreement ## ]] entailment [[ ## completed ## ]] User message: This is an example of the task, though some input or output fields are not supplied. [[ ## premise ## ]] Exhibitions are often held in the splendid entrance hall. [[ ## hypothesis ## ]] The exhibitions in the entrance hall are usually the most exciting. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] Not supplied for this particular example. [[ ## agreement ## ]] neutral [[ ## completed ## ]] User message: This is an example of the task, though some input or output fields are not supplied. [[ ## premise ## ]] I think it behooves Slate, in its effort to take over the public-opinion industry, to make a thorough effort to uncover the truth behind this unnatural connection. [[ ## hypothesis ## ]] Slate has no interest in the public-opinion industry. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] Not supplied for this particular example. [[ ## agreement ## ]] contradiction [[ ## completed ## ]] User message: [[ ## premise ## ]] 3) Dare you rise to the occasion, like Raskolnikov, and reject the petty rules that govern lesser men? [[ ## hypothesis ## ]] Would you rise up and defeaat all evil lords in the town? Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] The premise encourages boldness and rejection of petty rules, while the hypothesis asks about defeating evil lords. The two are not directly related. [[ ## agreement ## ]] neutral [[ ## completed ## ]] User message: [[ ## premise ## ]] He married Dona Filipa Moniz (Perestrelo), the daughter of Porto Santo's first governor, and lived on the island for a period, fathering a son there. [[ ## hypothesis ## ]] He landed on the island but soon left for greener pastures, before later dying alone and childless. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] Not supplied for this particular example. [[ ## agreement ## ]] contradiction [[ ## completed ## ]] User message: [[ ## premise ## ]] The contrast between the landscape of the central highlands and the south coast could not be more marked. [[ ## hypothesis ## ]] There was a beautiful artist who painted the landscape of the central highlands. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] Not supplied for this particular example. [[ ## agreement ## ]] neutral [[ ## completed ## ]] User message: [[ ## premise ## ]] i don't know i i do i can think of all the uh the biblical things about it too where what did they say to uh i can't think of the scripture Render unto Caesar's what is Caesar's so [[ ## hypothesis ## ]] I know this because I own a bible. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] Not supplied for this particular example. [[ ## agreement ## ]] neutral [[ ## completed ## ]] User message: [[ ## premise ## ]] They returned to live in the Galilee village of Nazareth, making pilgrimages to Jerusalem. [[ ## hypothesis ## ]] They would make pilgrimages to Jerusalem. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] Not supplied for this particular example. [[ ## agreement ## ]] entailment [[ ## completed ## ]] User message: [[ ## premise ## ]] An Indian traveler described the prosperous Bujang Valley settlement as the seat of all felicities. [[ ## hypothesis ## ]] A traveler said the settlement was prospering. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] The premise describes the Bujang Valley settlement as prosperous, which is supported by the hypothesis that a traveler said the settlement was prospering. [[ ## agreement ## ]] entailment [[ ## completed ## ]] User message: [[ ## premise ## ]] TIG funds support the Technology Evaluation Project, an initiative of the Legal Aid Society of Cincinnati. [[ ## hypothesis ## ]] TIG funds are used to support the Technology Evolution project, a legal aid society in Cincinnati. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] The two statements are identical except for the word "Evaluation" instead of "Evaluation Project". This means that the premise is very close to the hypothesis. [[ ## agreement ## ]] entailment [[ ## completed ## ]] User message: [[ ## premise ## ]] well i think i got to agree with you there [[ ## hypothesis ## ]] I could not agree with you. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] The premise states that you agree with someone, while the hypothesis states that you cannot agree. This is an obvious contradiction. [[ ## agreement ## ]] contradiction [[ ## completed ## ]] User message: [[ ## premise ## ]] Say, man, don't you know you've been given up for dead? [[ ## hypothesis ## ]] You were thought to be dead! Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] The premise implies that someone has been given up for dead, which is equivalent to saying they are thought to be dead. [[ ## agreement ## ]] entailment [[ ## completed ## ]] User message: [[ ## premise ## ]] Rather, kids today are not only little bundles of joy but also are perhaps the ultimate symbols of worldly success and status. [[ ## hypothesis ## ]] Kids today are not bundles of joy, and are symbols of failure. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] Not supplied for this particular example. [[ ## agreement ## ]] contradiction [[ ## completed ## ]] User message: [[ ## premise ## ]] uh-huh so do you have to get a shade tolerant grass is that what you're [[ ## hypothesis ## ]] If i want to grow grass in the shade, do i need a special seed that will grow in the shade? Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Response: [[ ## reasoning ## ]] The premise is unclear and doesn't directly relate to the hypothesis. The hypothesis asks about growing grass in the shade, while the premise seems to be discussing something else. [[ ## agreement ## ]] neutral [[ ## completed ## ]]
Prompt Optimization with MIPROv2¶
MIPROv2 combines the bootstrapping shown above with additional optimizations of the prompt language itself. This training process requires more time and more cost to perform, but it should provide us with stronger prediction results.
mipro_opt = MIPROv2(
metric=accuracy,
auto='medium',
max_bootstrapped_demos=10
)
mipro_cls = mipro_opt.compile(
cot_cls,
trainset=test_examples,
max_bootstrapped_demos=3,
max_labeled_demos=4
)
2025/02/15 12:58:23 INFO dspy.teleprompt.mipro_optimizer_v2: RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS: num_trials: 25 minibatch: False num_candidates: 19 valset size: 32 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: ==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <== 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=19 sets of demonstrations...
Bootstrapping set 1/19 Bootstrapping set 2/19 Bootstrapping set 3/19
38%|███▊ | 3/8 [00:00<00:00, 893.23it/s]
Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts. Bootstrapping set 4/19
12%|█▎ | 1/8 [00:00<00:00, 1037.94it/s]
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts. Bootstrapping set 5/19
25%|██▌ | 2/8 [00:00<00:00, 1564.75it/s]
Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts. Bootstrapping set 6/19
38%|███▊ | 3/8 [00:00<00:00, 604.95it/s]
Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts. Bootstrapping set 7/19
12%|█▎ | 1/8 [00:00<00:00, 638.50it/s]
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts. Bootstrapping set 8/19
38%|███▊ | 3/8 [00:00<00:00, 748.23it/s]
Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts. Bootstrapping set 9/19
38%|███▊ | 3/8 [00:00<00:00, 942.61it/s]
Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts. Bootstrapping set 10/19
38%|███▊ | 3/8 [00:00<00:00, 1211.29it/s]
Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts. Bootstrapping set 11/19
12%|█▎ | 1/8 [00:00<00:00, 1100.87it/s]
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts. Bootstrapping set 12/19
25%|██▌ | 2/8 [00:00<00:00, 744.46it/s]
Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts. Bootstrapping set 13/19
12%|█▎ | 1/8 [00:00<00:00, 433.39it/s]
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts. Bootstrapping set 14/19
12%|█▎ | 1/8 [00:00<00:00, 489.59it/s]
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts. Bootstrapping set 15/19
12%|█▎ | 1/8 [00:00<00:00, 679.79it/s]
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts. Bootstrapping set 16/19
38%|███▊ | 3/8 [00:00<00:00, 930.00it/s]
Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts. Bootstrapping set 17/19
12%|█▎ | 1/8 [00:00<00:00, 1152.60it/s]
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts. Bootstrapping set 18/19
12%|█▎ | 1/8 [00:00<00:00, 825.98it/s]
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts. Bootstrapping set 19/19
38%|███▊ | 3/8 [00:00<00:00, 1287.25it/s] 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: ==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <== 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: Proposing instructions... 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0: 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `premise`, `hypothesis`, produce the fields `agreement`. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Analyze the relationship between the premise and hypothesis to determine whether the premise entails the hypothesis, contradicts it, or has neither an entailment nor contradiction (neutral), providing a logical reasoning explanation for your conclusion. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 2: Analyze the given premise and hypothesis to determine if the premise entails, contradicts, or is neutral with respect to the hypothesis. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 3: Using the provided task demos as examples, please generate an instruction for the program to predict the agreement between a premise and a hypothesis based on logical reasoning. The input should be two statements: one representing the premise and another representing the hypothesis. The output should be a classification of the relationship between the premise and hypothesis as either entailment, contradiction, or neutral. Please provide the proposed instruction below: 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 4: Provide a detailed step-by-step analysis of the relationship between the given premise and hypothesis, concluding with an agreement indicating whether the premise entails, contradicts, or has no logical connection to the hypothesis. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 5: Analyze the relationship between the provided premise and hypothesis, generating a reasoning statement that explains why the premise supports or contradicts the hypothesis. Output an agreement label indicating whether the premise entails, contradicts, or is neutral with respect to the hypothesis. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 6: To generate an agreement indicating whether a given premise entails, contradicts, or is neutral with respect to a hypothesis, please provide a natural language inference task by specifying a premise and a hypothesis. The program will then generate a reasoning step-by-step and output an agreement based on the relationship between the premise and the hypothesis. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 7: Generate a logical argument structure based on the input premise and hypothesis, providing a step-by-step explanation of why the premise entails or contradicts the hypothesis. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 8: In a world where the fate of humanity hangs in the balance, you are tasked with analyzing a series of logical arguments to determine whether they support or contradict each other. Given the fields `premise`, `hypothesis`, produce the fields `agreement`. Will your decisions lead to the salvation or destruction of humanity? The choice is 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 9: Analyze the logical relationship between the given `premise` and `hypothesis`, providing a detailed `reasoning` explanation for your conclusion, and output the resulting `agreement`. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 10: Generate a step-by-step explanation of the reasoning process for the given premise and hypothesis, then determine whether the premise entails, contradicts, or has no bearing on the hypothesis. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 11: Analyze the given premise and hypothesis, then determine whether they entail, contradict, or are neutral with respect to each other. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 12: Analyze the logical relationship between the premise and hypothesis, considering factors such as tense, tone, and logical implications. Provide a detailed explanation of your thought process and conclusion. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 13: Produce an agreement label indicating whether the premise entails, contradicts, or has no logical connection to the hypothesis. This requires generating a reasoning statement that explains how the premise relates to the hypothesis, considering factors such as time frames, descriptions of past events, and comparisons between different entities. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 14: In a world where facial recognition technology is crucial for national security, a team of experts has discovered that a previously unknown genetic trait can significantly alter face shape over generations. Given this premise and the hypothesis that this trait does not affect face shape, produce an agreement indicating whether the premise entails or contradicts the hypothesis. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 15: Given a premise and hypothesis, use a chain of thought approach to determine whether the premise entails, contradicts, or is neutral with respect to the hypothesis. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 16: Assume a conversational tone and generate an agreement based on the premise and hypothesis. Given the premise `premise` and the hypothesis `hypothesis`, determine whether they are equivalent, contradictory, or neutral. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 17: Analyze the given premise and hypothesis, using logical reasoning strategies to determine whether the premise entails, contradicts, or is neutral with respect to the hypothesis. Provide a step-by-step explanation of your thought process, followed by an agreement indicating the relationship between the premise and hypothesis. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 18: You are an argument analyst, tasked with evaluating the logical relationship between two statements. Given the fields `premise` and `hypothesis`, use your reasoning skills to determine whether the premise entails, contradicts, or is neutral with respect to the hypothesis. 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: Evaluating the default program...
Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts. Average Metric: 19.00 / 32 (59.4%): 100%|██████████| 32/32 [00:00<00:00, 1476.80it/s]
2025/02/15 12:58:26 INFO dspy.evaluate.evaluate: Average Metric: 19 / 32 (59.4%) 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 59.38 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: ==> STEP 3: FINDING OPTIMAL PROMPT PARAMETERS <== 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: We will evaluate the program over a series of trials with different combinations of instructions and few-shot examples to find the optimal combination using Bayesian Optimization. /home/nastory/repos/dspy_test/dspy_env/lib/python3.10/site-packages/optuna/_experimental.py:31: ExperimentalWarning: Argument ``multivariate`` is an experimental feature. The interface can change in the future. warnings.warn( 2025/02/15 12:58:26 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 1 / 25 =====
Average Metric: 23.00 / 32 (71.9%): 100%|██████████| 32/32 [01:57<00:00, 3.66s/it]
2025/02/15 13:00:23 INFO dspy.evaluate.evaluate: Average Metric: 23 / 32 (71.9%)
2025/02/15 13:00:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far! Score: 71.88
2025/02/15 13:00:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 71.88 with parameters ['Predictor 0: Instruction 12', 'Predictor 0: Few-Shot Set 7'].
2025/02/15 13:00:23 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88]
2025/02/15 13:00:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 71.88
2025/02/15 13:00:23 INFO dspy.teleprompt.mipro_optimizer_v2: ========================
2025/02/15 13:00:23 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 2 / 25 =====
Average Metric: 25.00 / 32 (78.1%): 100%|██████████| 32/32 [01:43<00:00, 3.24s/it]
2025/02/15 13:02:07 INFO dspy.evaluate.evaluate: Average Metric: 25 / 32 (78.1%)
2025/02/15 13:02:07 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far! Score: 78.12
2025/02/15 13:02:07 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 78.12 with parameters ['Predictor 0: Instruction 10', 'Predictor 0: Few-Shot Set 7'].
2025/02/15 13:02:07 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12]
2025/02/15 13:02:07 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12
2025/02/15 13:02:07 INFO dspy.teleprompt.mipro_optimizer_v2: ========================
2025/02/15 13:02:07 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 3 / 25 =====
Average Metric: 21.00 / 32 (65.6%): 100%|██████████| 32/32 [01:54<00:00, 3.58s/it]
2025/02/15 13:04:01 INFO dspy.evaluate.evaluate: Average Metric: 21 / 32 (65.6%) 2025/02/15 13:04:01 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 65.62 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 18']. 2025/02/15 13:04:01 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62] 2025/02/15 13:04:01 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:04:01 INFO dspy.teleprompt.mipro_optimizer_v2: ======================== 2025/02/15 13:04:01 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 4 / 25 =====
Average Metric: 22.00 / 32 (68.8%): 100%|██████████| 32/32 [01:40<00:00, 3.13s/it]
2025/02/15 13:05:42 INFO dspy.evaluate.evaluate: Average Metric: 22 / 32 (68.8%) 2025/02/15 13:05:42 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 68.75 with parameters ['Predictor 0: Instruction 15', 'Predictor 0: Few-Shot Set 2']. 2025/02/15 13:05:42 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75] 2025/02/15 13:05:42 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:05:42 INFO dspy.teleprompt.mipro_optimizer_v2: ======================== 2025/02/15 13:05:42 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 5 / 25 =====
Average Metric: 21.00 / 32 (65.6%): 100%|██████████| 32/32 [01:51<00:00, 3.49s/it]
2025/02/15 13:07:33 INFO dspy.evaluate.evaluate: Average Metric: 21 / 32 (65.6%) 2025/02/15 13:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 65.62 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 18']. 2025/02/15 13:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62] 2025/02/15 13:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: ======================== 2025/02/15 13:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 6 / 25 =====
Average Metric: 22.00 / 32 (68.8%): 100%|██████████| 32/32 [01:51<00:00, 3.49s/it]
2025/02/15 13:09:25 INFO dspy.evaluate.evaluate: Average Metric: 22 / 32 (68.8%) 2025/02/15 13:09:25 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 68.75 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 1']. 2025/02/15 13:09:25 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75] 2025/02/15 13:09:25 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:09:25 INFO dspy.teleprompt.mipro_optimizer_v2: ======================== 2025/02/15 13:09:25 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 25 =====
Average Metric: 18.00 / 32 (56.2%): 100%|██████████| 32/32 [01:53<00:00, 3.55s/it]
2025/02/15 13:11:19 INFO dspy.evaluate.evaluate: Average Metric: 18 / 32 (56.2%) 2025/02/15 13:11:19 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 56.25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 12']. 2025/02/15 13:11:19 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25] 2025/02/15 13:11:19 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:11:19 INFO dspy.teleprompt.mipro_optimizer_v2: ======================== 2025/02/15 13:11:19 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 8 / 25 =====
Average Metric: 22.00 / 32 (68.8%): 100%|██████████| 32/32 [01:50<00:00, 3.46s/it]
2025/02/15 13:13:09 INFO dspy.evaluate.evaluate: Average Metric: 22 / 32 (68.8%) 2025/02/15 13:13:09 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 68.75 with parameters ['Predictor 0: Instruction 11', 'Predictor 0: Few-Shot Set 13']. 2025/02/15 13:13:09 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75] 2025/02/15 13:13:09 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:13:09 INFO dspy.teleprompt.mipro_optimizer_v2: ======================== 2025/02/15 13:13:09 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 9 / 25 =====
Average Metric: 22.00 / 32 (68.8%): 100%|██████████| 32/32 [01:46<00:00, 3.31s/it]
2025/02/15 13:14:55 INFO dspy.evaluate.evaluate: Average Metric: 22 / 32 (68.8%) 2025/02/15 13:14:55 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 68.75 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 4']. 2025/02/15 13:14:55 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75] 2025/02/15 13:14:55 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:14:55 INFO dspy.teleprompt.mipro_optimizer_v2: ======================== 2025/02/15 13:14:55 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 10 / 25 =====
Average Metric: 23.00 / 32 (71.9%): 100%|██████████| 32/32 [01:30<00:00, 2.84s/it]
2025/02/15 13:16:26 INFO dspy.evaluate.evaluate: Average Metric: 23 / 32 (71.9%) 2025/02/15 13:16:26 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 71.88 with parameters ['Predictor 0: Instruction 14', 'Predictor 0: Few-Shot Set 1']. 2025/02/15 13:16:26 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88] 2025/02/15 13:16:26 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:16:26 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:16:26 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 11 / 25 =====
Average Metric: 25.00 / 32 (78.1%): 100%|██████████| 32/32 [01:48<00:00, 3.38s/it]
2025/02/15 13:18:14 INFO dspy.evaluate.evaluate: Average Metric: 25 / 32 (78.1%) 2025/02/15 13:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 78.12 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 17']. 2025/02/15 13:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88, 78.12] 2025/02/15 13:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 12 / 25 =====
Average Metric: 25.00 / 32 (78.1%): 100%|██████████| 32/32 [00:00<00:00, 2069.95it/s]
2025/02/15 13:18:14 INFO dspy.evaluate.evaluate: Average Metric: 25 / 32 (78.1%) 2025/02/15 13:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 78.12 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 17']. 2025/02/15 13:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88, 78.12, 78.12] 2025/02/15 13:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 25 =====
Average Metric: 25.00 / 32 (78.1%): 100%|██████████| 32/32 [01:42<00:00, 3.19s/it]
2025/02/15 13:19:57 INFO dspy.evaluate.evaluate: Average Metric: 25 / 32 (78.1%) 2025/02/15 13:19:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 78.12 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 7']. 2025/02/15 13:19:57 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88, 78.12, 78.12, 78.12] 2025/02/15 13:19:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:19:57 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:19:57 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 14 / 25 =====
Average Metric: 23.00 / 32 (71.9%): 100%|██████████| 32/32 [01:47<00:00, 3.37s/it]
2025/02/15 13:21:44 INFO dspy.evaluate.evaluate: Average Metric: 23 / 32 (71.9%) 2025/02/15 13:21:44 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 71.88 with parameters ['Predictor 0: Instruction 10', 'Predictor 0: Few-Shot Set 15']. 2025/02/15 13:21:44 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88, 78.12, 78.12, 78.12, 71.88] 2025/02/15 13:21:44 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:21:44 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:21:44 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 15 / 25 =====
Average Metric: 25.00 / 32 (78.1%): 100%|██████████| 32/32 [01:47<00:00, 3.35s/it]
2025/02/15 13:23:32 INFO dspy.evaluate.evaluate: Average Metric: 25 / 32 (78.1%) 2025/02/15 13:23:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 78.12 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 14']. 2025/02/15 13:23:32 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88, 78.12, 78.12, 78.12, 71.88, 78.12] 2025/02/15 13:23:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:23:32 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:23:32 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 16 / 25 =====
Average Metric: 25.00 / 32 (78.1%): 100%|██████████| 32/32 [00:00<00:00, 1475.13it/s]
2025/02/15 13:23:32 INFO dspy.evaluate.evaluate: Average Metric: 25 / 32 (78.1%) 2025/02/15 13:23:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 78.12 with parameters ['Predictor 0: Instruction 10', 'Predictor 0: Few-Shot Set 7']. 2025/02/15 13:23:32 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88, 78.12, 78.12, 78.12, 71.88, 78.12, 78.12] 2025/02/15 13:23:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:23:32 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:23:32 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 17 / 25 =====
Average Metric: 23.00 / 32 (71.9%): 100%|██████████| 32/32 [01:35<00:00, 2.97s/it]
2025/02/15 13:25:07 INFO dspy.evaluate.evaluate: Average Metric: 23 / 32 (71.9%) 2025/02/15 13:25:07 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 71.88 with parameters ['Predictor 0: Instruction 13', 'Predictor 0: Few-Shot Set 17']. 2025/02/15 13:25:07 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88, 78.12, 78.12, 78.12, 71.88, 78.12, 78.12, 71.88] 2025/02/15 13:25:07 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:25:07 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:25:07 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 18 / 25 =====
Average Metric: 23.00 / 32 (71.9%): 100%|██████████| 32/32 [01:44<00:00, 3.28s/it]
2025/02/15 13:26:52 INFO dspy.evaluate.evaluate: Average Metric: 23 / 32 (71.9%) 2025/02/15 13:26:52 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 71.88 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 6']. 2025/02/15 13:26:52 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88, 78.12, 78.12, 78.12, 71.88, 78.12, 78.12, 71.88, 71.88] 2025/02/15 13:26:52 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:26:52 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:26:52 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 25 =====
Average Metric: 22.00 / 32 (68.8%): 100%|██████████| 32/32 [01:36<00:00, 3.03s/it]
2025/02/15 13:28:29 INFO dspy.evaluate.evaluate: Average Metric: 22 / 32 (68.8%) 2025/02/15 13:28:29 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 68.75 with parameters ['Predictor 0: Instruction 18', 'Predictor 0: Few-Shot Set 3']. 2025/02/15 13:28:29 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88, 78.12, 78.12, 78.12, 71.88, 78.12, 78.12, 71.88, 71.88, 68.75] 2025/02/15 13:28:29 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:28:29 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:28:29 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 20 / 25 =====
Average Metric: 25.00 / 32 (78.1%): 100%|██████████| 32/32 [01:31<00:00, 2.86s/it]
2025/02/15 13:30:00 INFO dspy.evaluate.evaluate: Average Metric: 25 / 32 (78.1%) 2025/02/15 13:30:00 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 78.12 with parameters ['Predictor 0: Instruction 14', 'Predictor 0: Few-Shot Set 16']. 2025/02/15 13:30:00 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88, 78.12, 78.12, 78.12, 71.88, 78.12, 78.12, 71.88, 71.88, 68.75, 78.12] 2025/02/15 13:30:00 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:30:00 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:30:00 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 21 / 25 =====
Average Metric: 23.00 / 32 (71.9%): 100%|██████████| 32/32 [01:45<00:00, 3.30s/it]
2025/02/15 13:31:46 INFO dspy.evaluate.evaluate: Average Metric: 23 / 32 (71.9%) 2025/02/15 13:31:46 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 71.88 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 5']. 2025/02/15 13:31:46 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88, 78.12, 78.12, 78.12, 71.88, 78.12, 78.12, 71.88, 71.88, 68.75, 78.12, 71.88] 2025/02/15 13:31:46 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:31:46 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:31:46 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 22 / 25 =====
Average Metric: 25.00 / 32 (78.1%): 100%|██████████| 32/32 [00:00<00:00, 1579.20it/s]
2025/02/15 13:31:46 INFO dspy.evaluate.evaluate: Average Metric: 25 / 32 (78.1%) 2025/02/15 13:31:46 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 78.12 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 17']. 2025/02/15 13:31:46 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88, 78.12, 78.12, 78.12, 71.88, 78.12, 78.12, 71.88, 71.88, 68.75, 78.12, 71.88, 78.12] 2025/02/15 13:31:46 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:31:46 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:31:46 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 23 / 25 =====
Average Metric: 22.00 / 32 (68.8%): 100%|██████████| 32/32 [01:44<00:00, 3.26s/it]
2025/02/15 13:33:31 INFO dspy.evaluate.evaluate: Average Metric: 22 / 32 (68.8%) 2025/02/15 13:33:31 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 68.75 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 17']. 2025/02/15 13:33:31 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88, 78.12, 78.12, 78.12, 71.88, 78.12, 78.12, 71.88, 71.88, 68.75, 78.12, 71.88, 78.12, 68.75] 2025/02/15 13:33:31 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:33:31 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:33:31 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 24 / 25 =====
Average Metric: 20.00 / 32 (62.5%): 100%|██████████| 32/32 [01:48<00:00, 3.41s/it]
2025/02/15 13:35:20 INFO dspy.evaluate.evaluate: Average Metric: 20 / 32 (62.5%) 2025/02/15 13:35:20 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 62.5 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 0']. 2025/02/15 13:35:20 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88, 78.12, 78.12, 78.12, 71.88, 78.12, 78.12, 71.88, 71.88, 68.75, 78.12, 71.88, 78.12, 68.75, 62.5] 2025/02/15 13:35:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:35:20 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:35:20 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 25 / 25 =====
Average Metric: 21.00 / 32 (65.6%): 100%|██████████| 32/32 [03:03<00:00, 5.72s/it]
2025/02/15 13:38:23 INFO dspy.evaluate.evaluate: Average Metric: 21 / 32 (65.6%) 2025/02/15 13:38:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 65.62 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 0']. 2025/02/15 13:38:23 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [59.38, 71.88, 78.12, 65.62, 68.75, 65.62, 68.75, 56.25, 68.75, 68.75, 71.88, 78.12, 78.12, 78.12, 71.88, 78.12, 78.12, 71.88, 71.88, 68.75, 78.12, 71.88, 78.12, 68.75, 62.5, 65.62] 2025/02/15 13:38:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.12 2025/02/15 13:38:23 INFO dspy.teleprompt.mipro_optimizer_v2: ========================= 2025/02/15 13:38:23 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 78.12!
mipro_cls(test_examples[0].premise, test_examples[0].hypothesis)
Prediction( reasoning="The premise suggests that one should keep something in mind or serve it themselves, implying that it's important to remember or take care of. This contradicts the hypothesis that the speaker will forget about it.", agreement='contradiction' )
test_df['mipro_preds'] = predict(mipro_cls, test_examples)
100%|██████████| 40/40 [00:24<00:00, 1.60it/s] /tmp/ipykernel_55012/807575071.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy test_df['mipro_preds'] = predict(mipro_cls, test_examples)
print(classification_report(test_df['agreement'], test_df['mipro_preds']))
precision recall f1-score support contradiction 0.83 0.88 0.86 17 entailment 0.88 0.93 0.90 15 neutral 0.67 0.50 0.57 8 accuracy 0.82 40 macro avg 0.79 0.77 0.78 40 weighted avg 0.82 0.82 0.82 40
These results are quite impressive, with a sizeable gain in accuracy over bootstrapping alone.
Optimization Technique | Accuracy % |
---|---|
Zero-Shot | 65 |
Boostrap Few-Shot | 72 |
MIPROv2 Few-Shot | 82 |
And again, we are able to inspect the optimized prompt itself, as seen below.
dspy.inspect_history(n=1)
[2025-02-15T13:38:52.193812] System message: Your input fields are: 1. `premise` (str): The premise to which the hypothesis will be compared. 2. `hypothesis` (str): Statement to be compared to the premise for contradiction or entailment Your output fields are: 1. `reasoning` (str) 2. `agreement` (Literal[entailment, contradiction, neutral]): entailment/contradiction/neutral indicating whether the premise entails the hypothesis, the hypothesis contradicts the premise, or neither (neutral) All interactions will be structured in the following way, with the appropriate values filled in. [[ ## premise ## ]] {premise} [[ ## hypothesis ## ]] {hypothesis} [[ ## reasoning ## ]] {reasoning} [[ ## agreement ## ]] {agreement} # note: the value you produce must be one of: entailment; contradiction; neutral [[ ## completed ## ]] In adhering to this structure, your objective is: Generate a step-by-step explanation of the reasoning process for the given premise and hypothesis, then determine whether the premise entails, contradicts, or has no bearing on the hypothesis. User message: This is an example of the task, though some input or output fields are not supplied. [[ ## premise ## ]] Personal Communication with P. Croteau, Babcock Borsig Power, August 2001. [[ ## hypothesis ## ]] In August 2001, there was personal communication between P. Croteau and Babcock Borsig Power. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] Not supplied for this particular example. [[ ## agreement ## ]] entailment [[ ## completed ## ]] User message: [[ ## premise ## ]] The entire setup has an anti-competitive, anti-entrepreneurial flavor that rewards political lobbying rather than good business practices. [[ ## hypothesis ## ]] The setup rewards political lobbying. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] The premise implies that the setup rewards political lobbying rather than good business practices. This is because it specifically states that the setup has an anti-competitive flavor that rewards political lobbying. [[ ## agreement ## ]] entailment [[ ## completed ## ]] User message: [[ ## premise ## ]] If the face has been getting longer at the bottom over the generations, it has been getting shorter (and broader) on top. [[ ## hypothesis ## ]] The shape of the face doesn't change at all over the span of generations. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] The premise states that the face has been changing shape over generations, specifically getting longer at the bottom and shorter on top. This contradicts the hypothesis that the shape of the face doesn't change. [[ ## agreement ## ]] contradiction [[ ## completed ## ]] User message: [[ ## premise ## ]] i've yeah i've done it before and when i was in high in high school and college and thoroughly enjoyed it and and it's really a a blast my wife hates it but that's the way life is i guess [[ ## hypothesis ## ]] I've never done it and don't think I would like it. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Assistant message: [[ ## reasoning ## ]] The speaker has already mentioned that they have done it before and enjoyed it, which contradicts the hypothesis. [[ ## agreement ## ]] contradiction [[ ## completed ## ]] User message: [[ ## premise ## ]] uh-huh so do you have to get a shade tolerant grass is that what you're [[ ## hypothesis ## ]] If i want to grow grass in the shade, do i need a special seed that will grow in the shade? Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), and then ending with the marker for `[[ ## completed ## ]]`. Response: [[ ## reasoning ## ]] The premise is asking about shade-tolerant grass, which implies that regular grass may not grow well in shaded areas. This suggests that yes, a special seed or type of grass is needed to grow in the shade, which supports the hypothesis. [[ ## agreement ## ]] entailment [[ ## completed ## ]]
Cost Monitoring ¶
Any kind of automated LLM procedures raise the question of cost. Luckily, DSPy will track the costs of LLM prompting and the input/output tokens for you in the LM's history attribute. This is quite convenient for tracking the costs of current experiments and using the token counts to compare with how an implementation would look across defferent paid models.
In this case, suppose I'm impressed with the performance gains working locally with Llama3, and I'd like to perform the same process with a higher performance paid model. Using the DSPy history, I could do just that.
lm.history[0]
{'prompt': None, 'messages': [{'role': 'system', 'content': 'Your input fields are:\n1. `premise` (str): The premise to which the hypothesis will be compared.\n2. `hypothesis` (str): Statement to be compared to the premise for contradiction or entailment\n\nYour output fields are:\n1. `agreement` (Literal[entailment, contradiction, neutral]): entailment/contradiction/neutral indicating whether the premise entails the hypothesis, the hypothesis contradicts the premise, or neither (neutral)\n2. `explanation` (str): Explanation or reason why the result was chosen.\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\n[[ ## premise ## ]]\n{premise}\n\n[[ ## hypothesis ## ]]\n{hypothesis}\n\n[[ ## agreement ## ]]\n{agreement} # note: the value you produce must be one of: entailment; contradiction; neutral\n\n[[ ## explanation ## ]]\n{explanation}\n\n[[ ## completed ## ]]\n\nIn adhering to this structure, your objective is: \n Given the fields `premise`, `hypothesis`, produce the fields `agreement`, `explanation`.'}, {'role': 'user', 'content': "[[ ## premise ## ]]\nokay i'll keep that in mind yeah you serve that yourself or the for a family\n\n[[ ## hypothesis ## ]]\nI think I will forget about that. You will need to remind me.\n\nRespond with the corresponding output fields, starting with the field `[[ ## agreement ## ]]` (must be formatted as a valid Python Literal[entailment, contradiction, neutral]), then `[[ ## explanation ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`."}], 'kwargs': {'temperature': 0.0, 'max_tokens': 1000}, 'response': ModelResponse(id='chatcmpl-5ea12d97-860a-485b-943c-f4a33208ea41', choices=[Choices(finish_reason='stop', index=0, message=Message(content='[[ ## agreement ## ]]\nneutral\n\n[[ ## explanation ## ]]\nThe premise is about serving food to oneself or others, while the hypothesis is about forgetting something. These two statements are unrelated, so there is no entailment or contradiction between them.\n\n[[ ## completed ## ]]', role='assistant', tool_calls=None, function_call=None))], created=1739631669, model='ollama_chat/llama3', object='chat.completion', system_fingerprint=None, usage=Usage(completion_tokens=54, prompt_tokens=101, total_tokens=155, completion_tokens_details=None, prompt_tokens_details=None)), 'outputs': ['[[ ## agreement ## ]]\nneutral\n\n[[ ## explanation ## ]]\nThe premise is about serving food to oneself or others, while the hypothesis is about forgetting something. These two statements are unrelated, so there is no entailment or contradiction between them.\n\n[[ ## completed ## ]]'], 'usage': {'completion_tokens': 54, 'prompt_tokens': 101, 'total_tokens': 155, 'completion_tokens_details': None, 'prompt_tokens_details': None}, 'cost': None, 'timestamp': '2025-02-15T10:33:40.907002', 'uuid': '7a8c0a22-93a5-4047-9512-7c2467572616', 'model': 'ollama_chat/llama3', 'model_type': 'chat'}
def summarize_lm_history(lm):
"""Summarize the costs and token counts
for current LM session.
"""
stats = pd.DataFrame(
[
[1, d['usage']['completion_tokens'], d['usage']['prompt_tokens'], d['usage']['total_tokens'], d['cost']]
for d in lm.history
],
columns=['n_prompts', 'completion_tokens', 'prompt_tokens', 'total_tokens', 'cost']
)
return stats.sum()
Below, I can quickly see the number of prompts and tokens sent during my experimentation, and with a quick Google and calculation, I can estimate how much the experiments would have cost on some of the newer OpenAI models: gpt-40, gpt-40-mini, and gpt-o3-mini.
lm_stats = summarize_lm_history(lm)
lm_stats
n_prompts 1453 completion_tokens 139915 prompt_tokens 357104 total_tokens 497019 cost 0 dtype: object
gpt_costs = {
'gpt_4o_mini': {'prompt': 0.15 / 1e6, 'completion': 0.6 / 1e6},
'gpt_o3_mini': {'prompt': 1.1 / 1e6, 'completion': 4.4 / 1e6},
'gpt_4o': {'prompt': 2.5 / 1e6, 'completion': 10 / 1e6}
}
def total_cost(prompt, completion):
return (lm_stats['prompt_tokens'] * prompt) + (lm_stats['completion_tokens'] * completion)
print(
'Cost per model:\n' +
'\n'.join(
[
f"{model}:{' '*(15 - len(model))}${total_cost(**gpt_costs[model]): 0.4f}"
for model in gpt_costs
]
)
)
Cost per model: gpt_4o_mini: $ 0.1375 gpt_o3_mini: $ 1.0084 gpt_4o: $ 2.2919
With these costs visible, I could now compare the value of my AI product against the training costs to make an informed decision.
Conclusion ¶
DSPy still has some growing to do, but it already shows a ton of promise. Designing similar experiments as the ones I have here using another framework, like LangChain, would ahve required a lot more work and a lot more prompt engineering. So in its mission to abstract away from prompt engineering, I call DSPy a success. However, this abstraction comes with drawbacks in flexibility and customization, but for most AI applications, particularly those that mirror typical data science work streams, I think DSPy provides a fairly robust and easy to use set of tools.