FIM Evaluation Part 2

Adding Claude into the mix

Yesterday I started evaluating off-the-self LLMs on their usefullness as an intelligent autocomplete model (for use in Satyrn). I’m using the human-eval-infilling tool from OpenAI to benchmark these models.

This first round of evaluations were done with the same system prompt accross all 3

I’m also using the same prompt format, following the “Fill In the Middle” (FIM) approach reported by OpenAI. It’s not clear this is the best prompting method for these particular models, but it’s where I’m starting.

<PRE>{prefix}<SUF>{suffix}<MID>

The results for running 100 “single-line” completion tasks with 5 attempts for each task are below:

model	haiku 3.5	gpt-4o-mini	gemini-1.5-flash
pass@1	0.17	0.39	0.96
pass@3	0.21	0.41	0.98
pass@5	0.24	0.43	0.99
cost ($)	$0.39	$0.06	$0.03

Here “pass@k” is an estimate of the probability that the model will get a correct answer if its given k attempts. pass@5 is the average number of tasks that had at least 1 correct answer.

As you can see Gemini blows the other two out of the water. It’s interesting Claude performs so poorly even though its by far the most expensive model.

I have not been rigorous with my prompting methods, and I think there is probably a lot of room to improve these scores, but first I’m going to dive into open source alternatives to get a sense of what their capable of.

Code Llama

Meta released Code Llama in August 2023 which was specifically trained at FIM so I decided to test it out.

To make Code Llama they took the base Llama2 model and continued training it on lots of code to form the base “Code Llama”. They also created two other variations of this base code model called “Code Llama - Python” and “Code Llama - Instruct”, but we’ll mainly be interested in the base code model for now.

Out of the 4 sizes of Code Llama available (7B, 13B, 34B, and 70B), 3 of them included FIM training (all except for 34B). So we’re going to try to test them out.

I’ve only got 36Gb of memory on my MacBook so we’ll need to see how far we can push it, and then switch to a cloud hosted GPU to run the larger instances.

Running Llama

Initially I decided to run Llama via ollama. This simplifies matters enormously because I can use the OpenAI python sdk to interact with the model, and does not require messing around with installing the correct dependencies.

I simply start up ollama, download the model (codellama:7b-code) and then call it from within my python code:

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # dummy key, required but not used
)

response = client.chat.completions.create(
    model="codellama:7b-code",
		messages = [
        {"role": "user", "content": f"<PRE> {prefix} <SUF>{suffix} <MID>"},
    ],
    temperature=0.0,
)

I found setting the temperature to zero was a good starting point, but I’ll experiment with this later.

Prompting Llama

Unlike my tests with Gemini, Claude, and GPT, the base Code Llama model has been heavily trained on the FIM task, so it’s not necessary (and probably harmful) to include a system prompt. Therefore you don’t see one in my code snippet above.

Another difference I needed to introduce to get this working was to tweak the prompt itself. I was rather surprised by the nuances of prompting required for Code Llama. Fortunately the Ollama docs specify the prompt format clearly:

<PRE> {prefix} <SUF>{suffix} <MID>

At first glance that might look the same as the prompt I used for Gemini, Claude, and GPT, but it actually includes additional spaces around the prefix, and after the suffix.

The referenced paper does not indicate there should be spaces between the tags, but I’ve found that if I don’t include them ollama will hang without providing a response. I can’t explain this yet, but it makes me wonder whether I should use this format when prompting Gemini, Claude, and GPT (I assume it’s not too important for those models but I’m not sure).

Results for 7b

The results show it’s better than Claude and GPT, but worse than Gemini. We got %76 accuracy on the 100 test cases. I had the temperature set to zero so the pass@1 was the same as pass@5.

Most of the failed cases are where the cursor is at the end of the code and the model does not generate anything.

There are also some failure cases where the model generates valid python code but it does not quite fulfill the perceived intent of the test case. For example, the test case below provides context where it is clear that we need to define maxlen to a sensible value in this function.

def longest(strings: List) -> Optional:
    """ Out of list of strings, return the longest one. Return the first one in case of multiple
    strings of the same length. Return None in case the input list is empty.
    >>> longest([])

    >>> longest(['a', 'b', 'c'])
    'a'
    >>> longest(['a', 'bb', 'ccc'])
    'ccc'
    """
    if not strings:
        return None

|
    for s in strings:
        if len(s) == maxlen:
            return s

When trying to fill model makes a mistake and fills in maxlen = 0 which would result in the function returning None for any case which is clearly incorrect.

I assume larger versions of Llama will be able to handle this kind of test case better.

Next up

Tomorrow I’m going to parametrize the temperature of llama and test out some larger versions of the model.