Introduction
I previously wrote about writing react code with Deepseek-coder 33b model, and whether we could improve some of these shortcomings with the latest research in the LLM space
But to really measure and mark progress, it would require the build of a benchmark to test various hypothesis around it.
So in this post, I’m going to evaluate existing benchmarks that specifically measures LLM capabilities on coding capabilities.
My goal is to be able to build a benchmark that can test their React/Typescript coding capabilities.
What we need
Unit Test Evaluations
In this method, we’ll require that the LLM write the code, and then we will run unit tests to measure the outcome.
We then will evaluate pass@1, pass@k, and strict-accuracy metrics.
Visual verification
In this method, we want to test style replication and ask the LLM to produce a component with some given specifications.
We’ll then verify it’s output against a known ground-truth of correct visual output.
Ease of writing, and similarity to real-life
I’d also want this to be similar to how we write code practically.
A file where some code is written, and a corresponding .test file that imports the code and runs a set of evaluations.
How the rest of the post is structured
Review of existing benchmarks and how they are setup
OpenAI Evals
HumanEval
CanAiCode
MultiPL-E
RepoBench
In a future post, I intent to cover
Details on Test based method
Details on Visual verification
Benchmark Results for 3 open source LLM models.
1) OpenAI Evals
This is probably the most renowned of all evaluation frameworks. https://github.com/openai/evals
However, they don’t accept “Custom code” Evals. Meaning, only simple matches (Exact, Includes, Fuzzy Match) are possible test evaluations to run.
Even though OpenAI doesn’t accept these evals. It’s worth noting that we can simply fork the repo and write our own custom evals
The framework allows to build a custom eval, as well as a custom completion function. It also comes with a nice cookbook tutorial.
Pros
Mature framework.
A ton of existing sample benchmarks. Once this is set up, it will allow one to find results on other interesting benchmarks.
Enables custom evals and custom completions
Cons
Doesn’t accept new custom evals.
It’s a bit heavy to setup, with git LFS and lots of dependencies that are added over time
Doesn’t have many code related benchmarks
Verdict
👍 - This could work for building a react benchmark. It might be a bit hard to get off the ground though, and may limit customization.
2) APPS
Paper: Measuring Coding Challenge Competence With APPS
Repository: https://github.com/hendrycks/apps
10,000 code generation problems of varying difficulties. Covers simple introductory problems, interview-level problems, and coding competition challenges
Pros
Simple code base. See evaluation guide here
A ton of Coding specific evaluations, with multiple difficulty levels.
Cons
Most of the code benchmarks are python. So it may not work too well for other languages.
Isn’t written with extensibility in mind, and mostly coded for testing python codebases.
Verdict
- 👎 - Not something to use for custom real world “app” related benchmarking
3) HumanEval
From OpenAI again, hand-written set of evaluations
Repo: https://github.com/openai/human-eval
Paper: Evaluating LLMs
We evaluate functional correctness on a set of 164 handwritten programming problems, which we call the HumanEval dataset. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem
Pros
- Pretty simple codebase, and good examples
Cons
- Mostly python evaluations
Verdict
If not testing python, this one is a 👎
4) CanAiCode
Repo: https://github.com/the-crypt-keeper/can-ai-code/blob/main/prompts/codellama-input-v2.txt
Leaderboard: https://huggingface.co/spaces/mike-ravkine/can-ai-code-results
Pros
Supports Javascript, and not just python test cases.
Template based generation of test cases. See template prompt for starcoder
{% if language == "python" %}<fim_prefix>def {{Signature}}:
'''a function {{Input}} that returns {{Output}}{% if Fact %} given {{Fact}}{% endif %}'''
<fim_suffix>
# another function{% endif %}
{% if language == "javascript" %}<fim_prefix>// a function {{Input}} that returns {{Output}}{% if Fact %} given {{Fact}}{% endif %}
function {{Signature}} {
<fim_suffix>
}
// another function{% endif %}<fim_middle>
- Combined with yaml for tests
.Checks: &Checks
FactorialZeroShot:
Signature: "factorial(n)"
Input: "with input n"
Output: "the factorial of n using iteration"
Description: "See if the model can implement a well known function"
Checks:
one_argument:
assert: "len(f.args)"
eq: 1
returns_list:
assert: "isinstance(f.call(1),int)"
eq: true
value_0:
assert: "f.call(1)"
eq: 1
value_5:
assert: "f.call(5)"
eq: 120
Cons
1 - Unfortunately, it is not customizable beyond simple input-output testing.
5) MultiPL-E
Meant to tests code LLMs on multiple programming languages.
A system for translating unit test-driven neural code generation benchmarks to new languages. We have used MultiPL-E to translate two popular Python benchmarks (HumanEval and MBPP) to 18 other programming languages.
Examples shown here: https://nuprl.github.io/MultiPL-E/
Paper: MultiPL-E
Repo: https://github.com/nuprl/MultiPL-E
Pros
Examples on running JS tests: https://github.com/nuprl/MultiPL-E/blob/main/prompts/humaneval-js-keep.jsonl
Enables writing tests as a function, so not just simple input output comparisons.
Adding new tests seems simple: See https://nuprl.github.io/MultiPL-E/new_benchmark.html
Cons
While the tutorial makes it sound like writing the test cases is really simple. This doesn’t seem to be the case
https://github.com/nuprl/MultiPL-E/blob/main/prompts/humaneval-r-remove.jsonl
Each of the test case needs to be decoded in a particular jsonl format, with escape characters fixed etc.
Verdict
👍 - This could work for building a react benchmark. But may not be easy to add new test cases to.
6) RepoBench
Paper: RepoBench
Repo: https://github.com/Leolty/repobench
Validates LLM on 3 tasks
1 - Retrieval Task: Ability to retrieve the right contextual files.
2 - Completion Task: Ability to complete next line, given the context files.
3 - Combined Task: Retrieval + Completion
Some interesting points noted in the paper:
Python Retrieval Shows Higher Accuracy Than Java: The language-specific results show that Python tasks typically show higher accuracy than Java across all retrieval methods. This discrepancy might be attributed to Python’s simpler syntax and less verbose nature, potentially reducing the variability of similar code snippets.
Pronounced Performance Differences in Java for RepoBenchC-2k: The evaluation on Java showcases a marked differentiation in model performance: Codex notably stands out as the superior model, followed by StarCoder, while CodeGen largely lags behind.
While there are some intuitive reasons cited, this clearly shows that benchmarks on Python may not directly apply to React / Typescript codebases.
Interesting bits
The project is easy to read and some interesting files are
1 - Metrics: ExactMatch, Similarity, and Accuracy@K https://github.com/Leolty/repobench/blob/main/evaluation/metrics.py. Note: their accuracy@k is not a probabilistic calculation like the pass@k metric introduced in HumanEval, and refers to the number of accurate codes retrieved out of correct codes.
2 - Retriever: https://github.com/Leolty/repobench/blob/main/retriever/retriever.py
3 - Similarity (Jaccard, Edit, Cosine): https://github.com/Leolty/repobench/blob/main/retriever/similarity.py
4 - Promp constructor: https://github.com/Leolty/repobench/blob/main/data/utils.py
Pros
1 - Easy to understand
2 - Repo level context understanding.
3 - Usage of Google drive for dataset.
4 - Multiple languages supported with various similarity metrics on next line.
Cons
The question this benchmark is trying to answer is different from what we need.
We require unit-test and visual accuracy, assuming the right context is already given.
Verdict
Not applicable.
Conclusion
So far, the only ones that meet what I’m looking for are the open-ai evals, and the MultiPL-E benchmark.
Ideally, if these benchmarks were easier to prepare and mimicked the way we actually write code / test cases, then it would be much easier to extend.
So after this research, I believe the best answer is to build a new “ReactBench” - a benchmark that mimics how React code is structured and is geared towards accuracy on Typescript / React with unit-testing and snapshotting.