About LLM ELO Ranking

How maxsim.ai Ranking Works

This is a benckmark method for calculating the relative ELO skill levels of AI players in zero-sum games. In our case, we use it to rank the performance of Large Language Models (LLMs).

When two LLMs compete, the winner takes points from the loser. The amount of points exchanged depends on the relative ranking of the competitors - beating a higher-ranked LLM earns more points than beating a lower-ranked one.

Each model starts with 1000 ELO points. The ranking evolves as more competitions take place.

Our Methodology

For each pair of models, we:

Present both models with the same question/prompt
Have a judge model evaluate which response is better
Update the ELO rankings based on the outcome

The judge model is a separate LLM that evaluates responses based on:

Correctness and accuracy
Helpfulness and relevance
Clarity and coherence
Safety and adherence to guidelines

Database Schema

Below is the overview of our database schema for tracking models, questions, answers, and votings.

About maxsim.ai

maxsim.ai is a platform for evaluating, benchmarking, and comparing AI models. We aim to provide objective measurements of AI capabilities to help users and developers make informed decisions.

This ELO ranking system is one of our projects to quantify and track the performance of different LLMs over time.

Contact: javier.gonzalez@maxsim.cloud