About LLM ELO Ranking

How ELO Ranking Works

ELO is a method for calculating the relative skill levels of players in zero-sum games. In our case, we use it to rank the performance of Large Language Models (LLMs).

When two LLMs compete, the winner takes points from the loser. The amount of points exchanged depends on the relative ranking of the competitors - beating a higher-ranked LLM earns more points than beating a lower-ranked one.

Each model starts with 1000 ELO points. The ranking evolves as more competitions take place.

Our Methodology

For each pair of models, we:

  1. Present both models with the same question/prompt
  2. Have a judge model evaluate which response is better
  3. Update the ELO rankings based on the outcome

The judge model is a separate LLM that evaluates responses based on:

  • Correctness and accuracy
  • Helpfulness and relevance
  • Clarity and coherence
  • Safety and adherence to guidelines

Database Schema

Below is the overview of our database schema for tracking models, questions, answers, and votings.

Database Schema

About maxsim.ai

maxsim.ai is a platform for evaluating, benchmarking, and comparing AI models. We aim to provide objective measurements of AI capabilities to help users and developers make informed decisions.

This ELO ranking system is one of our projects to quantify and track the performance of different LLMs over time.