Model Performance — Space Invaders

Current Model

Architecture

DINOv2 + FAISS

Inference

Modal (serverless GPU)

Candidates per task

Top-5

Version

—

Model Label

—

Training Epoch

—

Trainer Commit

—

Reference Commit

—

Overall Performance

Match rate

…

confirmed / decided

Top-1 accuracy

…

1st candidate was correct

Avg score — confirmed

…

FAISS similarity

Avg score — rejected

…

FAISS similarity

Tasks evaluated

…

… pending

Confidence Calibration

For each FAISS score range, how often did the model's top-5 candidates include the correct answer? A well-calibrated model shows higher match rates at higher scores.

Version History

Each time the model is retrained on new confirmed labels, a new version is created here. All tasks retain the version tag of the model that generated them.

Version	Tasks	Match rate	Avg score	Period	Status