← Home

Model Performance

Verify →

Current Model

Architecture
DINOv2 + FAISS
Inference
Modal (serverless GPU)
Candidates per task
Top-5
Version
Model Label
Training Epoch
Trainer Commit
Reference Commit

Overall Performance

Match rate
confirmed / decided
Top-1 accuracy
1st candidate was correct
Avg score — confirmed
FAISS similarity
Avg score — rejected
FAISS similarity
Tasks evaluated
… pending

Confidence Calibration

For each FAISS score range, how often did the model's top-5 candidates include the correct answer? A well-calibrated model shows higher match rates at higher scores.

Version History

Each time the model is retrained on new confirmed labels, a new version is created here. All tasks retain the version tag of the model that generated them.

Version Tasks Match rate Avg score Period Status