Percentage of annotated spans where each annotator filled in the subcategory, impact level, and span-level comment fields. Values below 80% are highlighted in orange.
| task_id | annotator_id | total_spans | subcategory_% | impact_% | span_comments_% |
|---|---|---|---|---|---|
| 26.0 | 1.0 | 8.0 | 100.0 | 100.0 | 100.0 |
| 26.0 | 110.0 | 8.0 | 100.0 | 100.0 | 100.0 |
Whether each annotator completed all five document-level fields: document issues, correspondence score, correspondence comment, readability score, and readability comment. Missing fields are shown in red.
| task_id | annotator_id | document_issues | correspond_score | correspond_comment | readable_score | readable_comment |
|---|---|---|---|---|---|---|
| 26 | 1 | ✓ | ✓ | ✗ | ✓ | ✗ |
| 26 | 110 | ✓ | ✓ | ✗ | ✓ | ✗ |
Outliers in both measures are identified using the 1.5×IQR rule.
Active time spent working on the task as recorded by Label Studio, excluding time the task was open but idle.
| Statistic | Value |
|---|---|
| Measurements | 2 |
| Median (min) | 5.9 |
| Mean (min) | 5.9 |
| Q1 – Q3 (min) | 4.9 – 7.0 |
| Min (min) | 3.8 |
| Max (min) | 8.1 |
Total elapsed time from when the annotator first opened the task to submission, which may include breaks or interruptions.
| Statistic | Value |
|---|---|
| Measurements | 2 |
| Median (min) | 0.8 |
| Mean (min) | 0.8 |
| Q1 – Q3 (min) | 0.4 – 1.2 |
| Min (min) | 0.0 |
| Max (min) | 1.6 |
Each document is shown with annotated spans highlighted. Darker shading indicates higher annotator agreement on a given span.
Legend:
Distribution of error labels, subcategories, and impact ratings across all documents in the project.
Distribution of annotator ratings on overall translation correspondence (accuracy) and readability, on a 1–4 scale.
Percentage of error spans where two or more annotators identified identical start and end character positions.
F1 scores measuring partial overlap between annotator spans. Unlike exact matching, partial credit is awarded when spans overlap, making this a more lenient measure of boundary agreement. Production deployment target: ≥0.70.
Pairwise Cohen's Kappa scores measuring agreement on error category labels between each pair of annotators, accounting for chance agreement. Kappa values: <0.20 poor · 0.21–0.40 fair · 0.41–0.60 moderate · 0.61–0.80 substantial · >0.80 almost perfect. Production deployment target: ≥0.70.