QUALITY REVIEWS – Inter-Rater Reliability Report

Round date: 2026-03-19  |  Annotators: 2 annotations across 1 task(s)

Annotator Completeness — All Documents

Span-level Completeness (% filled)

Percentage of annotated spans where each annotator filled in the subcategory, impact level, and span-level comment fields. Values below 80% are highlighted in orange.

task_idannotator_idtotal_spanssubcategory_%impact_%span_comments_%
26.01.08.0100.0100.0100.0
26.0110.08.0100.0100.0100.0

Document-level Assessment Completeness

Whether each annotator completed all five document-level fields: document issues, correspondence score, correspondence comment, readability score, and readability comment. Missing fields are shown in red.

task_idannotator_iddocument_issuescorrespond_scorecorrespond_commentreadable_scorereadable_comment
261
26110

Annotator Timing — All Documents

Outliers in both measures are identified using the 1.5×IQR rule.

Lead Time

Active time spent working on the task as recorded by Label Studio, excluding time the task was open but idle.

StatisticValue
Measurements2
Median (min)5.9
Mean (min)5.9
Q1 – Q3 (min)4.9 – 7.0
Min (min)3.8
Max (min)8.1

Review Time

Total elapsed time from when the annotator first opened the task to submission, which may include breaks or interruptions.

StatisticValue
Measurements2
Median (min)0.8
Mean (min)0.8
Q1 – Q3 (min)0.4 – 1.2
Min (min)0.0
Max (min)1.6

Annotation Overlap Visualization — All Documents

Each document is shown with annotated spans highlighted. Darker shading indicates higher annotator agreement on a given span.

Annotation Overlap Visualization

Document ID: 26

Number of unique annotators: 2 Total annotations: 16 Maximum overlap: 2 annotators
Quality reviews are scheduled after common production tasks within translation and localization workflows as a check to ensure that the developing product meets the minimum quality standards for that stage, and to prevent unnecessary downstream issues and losses related to rework becoming necessary. The issues and expectations-exceeding performance that quality reviewers flag help quality managers understand where to encourage good performance and where to focus root-cause analysis to prevent issues, and root-cause analyses help project managers optimize processes and resources to build high performing productions over time.

## Quality Reviewer - Primary Duties

- **Review translated technical documentation** for accuracy, consistency, and completeness, including correct measurement conversions and proper application of language-specific conventions

---

*Source: Adapted from Iverson Language Associates Quality Reviewer Training Materials*

Legend:

1 annotator
2 annotators

Error Type Distribution — All Documents

Distribution of error labels, subcategories, and impact ratings across all documents in the project.

Correspondence & Readability Ratings — All Documents

Distribution of annotator ratings on overall translation correspondence (accuracy) and readability, on a 1–4 scale.

Exact Span Matching — All Documents

Percentage of error spans where two or more annotators identified identical start and end character positions.

Span F1 — Partial Match Agreement — All Documents

F1 scores measuring partial overlap between annotator spans. Unlike exact matching, partial credit is awarded when spans overlap, making this a more lenient measure of boundary agreement. Production deployment target: ≥0.70.

Cohen's Kappa — Error Category Agreement — All Documents

Pairwise Cohen's Kappa scores measuring agreement on error category labels between each pair of annotators, accounting for chance agreement. Kappa values: <0.20 poor · 0.21–0.40 fair · 0.41–0.60 moderate · 0.61–0.80 substantial · >0.80 almost perfect. Production deployment target: ≥0.70.