QUALITY REVIEWS – Inter-Rater Reliability Report

Annotator Completeness — All Documents

Span-level Completeness (% filled)

Percentage of annotated spans where each annotator filled in the subcategory, impact level, and span-level comment fields. Values below 80% are highlighted in orange.

task_id	annotator_id	total_spans	subcategory_%	impact_%	span_comments_%
26.0	1.0	8.0	100.0	100.0	100.0
26.0	110.0	8.0	100.0	100.0	100.0

Document-level Assessment Completeness

Whether each annotator completed all five document-level fields: document issues, correspondence score, correspondence comment, readability score, and readability comment. Missing fields are shown in red.

task_id	annotator_id	document_issues	correspond_score	correspond_comment	readable_score	readable_comment
26	1	✓	✓	✗	✓	✗
26	110	✓	✓	✗	✓	✗

Annotator Timing — All Documents

Outliers in both measures are identified using the 1.5×IQR rule.

Lead Time

Active time spent working on the task as recorded by Label Studio, excluding time the task was open but idle.

Statistic	Value
Measurements	2
Median (min)	5.9
Mean (min)	5.9
Q1 – Q3 (min)	4.9 – 7.0
Min (min)	3.8
Max (min)	8.1

Review Time

Total elapsed time from when the annotator first opened the task to submission, which may include breaks or interruptions.

Statistic	Value
Measurements	2
Median (min)	0.8
Mean (min)	0.8
Q1 – Q3 (min)	0.4 – 1.2
Min (min)	0.0
Max (min)	1.6

Annotation Overlap Visualization — All Documents

Each document is shown with annotated spans highlighted. Darker shading indicates higher annotator agreement on a given span.

Annotation Overlap Visualization

Document ID: 26

Number of unique annotators: 2 Total annotations: 16 Maximum overlap: 2 annotators

Quality reviews are scheduled after common production tasks within translation and localization workflows as a check to ensure that the developing product meets the minimum quality standards for that stage, and to prevent unnecessary downstream issues and losses related to rework becoming necessary. The issues and expectations-exceeding performance that quality reviewers flag help quality managers understand where to encourage good performance and where to focus root-cause analysis to prevent issues, and root-cause analyses help project managers optimize processes and resources to build high performing productions over time.

## Quality Reviewer - Primary Duties

- **Review translated technical documentation** for accuracy, consistency, and completeness, including correct measurement conversions and proper application of language-specific conventions

---

*Source: Adapted from Iverson Language Associates Quality Reviewer Training Materials*

Legend:

1 annotator

2 annotators

Span F1 — Partial Match Agreement — All Documents

F1 scores measuring partial overlap between annotator spans. Unlike exact matching, partial credit is awarded when spans overlap, making this a more lenient measure of boundary agreement. Production deployment target: ≥0.70.

Cohen's Kappa — Error Category Agreement — All Documents

Pairwise Cohen's Kappa scores measuring agreement on error category labels between each pair of annotators, accounting for chance agreement. Kappa values: <0.20 poor · 0.21–0.40 fair · 0.41–0.60 moderate · 0.61–0.80 substantial · >0.80 almost perfect. Production deployment target: ≥0.70.