Written for the practitioner, not the press release.
Specific articles on Arabic NLP gaps, bilingual alignment failure modes, and cultural-context annotation. Where current datasets fall short, we say so.


When token-level agreement masks meaning-level failure
High BLEU scores can co-exist with annotations that fail on idiom, register, and cultural reference. This article maps the specific failure modes our review process catches.
Arabic NLP · 12 min read
Recent articles
Gulf vs. Levantine: labeling dialect in training data
What makes a joke survive Arabic localization
Building evaluation sets that test cultural reasoning
Why collapsing Arabic dialect variants into a single label produces classifiers that generalize poorly across regions — and how granular dialect tagging changes model behavior.
Pragmatic humor depends on shared cultural reference frames. We examine three localization failure cases and the annotation decisions that would have prevented them.
Standard benchmarks rarely probe whether an Arabic LLM understands cultural implication. We outline a methodology for constructing evaluation sets that do.
8 min read
10 min read
9 min read
The same rigor behind these articles shapes every dataset we build. See the work in the Datasets & Services catalogue.
