
Validation at the meaning layer, not the token layer.
We built SinoArabic because dialect, register, and cultural frame cannot be resolved by token-level alignment. Every dataset we produce is reviewed by native bilingual specialists before it leaves our pipeline.
Dialect Classification
Arabic is not monolithic — and alignment cannot begin until that's resolved.
Every Arabic segment is classified by dialect, register, and linguistic intent before alignment begins. Ambiguous cases are escalated to specialist review.
Cultural Frame Annotation
Before a single pair enters our alignment pipeline, each Arabic segment is classified by dialect and register. Modern Standard Arabic, Gulf, Levantine, Egyptian — each carries assumptions that break quietly when ignored.
Idioms, humor, and cultural references are annotated independently from fluency scoring to preserve meaning across languages.
Cultural reference and localization intent are annotated as distinct axes — not folded into a fluency score. A sentence can be fluent and still fail at the concept it was supposed to carry.


Specialists, not crowd workers.
Native bilingual specialists review each dataset according to domain expertise — NLP, localization, or multilingual evaluation workflows.
Each aligned pair passes three review stages: dialect and register confirmation, cultural-reference validation, and localization-intent sign-off. Rejection at any stage triggers re-annotation, not a fluency patch.
The only Arabic-Chinese dataset built at the meaning layer.
Generic cross-lingual corpora optimize for coverage. We optimize for correctness — the kind that only surfaces when a cultural reference lands, a dialect assumption is caught, and a classifier does not fail in production.
