Samsung Electronics introduced its dedicated AI work productivity benchmark called “TRUEBench” to evaluate the productivity of several AI models.
Samsung Research developed a benchmark, ‘TRUEBench,’ which stands for Trustworthy Real-world Usage Evaluation Benchmark. The aim behind developing this is to accurately measure the work productivity performance of AI models using existing benchmarks.
Most of the AI benchmarks available on the market are based on English and evaluate conversations one time or a limited number of times, preferably continuous conversations. Samsung’s new online tool is designed with a comprehensive set of metrics to measure how large language models (LLMs) perform in real-world workplace productivity applications.
The company claims that “TRUEBench” stands out from existing benchmarks by focusing on evaluating work productivity, while the actual evaluation items consist of 10 categories, 46 tasks, and 2,485 detailed items.
Samsung provided the TRUEBench with manifold dialogue scenarios and multilingual conditions, helping it to offer realistic AI evaluation. Users are also allowed to dive deeper into the company’s in-house AI test platform on the TRUEBench Hugging Face page.
Paul (Kyungwhoon) Cheun, CTO of the DX Division at Samsung Electronics and Head of Samsung Research, stated, “We expect TRUEBench to establish evaluation standards for productivity and solidify Samsung’s technological leadership.”
TRUEBench is composed of a total of 2,485 test sets across 10 categories and 12 languages and supports cross-linguistic scenarios, examining what AI models can actually solve.