Emdoor & IDEA Research Institute Unveil UniTTS: A Breakthrough End-to-End AI Voice Model

marketing.info@emdoor.com

+86-755-2372-2880

16/17F, Emdoor Building, No.8 Guangke 1st Road, Pingshan District, Shenzhen

Shop

Amazon US

Amazon DE

AIiExpress

Shopify

Alibaba

News Blog

EN ES PT FR AR RU CN

Solutions

Public Utilities

Mobile Surveying and Construction Rugged PC in Public Industry

Intelligent Manufacturing

Rugged PC in Manufacturing Industry

Warehouse Management

Rugged PC in Warehouse & Logistics Industry Inbound Management of Warehouse Industry Outbound Management of Warehouse Industry Transit Sorting of Warehouse Industry Inventory of Warehouse Industry Loading/Unloading Management Item Tracking Management

Transportation

Rugged PC in Transportation Industry Transportstion of Warehouse Industry

Energy Industry

Rugged PC in Energy Industry

Medical Industry

Rugged Medical Tablet PC in Medical Industry

Surveying and Mapping Industry

Rugged PC in Surveying & Mapping Industry

Products

Rugged Tablet Industrial PC Rugged Notebook Outdoor Rugged Vehicle PC Rugged Handhelds

EM-T195

10.1 inch Rugged Tablet

EM-IS19N

Windows 11 Home/Optional Windows 11 PRO/IOT,Intel N150(6MB Cache,up to 3.6GHz,4 cores,4 threads,TDP 6W)

EM-I14M

14 Inch Rugged AI Tablet PC Windows 11

EM-I14A

14 Inch Ultra-large Rugged Tablet Intel Core i5/i7 Windows 11

EM-I20J

12.2 Inch Intel Industrial Rugged Tablet 4G Windows 10

EM-I20A

12.2 Inch Intel Windows 11 Rugged Tablet PC I7/I5

EM-T19

10.1 Inch Android Rugged Tablet PC

EM-T17X

10.1 Inch Long-endurance GMS Android Rugged Tablet

EM-T17X (RTK)

10.1 Inch Android Rugged Tablet with RTK GNSS

EM-T11X

10.1 Inch Android 14 /GMS Rugged Tablet

EM-R18

Rockchip3568 Quad-Core 2.0GHz 10.1 Inch Android Rugged Tablet

EM-R16

10.1 Inch Android 13 Rugged Tablet PC

EM-Q185M

10.1 Inch Rugged Tablet PC 5G Modem Android 14

EM-Q115M

10.1 Inch 5G Android Rugged Tablet

EM-Q19

10.1 Inch 4G Rugged Tablet With Barcode Scanner

EM-I17J (EX)

10.1 Inch Explosion-Proof Rugged Tablet PC

EM-I17J

10.1 Inch Windows Rugged Tablet PC

EM-I16J

10.1 Inch Windows Rugged PC

EM-I10J

10.1 Inch Intel Waterproof Rugged Tablet PC

EM-I10A

10.1 Inch Industrial Rugged Tablet PC I7/I5 Windows 11

EM-T89

8 Inch ARM Android Rugged Tablet

EM-T87X

8 Inch GMS Android Rugged Tablet

EM-T86X

8 Inch Heavy Duty Android 12 Rugged Tablet

EM-T81X

8 Inch Android Rugged Tablet

EM-R88

8 Inch Rugged Android Tablet PC

EM-Q885M

8 Inch 5G Modem Rugged Tablet

EM-Q89

8 Inch Windows Rugged Tablet

EM-Q865M

8 Inch Android 11 Rugged Tablet

EM-I87J

8 Inch Windows Rugged Tablet PC

EM-I88N

8 Inch Rugged Tablet Windows PC

EM-HC19

10.1 Inch Windows Medical Tablet PC

EM-HC195

10.1 Inch 5G Medical Tablet PC Android

EM-I18N

10.1 Inch Rugged Tablet PC Windows 11

EM-PT21

21.5 Inch Panel Touch Display

EM-P21R

21.5 Inch Android Embedded Industrial Panel PC

EM-P21J

21.5 inch Windows Industrial PC

EM-P21A

21.5 Inch Windows Industrial Panel PC

EM-P17R

17 Inch Industrial Panel PC Android 12

EM-P17J

17 Inch Industrial Panel PC Windows

EM-P15R

15.6 Inch Android Embedded Industrial PC

EM-P15J

15.6 Inch Windows Industrial Panel PC

EM-P15A

15.6 Inch Industrial Panel PC Windows

EM-P10R

10.1 Inch Android Industrial PC P10R

EM-P10J

10.1 Inch Windows Industrial PC

EM-P10A

10.1 Inch Industrial Panel PC Windows

EM-D10R

10.1 Inch Android 12 Industrial Digital Signage

EM-B10R

Rockchip RK3568 Fanless Embedded BOX PC

EM-B10J

Intel Celeron N5100 Windows Embedded Box PC

EM-B10A

Intel i5-1235U Windows Embedded Box PC

EM-P18J

18.5 Inch Industrial Panel PC Windows

EM-P18R

18.5 Inch Industrial Panel PC Windows

EM-X15M

15.6 Inch Rugged AI PC Intel Windows+AI

EM-X15A

15.6 Inch Rugged Notebook Intel Core i5/i7 Windows 11

EM-X14M

14 Inch Rugged AI PC Intel Windows+AI

EM-X14A

14 Inch Rugged Notebook Intel Core i5/i7 Windows11

EM-Q225M

12.2 Inch Rugged Android Tablet PC

EM-I22J

12.2 Inch Windows 11 Rugged 2in1 Notebook

EM-P1

6.56 inch Rugged Outdoor Phone Android 14

EM-P2 Pro

6.78 inch Rugged Outdoor Phone Android 14

EM-T2 Ultra

10.95 inch Rugged Outdoor Tablet Android 15 MediaTek

EM-T1 MAX

10.95 inch Rugged Outdoor Tablet Android 14 Helio G99

EM-T1

8.68 inch Rugged Outdoor Tablet Android 14 Helio G99

EM-V12R

10.1 Inch Android+Linux Vehicle PC

EM-V10J

Intel Celeron N5100 10.1 Inch Windows 11 Vehicle PC

EM-V10T

10.1 Inch Android 12 (GMS) Vehicle-mount Tablet

EM-V80T

8 Inch Android 12 (GMS) Vehicle-mount Tablet

EM-V80J

8 Inch Windows 10 Vehicle Mount Rugged Tablet

EM-I61N

6.5 Inch Rugged Handheld Windows 11

EM-T50

5.0 Inch Rugged Android Handheld

EM-T40

4.0 Inch Rugged Android Handheld

EM-H68T

6.56 Inch Rugged Android Handheld

EM-I61J

6.5 Inch Rugged Handheld Windows 11

EM-R51

5.0 Inch Android Rugged Handheld

EM-T89: A Lightweight Rugged Tablet Designed for Extreme

Emdoor As industries continue to evolve, the need for durable, reliable, and high-performance mobile devices grows in tandem. The EM-T89...

Explore More

View All Products

Services

Software Hardware After-sales service Customization Feedback Corner FAQ

Resource

Download Video Partner Office VR

Company

About Emdoor Social Contribution Case Studies News Blog

EN ES PT FR AR RU CN

marketing.info@emdoor.com

+86-755-2372-2880

16/17F, Emdoor Building, No.8 Guangke 1st Road, Pingshan District, Shenzhen

Shop

Amazon

AIiExpress

shopify

Alibaba

News Blog

Solutions

Public Utilities
- Mobile Surveying and Construction
- Rugged PC in Public Industry
Intelligent Manufacturing
- Rugged PC in Manufacturing Industry
Warehouse Management
Transportation
- Rugged PC in Transportation Industry
- Transportstion of Warehouse Industry
Energy Industry
- Rugged PC in Energy Industry
Medical Industry
- Rugged Medical Tablet PC in Medical Industry
Surveying and Mapping Industry
- Rugged PC in Surveying & Mapping Industry

Products

Rugged Tablet
- EM-T195
- EM-IS19N
- EM-I14M
- EM-I14A
- EM-I20J
- EM-I20A
- EM-T19
- EM-T17X
- EM-T17X (RTK)
- EM-T11X
- EM-R18
- EM-R16
- EM-Q185M
- EM-Q115M
- EM-Q19
- EM-I17J (EX)
- EM-I17J
- EM-I16J
- EM-I10J
- EM-I10A
- EM-T89
- EM-T87X
- EM-T86X
- EM-T81X
- EM-R88
- EM-Q885M
- EM-Q89
- EM-Q865M
- EM-I87J
- EM-I88N
- EM-HC19
- EM-HC195
- EM-I18N
Industrial PC
- EM-PT21
- EM-P21R
- EM-P21J
- EM-P21A
- EM-P17R
- EM-P17J
- EM-P15R
- EM-P15J
- EM-P15A
- EM-P10R
- EM-P10J
- EM-P10A
- EM-D10R
- EM-B10R
- EM-B10J
- EM-B10A
- EM-P18J
- EM-P18R
Rugged Notebook
- EM-X15M
- EM-X15A
- EM-X14M
- EM-X14A
- EM-Q225M
- EM-I22J
Outdoor Rugged
- EM-P1
- EM-P2 Pro
- EM-T2 Ultra
- EM-T1 MAX
- EM-T1
Vehicle PC
- EM-V12R
- EM-V10J
- EM-V10T
- EM-V80T
- EM-V80J
Rugged Handhelds
- EM-I61N
- EM-T50
- EM-T40
- EM-H68T
- EM-I61J
- EM-R51

Services

Resource

Company

Home > Blog > Public Utilities > Emdoor & IDEA Research Institute Unveil UniTTS: A Breakthrough End-to-End AI Voice Model to Revolutionize On-Device Human-Computer Interaction

Emdoor & IDEA Research Institute Unveil UniTTS: A Breakthrough End-to-End AI Voice Model to Revolutionize On-Device Human-Computer Interaction

2025-07-03

Emdoor

Originally by: Emdoor Research Institute | July 03, 2025

In the modern digital landscape, the interface between humans and machines is increasingly defined by voice. From smartphone assistants to smart home controls, voice interaction technology is rapidly reshaping our daily lives. However, a persistent challenge remains: achieving truly natural, fluid, and emotionally resonant communication with our devices. The robotic, monotone nature of many existing systems highlights a critical gap.

Traditional voice interaction systems often struggle to fully capture and utilize the rich, non-verbal information embedded in human speech. These "paralinguistic features"—such as timbre, prosody, and emotion—are essential for natural communication but are frequently lost in translation by machines. This results in synthesized speech that lacks the authenticity and expressiveness we expect. As artificial intelligence advances, user expectations have evolved; we no longer want a machine that simply understands commands, but one that can communicate with personality and emotional nuance.

To shatter these limitations and usher in a new era of intelligent on-device voice interaction, the Emdoor Research Institute, in a landmark collaboration with the Guangdong-Hong Kong-Macao Greater Bay Area Digital Economy Research (IDEA) Institute's joint laboratory (COTLab), has developed UniTTS, a series of powerful, end-to-end speech large models.

The Core Challenge: Beyond Words to Holistic Audio Understanding

One of the dominant approaches in modern Text-to-Speech (TTS) modeling relies on Large Language Models (LLMs) processing discrete audio codes. The effectiveness of this method hinges entirely on the quality of the audio's discrete encoding scheme. Many researchers attempt to separate acoustic features from semantic (content) features. However, this decoupling is fundamentally flawed. Not all speech information can be neatly categorized. For example, powerful emotional expressions like laughter, crying, or sarcasm are holistic audio events where acoustics and semantics are intrinsically linked. Furthermore, high-quality "universal audio" data, which includes rich background sounds or sound effects, defies simple separation.

While some have adopted multi-codebook solutions like GRFVQ-based methods to improve performance, this dramatically increases the bitrate of the discretized audio sequence. The resulting lengthy sequences significantly amplify the difficulty for LLMs to model the relationships within the audio, making low bitrate a critical metric for on-device performance.

To address this, our work introduces DistilCodec and UniTTS. DistilCodec is a novel single-codebook encoder trained to achieve nearly 100% uniform codebook utilization. Using the discrete audio representations from DistilCodec, we trained the UniTTS model on the powerful qwen2.5-7B backbone.

Our key contributions are:

A Novel Distillation Method for Audio Encoding: We successfully employ a multi-codebook teacher model (GRVQ) to distill its knowledge into a single-codebook student model (DistilCodec). This achieves near-perfect codebook utilization and provides a simple, efficient audio compression representation that doesn't require the decoupling of acoustic and semantic information.
A True End-to-End Architecture (UniTTS): Built upon DistilCodec's ability to model complete audio features, UniTTS possesses full end-to-end capabilities for both input and output. This allows the audio generated by UniTTS to exhibit far more natural and authentic emotional expressiveness.
A New Training Paradigm for Audio Language Models: We introduce a structured methodology:

Audio Perception Modeling: The training of DistilCodec, which focuses solely on feature discretization using universal audio data to enhance its robustness.
Audio Cognitive Modeling: The training of UniTTS, which is divided into three distinct phases: Pre-training, Supervised Fine-Tuning (SFT), and Alignment. This process leverages DistilCodec's complete audio feature modeling by incorporating a universal audio autoregressive task during pre-training. It also systematically validates the impact of different text-audio interleaved prompts during SFT and uses Direct Preference Optimization to further refine speech generation quality.

UniTTS & DistilCodec: The Technical Architecture

UniTTS System Architecture

UniTTS System Architecture

The UniTTS architecture is composed of two primary components: the ALM (Audio Language Model) Tokenizer and the Transformer-based Backbone.

ALM Tokenizer: This includes a standard Text Tokenizer for processing text and our innovative Audio Encoder (DistilCodec) for discretizing and reconstructing audio.
Backbone: This leverages a decoder-only Transformer architecture (qwen2.5-7B) to perform alternating autoregression across the two modalities of tokens (text and audio).

The model's vocabulary was expanded from its original size to 180,000 tokens to accommodate an additional 32,000 dedicated audio tokens generated by DistilCodec.

The DistilCodec Structure: Efficiency Through Distillation

The DistilCodec Structure

The DistilCodec Structure

DistilCodec's network, as shown above, first converts raw audio into a spectrogram via a Fourier transform. This spectrogram is then passed through a stack of residual convolutional layers for feature compression. A quantizer, using a linear layer, projects these compressed features into the vicinity of a codebook vector. The index of the nearest vector becomes the discrete representation for that audio segment. For reconstruction, a GAN-based network reverses this process to generate the corresponding audio waveform.

The training process for DistilCodec.

The training process for DistilCodec.

The training process for DistilCodec is unique. We first train a "Teacher Codec" that uses a combination of GVQ, RVQ, and FVQ with 32 distinct codebooks. We then initialize a "Student Codec"—our DistilCodec—with the parameters from the Teacher's encoder and decoder. This Student Codec has a residual and group value of 1, making it a single-codebook model, but its codebook size is the sum total of the teacher's, allowing it to capture immense acoustic diversity in a highly efficient structure.

The Three-Stage Training Paradigm of UniTTS

Modeling audio presents a much larger representation space than text alone. Therefore, access to large-scale, high-quality text-audio paired data is a prerequisite for achieving general-purpose audio autoregression.

Stage 1: Pre-training

UniTTS employs a multi-stage pre-training strategy.

Phase One: We start with a pre-trained text-based LLM and introduce text data, universal audio data, and a limited amount of text-audio paired data. This phase teaches the model the fundamentals of audio modeling. A key challenge here is "modality competition," where introducing audio data can cause the model's original text generation capabilities to degrade.
Phase Two: To counteract this, we combine text-based instruction datasets with our existing universal audio and text-audio datasets. This reinforces and enhances the model's text-generation abilities while solidifying its audio skills.
Context Expansion: To accommodate the long-sequence nature of audio data, we expanded the model's context window from 8,192 to 16,384 tokens.

UniTTS employs a multi-stage pre-training strategy.
Pre-training loss curve

Stage 2: Supervised Fine-Tuning (SFT)

The quality of data during SFT significantly impacts the final model's capabilities. Existing open-source text-audio datasets have notable flaws, including noisy ASR-generated labels and long, unnatural silences from sources like audiobooks. To overcome this, we designed a practical composite quality scoring method to filter and rank training samples:

Supervised Fine-Tuning (SFT)

Here, dnsmos(i) effectively filters for acoustic quality, while cer(i) (Character Error Rate from re-annotation) filters out samples with inaccurate labels. By re-ranking and applying a threshold based on this quality score, we drastically improved the quality of our training data.

Stage 3: Preference Alignment

While SFT helps the model learn specific speech patterns, it can sometimes lead to issues like unnatural prosodic prolongation or repetition—an auditory equivalent of the "parroting" seen in text-only LLMs. To refine this, we adopted preference optimization. However, standard Direct Preference Optimization (DPO) can be unstable for long-sequence audio modeling and may lead to mode collapse.

Preference Alignment

Preference Alignment

Therefore, UniTTS introduces Linear Preference Optimization (LPO) as a more stable alternative. In the LPO loss function, where $x 1$ and $x 2$ represent positive and negative samples, the model refines its policy gradient by gently promoting the positive sample's policy while suppressing the pass-through estimation for both samples. This stabilizes the preference optimization process for long audio sequences, leading to more robust and natural outputs.

Experimental Results: A New State-of-the-Art

We evaluated DistilCodec's perplexity (PPL) and codebook utilization (Usage) on the LibriSpeech-Clean dataset and our self-built Universal Audio dataset. The results confirm that DistilCodec achieves nearly 100% codebook utilization, a near-perfect result, on both speech and general audio datasets.

Comparison of code book rate, usage rate, and confusion rate

Comparison of code book rate, usage rate, and confusion rate

Furthermore, a comprehensive analysis on the LibriSpeech-Clean-Test benchmark demonstrates DistilCodec's superior speech reconstruction capabilities. At a highly efficient bitrate of around 1KBPS, DistilCodec achieves state-of-the-art (SOTA) performance on the STOI metric, indicating excellent speech intelligibility.

Comprehensive comparison of different Codec models

Comprehensive comparison of different Codec models

To conduct a rigorous evaluation of the complete system, we compared UniTTS against a suite of existing leading methods, including CosyVoice2, Spark-TTS, LLaSA, F5-TTS, and Fish-Speech. The results unequivocally show that UniTTS-LPO, the final aligned model, achieves comprehensive improvements in emotional expressiveness, fidelity, and naturalness when compared to the SFT-only version and all other competing models. This validates the effectiveness of our distillation-driven codec, holistic feature modeling, and advanced LPO training methodology.

Diversified unsupervised training

The Emdoor Advantage: From Research Lab to Rugged Reality

This research isn't just an academic exercise. For a company like Emdoor, a leader in rugged computing solutions, the development of UniTTS is a strategic move to redefine on-device human-computer interaction in the world's most demanding environments.

The efficiency of DistilCodec and the power of UniTTS are perfectly suited for the edge computing scenarios where Emdoor devices excel. Consider the real-world applications:

Field Service & Manufacturing: A technician in a noisy factory can issue complex, natural language commands to their rugged tablet, receiving clear, calm, and contextually appropriate synthesized audio feedback, even over the sound of heavy machinery.
First Responders & Public Safety: Paramedics can interact with their devices hands-free, receiving critical patient data read aloud with a tone that conveys urgency without causing panic. Police officers can operate in-vehicle systems with fluid voice commands, keeping their hands and eyes on the situation.
Logistics & Warehousing: Workers operating forklifts or managing inventory can communicate with the warehouse management system via voice, improving efficiency and safety without needing to stop and use a keypad.

The on-device nature of UniTTS means these interactions can happen instantly, without reliance on a stable cloud connection—a critical requirement for mobile and field operations. By integrating this technology into their rugged laptops, tablets, and handhelds, Emdoor is poised to deliver a user experience that is not only more efficient but also fundamentally more human.

Conclusion: The Future of Voice is Here

Through its highly efficient discrete encoding technology, DistilCodec has achieved near-perfect utilization of a single codebook, laying a robust foundation for versatile and adaptive audio LLMs. Building on this, the UniTTS model, with its stable three-stage cross-modal training strategy, represents a significant leap forward.

In the context of human-computer interaction, UniTTS does more than just improve the naturalness and fluency of voice exchange. It brings a new dimension of emotion and personality to the user experience, transforming devices from simple tools into intuitive, responsive partners. This collaboration between Emdoor Research Institute and IDEA Research Institute is not merely an innovation in AI; it is the blueprint for the future of on-device interaction.

Application of Industrial Tablet PC in Rail Transit Monitoring System