BEIJING, July 10 (Xinhua) — Chinese language researchers have constructed a brand new three-mode pre-training mannequin to understand mutual era between speech and picture.
The mannequin, OPT-Omni-Notion pre-Coach, can collectively study the multi-mode content material of textual content, speech, picture and video.
The present pre-training fashions typically cowl the modes of picture, video and textual content, whereas ignoring the speech info within the setting. To deal with the restrictions, the brand new mannequin can conduct cross-modal era duties corresponding to picture era from textual content, textual content era from picture, and picture era from speech.
The development of the brand new mannequin will promote the event of synthetic intelligence (AI) and considerably enhance the efficiency of primary duties of textual content, speech, picture and video, in keeping with the Institute of Automation, Chinese language Academy of Sciences, the developer of the mannequin.
It has nice potential worth in speech recognition and synthesis, in addition to the business functions corresponding to human-computer interplay and unmanned driving.