In a paper to detail the DeepSinger, Microsoft says it trains the AI with data from music-focused websites. Using a uniquely designed component, the research team could train the AI to separate the timbre of a singer’s voice even in noisy data. According to Microsoft Research, training AI to understand singing voices is more complicated than regular spoken language. That’s because rhythms and patterns are different. To overcome this, the AI needs to focus on the duration and pitch of the words. Furthermore, the company says a lack of major singing training data sets also impeded progress. Still, DeepSinger was able to overcome these challenges by tapping into data-modeling tools. The AI leverages a music website to crawl over popular songs from leading artists across multiple languages.
How it Works
DeepSinger then removes the singing voice from any data noise (instruments, music) by using a tool called Spleeter. It will then segment the audio into sentences. Next, the AI extracts the duration of the sounds made in words, called phonemes, to understand the lyrics. Once the words are filtered, the AI generates a confidence score based on accuracy. Microsoft showcased the model in action with some samples. To get this far, DeepSinger needed to track thousands of Songs. It did this across music websites in English, Chinese, and Cantonese. Each song was filtered for length and all were brought under a standard volume range. Any song with poor quality audio were removed from the dataset. By the end, the AI had compiled a training data (Singing-Wild data set) of 92 hours of songs by 89 singers. The research teams says the AI can match the quantitative pitch of the songs at over 85% accuracy.