OpenAI introduces neural net that generates music including vocals

The artificial intelligence research institute OpenAI has developed Jukebox, a model that can fully generate music tracks in various genres, including the vocals.

OpenAI publishes 7131 samples of songs created with Jukebox. The songs range from pop, rock and jazz to reggae and hip-hop. In addition to fully self-generated material in certain styles, there are also reinterpretations of songs and samples of well-known tracks such as Hotel California by The Eagles, which gives Jukebox a completely different turn 12 seconds after the original start.

For its model, the research institute does not use a representation of music, such as musical notes or midi data, but direct audio. His previous AI music generator, Musenet, still trained OpenAI on thousands of midi files. The disadvantage, according to the makers, is that when training on representations the human voice is not included and subtle properties in the field of dynamics and expressiveness are then also missing.

Training on pure audio is a lot more difficult, because the models have to take into account the greater degree of diversity and longer structures. For training, OpenAI used a dataset of 1.2 million tracks. Of those, 600,000 were in English, but the developers plan to use more international music to train Jukebox in the future. The music is combined with accompanying lyrics from LyricWiki and metadata about the genre, artist and other keywords.

Encoding existing music and generating, upsampling and decoding new audio by Jukebox

OpenAI uses 32-bit 44.1kHz audio as its base and compresses tracks into three levels for training: 8x, 32x and 128x. These levels are for encoding the input. During this downsampling a lot of audio information is lost, but the essential details in terms of pitch, timbre and volume are preserved. Codes are generated on the basis of the input at three levels, also at three levels. The upper level models the long structure of tracks, with vocals and melodies, and information such as genre can be added to this layer. The audio quality of this output is low. The middle and bottom levels take care of the rest of the musical structures and improve the audio quality.

According to the researchers, Jukebox still has some serious drawbacks and is not yet able to generate good musical structures as choruses. Downsampling and upsampling also causes noise and the model is slow: it takes about nine hours to generate a minute of audio. They do call the model a step forward in terms of creating coherent music through a neural net, whereby differentiation can be made on the basis of artist, genre and song lyrics.