Dual Diffusion

Dual Diffusion is a generative diffusion model for video game music. The model is still a work in progress. More information about the project and full source code is available on github.

The most recent model (U3) has 500m parameters in the music diffusion model, 25m in the auto-encoder and 15m in the diffusion decoder. It was trained for 460k steps on a single RTX 5090 GPU over the course of 4 weeks. While I originally intended to release this model's weights, I'm not entirely satisfied with the model's performance and will be continuing development until I feel I have something worth releasing publicly.

The model is conditioned on CLAP audio embeddings and is designed to be used with example audio. The training data consists of video game music from the mid-90s to present day. While this dataset is substantially larger (570k) than the SNES dataset (20k), it includes a substantial number of tracks with poor audio quality / codecs (Sega Saturn, PS1, etc). As a result the audio quality of generated samples can vary from phono-realistic to 96kbps WMA. A small minority of the tracks in the dataset contain vocals, but the model was not conditioned with lyrics or transcriptions. Although you might occasionally hear intelligible words or phrases, most tracks with vocals tend to sound like simlish.

For samples from the older model that was trained exclusively on SNES/SFC music click here.

Dual Diffusion Sample Audio (U3 Model)

Below are samples generated by the U3 model throughout the course of training. The tracks marked U2 were generated by the previous model which was only trained to 130k steps.

Super Nintendo / Super Famicom Model

Below are samples generated by an older model trained exclusively on SNES/SFC music. Captions indicate which game(s) were used as conditioning for each sample, "miscellaneous" or "et al" samples use too many (low-weighted) games to list specifically. Samples are in 32khz stereo to match the capabilities of the SPC700.