Technology

Comparing GAN and LSTM Performance in Music Generation

By Vivek Tiwari

Posted on March 24, 2023

The comparison introduces the use of neural networks in music generation, a complex but relatively unexplored field compared to tasks like image or text classification. One key challenge is generating realistic and aesthetically pleasing music, particularly because it often involves multiple instruments and tracks. The project discussed here aims to generate human-like music using a GAN (Generative Adversarial Network) architecture, trained on a subset of the MAESTRO dataset featuring piano compositions.

The methodology includes three main steps: preprocessing MIDI files to extract musical notes, developing a baseline model using Long Short-Term Memory (LSTM) networks, and creating a GAN model for music generation. The GAN involves a generator that creates music from random noise and a discriminator that distinguishes between real and generated music, improving the generator’s output over time.

The project explores hyperparameter tuning, evaluating activation functions, learning rates, optimizers, and batch sizes to enhance the models’ performance. The GAN outperformed the baseline LSTM model, producing less repetitive and more varied music. The results were confirmed both through quantitative metrics and human evaluations, showing that the GAN model created higher-quality music.

1) Introduction

Neural networks are widely used today for tasks related to text, image, and video classification. However, music remains a relatively untapped field, presenting unique learning opportunities and complex challenges. One of the main difficulties in music generation is creating realistic and aesthetically pleasing compositions. This task demands a thorough exploration of musical patterns and an understanding of music’s hierarchical structure before generating new compositions. Additionally, music typically involves multiple instruments and tracks, which complicates the process.

Despite these challenges, exciting potential applications exist for consumers, including commercial media companies and music composers. The use of neural networks to generate data that mimics human creativity is an emerging field. The goal of this project is to produce music comparable to human compositions using an advanced GAN architecture. A subset of the MAESTRO dataset, consisting of approximately 200 hours of piano music, was utilized. The dataset’s metadata provides details such as composer, year, MIDI filename, and duration. Some challenges in music generation include:

Each piece of music has its own unique characteristic, so model must effectively capture these numbers.
Assessing the quality of the music generated by the model.
Musical data requires a deep understanding of musical technicalities for proper processing.

2) Methodology

Here is a high-level overview of the approach:

Pre-process the midi files in order to extract sequence of notes.
Creation and evaluation of baseline LSTM model.
Creation and evaluation of GAN model.

It consists of the following steps:

A) Pre-processing and Analysis:

The MIDI music files were preprocessed to extract all notes and chords using the music21 library in Python. Each unique note and chord in the training set was identified and assigned a distinct integer value. The musical notes were then converted into sequences based on this mapping, preparing them for input into neural network models. The process is illustrated in Figure 1.

A. Pre-processing and Analysis:

B) Creation and Evaluation of baseline model:

Developed a baseline model using LSTM, as described in this article. LSTM (Long Short-Term Memory) is a type of recurrent neural network designed to process and predict sequential data over time, rather than spatial data. It excels at handling time series data by managing noise, distributed representations, and continuous values. LSTMs are particularly effective at remembering and utilizing past inputs over extended periods, which is beneficial for the task. The network’s architecture began with an LSTM layer, which processed the input sequence and was crucial for training the network to predict notes in a sequence. This was followed by a dropout layer to downsample the output and prevent overfitting. Additional LSTM and dropout layers were incorporated, with the final dropout layer’s output being fed into a fully connected layer. The network used the ReLU activation function throughout. Figure 2 shows the architecture of the LSTM model.

B. Creation and Evaluation of baseline model:

C) Creation and Evaluation of GAN:

Following the development of the baseline LSTM model, an adversarial network was trained to potentially enhance performance on the same task. The Generative Adversarial Network (GAN) comprised two main components: a generator and a discriminator. The generator’s role was to create music from random noise, while the discriminator’s task was to differentiate between authentic music files from the dataset and the fake music produced by the generator. The discriminator received both real and generated files simultaneously, aiming to correctly identify each. This adversarial setup involved a competition between the two models, with the generator improving its ability to deceive the discriminator over time. Figure 3 illustrates the general architecture of the GAN. Training the GAN can be viewed as a two-player minimax game between the generator and the discriminator.

C. Creation and Evaluation of GAN:

Generator network:

The generator network was a straightforward multi-layered perceptron that took random inputs corresponding to the latent dimension size. As previously mentioned, its role was to produce a music sequence from the given random noise. Figure 4 illustrates the architecture of the generator model.

Generator network:

Discriminator network:

The discriminator model began with two LSTM layers, including one bidirectional LSTM, allowing the model to consider context from both directions for decision-making. Figure 5 illustrates the architecture of the discriminator model.

Discriminator network:

3) Model Training and Hyper parameter selection

A) Model Training

The LSTM model was initially trained and used to generate music sequences from random inputs. Subsequently, the percentage of excessively repeated notes and the range of music notes produced were assessed, which are the primary evaluation metrics (detailed in Section 4). Figure 6 shows the loss per epoch for the baseline LSTM model.

A. Model Training

After obtaining results from the baseline model, the GAN model was trained. The GAN architecture was trained as follows: a randomly generated sequence was input to the generator to create a fake music sequence. This generated fake music sequence, along with a real pre-processed music sequence from the dataset, was then fed into the discriminator. The parameters of both the generator and discriminator were updated, with the generator working to enhance its ability to deceive the discriminator, and the discriminator aiming to improve its ability to differentiate between real and fake music sequences.

B) Hyper-Parameter Tuning

A model’s performance is influenced not only by its architecture but also by the fine-tuning of its hyperparameters. To achieve optimal performance, the hyperparameters of the GAN model (for both the generator and discriminator) were meticulously adjusted by testing various values and selecting the most effective ones. This process aimed to maximize the model’s performance, particularly in terms of minimizing excessively repeated notes and expanding the range of generated notes (discussed further in section 4). The following are the key hyperparameters selected for fine-tuning:

Activation Function: [relu, leakyRelu]

Learning Rate: [0.01, 0.001, 0.1]

Optimizer: [Adam, RMSProp, Adadelta]

Batch Size: [64, 128, 256]

a) Activation Function: The learning rate was initially set to 0.01, the batch size to 64, and the optimizer to RMSProp, consistent with the hyperparameters used in the baseline model. The model was then trained using various activation functions. Table I compares the percentage of excessively repeated notes and the range of notes generated for each activation function. The best results were achieved with LeakyReLU as the activation function.

TABLE I

Percentage of excessively repeated notes and range for
activation function

Activation Function	Notes excessively repeated (%)	Range
Relu	58	56
LeakyRelu	56	56

b) Learning Rate: The learning rate was fine-tuned while keeping the activation function as LeakyReLU and the optimizer as Adadelta. Table II displays the percentage of excessively repeated notes and the range of generated music notes for various learning rates. Optimal performance was obtained with a learning rate of 0.01.

TABLE II

Percentage of excessively repeated notes and range for
different learning rate

Learning Rate	Notes excessively repeated (%)	Range
0.1	57	51
0.01	56	57
0.001	60	53

c) Optimizer: The optimizer was fine-tuned while maintaining the learning rate at 0.01 and the activation function as LeakyReLU. Table III presents the percentage of excessively repeated notes and the range of generated music notes for various optimizers. The best results were obtained with Adam as the optimizer.

TABLE III

Percentage of excessively repeated notes and range for
different optimizers

Optimizer	Notes excessively repeated (%)	Range
Adam	58	55
Adadelta	59	51
RMSProp	61	49

d) Batch Size: The batch size was fine-tuned using the optimal hyperparameters identified so far: a learning rate of 0.01, LeakyReLU as the activation function, and Adam as the optimizer. Table IV shows the percentage of excessively repeated notes and the range of generated music notes for various batch sizes. The best performance was achieved with a batch size of 128.

TABLE IV

Percentage of excessively repeated notes and range for
different batch size

Batch Size	Notes excessively repeated (%)	Range
64	57	59
128	56	57
256	60	62

4) Evaluation

Evaluating generated music poses a challenge due to the unsupervised nature of the task. The most effective evaluation method would involve human assessment, where individuals review the model’s output. Although this is one of the best techniques, it is impractical to have all samples evaluated by humans. Therefore, several quantitative measures, grounded in music theory principles, were employed to assess the aesthetic quality of the music generated by the model. Below are the key measures used for the model’s quantitative analysis:

1. Excessively Repeated Notes: This measures the fraction of notes that are overly redundant in a musical piece. It serves as a good indicator of the variability in the generated music. A high level of repetition suggests the model is generating the same sequences and lacks variability, indicating that it not learning effectively and producing repetitive outputs.

2. Range: This metric evaluates the span between the highest and lowest notes in a tune, measured in tone steps. According to general music aesthetics, a tone span of around 36 is considered to produce good music. Table V below provides a summary of the results obtained for these metrics for both the baseline LSTM model and the GAN model.

TABLE V

Comparison of Baseline LSTM model and GAN using
evaluation metrics

	Notes Excessively Repeated (%)	Range
Baseline LSTM	82	32
GAN	55	56

Table V demonstrates that the baseline model had a higher percentage of repeated notes compared to the GAN model. Multiple music files were generated from both models, and the average percentage of excessively repeated notes was calculated. On average, 82 percent of the notes were repeated in the baseline model, while only 55 percent were repeated in the GAN model. The range metric revealed slightly different results: the baseline LSTM model had a range of 32, whereas the GAN model achieved a range of 56.

This can be attributed to the baseline model’s use of the softmax activation function in its final layer, which tends to generate the most probable note for a given sequence. Additionally, the LSTM model produced more repetitive music, limiting the difference between the highest and lowest notes, whereas the GAN model generated more varied music with a broader range of notes.

Alongside these quantitative metrics, a survey was conducted to assess the quality of the music generated by the proposed GAN model in comparison to the baseline model. Several individuals with basic knowledge of piano music participated in the survey. According to the results, 99 percent of participants agreed that the music generated by the GAN model was superior to that of the baseline LSTM model.

Combining these statistical measures with human evaluation ensured that the music generated by the model aligned with the original training data. The results clearly indicate that the proposed GAN model outperformed the baseline LSTM model, producing unique tones and capturing the nuances of the music more effectively.

References

[1] Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng- Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel and Dou- glas Eck “Enabling Factorized Piano Music Modelling and Generation with The Maestro Dataset” ICLR, Jan 2019.

[2] Natasha Jaques, Shixiang Gu, Richard E. Turner, Douglas Eck “Generating Music by Fine-Tuning Recurrent Neural Networks with Reinforcement Learning”.

[3] Allen Huang, Raymond Wu “Deep Learning for Music” Stanford U

Related Items:Image, music, Music Generation, Neural Networks, text, video classification

Comments

TechBullion

Comparing GAN and LSTM Performance in Music Generation

1) Introduction

2) Methodology

3) Model Training and Hyper parameter selection

4) Evaluation

Trending Stories

Navigating the IRS Fresh Start Program: A Comprehensive Guide

Massive Global Crypto Fraud Network Dismantled in Spain, Over 5,000 Victims Targeted

Looking for a safe and profitable way to grow your USDT or USDC?

Are Minecoins Worth It? Here’s Why You Absolutely Need Them!

Which cloud mining platform can earn $8,000 a day? W·Mining integrates innovation and launches a new smart cloud mining APP

NEXST Launches Web3 VR Entertainment Platform with K-Pop Group UNIS as First Global Partner

Bitcoin All-Time High Creates Millionaires: Bitcoin Solaris Confirms LBank Listing with 10,000 TPS Technology

The Future of Work: How Quality Virtual Match Connects Businesses with Top Remote Talent

Businesses Seek To Automate White Collar Work

Why Detroit: Become Human Still Hits Hard—and What to Play Next

Follow On Facebook

Latest Interview

An Interview With Sheila Kemirembe: Transforming Health Systems Through Data Analytics

Digital Transformation in Hospitality: The Role of Smart Workflows in Guest Experience. An Interview with Iana Petrova – Business Development Leader and TravelTech Expert

Press Release

GUNZ Announces $GUN Token Expansion to Solana

NEXST Launches Web3 VR Entertainment Platform with K-Pop Group UNIS as First Global Partner

Pin It on Pinterest

TechBullion

1) Introduction

2) Methodology

3) Model Training and Hyper parameter selection

4) Evaluation

Recommended for you

Trending Stories

Navigating the IRS Fresh Start Program: A Comprehensive Guide

Massive Global Crypto Fraud Network Dismantled in Spain, Over 5,000 Victims Targeted

Looking for a safe and profitable way to grow your USDT or USDC?

Are Minecoins Worth It? Here’s Why You Absolutely Need Them!

Which cloud mining platform can earn $8,000 a day? W·Mining integrates innovation and launches a new smart cloud mining APP

NEXST Launches Web3 VR Entertainment Platform with K-Pop Group UNIS as First Global Partner

Bitcoin All-Time High Creates Millionaires: Bitcoin Solaris Confirms LBank Listing with 10,000 TPS Technology

The Future of Work: How Quality Virtual Match Connects Businesses with Top Remote Talent

Businesses Seek To Automate White Collar Work

Why Detroit: Become Human Still Hits Hard—and What to Play Next

Follow On Facebook

Latest Interview

An Interview With Sheila Kemirembe: Transforming Health Systems Through Data Analytics

Digital Transformation in Hospitality: The Role of Smart Workflows in Guest Experience. An Interview with Iana Petrova – Business Development Leader and TravelTech Expert

Press Release

GUNZ Announces $GUN Token Expansion to Solana

NEXST Launches Web3 VR Entertainment Platform with K-Pop Group UNIS as First Global Partner

Pin It on Pinterest