The comparison introduces the use of neural networks in music generation, a complex but relatively unexplored field compared to tasks like image or text classification. One key challenge is generating realistic and aesthetically pleasing music, particularly because it often involves multiple instruments and tracks. The project discussed here aims to generate human-like music using a GAN (Generative Adversarial Network) architecture, trained on a subset of the MAESTRO dataset featuring piano compositions.
The methodology includes three main steps: preprocessing MIDI files to extract musical notes, developing a baseline model using Long Short-Term Memory (LSTM) networks, and creating a GAN model for music generation. The GAN involves a generator that creates music from random noise and a discriminator that distinguishes between real and generated music, improving the generator’s output over time.
The project explores hyperparameter tuning, evaluating activation functions, learning rates, optimizers, and batch sizes to enhance the models’ performance. The GAN outperformed the baseline LSTM model, producing less repetitive and more varied music. The results were confirmed both through quantitative metrics and human evaluations, showing that the GAN model created higher-quality music.
1) Introduction
Neural networks are widely used today for tasks related to text, image, and video classification. However, music remains a relatively untapped field, presenting unique learning opportunities and complex challenges. One of the main difficulties in music generation is creating realistic and aesthetically pleasing compositions. This task demands a thorough exploration of musical patterns and an understanding of music’s hierarchical structure before generating new compositions. Additionally, music typically involves multiple instruments and tracks, which complicates the process.
Despite these challenges, exciting potential applications exist for consumers, including commercial media companies and music composers. The use of neural networks to generate data that mimics human creativity is an emerging field. The goal of this project is to produce music comparable to human compositions using an advanced GAN architecture. A subset of the MAESTRO dataset, consisting of approximately 200 hours of piano music, was utilized. The dataset’s metadata provides details such as composer, year, MIDI filename, and duration. Some challenges in music generation include:
- Each piece of music has its own unique characteristic, so model must effectively capture these numbers.
- Assessing the quality of the music generated by the model.
- Musical data requires a deep understanding of musical technicalities for proper processing.
2) Methodology
Here is a high-level overview of the approach:
- Pre-process the midi files in order to extract sequence of notes.
- Creation and evaluation of baseline LSTM model.
- Creation and evaluation of GAN model.
It consists of the following steps:
A) Pre-processing and Analysis:
The MIDI music files were preprocessed to extract all notes and chords using the music21 library in Python. Each unique note and chord in the training set was identified and assigned a distinct integer value. The musical notes were then converted into sequences based on this mapping, preparing them for input into neural network models. The process is illustrated in Figure 1.
B) Creation and Evaluation of baseline model:
Developed a baseline model using LSTM, as described in this article. LSTM (Long Short-Term Memory) is a type of recurrent neural network designed to process and predict sequential data over time, rather than spatial data. It excels at handling time series data by managing noise, distributed representations, and continuous values. LSTMs are particularly effective at remembering and utilizing past inputs over extended periods, which is beneficial for the task. The network’s architecture began with an LSTM layer, which processed the input sequence and was crucial for training the network to predict notes in a sequence. This was followed by a dropout layer to downsample the output and prevent overfitting. Additional LSTM and dropout layers were incorporated, with the final dropout layer’s output being fed into a fully connected layer. The network used the ReLU activation function throughout. Figure 2 shows the architecture of the LSTM model.
C) Creation and Evaluation of GAN:
Following the development of the baseline LSTM model, an adversarial network was trained to potentially enhance performance on the same task. The Generative Adversarial Network (GAN) comprised two main components: a generator and a discriminator. The generator’s role was to create music from random noise, while the discriminator’s task was to differentiate between authentic music files from the dataset and the fake music produced by the generator. The discriminator received both real and generated files simultaneously, aiming to correctly identify each. This adversarial setup involved a competition between the two models, with the generator improving its ability to deceive the discriminator over time. Figure 3 illustrates the general architecture of the GAN. Training the GAN can be viewed as a two-player minimax game between the generator and the discriminator.
Generator network:
The generator network was a straightforward multi-layered perceptron that took random inputs corresponding to the latent dimension size. As previously mentioned, its role was to produce a music sequence from the given random noise. Figure 4 illustrates the architecture of the generator model.
Discriminator network:
The discriminator model began with two LSTM layers, including one bidirectional LSTM, allowing the model to consider context from both directions for decision-making. Figure 5 illustrates the architecture of the discriminator model.
3) Model Training and Hyper parameter selection
A) Model Training
The LSTM model was initially trained and used to generate music sequences from random inputs. Subsequently, the percentage of excessively repeated notes and the range of music notes produced were assessed, which are the primary evaluation metrics (detailed in Section 4). Figure 6 shows the loss per epoch for the baseline LSTM model.
After obtaining results from the baseline model, the GAN model was trained. The GAN architecture was trained as follows: a randomly generated sequence was input to the generator to create a fake music sequence. This generated fake music sequence, along with a real pre-processed music sequence from the dataset, was then fed into the discriminator. The parameters of both the generator and discriminator were updated, with the generator working to enhance its ability to deceive the discriminator, and the discriminator aiming to improve its ability to differentiate between real and fake music sequences.
B) Hyper-Parameter Tuning
A model’s performance is influenced not only by its architecture but also by the fine-tuning of its hyperparameters. To achieve optimal performance, the hyperparameters of the GAN model (for both the generator and discriminator) were meticulously adjusted by testing various values and selecting the most effective ones. This process aimed to maximize the model’s performance, particularly in terms of minimizing excessively repeated notes and expanding the range of generated notes (discussed further in section 4). The following are the key hyperparameters selected for fine-tuning:
Activation Function: [relu, leakyRelu]
Learning Rate: [0.01, 0.001, 0.1]
Optimizer: [Adam, RMSProp, Adadelta]
Batch Size: [64, 128, 256]
a) Activation Function: The learning rate was initially set to 0.01, the batch size to 64, and the optimizer to RMSProp, consistent with the hyperparameters used in the baseline model. The model was then trained using various activation functions. Table I compares the percentage of excessively repeated notes and the range of notes generated for each activation function. The best results were achieved with LeakyReLU as the activation function.
TABLE I
Percentage of excessively repeated notes and range for
activation function
Activation Function | Notes excessively repeated (%) | Range |
Relu | 58 | 56 |
LeakyRelu | 56 | 56 |
b) Learning Rate: The learning rate was fine-tuned while keeping the activation function as LeakyReLU and the optimizer as Adadelta. Table II displays the percentage of excessively repeated notes and the range of generated music notes for various learning rates. Optimal performance was obtained with a learning rate of 0.01.
TABLE II
Percentage of excessively repeated notes and range for
different learning rate
Learning Rate | Notes excessively repeated (%) | Range |
0.1 | 57 | 51 |
0.01 | 56 | 57 |
0.001 | 60 | 53 |
c) Optimizer: The optimizer was fine-tuned while maintaining the learning rate at 0.01 and the activation function as LeakyReLU. Table III presents the percentage of excessively repeated notes and the range of generated music notes for various optimizers. The best results were obtained with Adam as the optimizer.
TABLE III
Percentage of excessively repeated notes and range for
different optimizers
Optimizer | Notes excessively repeated (%) | Range |
Adam | 58 | 55 |
Adadelta | 59 | 51 |
RMSProp | 61 | 49 |
d) Batch Size: The batch size was fine-tuned using the optimal hyperparameters identified so far: a learning rate of 0.01, LeakyReLU as the activation function, and Adam as the optimizer. Table IV shows the percentage of excessively repeated notes and the range of generated music notes for various batch sizes. The best performance was achieved with a batch size of 128.
TABLE IV
Percentage of excessively repeated notes and range for
different batch size
Batch Size | Notes excessively repeated (%) | Range |
64 | 57 | 59 |
128 | 56 | 57 |
256 | 60 | 62 |
4) Evaluation
Evaluating generated music poses a challenge due to the unsupervised nature of the task. The most effective evaluation method would involve human assessment, where individuals review the model’s output. Although this is one of the best techniques, it is impractical to have all samples evaluated by humans. Therefore, several quantitative measures, grounded in music theory principles, were employed to assess the aesthetic quality of the music generated by the model. Below are the key measures used for the model’s quantitative analysis:
1. Excessively Repeated Notes: This measures the fraction of notes that are overly redundant in a musical piece. It serves as a good indicator of the variability in the generated music. A high level of repetition suggests the model is generating the same sequences and lacks variability, indicating that it not learning effectively and producing repetitive outputs.
2. Range: This metric evaluates the span between the highest and lowest notes in a tune, measured in tone steps. According to general music aesthetics, a tone span of around 36 is considered to produce good music. Table V below provides a summary of the results obtained for these metrics for both the baseline LSTM model and the GAN model.
TABLE V
Comparison of Baseline LSTM model and GAN using
evaluation metrics
Notes Excessively Repeated (%) | Range | |
Baseline LSTM | 82 | 32 |
GAN | 55 | 56 |
Table V demonstrates that the baseline model had a higher percentage of repeated notes compared to the GAN model. Multiple music files were generated from both models, and the average percentage of excessively repeated notes was calculated. On average, 82 percent of the notes were repeated in the baseline model, while only 55 percent were repeated in the GAN model. The range metric revealed slightly different results: the baseline LSTM model had a range of 32, whereas the GAN model achieved a range of 56.
This can be attributed to the baseline model’s use of the softmax activation function in its final layer, which tends to generate the most probable note for a given sequence. Additionally, the LSTM model produced more repetitive music, limiting the difference between the highest and lowest notes, whereas the GAN model generated more varied music with a broader range of notes.
Alongside these quantitative metrics, a survey was conducted to assess the quality of the music generated by the proposed GAN model in comparison to the baseline model. Several individuals with basic knowledge of piano music participated in the survey. According to the results, 99 percent of participants agreed that the music generated by the GAN model was superior to that of the baseline LSTM model.
Combining these statistical measures with human evaluation ensured that the music generated by the model aligned with the original training data. The results clearly indicate that the proposed GAN model outperformed the baseline LSTM model, producing unique tones and capturing the nuances of the music more effectively.
References
[1] Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng- Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel and Dou- glas Eck “Enabling Factorized Piano Music Modelling and Generation with The Maestro Dataset” ICLR, Jan 2019.
[2] Natasha Jaques, Shixiang Gu, Richard E. Turner, Douglas Eck “Generating Music by Fine-Tuning Recurrent Neural Networks with Reinforcement Learning”.
[3] Allen Huang, Raymond Wu “Deep Learning for Music” Stanford U