0% found this document useful (0 votes)
32 views

Music Deep Learning Deep Learning Methods for Music Signal ProcessingA Review of the State-Of-The-Art

Uploaded by

amrutavalliakula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Music Deep Learning Deep Learning Methods for Music Signal ProcessingA Review of the State-Of-The-Art

Uploaded by

amrutavalliakula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Received 24 December 2022, accepted 3 February 2023, date of publication 13 February 2023, date of current version 23 February 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3244620

Music Deep Learning: Deep Learning Methods for


Music Signal Processing—A Review of the
State-of-the-Art
LAZAROS MOYSIS 1,2 , LAZAROS ALEXIOS ILIADIS 3 , (Graduate Student Member, IEEE),
SOTIRIOS P. SOTIROUDIS 3 , ACHILLES D. BOURSIANIS 3 , (Member, IEEE),
MARIA S. PAPADOPOULOU3,6 , (Member, IEEE), KONSTANTINOS-IRAKLIS D. KOKKINIDIS4 ,
CHRISTOS VOLOS 1 , PANAGIOTIS SARIGIANNIDIS 5 , (Member, IEEE),
SPIRIDON NIKOLAIDIS 3 , (Senior Member, IEEE),
AND SOTIRIOS K. GOUDOS 3 , (Senior Member, IEEE)
1 Laboratoryof Nonlinear Systems-Circuits and Complexity, School of Physics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
2 Department of Mechanical Engineering, University of Western Macedonia, 50100 Kozani, Greece
3 ELEDIA@AUTH, School of Physics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
4 Department of Applied Informatics, University of Macedonia, 54636 Thessaloniki, Greece
5 Department of Electrical and Computer Engineering, University of Western Macedonia, 50131 Kozani, Greece
6 Department of Information and Electronic Engineering, International Hellenic University, 57400 Sindos, Greece

Corresponding authors: Lazaros Moysis ([email protected]) and Sotirios K. Goudos ([email protected])


This research was carried out as part of the project ≪Recognition and direct characterization of cultural items for the education and
promotion of Byzantine Music using artificial intelligence≫ (Project code: KMP6-0078938) under the framework of the Action
‘‘Investment Plans of Innovation’’ of the Operational Program ‘‘Central Macedonia 2014 2020’’, that is co-funded by the European
Regional Development Fund and Greece.

ABSTRACT The discipline of Deep Learning has been recognized for its strong computational tools, which
have been extensively used in data and signal processing, with innumerable promising results. Among
the many commercial applications of Deep Learning, Music Signal Processing has received an increasing
amount of attention over the last decade. This work reviews the most recent developments of Deep Learning
in Music signal processing. Two main applications that are discussed are Music Information Retrieval, which
spans a plethora of applications, and Music Generation, which can fit a range of musical styles. After a review
of both topics, several emerging directions are identified for future research.

INDEX TERMS Deep learning, music signal processing, music information retrieval, music generation,
neural networks, machine learning.

I. INTRODUCTION natural language processing (NLP) [3], bioinformatics [4],


A. DEEP LEARNING IN MUSIC SIGNAL PROCESSING medical diagnosis [5], speech recognition [6], image pro-
Deep Learning (DL) [1], a sub-field of Machine Learn- cessing (IP) [7], system identification [8], recommendation
ing (ML), has been established as a strong computational systems [9], and more [10].
toolbox, with applications in numerous tasks, like feature A research field where DL has emerged as a valuable
extraction, classification, and pattern recognition. Such func- tool over the last decade is that of audio signal process-
tionalities enable the extraction of meaningful informa- ing (ASP) [11] and music signal processing (MSP) [12].
tion from raw data, and thus find applications in a wide Music is a well-known art form that is a big part of the
range of disciplines, including computer vision (CV) [2], most fun and educational human activities. As a result, the
music industry includes a wide range of organizations and
The associate editor coordinating the review of this manuscript and consumers. The application of DL tools in MSP has led
approving it for publication was Pasquale De Meo. to a collection of successful commercial applications, the

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 11, 2023 17031
L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

most famous of which is Music Recommendation Systems


(MRS) [13]. As shown in Fig. 1, the number of publications
indexed in Scopus under the keywords ‘‘deep,’’ ‘‘learning,’’
and ‘‘music’’ demonstrate the applicability of DL in music
processing. From 2014 to 2021, there are 638 publications,
a sharp increase each year. This shows that scientists are
becoming more interested in this field. The diversity of the
field is also made apparent when looking at the subject
area categorization of these works, with 567 being listed in
Computer Science, 296 in Engineering, 136 in Mathematics,
74 in Physics and Astronomy, 63 in Decision Sciences, 51 in
Arts and Humanities, and the rest covering disciplines such
as Materials Science, Medicine, Social Sciences, Energy and
more.
The broad field of DL in music-related applications could
be termed Music Deep Learning (MDL) and can be divided
into two categories, Music Information Retrieval (MIR) [11]
and Music Generation (MG) [14]. MIR refers to the extrac- FIGURE 1. Number of publications indexed under the common keywords
‘deep’, ‘learning’, and ‘music’ in Scopus.
tion of characterizing information from music data. Such
information can then be exploited for a wide range of appli-
cations, such as genre classification [15], [16], music recom-
mendation [17], [18], music source separation [19], singing signals. For singing information processing, [29] reviews
voice detection [20], instrument recognition [21], music emo- several aspects, like singing skill evaluation, singing voice
tion recognition [22] and transcription [23]. All of the above synthesis, singing voice separation, lyric synchronization,
applications aid in the digital preservation of music, by con- and transcription. Specifically for singing voice detection, the
structing and managing song databases, as well as the study review in [20] investigates the traditional and deep learning
of different music genres. techniques available. DL for music emotion recognition is
MG, under the framework of DL, broadly refers to the reviewed in [22].
automatic generation of music content. This task is performed For MG, the extensive survey in [14] offers an in-depth
by first extracting valuable information from music databases analysis, covering five key aspects of MG, the objective,
using MIR techniques, and then building DL architectures to representation, architecture, challenge, and strategy. The
generate original music content. This has several commercial work [30] provides a systematic review of AI techniques in
applications, like movie and game score generation. The MG with valuable information regarding publications, cita-
automatic generation of music content has spun discussions tions, geographical distribution, and many more. A review of
on whether this new way to create art will eventually replace the composition tasks for various music generation levels is
musicians. However, the more realistic projection for the provided in [31]. Finally, [32] talks about the challenges and
future is that MG can serve as a valuable tool to musicians and limitations of MG. These include, for example, the designer’s
educators alike, to explore new approaches to composition creative limitations, the lack of structure, the extent of control
and teaching [24]. the designer has over the generated music features, and the
lack of direct user interaction. Moreover, it argues on how to
address these issues.
B. RELATED SURVEYS
There have been some reviews of the results so far in MDL. C. MOTIVATION
In [11] a review of the (at the time) current DL techniques From the above, it is clear that different aspects of DL in MSP
for ASP is provided. Three types of audio are considered, have been surveyed, with many reviews being dedicated to
speech, music, and environmental sounds, with applications focused topics, thus providing highly detailed insights into
like audio recognition, synthesis, and transformation. Several it. In this work, DL for both MIR and MG is discussed,
reviews have also considered specific applications of MDL. which to the authors’ knowledge are discussed for the first
In [21], a tutorial on MIR is provided that is especially time together. The purpose of this work is to provide a more
useful to newcomers in the field [13], [25] reviews MRS. spherical overview of the current research in this field, which
Music Genre Classification (MGC) is reviewed in [26]. Drum could serve as a guide for identifying new research trends.
transcription is reviewed in [27], focusing on non-negative For that matter, after a review of recent results on both MIR
matrix factorization and recurrent neural network architec- and MG, a section is dedicated to identifying future directions
tures. In [28], a review of the audio signal representations for on MDL. Specifically, four research directions are identified,
use with CNNs is given. A review of DL for speech recog- all of which can yield fruitful results in MDL. An earlier
nition is available in [6], though the focus is not on music version of this study was presented in [33]. The current work

17032 VOLUME 11, 2023


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

extends [33] by expanding upon the literature review, and the


discussion on future topics of interest.
The main contributions of this work are summarized as
follows:
1) To complement previous surveys, emphasis is given
to works published in 2020 or later. In this way, the
evolution of MDL into a mature field is presented.
2) To the best of the authors’ knowledge, this is the
first time that MIR techniques and MG processes
are reviewed together, highlighting the interconnection
between the two research directions.
3) Attention is given to four areas, which are identified
as emerging research topics. These areas are hybrid FIGURE 2. Fully connected DNN.
architectures, DL in traditional music genres, MDL in
medical applications, and DL for music generated from
dynamic systems. The reader should refer to each work for an extensive pre-
The rest of the work is outlined as follows: In Section II, the sentation of the evaluation analysis. DL-based on the dataset
DL methods for Music Information Retrieval are presented. used for training and validation is also provided, for works
In Section III the field of DL-based Music Generation is that used public datasets.
discussed. Section IV identifies future research directions.
Finally, Section V concludes the work. For a list of Abbre- A. FULLY CONNECTED DEEP NEURAL NETWORKS
viations, see Appendix A. (FCDNN)
FCDNNs refer to the most basic type of deep neural net-
II. DL METHODS FOR MIR work, where multiple hidden layers are applied and all nodes
In this section, the application of DL for different MIR appli- between consecutive layers are connected, as shown in Fig. 2.
cations is reviewed. The section is divided into subsections For MR, in [18], an architecture termed HitMusicNet,
based on the DL architecture used, and the different applica- using an FCDNN was presented, for predicting the popularity
tions are talked about in each subsection. Table 1, summarizes of a music recording, using inputs that incorporate text, audio,
all the reviewed works in MIR, organized by architecture. and meta-data. The authors also construct a database termed
First, a short description is provided of the various appli- the SpotGenTrack Popularity Dataset (SPD), which unifies
cations of MIR: information from the Spotify and Genius music and lyric
1) Music Recommendation Systems (MRS): MRS is the databases. Meta-data information that was considered was
most fundamental application of MDL. Its goal is to the number of an artist’s followers, an artist’s popularity,
successfully recommend new music tracks to users as well as market availability. The resulting system can reach
based on their previous listening history. For new users an 83% precision score. In [34], an FCDNN was used for
with no prior information, the problem is termed ‘‘cold- MR combining content-based and collaborative filtering in
start MR.’’ its input. The dataset used was the Spotify Recsys Challenge
2) Music classification: The goal is to identify the musical 2018 million playlist dataset [35], reaching an 88% precision
genre of a song, which is of fundamental importance in score.
MRS. A more general goal is to identify music from For emotion classification in [36], classification was per-
other audio tracks, like speech, natural sounds, etc. formed on the Music4All dataset [37], using valence, dance-
3) Emotion classification and prediction: The goal is to ability, and energy as features. The classification is binary,
identify the underlying emotions that can be triggered with happy/sad classes. The model has a mean accuracy of
by a song. This is again useful in MRS and music 98.3%.
therapy.
4) Instrument/voice identification: The goal is to identify B. RECURRENT NEURAL NETWORKS (RNN)
and separate the different instruments used to compose RNNs are a class of neural networks used for processing
a music track. This also applies to detecting singing sequential data [10], and are thus suitable for time series input
voices. signals. In contrast to the FCDNN architectures, RNNs are
Several objective measures can be used to evaluate MIR composed of loops or cycles. RNNs also possess an internal
architectures. These include accuracy, precision, recall, f1- memory state, that is utilized to process long sequences.
score, mean absolute and square error, Area Under the There are many variants of such architectures, including Long
Receiver Operating Characteristic Curve (ROC-AUC), and Short Term Memory (LSTM), Gated RNN (GRU), bidirec-
more. In the following, a note is made for each work on the tional RNNs, Hopfield networks, etc. [10]. A simple RNN
accuracy achieved, or the ROC-AUC score, when provided. structure is shown in Fig. 3.

VOLUME 11, 2023 17033


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

FIGURE 4. LSTM unit cell.

FIGURE 3. Recurrent neural network.

from objective evaluation, the resulting tracks were evaluated


by human listeners and were positively received.
In [38], a tagging system is developed using RNN. A scat- For Music Classification, Attention Mechanism (AM) has
tering transform is used to extract features from the data. The proven to be a strong technique for improving performance
MagnaTagATune dataset [39] is used. The resulting architec- and is adopted in many architectures. An RNN with an atten-
ture achieves an average AUC-ROC score of 0.909. In [40], tion mechanism is used with MIDI formatted input by [52].
a web application was developed that can take as input any Five classes are considered, classical, country, dance music,
YouTube video song and classify its music genre, using four folk, and metal. The accuracy achieved is 90.1%.
different architectures. The classification is performed for
C. LONG SHORT-TERM MEMORY (LSTM)
individual 10-second segments of the input track. The results
are visualized in a graph. The music genre samples from the Long Short-Term Memory networks (LSTM) [53] constitute
Audioset database [41] are used for training. The supporting a special case of RNNs, which have found applications in
website, being highly visual, can offer great help to music MIR. An LSTM unit is shown in Fig. 4.
composers and students, and also has the potential to be used An LSTM network can be mathematically represented as
for user feedback. follows. For a given input vector uk at time step k and Nh
For emotion classification tasks, in [42] an RNN is pro- hidden layers, the activation vector of the forget gate is fk ∈
posed that uses a two-note melody trend as a music fea- (0, 1)Nh .
ture. Five emotion classes were considered, aggressive, bit- fk = σ (Wf uTk + Uf qTk−1 + bf ) (1)
tersweet, happy, humorous, and passionate. Data files from
YouTube were used, and the accuracy is up to 75.4%. In [43], where Wf and UF are weight matrices, qk ∈ (0, 1)Nh is the
emotion recognition is performed on classes of instruments. vector representing the hidden state, and bf is the bias vector.
Four instrument classes are considered: string, percussion, In addition, the activation vectors for the input/update gate
woodwind, and brass, and four emotion classes are deemed: Ik ∈ (0, 1)Nh and the output Ok ∈ (0, 1)M are represented
happy, sad, neutral, and fear. The study shows that the system similarly
recognizes more specific instrument-emotional pairings. Ik = σ (WI uTk + UI qTk−1 + bI ) (2)
RNNs have also been employed for music recommenda-
tion. In [44], an RNN architecture was used, and the study and
showed that song order does not significantly affect the qual- Ok = σ (WO xTk + UO qTk−1 + bO ) (3)
ity of playlist recommendations. The AotM-2011 [45] and
where I and O represent input and output, respectively,
8tracks [46] playlist datasets were used.
whereas the rest of the symbols have the same meaning as
For singing voice separation, in [47], a curriculum learning
previous.
approach was considered, where the learning begins with
An LSTM unit also contains a cell input activation vector
easy examples and the difficulty is steadily increased. Three
denoted by Ck ∈ (−1, 1)Nh , expressed as
different databases were tested: MIR-1K [48], ccMixter [49],
and MUSDB18 [50], with the model yielding improved per- Ck = σ (WC uTu + UC qTk−1 + bC ) (4)
formance with respect to the global normalized source to
Using the following principles, the cell state vector and the
distortion ratio measure.
hidden state vector are updated by combining the preceding
A piano harmony automatic arrangement architecture is
equations
proposed in [51]. The model performs three tasks, note detec-
tion, multibasic frequency estimation, and training. Apart Sk = Fk ◦ Sk−1 + Ik ◦ Ck (5)

17034 VOLUME 11, 2023


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

where ◦ is the Hadamard product and S0 = 0 and q0 = 0.


Finally,
qk = Ok ◦ tanh(Sk ) (6)
For music Classification, a model is proposed in [54],
where the segment features are the statistics of frame fea-
tures in each segment. The ISMIR database [55] is used,
which includes a collection of songs from different gen-
res. The model achieves an accuracy of 89.71%. In [56],
a complex architecture is used, combining a Bidirectional FIGURE 5. General CNN architecture.
Long Short-Term Memory (BLSTM) model with an attention
mechanism, paired with a Graphical Convolutional Network.
Three datasets are tested, GTZAN [57], ISMIR [55] and A deep CNN model utilizes the convolution operation
MagnaTagATune [58]. An accuracy of 93.51% is achieved. instead of the general matrix multiplication in at least one
For emotion prediction, in [59] the valence-arousal (V-A) of its layers. In addition, the architecture consists of fully
emotion model was used to represent the dynamic emotion, connected layers and pooling layers. The purpose of the latter
using a BLSTM network. The dataset used was taken from is to reduce in a computationally efficient manner, the size
the Emotion in Music task in MediaEval 2015 [60]. of the incoming data. Compared to a fully connected layer,
The problem of music source separation was studied using a convolutional layer is characterized by a neuron’s receptive
a BLSTM network for instrument detection and identification field. This receptive field indicates that every single unit
in [61]. Data augmentation was used during the training to receives input from only a restricted area of the previous layer.
avoid overfitting. To improve performance, the BLSTM net- As an activation function, most CNNs in the current literature
work is combined with a feed-forward neural network, which use either the rectified linear unit (ReLU) function or some
outperforms both individual networks. The SiSEC DSD100 kind of variant. ReLU is mathematically defined in [10] and
dataset is cited [62]. can be expressed by
For MR, an architecture was developed in [63] that ana-
g(x) = max(0, x). (7)
lyzes the connection between dance moves and music to
recommend tracks. The database used is [64], which includes A general CNN architecture is depicted in Fig. 5.
samples of synchronized dance and music. The dataset con- In audio Classification, an architecture was developed for
tains four classes of dance, waltz, tango, cha-cha, and rumba. spatial audio location and classification between speech and
The accuracy can reach up to 91.3%. music in [74]. Two different microphone arrangements were
For singing voice detection in [65], a Long-Term Recurrent considered. The classification can achieve an accuracy of up
Convolutional Network (LRCN) was considered for elec- to 97.9%. Although audio location is not unique to music
tronic music. The architecture consists of a voice separation signals, it can be especially useful in MIR, such as live audio
step and a feature extraction step. The CNN layer extracts the processing. In [75], different CNN architectures are used
audio features, and the LSTM layer uses the CNN output to for the classification of audio videos, using a wide class of
differentiate between the singing and non-singing parts. The labels and a large dataset from YouTube, which is termed
Arcadium [66] and NCS [67] were used as sources to create YouTube-100M. The ROC-AUC can reach up to 0.926. The
‘‘Electrobyte,’’ a new copyright-free electronic music dataset. Audioset [41] is also considered. In [76], a CNN is used for
The model was also tested in a pop dataset Jamendo [68], sound representation learning, using sound from an unlabeled
yielding an accuracy score of 0.833 (Electrobyte) and 0.939 video dataset, gathered from the Flickr website. To improve
(Jamendo). In [69], an LRCN architecture was developed for its performance, the network is trained by moving knowledge
the vocal separation and temporal smoothing. The CNN layer from networks that recognize images to networks that recog-
is again used for feature extraction, and the LSTM learns nize sounds.
the time-sequence relationship. The model was tested on five For music classification, in [15] a CNN is tested on the
datasets, RWC pop music dataset [70], Jamendo [68], Med- ISMIR dataset [55], a Latin Music Database (LMD) [77], and
leyDB [71], MIR-1K [48], and iKala [72], yielding accuracy an African ethnic database, provided by the Royal Museum
as high as 0.992. of Central-Africa (RMCA) in Belgium [78]. In all cases,
the CNN performed either equally well or better than other
D. CONVOLUTIONAL NEURAL NETWORKS (CNN) architectures. In [16], the CNN input consists of eight music
CNNs are models that can operate on data with a grid-like features chosen in three music dimensions: dynamics, timbre,
structure [10]. This is why they’ve had success with problems and tonality. This outperforms the use of a spectrogram. The
involving IP, CV, NLP, and other technologies [73]. In MIR, GTZAN dataset [57] is used for the experiments, and an
CNNs are often used to obtain information from music sig- accuracy of 91% is reached. In [79], sample-level CNNs
nals, which are mostly represented as two-dimensional time- were used for auto-tagging using raw waveform data. The
frequency data. term ‘‘sample-level’’ refers to learning representations from

VOLUME 11, 2023 17035


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

very small waveforms, like 2-3 samples. The MagnaTa-


gATune [39] and Million Song Dataset [80] were considered,
and an AUC of over 0.905 can be achieved. In [81], a 3D
convolutional denoising autoencoder architecture is built for
music classification, using MIDI input format. The model
gives out latent representations of the data, which are then
used to classify the data with a multi-layer perceptron net-
work. The Lakh MIDI dataset [82], [83] was used for testing,
with accuracy surpassing 88% and a ROC-AUC of over 0.86.
CNNs are used for note onset detection in audio record-
ings in the early work [84] for sound event recognition. The
FIGURE 6. GAN architecture.
use of a spectrogram as an input to the network instead of
the enhanced auto-correlation yields better detection perfor-
mance. The dataset used is combined from several different
sources. In [85], a simple CNN was proposed for event recog- catering to one instrument. The Slakh dataset is used [99],
nition under noise, with only three layers: convolutional, and the AUC ROC measure reached an average of 0.96, with
pooling, and softmax. The databases used are the Real Word the drums being easier to identify, and the guitar and piano
Computing Partnership (RWCP) Sound Scene Database in being the more difficult ones.
Real Acoustic Environments [86], and the NOISEX-92 In [100], a CNN is developed for emotion classification
database [87]. The accuracy can reach up to 99%. with 18 emotion tags, using time and frequency domain
For singing voice separation, in [88], a CNN architecture information. The experiments make use of the CAL500 [288]
was successfully developed that utilized pixel-wise classifi- and CAL500exp [101] datasets. In [102], classification is per-
cation on the spectrogram image. The model is trained using formed specifically for film music, with 9 emotional classes.
the Ideal Binary Mask as the target label and cross-entropy Each class is also associated with specific colors. The Epi-
as the objective function. The iKala database [72] was used, demic Sound Online database [103] was used. The classifi-
as well as the DSD100 dataset [62], [89]. cation is performed using 30-second excerpts of tracks.
For singing voice evaluation, in [90], a one-dimensional In [104], a feature combination CNN architecture for auto-
CNN is used, that applies fractional processing node theory matic playlist continuation is proposed, with collaborative
for training, which reduces the training time. For the exper- filtering integrating information from curated playlists as well
iment, 100 music major students were selected to provide as song feature vectors. The databases used are Art of the
input. Accuracy can be as high as 86.3%. Mix [105] and 8tracks [46]. In [106], distance measuring is
For musical instrument identification, a CNN with a sim- used for the classification system, which is then used for the
ple architecture is used for classification into 11 differ- recommendation system. The GTZAN database [57] is used
ent classes in [91]. The MedleyDB database is used [71], for training, and the Emotify music dataset [107] and Music
and the accuracy surpasses 82%. In [92], three different Audio Benchmark Dataset (MABD) [108] for testing. The
weight-sharing strategies for CNNs are considered, tempo- designed system can reach a good level of accuracy on the
ral kernels, time-frequency kernels, and a linear combina- 10-best list. In [109], a CNN architecture is tested using the
tion of time-frequency kernels which are one octave apart. MIREX database [110], along with the Baidu Music service.
MedleyDB is used [71] for training and testing, with hybrid The model has a ROC-AUC that can exceed 0.90.
models having the best overall performance. In [93], a Tem- For music transcription, a toolbox termed nnAudio
poral Convolutional Network was trained on a weakly labeled was developed for audio-to-spectrogram conversion using
dataset. The OpenMIC-2018 [94] dataset was used for train- one-dimensional CNNs in [111]. The MusicNet dataset [112]
ing and testing, and the MUSDB18 [50] for testing. The is used for testing. The toolbox can significantly reduce
model slightly outperforms an LSTM model with respect execution time compared to the existing librosa Python
to the ROC-AUC score, which indicates a strong candidate library [113].
for such problems. Attention-augmented CNNs are used for
instrument identification in [95]. When 25% of the filters E. GENERATIVE ADVERSARIAL NETWORKS (GAN)
are assigned to attention, the resulting CNN outperforms Despite the fact that RNNs and CNNs are the most popular
the attention-free ones. The datasets used were the London MIR architectures, there have been studies that look at alter-
Philharmonic Orchestra Dataset [96], and the University of native networks for MIR. GANs (Fig. 6) were first proposed
Iowa Musical Instrument Samples [97]. Judging from the in the original version of [114]. A GAN consists of two
consistently positive outcomes, it only makes sense to assume competitive agents: a generator and a discriminator. Starting
that in the future, AM-enhanced NNs will be extensively used with a training set of real data, the generator is trained to
for MIR. In [98], identification is performed for four instru- generate new samples that follow the distribution of the real
ments: bass, drums, piano, and guitar. The model architec- data, while the discriminator must identify the real from the
ture consists of four identical, independent sub-models, each artificial samples.
17036 VOLUME 11, 2023
L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

For emotion classification, a GAN is proposed in [115] from the lyrics, combining a word vector and a CNN-LSTM
that utilizes a double-channel fusion strategy to extract local architecture, with a word frequency weight vector along with
and global features of an input voice or image. There are a DNN. The outputs of the two architectures are combined
five emotion classes considered: sad, happy, quiet, lonely, on a matching attention mechanism to derive the text emotion
and miss. The information used in the experiments comes classification. Four classes are considered, happy, sad, heal-
from a number of websites, such as Kuwo Music Box, Baidu ing, and calm. The classification accuracy for all emotions
Heartlisten, and others. The recognition rates achieved are ranges between 0.809 to 0.903.
between 87.6% and 91.2% for all emotions. For music score recognition, the proposed architecture
In [116], an architecture combining computer vision and takes as input an image of a music score and outputs the
note recognition is proposed for music notation recognition. duration, pitch, and coordinate for each note in [127]. Data
The experiments make use of several datasets, including the from Muse Score [128] were used for the experiments, and
JSB Chorales [117], Maestro [118], Video Game [119], Lakh the model outperforms other architectures, with respect to all
MIDI [82], [83], and another MIDI dataset. The recognition accuracy measures.
accuracy ranges from 0.88 to 0.92 for all the datasets. The For sound event recognition, [129] considers polyphonic
proposed model’s intended application is music education. sounds, for a wide family of 61 classes, including music,
For Singing voice separation, in [120], a GAN with a taken out of a dataset of ten different daily contexts, like a
time-frequency masking function is used. The databases sports game, a bus, a restaurant, and more [130]. The model
MIR-1K [48], iKala [72], and DSD100 [62], [89] are used in achieves an average f1 score of around 65%.
the experiments, and the model outperforms a conventional
DNN. H. ARCHITECTURE OVERVIEW
From the above review, it is clear that the ‘‘classical’’ DL
F. CONVOLUTIONAL RNNs (CRNN) models perform well in a variety of MIR tasks. However,
Complementary to standard models, more complex ones the models under consideration need to be appropriately
have been developed that utilize couplings between different designed, so that they can achieve good results for their
architectures, often in a series interconnection, to combine set problem. Thus, (and accordingly to the no free lunch
their characteristics and improve performance. Convolutional theorem) there is no architecture that can be considered
RNNs (CRNNs) are one of these examples. holistically better than the rest. On the contrary, complex
architectures that incorporate layers of different types are the
For music classification, a CRNN was considered in [121],
most promising, since they combine the best characteristics
which is a CNN network with the last layers replaced by
of each DL module, as discussed in Section IV.
an RNN. The CNN part is used for feature extraction and
the RNN part as a temporal summarizer. The Million Song
III. DL METHODS FOR MUSIC GENERATION
Dataset [80] is used for training, to predict genre, mood,
In this section, the application of DL in MG is reviewed.
instrument, and era. The model outperforms other architec-
Automatic MG utilizes the MIR techniques mentioned in the
tures with respect to AUC-ROC.
previous section to generate novel music scores of desired
For MR, a CRNN is used in [122] for classifying and
characteristics, like genre, rhythm, tonality, and underlying
recommending music, in the categories of classical, elec-
emotion. The resulting output can either be a music track in
tronic, folk, hip-hop, instrumental, jazz, and rock music. The
the form of audio, so it can be directly listened to, or it can
database used is the Free Music Archive [123]. The system
be in a symbolic notation form. Along with the generation
was tested on a group of 30 users, and the best architecture
of novel tracks, some tasks can be considered adjacent to
was the one that implemented a cosine similarity, along with
MG. One such application is Genre Transfer (GT). This
information on music genre.
refers to preserving key content characteristics of a music
score and applying style characteristics that are typical of a
G. CNN-LSTM different genre. An example would be transforming a pop
Similarly to CRNNs, some works combine the architectures song into its heavy metal cover. Another application is Music
of CNNs with LSTMs. For emotion classification, a model Inpainting (MI), which refers to filling a missing part of a
in [124], consisting of a 2d input through a CNN-LSTM and a music track, using information from the rest of its content.
1d input through a DNN, combines two types of features and Again, the section is divided into subsections based on the DL
improves audio and lyrics classification performance. Four architecture used. The public databases used in each work are
classes are considered, angry, happy, relaxed, and sad. The also mentioned. Table 2, summarizes the reviewed works for
dataset used is the Last.fm tag subset of the Million Song MG, categorized by their architecture.
Dataset [80], with an average accuracy of 78%. In [125], The MG architectures can be evaluated both objectively
a novel database of Turkish songs is constructed for exper- and subjectively. Objective evaluation refers to using math-
imentation. The model uses a CNN as the feature extractor ematical and statistical tools, to measure the similarity of
and an LSTM with a DNN as the classifier. An accuracy of the generated music tracks to the training dataset, as well as
over 99% is obtained. In [126], the model extracts features other characteristics that can measure their similarity to real

VOLUME 11, 2023 17037


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

TABLE 1. Deep learning methods for music information retrieval. pleRNN [133] generates one audio sample at a time, with the
resulting signals receiving positive evaluation from human
listeners. Three different datasets were considered, one con-
taining a female English voice actor, one containing human
sounds like breathing, grunts, coughs, etc, and one con-
taining Beethoven’s piano sonatas, taken from the Internet
Archive [134]. The models were evaluated by a human
group, with the samples of the 3-tier model gaining the
highest preference. In [117], an RNN model termed Deep-
Bach is designed, for generating hymn-like scores mimicking
the style of Bach. The dataset is taken from the music21
library [135]. The model offers some control to the user,
allowing the placement of constraints like notes, rhythms,
or cadences to the score. The model was evaluated by human
listeners of varying expertise, who were given several sam-
ples, and had to guess between Bach or computer generated.
Around 50% of the time, the computer tracks were passed
as real samples, which is a very satisfying result for such
complex music. The work was expanded in [136], with an
architecture termed Anticipation-RNN which again offered
control to the user to place defined positional constraints. The
music21 library [135] was used once again.
In [137], a Graphical User Interface (GUI) system termed
BachDuet was developed for promoting classical music
improvisation training through user and computer interaction.
The JSB chorales data from the music21 dataset [135] is used
for training. The GUI was warmly received by test users, who
music. For objective evaluation, there are several measures, found the improvisation interaction easy to use, enjoyable,
including the loss and accuracy of the training process, the and helpful for improving their counterpoint improvisation
empty bar rate, polyphonicity, note in a scale, qualified note skills. Additionally, a second group of participants were asked
rate, tonal distance, and note length histogram, among others. to listen to music clips, rate them, and also decide whether
Most studies consider a subset of these measures or similar they resulted from a human-machine improvisation using
ones, so the reader can refer to each work for details. BachDuet, or human-human interaction. Both types of tracks
For subjective evaluation, a test audience is usually given received similar scores, and the listeners were also unable
a collection of DL-generated tracks from different architec- to differentiate between the duets, as they wrongly classified
tures, along with human compositions, and is asked to rate them around 50% of the time.
them with respect to different aspects, usually on a five-point In [138] the model produces drum rhythms for a
Likert scale. Variations of this include comparing pairs of seven-piece drum kit. Natural language translation was used
tracks and choosing which one they prefer the most or being to express the hit sequences. An online interface was designed
asked to decide if a track is computer or human-made. In the and evaluated by users, who gave an overall average to posi-
following sections, we point out which works have conducted tive score.
subjective evaluations, as the positive audience perception In [139], the effects of different conditioning inputs on the
of AI music tracks is essential for the future applicability performance of a recurrent monophonic melody generation
of MDL. The reader can again refer to each work for the model are studied. The model was trained on the FolkDB
extensive presentation of the evaluation results. dataset [140] and a novel Bebop Jazz dataset. The valida-
As a closing note, it is worth mentioning an issue that tion Negative Log Likelihood loss (NLL) can be as low as
emerges from the field of AI-based MG, that of copyright- 0.190 for the pitch and 0.045 for the duration.
ing [131], [132]. As AI methods use different software and In [141], the problem of inpainting was considered, which
sample databases, legal problems may arise when claiming combines a VAE that takes as input past and future context
authorship of the final musical product. It is thus important sequences, with an RNN that takes as input the latent vectors
that legislators update the existing policies, to avoid rising from the VAE, and as output a latent vector sequence that is
such issues in the future. passed through a decoder, to create the inpainting sequence.
A folk dataset from The Session [142] is used for testing. The
A. RNNs model outperforms others with respect to the NLL measure.
As with MIR, RNNs have proved popular for MG tasks. The architecture was also tested by users, who were given
For works on classical music, the model termed Sam- pairs of segmented sequences, and had to choose among
17038 VOLUME 11, 2023
L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

excerpts that fit. The model performance was on the same Experiments were conducted in MuseData [154], a classical
level as other architectures. music dataset, and JSB chorales [155] dataset. The model out-
performs other architectures with respect to Log-likelihood
B. LSTMs (LL) and frame-level accuracy (ACC%) measures.
LSTMs have been considered for several scenarios. In [143], In [156], variations of the LSTM are discussed, termed
data preprocessing has been applied to improve the quality of Tied Parallel LSTM with a neural autoregressive distribution
the generated music, and also reduce training time. estimator (NADE), and Biaxial LSTM. The model was tested
In [144], BLSTM networks are used for chord generation. on the datasets of JSB Chorales [155], MuseData [154],
The database used was Wikifonia, which is now inactive, Nottingham [147], and Piano-MIDI [151], a classical piano
that included sheets for several music genres [145]. The user dataset. The architectures perform well concerning the Log-
evaluation showed a preference for the BLSTM model over likelihood measure. The architectures also have translation
others, although the original music still received the highest invariance.
score. In [157], an RNN-LSTM architecture is proposed, using
In [146], BLSTM is used for chord generation. The model the Meier cepstrum coefficients as features. The dataset con-
consists of three parts: a chord generator, which uses some sists of folk tunes collected by the author. The model achieves
starting chords as input, a chord-to-note generator, which an accuracy of 99% and a loss rate of 0.03.
generates the melody line from the generated chords, and a In [158], a model termed Chord conditioned Melody
music styler, which combines the chords and melody into Transformer (CMT) is proposed, which generates rhythm and
a final music piece. Multiple music genres were used as a pitch conditioned on a chord progression. The training has
training database, including Nottingham [147], a collection two phases, first, a rhythm decoder is trained, and second,
of British and American folk tunes, Wikifonia [145], and the a pitch decoder is trained based on the rhythm decoder.
McGill-Billboard Chord Annotations [148]. The model was The model was trained on a novel K-Pop dataset. In addi-
evaluated by listeners, which gave a score ranging from neu- tion to various measures, like rhythm accuracy, the model
tral to positive, taking into consideration harmony, rhythm, was also evaluated by listeners, with respect to rhythm, har-
and structure. mony, creativity, and naturalness. The model outperforms the
In [149], a combination of two LSTM models, termed Explicitly-constrained conditional variational auto-encoder
CLSTMS, is used to build chords that can match a given (EC2 -VAE) [159], with respect to rhythm, harmony, and natu-
melody. One sub-model is used for the analysis of measure ralness. The model also has a higher score for creativity than
note information, and the other is used for chord transfer the real dataset tracks, meaning that it can indeed generate
information. Wikifonia is used with data taken from [144] and novel melodies.
[145]. In [160], an LSTM specifically for Jazz music was
In [150], a variation of Biaxial LSTM was used, and a designed, using a novel Jazz music dataset in MIDI format,
model termed DeepJ was developed for MG. The model was and the Piano-MIDI [151]. The model can also generate
tested on three types of music, baroque, classical, and roman- music using only a chosen instrument. The model can achieve
tic, with test participants being able to successfully categorize a very low final loss value.
the generated samples most of the time. The Piano-MIDI In [161], a BLSTM network with attention is considered
dataset [151] was used. The model is also capable of mixing for Jazz MG. The architecture consists of a BLSTM network,
musical styles by tuning the values of a single input vector. an attention layer, and another LSTM layer. The Jazz ML
In [152], a two-stage architecture is proposed that utilizes ready MIDI dataset [162] is considered. The model outper-
BLSTM, where the harmony and rhythm templates are first forms simpler architectures like LSTM without attention and
produced, and the melody is then generated and conditioned the attention LSTM without the BLSTM layer.
on these templates. The Wikifonia dataset is used [145]. In the In [163], a piano composer is designed, that uses
subjective evaluation, participants were given a collection of information from given composers to generate music.
tracks and were asked to rate them according to how much The datasets used were Classical Music MIDI [164] and
they found them pleasing and coherent, and whether they MIDI_classic_music [165], from which tracks of Beethoven,
believe they were human or AI-generated. The highest scores Mozart, Bach, and Chopin were considered. The model was
were achieved by the model where the melody generator is evaluated through a human survey, where participants had
conditioned on an existing chord and rhythm scheme from a to choose the real sample among the computer-generated
real song. This melody is also perceived as human-made by and composer ones. Around half the time, people mistook
many participants. The authors also noted that there are high the model-generated music for the human-composed track,
standard deviations in all answers, and slightly more so in the meaning that the model can generate music that is relatively
models rated positively, indicating that there is a much wider indistinguishable from real samples. The generated tracks can
perception of what is considered good-sounding music, than also be perceived as fairly interesting, pleasing, and realistic.
a bad one. In [166], an architecture, comprising of an LSTM paired
In [153], an architecture combining LSTM with a Recur- with a Feed Forward layer, can generate drum sequences
rent Temporal Restricted Boltzmann Machine is designed. resembling a learned style, and can also match up to set
VOLUME 11, 2023 17039
L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

constraints. The LSTM part learns drum sequences, while experiments [82]. The model achieves high cosine similarity
the feed-forward part processes information on guitar, bass, with the human-composed music for the frequency vector.
metrical structure, tempo, and grouping. The dataset was In [176], the problem of symbolic music GT was studied
collected from 911tabs [167], and broken into three parts, for using CycleGAN, a model consisting of two GANs that
80s disco, 70s blues and rock, and progressive rock/metal, exchange data and are trained simultaneously. The model
with the model being effective in all styles. was evaluated using genre classifiers, verifying the successful
Finally, in [168], the MI problem was considered by com- style transfer.
bining half-toning and steganography, and various methods In [177], DrumGan is proposed, an architecture for gen-
were compared using a dataset of various instruments, with erating drum sounds (kick, snare, and cymbal). The model
satisfying results for the considered models. offers user control over the resulting score, by tuning the
timbre features.
C. CNNs In [178], the authors generated log-magnitude spectro-
grams and phases directly with GAN to produce more coher-
For CNN architectures, in [169], the architecture comprises
ent waveforms than directly generating waveforms with
an LSTM as a generator, a CNN as a discriminator, and a con-
strided convolutions. The resulting scores are generated at a
trol network that introduces restriction rules for a particular
much higher speed. The NSynth dataset [179] is used, which
style of music generation. The matching subset of the Lakh
contains single notes from many instruments, at different
MIDI dataset (LMD) [82] and Piano-MIDI dataset [151]
pitches, timbres, and volumes. The human audience rated the
was used. The model was evaluated by music experts, with
audio quality of the tracks, and the model was received as
respect to melody, rhythm, chord harmony, musical texture,
slightly inferior to the real tracks.
and emotion. The model is rated higher than other ones in all
In [180], a GAN equipped with a self-attention mechanism
of the above aspects.
is used to generate multi-instrument music. The self-attention
In [170], a CNN with a Bidirectional Gate Recurrent Unit
mechanism is used to allow the extraction of spatial and
(BiGRU) and attention mechanism is used for folk music
temporal features from data. The Lakh MIDI [82] and Million
generation. The ESAC dataset [171] is used for testing. The
Song [80] datasets were used here.
results were evaluated by listeners, who gave overall positive
In [181], a GAN was designed for symbolic MG, along
ratings, although lower than the real ones. There were also
with a conditional mechanism to use available prior informa-
some exceptions of low scores, meaning that the model gen-
tion, so that the model can generate melodies either starting
eration may have some inconsistencies in its performance.
from zero, by following a chord sequence, or by conditioning
In [172] a Convolution-LSTM for piano track generation is
on the melody of previous bars. Pop music tabs from Theory-
considered. The CNN layer is used for feature extraction, and
Tab [182] were used. The resulting system, termed MidiNet,
the output is fed into the LSTM for music generation. Piano
is compared to Google’s MelodyRNN and performs equally,
tracks from Midiworld [173] were used for training. The
with the test audience characterizing the results as being more
model was evaluated by listeners, who were given 10 music
interesting.
segments, and had to decide whether they were human-made
In [183], multi-track MG was considered using three dif-
or computer generated. In most cases, the segments were
ferent GAN models, termed the Jamming, Composer, and
correctly identified, but the Convolution-LSTM model per-
Hybrid. The Jamming model consists of multiple indepen-
formed better than the simple LSTM.
dent generators. The Composer consists of a single gener-
ator and discriminator, and a shared random input vector.
D. GANs In the Hybrid model, the independent generators have both
Symbolic music is stored using a notation-based format, an independent and a shared random input vector. The models
which makes it an easier-to-use input for training NNs. were trained on a rock music database and used to generate
For symbolic music generation, a GAN model is proposed piano rolls for bass, drums, guitar, piano, and strings. The
in [174] for piano roll generation, equipped with LSTM layers database is termed Lakh Pianoroll Dataset, as it is created
in the generator and discriminator. The generated files were from the Lakh MIDI [82], by converting the MIDI files to
evaluated by participants with respect to melody and rhythm, multi-track piano rolls. A subset is also used with matched
and the proposed model received a higher score than files entries from the Million Song dataset [80]. Additionally to
generated from other architectures. using the training database, the model can also use as an
In [175], an inception model conditional GAN termed input a given music track from the user and generate four
INCO-GAN is proposed that can generate variable-length additional tracks from it. The model was evaluated by profes-
music. This complex architecture consists of two phases, sional and casual users and received overall neutral to positive
that of training and generation, and each phase is bro- scores.
ken into three processes: preprocessing, CVG training, and In [184], Sequence Generative Adversarial Net (SeqGAN)
conditional GAN training for the training stage, and CVG is proposed, which applies policy gradient update. The Not-
executing, phrase generation, and postprocessing for the tingham folk dataset [147] is used in the experiments. The
generation phase. The Lakh MIDI dataset is used for the model outperforms a maximum likelihood estimation (MLE)
17040 VOLUME 11, 2023
L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

in the sequence. To achieve this, the query vector is used and


since every token becomes the query for once, we calculate

eij = qi kj , with i, j ∈ {1, . . . , N }. (8)

To have more stable gradients, normalization is performed as


eij
nijk = sij 1/2
. (9)
dk
The final step is to calculate the self-attention score as
dk
X exp (sij )
zi = PN vj . (10)
FIGURE 7. Self-attention mechanism. j=1 l=1 exp (sil )

In practice, the aforementioned procedure is performed in


matrix form and is depicted in Fig. 7.
trained LSTM with respect to the mean squared error and
Modifications of the simple transformer are proposed in
other measures.
various works. In [188], a relative attention mechanism is
In [185], sequence generative GANs were considered
used to generate minute-long compositions, with reduced
for polyphonic music generation. The method condenses
intermediate memory requirements from quadratic to linear.
the duration, octaves, and keys of melodies and chords
The JSB chorales dataset [155] and Piano-e-Competition
into a one-word vector representation. The Nottingham
dataset [189] were used. The model was evaluated by listen-
dataset [147] was used. The results were well received by
ers, who were asked to rate pairs of musical excerpts. The
a test audience, with respect to pleasantness, realism, and
model outperformed other architectures and was seconded
interest.
only by the real music tracks.
In [186], a conditional GAN is proposed for long inpainting
In [190], an adversarial transformer is proposed to generate
up to a few seconds. The model was trained on datasets of
single-track or multitrack music. The results were positively
increasing complexity, like the Lakh MIDI [82] and Million
received by a test audience, who rated tracks with respect to
song [80], the Maestro dataset [118], recordings of grand
being human-like, harmonious, rhythmic, structured, fluent,
pianos, and free music archive dataset [123], and exten-
and overall quality. The model scores better compared to
sive audience experiments were performed to evaluate the
another architecture, and much closer to the real track scores.
model. The inpaintings were generally detectable, especially
In [191], sparse factorization was applied to the attention
in tracks with higher complexity, but were considered slightly
matrix, which reduced the memory and time requirements
or non-disturbing.
from quadratic to sub-quadratic. Five-second-long samples
were generated. A piano recording dataset from [192] was
E. TRANSFORMERS
used for training.
Transformers constitute a relatively recent architecture [187], In [193] a model termed Pop Music Transformer is pro-
which has found popularity in NLP. A key aspect of trans- posed to generate pop piano music. The model uses a
formers is self-attention, which refers to the process of beat-based music representation. The generated tracks were
weighting the relevance between different positions of a sin- evaluated by experts and casual listeners and were preferred
gle sequence. Transformers process sequential input data, but by both groups over other architectures.
not necessarily in order.
In [194] a model termed Jukebox can generate music along
The transformer’s architecture is basically an encoder-
with vocals in various musical styles. The model uses mul-
decoder scheme. The encoder maps the sequence of
tiscale Vector Quantization - Variational Autoencoders (VQ-
inputs (x1 , . . . , xN ) to a sequence of vector representations
VAE) to compress the raw audio input to discrete codes. Then
(z1 , . . . , zN ). The decoder then takes this vector representa-
the output is generated using an auto-regressive transformer.
tion and generates a sequence of outputs (y1 , . . . , yM ), one at
The architecture provides lyric conditioning, to control the
a time.
singing part. The Maestro dataset was used [118] for training,
Let Wq , Wk , Wv be the three parameter matrices that are
and the LyricWiki (now closed) to gather metadata, among
trained. These matrices are used to define the following
others. The model can generate music in any chosen style by
parameters:
supplying conditioning signals during training.
• Query: q = Wq xi In [195], a model for symbolic MG for Mandarin pop is
• Key: k = Wk xi proposed, where the transformer training considers the condi-
• Value: v = Wv xi tioning sequence as a thematic material. The POP909 dataset
The self-attention score is calculated as follows: For every is used [196]. The model was evaluated by participants, on the
input, our desire is to calculate how it attends to all the tokens aspects of theme controllability, repetition, timing, variation,

VOLUME 11, 2023 17041


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

and overall structure and quality. The proposed model outper- TABLE 2. DL methods for MG.
forms others in all metrics.
In [197], conditional drum generation is considered,
inspired by [166]. A BLSTM encoder receives the condition-
ing parameter information, and a transformer-based decoder
with relative global attention generates the drum sequence.
A subset of rock and metal songs from the Lakh MIDI
dataset is used [82]. For subjective evaluation, participants
were given a set of three tracks, two being the accompanying
or condition tracks, and the third being the drum track to
be evaluated. They were asked to rate the drum tracks with
respect to rhythm, pitch, naturalness, groove, and coherence.
The tracks generated from the proposed model outperform
another baseline model and are even rated higher than real
compositions with respect to naturalness, groove, and coher-
ence. The users were also asked their opinion on whether
the given drum tracks each time were real compositions or
computer generated. The drum tracks from the model were
perceived as computer generated only 39% of the time, indi-
cating the natural feel of the tracks.
In [198], the problem of melody harmonization was
considered. The model maps lower-level melody notes
into semantic higher-level chords. Three architectures are
proposed using a standard transformer, variational trans-
former, and regularized variational transformer. The Chord
Melody [199] and Hooktheory Lead Sheet [200] datasets are
used. In the human evaluation conducted, participants, com-
prising casual music listeners and professionals, were asked
to rate samples with respect to harmonicity, unexpectedness, tently yield improved results is to consider combined archi-
complexity, and preference. The standard model achieved tectures, like CRNNs [121], [122] or LRCNs [65], [69]. Such
the highest scores in harmony and preference, whereas the approaches can harness the individual characteristics of each
variational model achieved the highest in unexpectedness and model to surpass their counterparts. Attention mechanism
complexity. enhanced architectures is one such example [56], [95], [126],
[161], [180], with more being developed [201], [202], [203],
F. ARCHITECTURE OVERVIEW [204]. Such approaches will surely lead the advances in the
As with the case of MIR, it is clear that there is no single MDL field.
architecture that can outperform the rest in MG tasks. Multi- Apart from hybrid architectures, MDL will be significantly
layered architectures though can be a path for building better benefited from the use fusion of diverse input modalities. This
models, especially when additional objectives are set, like would increase performance, as the conjunction of differ-
conditioning the generated music to desired features. ent modalities can help build connections between different
features. For example, in [76] sound signals were extracted
IV. FUTURE STUDIES IN MDL from unlabelled video sources. In [205], the combination of
In this section, future research directions in MDL are identi- singing signals along with laryngoscope images was com-
fied and discussed. bined for voice parts division. In [206], a system that com-
bined heart rate measurements and facial expressions was
A. MIXED ARCHITECTURES composed to detect drowsiness in drivers, which is accompa-
So far there have been multiple approaches and different nied by a music recommendation system used as a counter-
architectures to address key problems in MDL. However, measure to avoid accidents. In [63] and [64], a synchronized
despite most works reporting positive results, due to the com- music and dance dataset were used for recommendation.
plexity of the applications under study and their peculiarities, In [207], music emotion classification is performed for four
there is no dominant method that should be followed for a emotional classes, combining features from lyrics and acous-
given task. Thus, there is no overall superior architecture that tics. These are indicative examples of an emerging trend of
is guaranteed to outperform all others for any given MDL bridging the gap between different modalities.
problem. For the above techniques, an all-present problem is
On the other hand, results indicate that the best approach the computational cost of training [208]. The increase in
to constructing holistically better models, which can consis- hardware requirements creates practical issues with energy

17042 VOLUME 11, 2023


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

consumption and environmental footprint, which under the TABLE 3. List of DL studies focused in traditional music.
scope of the global energy and environmental crisis, are
mandatory to address. Addressing the above will require the
performance improvement of current architectures, or the
consideration of different ones [209]. Understandably, any
improvements in the computational cost will, by extension,
also boost the commercialization of MDL applications.

B. TRADITIONAL MUSIC
Most of the existing works use widely available training
databases, which mainly include western music genres, like
classical music, pop, rock, metal, jazz, blues, etc. Using
widely established music genres make sense, due to their
popularity, but it is highly important to enrich and diversify
the training databases by including more genres. So, while it
is essential to consider new and emerging genres, especially
ones that are computer-based, like electronic, synth-wave,
and vaporwave [65], [210], [211], [212], another trend that
is gaining popularity is the application of MDL and MG for
traditional and regional music. Traditional music refers to
music originating from a specific country or region and is
closely tied to its culture [213]. Examples include the recita-
tion of religious excerpts like the Holy QurBan [214], and
traditional music from different regions, like Byzantine [215],
Greek [216], [217], [218], Persian [219], Chinese [220],
[221], Indian [222], [223], and many more.
In the development of MDL for regional and traditional
music, several challenges may appear, as a result of the dis- have to be adjusted to fit the characteristics of each genre.
tinct nature of the topic. One issue is the dataset availability, This again requires the existence of appropriate databases for
which in contrast to western popular music, is in many cases different musical notations.
hard to gather, especially in large amounts, which are required Overall, it seems that there are still several practical chal-
for optimal training. In most cases, the research groups take it lenges to fully developing DL for traditional music. These are
upon themselves to build their own dataset, due to the lack of steadily addressed by the efforts of several research groups
existing ones, so hopefully, in the future, more authorities will over the world. Table 3 lists the recent works that study
help towards building free databases [77], [78], [142], [196], Traditional Music Deep Learning (TMDL), categorized by
[221], [224], [225], [226]. For this task, recording difficulties music type. These works offer great service to the preserva-
may arise, especially for recordings made outside a music stu- tion of history, culture, and art, as the digitization, study, and
dio, with varying acoustics, for example in religious singing. generation of traditional music will help open it up to new
Coming along with the problem of dataset collection is that generations of listeners and also promote thematic (music,
of appropriate feature tagging of the tracks. This is strenuous religious) tourism. Thus, it is expected that more research
work that requires time, and often the collaboration of music groups will contribute to regional MDL in the future, and
experts, for tasks like the annotation of music features, and hopefully, such research endeavors will also receive govern-
testing audiences, for more ambiguous characterizations, like mental support and recognition.
the emotion that a track evokes.
Moreover, many musical instruments, like the guitar and C. MEDICAL APPLICATIONS
piano, are present in almost all music genres, so it is eas- The field of Music Therapy (MT) lies at the intersec-
ier to adopt MG architectures for a specific instrument to tion of Medicine and Music. MT is an evident-based
many different styles. This may not be the case for regional approach for treating a plethora of pathological condi-
instruments, which are only used for playing a region’s tradi- tions, including, among others, anxiety, depression, substance
tional music. So, for preserving and learning musical styles abuse, Alzheimer’s, eating disorders, sleep disorders, and
through DL, it is essential to build datasets for specific instru- more [261], [262], [263]. Naturally, DL can prove a valuable
ments [221]. Finally, many traditional music styles have a dis- tool to therapists and patients, as a complement to existing
tinct musical notation, like Mensural notation, Chinese Gong- treatments. Table 4 summarizes the recent applications of
che, and Organ tablature, meaning that MDL architectures for DL in music therapy, categorized by architecture. The con-
transcription, pattern recognition, and symbolic MG would ditions that have been addressed include music remixing to

VOLUME 11, 2023 17043


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

TABLE 4. DL methods for music medical applications. TABLE 5. List of abbreviations.

improve cochlear implant performance, effective MRS and


MG for mood transformation, including anxiety and depres-
sion, MG for stimulating the musical memory in patients
with Alzheimer’s, MG for relieving Tinnitus, and voice parts
classification for vocal art medicine. Existing architectures of
DL for tasks like music recommendation and emotion classi-
fication can be adapted to fit many of the above conditions.
For example, music recommendation systems can be updated
to make suggestions based on emotion and mood, using a
collection of patient inputs, like facial expressions, and other
physiological signals, like heart rate, temperature, respiratory
rate, EEG signals, and more. By designing appropriate user
interfaces [40], [117], [137], MDL architectures could also that two solutions of the same system, starting from almost
be used as an entertainment and educational tool, especially identical initial configurations, will quickly diverge from
for interventions with children. Finally, it would also be each other, yielding two different, non-periodic time series.
interesting to see if knowledge transfer could be applied to This feature can thus be exploited in MG, as it can aid in the
models developed for treating conditions with overlapping generation of non-repeating musical patterns. So exploring
symptoms, for example, anxiety and depression. DL methods in this area could give rise to applications in
MT is a field that is constantly developing, with medical numerous fields, including medical treatment [284], [285],
researchers turning to it as a method for effectively treating, and possibly secure communications [286], and system iden-
or reducing the symptoms of many conditions. By developing tification [287].
proper training databases and MIR and MG architectures,
DL will help in establishing open-access tools that can be V. CONCLUSION
used by anyone alike, without the need for increased medical MDL has evolved into a very active field, with an increasing
expenses. Moreover, tools like MRS for mood transformation number of contributions each year, addressing its vast appli-
can be directly available to patients, providing daily help cations. This work provided a review of the recent develop-
coverage. Overall, there are many promising future directions ments in Music Deep Learning. The review was divided into
to be considered by researchers. two main categories, Music Information Retrieval, and Music
Generation. After reviewing each field, future research trends
D. MUSIC GENERATED FROM DYNAMICAL SYSTEMS were identified.
Another field that would also be interesting to consider is that The future of MDL lies in developing hybrid architec-
of chaos-based music generation [278], [279], [280], [281], tures to improve performance, while applications span a
[282], [283]. In this interdisciplinary field, which bridges MG plethora of commercial, conservational, medical, and exper-
with the rich area of chaos theory, the time series solution imental applications being developed. Of these, applying
of a chaotic system is used as a high entropy source for DL for studying and preserving the cultural heritage of
music generation, in tuning parameters like the extraction of each country is of high importance. So is the exploitation
musical pitches, the duration of a musical note, the amplitude, of MDL for medical applications. The integration of MDL
and the velocity. Chaotic systems are characterized by non- and chaos seems much more experimental, but its multi-
periodicity, and sensitivity to parameter changes, meaning disciplinarity will surely lead to new developments in both

17044 VOLUME 11, 2023


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

fields. For all of the aforementioned applications, bringing [19] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, D. FitzGerald, and B.
together research groups consisting of heterogeneous and Pardo, ‘‘An overview of lead and accompaniment separation in music,’’
IEEE/ACM Trans. Audio, Speech, Language Process., vol. 26, no. 8,
complementing researchers, like computer scientists, physi- pp. 1307–1335, Aug. 2018.
cists, mathematicians, musicians, audio engineers, and med- [20] R. Monir, D. Kostrzewa, and D. Mrozek, ‘‘Singing voice detection: A
ical practitioners, is the key to success. The authors hope survey,’’ Entropy, vol. 24, no. 1, p. 114, Jan. 2022.
[21] K. Choi, G. Fazekas, K. Cho, and M. Sandler, ‘‘A tutorial on deep learning
that the present work can be of service to these researchers, for music information retrieval,’’ 2017, arXiv:1709.04396.
by providing a clear overview of recent and emerging devel- [22] D. Han, Y. Kong, J. Han, and G. Wang, ‘‘A survey of music emotion
opments in the field. recognition,’’ Frontiers Comput. Sci., vol. 16, no. 6, pp. 1–11, Dec. 2022.
[23] B. L. Sturm, J. Felipe Santos, O. Ben-Tal, and I. Korshunova, ‘‘Music
transcription modelling and composition using deep learning,’’ 2016,
APPENDIX A arXiv:1604.08723.
LIST OF ABBREVIATIONS [24] L. Casini, G. Marfia, and M. Roccetti, ‘‘Some reflections on the potential
Table 5 lists the abbreviations used throughout the text. and limitations of deep learning for automated music generation,’’ in
Proc. IEEE 29th Annu. Int. Symp. Pers., Indoor Mobile Radio Commun.
(PIMRC), Sep. 2018, pp. 27–31.
REFERENCES [25] M. Kleć and A. Wieczorkowska, ‘‘Music recommendation systems:
[1] A. Shrestha and A. Mahmood, ‘‘Review of deep learning algorithms and A survey,’’ in Recommender Systems for Medicine and Music. Cham,
architectures,’’ IEEE Access, vol. 7, pp. 53040–53065, 2019. Switzerland: Springer, 2021, pp. 107–118.
[2] J. Chai, H. Zeng, A. Li, and E. W. T. Ngai, ‘‘Deep learning in computer [26] N. Ndou, R. Ajoodha, and A. Jadhav, ‘‘Music genre classification: A
vision: A critical review of emerging techniques and application scenar- review of deep-learning and traditional machine-learning approaches,’’
ios,’’ Mach. Learn. Appl., vol. 6, Dec. 2021, Art. no. 100134. in Proc. IEEE Int. IoT, Electron. Mechatronics Conf. (IEMTRONICS),
[3] N. Fatima, A. S. Imran, Z. Kastrati, S. M. Daudpota, and A. Apr. 2021, pp. 1–6.
Soomro, ‘‘A systematic literature review on text generation using [27] C.-W. Wu, C. Dittmar, C. Southall, R. Vogl, G. Widmer, J. Hockman,
deep neural network models,’’ IEEE Access, vol. 10, pp. 53490–53503, M. Müller, and A. Lerch, ‘‘A review of automatic drum transcription,’’
2022. IEEE/ACM Trans. Audio, Speech, Language Process., vol. 26, no. 9,
[4] M. R. Karim, O. Beyan, A. Zappa, I. G. Costa, D. Rebholz-Schuhmann, pp. 1457–1483, Sep. 2018.
M. Cochez, and S. Decker, ‘‘Deep learning-based clustering approaches [28] L. Wyse, ‘‘Audio spectrogram representations for processing with convo-
for bioinformatics,’’ Briefings Bioinf., vol. 22, no. 1, pp. 393–415, lutional neural networks,’’ 2017, arXiv:1706.09559.
Jan. 2021. [29] C. Gupta, H. Li, and M. Goto, ‘‘Deep learning approaches in topics
[5] M. M. Islam, F. Karray, R. Alhajj, and J. Zeng, ‘‘A review on deep learning of singing information processing,’’ IEEE/ACM Trans. Audio, Speech,
techniques for the diagnosis of novel coronavirus (COVID-19),’’ IEEE Language Process., vol. 30, pp. 2422–2451, 2022.
Access, vol. 9, pp. 30551–30572, 2021. [30] M. Civit, J. Civit-Masot, F. Cuadrado, and M. J. Escalona, ‘‘A sys-
[6] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, ‘‘Speech tematic review of artificial intelligence-based music generation: Scope,
recognition using deep neural networks: A systematic review,’’ IEEE applications, and future trends,’’ Exp. Syst. Appl., vol. 209, Dec. 2022,
Access, vol. 7, pp. 19143–19165, 2019. Art. no. 118190.
[7] S. Minaee, Y. Y. Boykov, F. Porikli, A. J. Plaza, N. Kehtarnavaz, and [31] S. Ji, J. Luo, and X. Yang, ‘‘A comprehensive survey on deep music gen-
D. Terzopoulos, ‘‘Image segmentation using deep learning: A survey,’’ eration: Multi-level representations, algorithms, evaluations, and future
IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3523–3542, directions,’’ 2020, arXiv:2011.06801.
Jul. 2021. [32] J.-P. Briot and F. Pachet, ‘‘Deep learning for music generation: Challenges
[8] L. Ljung, C. Andersson, K. Tiels, and T. B. Schön, ‘‘Deep learn- and directions,’’ Neural Comput. Appl., vol. 32, no. 4, pp. 981–993,
ing and system identification,’’ IFAC-PapersOnLine, vol. 53, no. 2, Feb. 2020.
pp. 1175–1181, 2020. [33] L. A. Iliadis, S. P. Sotiroudis, K. Kokkinidis, P. Sarigiannidis,
[9] G. Gupta and R. Katarya, ‘‘Research on understanding the effect of S. Nikolaidis, and S. K. Goudos, ‘‘Music deep learning: A survey on deep
deep learning on user preferences,’’ Arabian J. Sci. Eng., vol. 46, no. 4, learning methods for music processing,’’ in Proc. 11th Int. Conf. Modern
pp. 3247–3286, Apr. 2021. Circuits Syst. Technol. (MOCAST), Jun. 2022, pp. 1–4.
[10] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, [34] F. Fessahaye, L. Perez, T. Zhan, R. Zhang, C. Fossier, R. Markarian,
MA, USA: MIT Press, 2016. C. Chiu, J. Zhan, L. Gewali, and P. Oh, ‘‘T-RECSYS: A novel music
[11] H. Purwins, B. Li, T. Virtanen, J. Schlüter, S.-Y. Chang, and T. Sainath, recommendation system using deep learning,’’ in Proc. IEEE Int. Conf.
‘‘Deep learning for audio signal processing,’’ IEEE J. Sel. Topics Signal Consum. Electron. (ICCE), Jan. 2019, pp. 1–6.
Process., vol. 13, no. 2, pp. 206–219, Apr. 2019. [35] (2018). Spotify RecSys Challenge. Accessed: Sep. 30, 2022. [Online].
[12] J. P. Puig, ‘‘Deep neural networks for music and audio tagging,’’ Available: http://www.recsyschallenge.com/2018/
Ph.D. thesis, Inf. Commun. Technol., Universitat Pompeu Fabra, [36] V. Revathy and A. S. Pillai, ‘‘Binary emotion classification of music using
Barcelona, Spain, 2019. deep neural networks,’’ in Proc. Int. Conf. Soft Comput. Pattern Recognit.
[13] M. Schedl, ‘‘Deep learning in music recommendation systems,’’ Frontiers Cham, Switzerland: Springer, 2021, pp. 484–492.
Appl. Math. Statist., vol. 5, p. 44, Aug. 2019. [37] I. A. P. Santana, F. Pinhelli, J. Donini, L. Catharin, R. B. Mangolin, V.
[14] J.-P. Briot, G. Hadjeres, and F.-D. Pachet, ‘‘Deep learning techniques for Delisandra Feltrim, and M. A. Domingues, ‘‘Music4All: A new music
music generation—A survey,’’ 2017, arXiv:1709.01620. database and its applications,’’ in Proc. Int. Conf. Syst., Signals Image
[15] Y. M. G. Costa, L. S. Oliveira, and C. N. Silla Jr., ‘‘An evaluation of con- Process. (IWSSIP), Jul. 2020, pp. 399–404.
volutional neural networks for music classification using spectrograms,’’ [38] G. Song, Z. Wang, F. Han, S. Ding, and M. A. Iqbal, ‘‘Music auto-tagging
Appl. Soft Comput., vol. 52, pp. 28–38, Mar. 2017. using deep recurrent neural networks,’’ Neurocomputing, vol. 292,
[16] C. Senac, T. Pellegrini, F. Mouret, and J. Pinquier, ‘‘Music feature maps pp. 104–110, May 2018.
with convolutional neural networks for music genre classification,’’ in [39] E. Law and L. von Ahn, ‘‘Input-agreement: A new mechanism for col-
Proc. 15th Int. Workshop Content-Based Multimedia Indexing, Jun. 2017, lecting data using human computation games,’’ in Proc. SIGCHI Conf.
pp. 1–5. Hum. Factors Comput. Syst., Apr. 2009, pp. 1197–1206.
[17] M. Lu, D. Pengcheng, and S. Yanfeng, ‘‘Digital music recommendation [40] J. R. Castillo and M. J. Flores, ‘‘Web-based music genre classifica-
technology for music teaching based on deep learning,’’ Wireless Com- tion for timeline song visualization and analysis,’’ IEEE Access, vol. 9,
mun. Mobile Comput., vol. 2022, pp. 1–8, May 2022. pp. 18801–18816, 2021.
[18] D. Martín-Gutiérrez, G. H. Penaloza, A. Belmonte-Hernandez, and [41] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence,
F. A. Garcia, ‘‘A multimodal end-to-end deep learning architecture for R. C. Moore, M. Plakal, and M. Ritter, ‘‘Audio set: An ontology and
music popularity prediction,’’ IEEE Access, vol. 8, pp. 39361–39374, human-labeled dataset for audio events,’’ in Proc. IEEE Int. Conf. Acoust.,
2020. Speech Signal Process. (ICASSP), Mar. 2017, pp. 776–780.

VOLUME 11, 2023 17045


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

[42] W. Zhao, Y. Zhou, Y. Tie, and Y. Zhao, ‘‘Recurrent neural network for [68] M. Ramona, G. Richard, and B. David, ‘‘Vocal detection in music with
MIDI music emotion classification,’’ in Proc. IEEE 3rd Adv. Inf. Technol., support vector machines,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal
Electron. Autom. Control Conf. (IAEAC), Oct. 2018, pp. 2596–2600. Process., Mar. 2008, pp. 1885–1888.
[43] S. Rajesh and N. J. Nalini, ‘‘Musical instrument emotion recognition [69] X. Zhang, Y. Yu, Y. Gao, X. Chen, and W. Li, ‘‘Research on singing
using deep recurrent neural network,’’ Proc. Comput. Sci., vol. 167, voice detection based on a long-term recurrent convolutional network
pp. 16–25, Jan. 2020. with vocal separation and temporal smoothing,’’ Electronics, vol. 9, no. 9,
[44] A. Vall, M. Quadrana, M. Schedl, and G. Widmer, ‘‘The importance of p. 1458, Sep. 2020.
song context and song order in automated music playlist generation,’’ [70] (2012). Rwc Pop Music Dataset. Accessed: Sep. 30, 2022. [Online].
2018, arXiv:1807.04690. Available: https://staff.aist.go.jp/m.goto/RWC-MDB/
[45] B. McFee and G. R. Lanckriet, ‘‘Hypergraph models of playlist dialects,’’ [71] R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and
in Proc. ISMIR. Pennsylvania, PA, USA: Citeseer, vol. 12, 2012, J. P. Bello, ‘‘MedleyDB: A multitrack dataset for annotation-intensive
pp. 343–348. MIR research,’’ in Proc. ISMIR, vol. 14, 2014, pp. 155–160.
[46] 8TRACKS. Accessed: Sep. 30, 2022. [Online]. Available: https://8tracks. [72] iKala Dataset. Accessed: Sep. 30, 2022. [Online]. Available:
com/ https://paperswithcode.com/dataset/ikala
[47] S. Kang, J.-S. Park, and G.-J. Jang, ‘‘Improving singing voice separation [73] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, ‘‘A survey of convolutional
using curriculum learning on recurrent neural networks,’’ Appl. Sci., neural networks: Analysis, applications, and prospects,’’ IEEE Trans.
vol. 10, no. 7, p. 2465, Apr. 2020. Neural Netw. Learn. Syst., vol. 33, no. 12, pp. 6999–7019, Dec. 2021.
[48] Mir-1K Dataset. Accessed: Sep. 30, 2022. [Online]. Available: [74] T. Hirvonen, ‘‘Classification of spatial audio location and content using
https://sites.google.com/site/unvoicedsoundseparation/mir-1k convolutional neural networks,’’ in Proc. 138th Audio Eng. Soc. Conv.,
[49] A. Liutkus, D. Fitzgerald, and Z. Rafii, ‘‘Scalable audio separation with 2015, pp. 1–10.
light kernel additive modelling,’’ in Proc. IEEE Int. Conf. Acoust., Speech
[75] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen,
Signal Process. (ICASSP), Apr. 2015, pp. 76–80.
R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney,
[50] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, ‘‘The R. J. Weiss, and K. Wilson, ‘‘CNN architectures for large-scale audio
MUSDB18 corpus for music separation,’’ Dec. 2017, doi: 10.5281/zen- classification,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
odo.1117372. (ICASSP), Mar. 2017, pp. 131–135.
[51] J. Li, ‘‘Automatic piano harmony arrangement system based on deep
[76] Y. Aytar, C. Vondrick, and A. Torralba, ‘‘SoundNet: Learning sound
learning,’’ J. Sensors, vol. 2022, pp. 1–13, Jul. 2022.
representations from unlabeled video,’’ in Proc. Adv. Neural Inf. Process.
[52] F. Zhang, ‘‘Research on music classification technology based on deep Syst., vol. 29, 2016, pp. 1–9.
learning,’’ Secur. Commun. Netw., vol. 2021, pp. 1–8, Dec. 2021.
[77] C. N. Silla Jr., A. L. Koerich, and C. A. Kaestner, ‘‘The Latin music
[53] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural database,’’ in Proc. ISMIR, 2008, pp. 451–456.
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[78] Royal Museum of Central-Africa (RMCA). Accessed: Sep. 30, 2022.
[54] J. Dai, S. Liang, W. Xue, C. Ni, and W. Liu, ‘‘Long short-term memory
[Online]. Available: https://www.africamuseum.be/en
recurrent neural network based segment features for music genre classifi-
cation,’’ in Proc. 10th Int. Symp. Chin. Spoken Lang. Process. (ISCSLP), [79] J. Lee, J. Park, K. Luke Kim, and J. Nam, ‘‘Sample-level deep convo-
Oct. 2016, pp. 1–5. lutional neural networks for music auto-tagging using raw waveforms,’’
2017, arXiv:1703.01789.
[55] (2004). Ismir Genre Dataset. Accessed: Sep. 30, 2022. [Online]. Avail-
able: https://ismir2004.ismir.net [80] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere, ‘‘The mil-
lion song dataset,’’ in Proc. 12th Int. Soc. Music Inf. Retr. Conf., 2011,
[56] S. K. Prabhakar and S.-W. Lee, ‘‘Holistic approaches to music genre
pp. 591–596.
classification using efficient transfer and deep learning techniques,’’ Exp.
Syst. Appl., vol. 211, Jan. 2023, Art. no. 118636. [81] L. Qiu, S. Li, and Y. Sung, ‘‘3D-DCDAE: Unsupervised music latent rep-
[57] G. Tzanetakis and P. Cook, ‘‘Musical genre classification of audio sig- resentations learning method based on a deep 3D convolutional denoising
nals,’’ IEEE Trans. Speech Audio Process., vol. 10, no. 5, pp. 293–302, autoencoder for music genre classification,’’ Mathematics, vol. 9, no. 18,
Jul. 2002. p. 2274, Sep. 2021.
[58] (2013). The MagnaTagATune Dataset. Accessed: Sep. 30, 2022. [Online]. [82] (2004). The Lakh Midi Dataset. Accessed: Sep. 30, 2022. [Online].
Available: https://mirg.city.ac.uk/codeapps/the-magnatagatune-dataset Available: https://colinraffel.com/projects/lmd
[59] X. Li, H. Xianyu, J. Tian, W. Chen, F. Meng, M. Xu, and L. Cai, ‘‘A deep [83] C. Raffel, ‘‘Learning-based methods for comparing sequences, with
bidirectional long short-term memory based multi-scale approach for applications to audio-to-midi alignment and matching,’’ Ph.D. thesis,
music dynamic emotion prediction,’’ in Proc. IEEE Int. Conf. Acoust., Columbia Univ., New York, NY, USA, 2016, doi: 10.7916/D8N58MHV.
Speech Signal Process. (ICASSP), Mar. 2016, pp. 544–548. [84] B. Stasiak and J. Mońko, ‘‘Analysis of time-frequency representations for
[60] A. Aljanaki, Y.-H. Yang, and M. Soleymani, ‘‘Emotion in music task at musical onset detection with convolutional neural network,’’ in Proc. Ann.
mediaeval 2015,’’ in Proc. MediaEval Workshop, 2015, pp. 1–3. Comput. Sci. Inf. Syst., Oct. 2016, pp. 147–152.
[61] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, and [85] H. Phan, L. Hertel, M. Maass, and A. Mertins, ‘‘Robust audio event
Y. Mitsufuji, ‘‘Improving music source separation based on deep neural recognition with 1-max pooling convolutional neural networks,’’ 2016,
networks through data augmentation and network blending,’’ in Proc. arXiv:1604.06338.
IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2017, [86] S. Nakamura, K. Hiyane, F. Asano, T. Yamada, and T. Endo, ‘‘Data col-
pp. 261–265. lection in real acoustical environments for sound scene understanding and
[62] SiSEC DSD100. Accessed: Sep. 30, 2022. [Online]. Available: hands-free speech recognition,’’ in Proc. 6th Eur. Conf. Speech Commun.
https://sisec.inria.fr/sisec-2016/2016-professionally-produced-music- Technol. (EUROSPEECH), 1999, pp. 1–4.
recordings/ [87] A. Varga and H. J. M. Steeneken, ‘‘Assessment for automatic speech
[63] W. Gong and Q. Yu, ‘‘A deep music recommendation method based on recognition: II. NOISEX-92: A database and an experiment to study the
human motion analysis,’’ IEEE Access, vol. 9, pp. 26290–26300, 2021. effect of additive noise on speech recognition systems,’’ Speech Com-
[64] T. Tang, J. Jia, and H. Mao, ‘‘Dance with melody: An LSTM-autoencoder mun., vol. 12, no. 3, pp. 247–251, Jul. 1993.
approach to music-oriented dance synthesis,’’ in Proc. 26th ACM Int. [88] K. W. E. Lin, B. T. Balamurali, E. Koh, S. Lui, and D. Herremans,
Conf. Multimedia, Oct. 2018, pp. 1598–1606. ‘‘Singing voice separation using a deep convolutional neural network
[65] R. Romero-Arenas, A. Gómez-Espinosa, and B. Valdés-Aguirre, trained by ideal binary mask and cross entropy,’’ Neural Comput. Appl.,
‘‘Singing voice detection in electronic music with a long-term recurrent vol. 32, no. 4, pp. 1037–1050, Feb. 2020.
convolutional network,’’ Appl. Sci., vol. 12, no. 15, p. 7405, Jul. 2022. [89] A. Liutkus, F.-R. Stöter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito, N. Ono,
[66] TheFatRat. (2016). The Arcadium. Accessed: Sep. 30, 2022. [Online]. and J. Fontecave, ‘‘The 2016 signal separation evaluation campaign,’’ in
Available: https://www.youtube.com/c/TheArcadium Proc. Int. Conf. Latent Variable Anal. Signal Separat. Cham, Switzerland:
[67] B. Woodford. (2011). NCS (No Copytight Sounds)—Free Music for Con- Springer, 2017, pp. 323–332.
tent Creators. Accessed: Sep. 30, 2022. [Online]. Available: https://www. [90] Y. Liusong and D. Hui, ‘‘Voice quality evaluation of singing art based on
ncs.io/ 1DCNN model,’’ Math. Problems Eng., vol. 2022, pp. 1–9, Jul. 2022.

17046 VOLUME 11, 2023


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

[91] P. Li, J. Qian, and T. Wang, ‘‘Automatic instrument recognition [116] N. Li, ‘‘Generative adversarial network for musical notation recognition
in polyphonic music using convolutional neural networks,’’ 2015, during music teaching,’’ Comput. Intell. Neurosci., vol. 2022, pp. 1–9,
arXiv:1511.05520. Jun. 2022.
[92] V. Lostanlen and C.-E. Cella, ‘‘Deep convolutional networks on the pitch [117] G. Hadjeres, F. Pachet, and F. Nielsen, ‘‘DeepBach: A steerable model
spiral for musical instrument recognition,’’ 2016, arXiv:1605.06644. for Bach chorales generation,’’ in Proc. Int. Conf. Mach. Learn., 2017,
[93] D. Mukhedkar, ‘‘Polyphonic music instrument detection on weakly pp. 1362–1371.
labelled data using sequence learning models,’’ School Elect. Eng. Com- [118] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang,
put. Sci., KTH Roy. Inst. Technol., Stockholm, Sweden, 2020. S. Dieleman, E. Elsen, J. Engel, and D. Eck, ‘‘Enabling factorized piano
[94] E. Humphrey, S. Durand, and B. McFee, ‘‘OpenMIC-2018: An open music modeling and generation with the MAESTRO dataset,’’ 2018,
data-set for multiple instrument recognition,’’ in Proc. ISMIR, 2018, arXiv:1810.12247.
pp. 438–444. [119] C.-Z. Anna Huang, C. Hawthorne, A. Roberts, M. Dinculescu, J. Wexler,
[95] A. Wise, A. S. Maida, and A. Kumar, ‘‘Attention augmented CNNs for L. Hong, and J. Howcroft, ‘‘The bach doodle: Approachable music com-
musical instrument identification,’’ in Proc. 29th Eur. Signal Process. position with machine learning at scale,’’ 2019, arXiv:1907.06637.
Conf. (EUSIPCO), Aug. 2021, pp. 376–380. [120] Z.-C. Fan, Y.-L. Lai, and J.-S.-R. Jang, ‘‘SVSGAN: Singing voice separa-
[96] London Philharmonic Orchestra Dataset. Accessed: Sep. 30, 2022. tion via generative adversarial network,’’ in Proc. IEEE Int. Conf. Acoust.,
[Online]. Available: https://philharmonia.co.uk/resources/sound- Speech Signal Process. (ICASSP), Apr. 2018, pp. 726–730.
samples/ [121] K. Choi, G. Fazekas, M. Sandler, and K. Cho, ‘‘Convolutional recur-
[97] University of Iowa Musical Instrument Samples. Accessed: Sep. 30, 2022. rent neural networks for music classification,’’ in Proc. IEEE Int. Conf.
[Online]. Available: https://theremin.music.uiowa.edu/MIS.html Acoust., Speech Signal Process. (ICASSP), Mar. 2017, pp. 2392–2396.
[98] M. Blaszke and B. Kostek, ‘‘Musical instrument identification using deep [122] A. A. S. Gunawan and D. Suhartono, ‘‘Music recommender system based
learning approach,’’ Sensors, vol. 22, no. 8, p. 3033, Apr. 2022. on genre using convolutional recurrent neural networks,’’ Proc. Comput.
[99] E. Manilow, G. Wichern, P. Seetharaman, and J. Le Roux, ‘‘Cutting music Sci., vol. 157, pp. 99–109, Jan. 2019.
source separation some Slakh: A dataset to study the impact of training [123] M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, ‘‘FMA: A
data quality and quantity,’’ in Proc. IEEE Workshop Appl. Signal Process. dataset for music analysis,’’ 2016, arXiv:1612.01840.
to Audio Acoust. (WASPAA), Oct. 2019, pp. 1–7. [124] C. Chen and Q. Li, ‘‘A multimodal music emotion classification method
[100] X. Liu, Q. Chen, X. Wu, Y. Liu, and Y. Liu, ‘‘CNN based music emotion based on multifeature combined network classifier,’’ Math. Problems
classification,’’ 2017, arXiv:1704.05665. Eng., vol. 2020, pp. 1–11, Aug. 2020.
[101] S.-Y. Wang, J.-C. Wang, Y.-H. Yang, and H.-M. Wang, ‘‘Towards time- [125] S. Hizlisoy, S. Yildirim, and Z. Tufekci, ‘‘Music emotion recognition
varying music auto-tagging based on CAL500 expansion,’’ in Proc. IEEE using convolutional long short term memory deep neural networks,’’ Eng.
Int. Conf. Multimedia Expo. (ICME), Jul. 2014, pp. 1–6. Sci. Technol., Int. J., vol. 24, no. 3, pp. 760–767, Jun. 2021.
[102] T. Ciborowski, S. Reginis, D. Weber, A. Kurowski, and B. Kostek, ‘‘Clas- [126] X. Jia, ‘‘Music emotion classification method based on deep learning and
sifying emotions in film music—A deep learning approach,’’ Electronics, improved attention mechanism,’’ Comput. Intell. Neurosci., vol. 2022,
vol. 10, no. 23, p. 2955, Nov. 2021. pp. 1–8, Jun. 2022.
[103] Epidemic Sound. Accessed: Sep. 30, 2022. [Online]. Available: [127] M. Liang, ‘‘Music score recognition and composition application based
https://www.epidemicsound.com/ on deep learning,’’ Math. Problems Eng., vol. 2022, pp. 1–9, Jun. 2022.
[104] A. Vall, M. Dorfer, H. Eghbal-zadeh, M. Schedl, K. Burjorjee, and [128] (2012). Musescore. Accessed: Sep. 30, 2022. [Online]. Available:
G. Widmer, ‘‘Feature-combination hybrid recommender systems for https://musescore.org/en
automated music playlist continuation,’’ User Model. User-Adapted
[129] G. Parascandolo, H. Huttunen, and T. Virtanen, ‘‘Recurrent neural net-
Interact., vol. 29, no. 2, pp. 527–572, Apr. 2019.
works for polyphonic sound event detection in real life recordings,’’
[105] Art of the Mix. Accessed: Sep. 30, 2022. [Online]. Available:
in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP),
http://www.artofthemix.org/
Mar. 2016, pp. 6440–6444.
[106] M. Sheikh Fathollahi and F. Razzazi, ‘‘Music similarity measurement
[130] T. Heittola, A. Mesaros, A. Eronen, and T. Virtanen, ‘‘Audio context
and recommendation system using convolutional neural networks,’’ Int.
recognition using audio event histograms,’’ in Proc. 18th Eur. Signal
J. Multimedia Inf. Retr., vol. 10, no. 1, pp. 43–53, Mar. 2021.
Process. Conf., 2010, pp. 1272–1276.
[107] M. Zentner, D. Grandjean, and K. R. Scherer, ‘‘Emotions evoked by
[131] O. Bulayenko, J. Quintais, D. J. Gervais, and J. Poort. (2022). AI Music
the sound of music: Characterization, classification, and measurement,’’
Outputs: Challenges to the Copyright Legal Framework. [Online]. Avail-
Emotion, vol. 8, no. 4, pp. 494–521, 2008.
able: https://ssrn.com/abstract=4072806
[108] H. Homburg, I. Mierswa, B. Möller, K. Morik, and M. Wurst, ‘‘A bench-
[132] R. B. Abbott and E. Rothman, ‘‘Disrupting creativity: Copyright law in
mark dataset for audio classification and clustering,’’ in Proc. ISMIR,
the age of generative artificial intelligence,’’ Aug. 2022. [Online]. Avail-
2005, pp. 31–528.
able: https://ssrn.com/abstract=4185327, doi: 10.2139/ssrn.4185327.
[109] H. Gao, ‘‘Automatic recommendation of online music tracks based on
deep learning,’’ Math. Problems Eng., vol. 2022, pp. 1–8, Jun. 2022. [133] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,
A. Courville, and Y. Bengio, ‘‘SampleRNN: An unconditional end-to-end
[110] J. S. Downie, K. West, A. Ehmann, and E. Vincent, ‘‘The 2005 music
neural audio generation model,’’ 2016, arXiv:1612.07837.
information retrieval evaluation exchange (MIREX 2005): Preliminary
overview,’’ in Proc. 6th Int. Conf. Music Inf. Retr. (ISMIR), 2005, [134] The Internet Archive. Accessed: Sep. 30, 2022. [Online]. Available:
pp. 320–323. https://archive.org/
[111] K. W. Cheuk, H. Anderson, K. Agres, and D. Herremans, ‘‘NnAudio: An [135] M. S. Cuthbert and C. Ariza, ‘‘music21: A toolkit for computer-aided
on-the-fly GPU audio to spectrogram conversion toolbox using 1D con- musicology and symbolic music data,’’ in Proc. 11th Int. Soc. Music Inf.
volutional neural networks,’’ IEEE Access, vol. 8, pp. 161981–162003, Retr. Conf. (ISMIR), 2010, pp. 637–642.
2020. [136] G. Hadjeres and F. Nielsen, ‘‘Interactive music generation with positional
[112] J. Thickstun, Z. Harchaoui, and S. Kakade, ‘‘Learning features of music constraints using anticipation-RNNs,’’ 2017, arXiv:1709.06404.
from scratch,’’ 2016, arXiv:1611.09827. [137] C. Benetatos, J. VanderStel, and Z. Duan, ‘‘BachDuet: A deep learning
[113] B. McFee, C. Raffel, D. Liang, D. Ellis, M. McVicar, E. Battenberg, system for human-machine counterpoint improvisation,’’ in Proc. Int.
and O. Nieto, ‘‘Librosa: Audio and music signal analysis in Python,’’ Conf. New Interfaces Musical Expression, 2020, pp. 1–6.
in Proc. 14th Python Sci. Conf. Pennsylvania, PA, USA: Citeseer, 2015, [138] P. Hutchings, ‘‘Talking drums: Generating drum grooves with neural
pp. 18–25. networks,’’ 2017, arXiv:1706.09558.
[114] G. Ian, J. Pouget-Abadie, M. Mirza, B. Xu, and D. Warde-Farley, ‘‘Gen- [139] B. Genchel, A. Pati, and A. Lerch, ‘‘Explicitly conditioned melody
erative adversarial nets,’’ in Proc. Adv. Neural Inf. Process. Syst., 2014, generation: A case study with interdependent RNNs,’’ 2019,
pp. 1–9. arXiv:1907.05208.
[115] I.-S. Huang, Y.-H. Lu, M. Shafiq, A. Ali Laghari, and R. Yadav, ‘‘A gen- [140] Folkdb. Accessed: Sep. 30, 2022. [Online]. Available: https://github.com/
erative adversarial network model based on intelligent data analytics IraKorshunova/folk-rnn/tree/master/data
for music emotion recognition under IoT,’’ Mobile Inf. Syst., vol. 2021, [141] A. Pati, A. Lerch, and G. Hadjeres, ‘‘Learning to traverse latent spaces
pp. 1–8, Nov. 2021. for musical score inpainting,’’ 2019, arXiv:1907.01164.

VOLUME 11, 2023 17047


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

[142] The Session. Accessed: Sep. 30, 2022. [Online]. Available: [170] Y. Su, R. Han, X. Wu, Y. Zhang, and Y. Li, ‘‘Folk melody generation based
https://thesession.org/ on CNN-BiGRU and self-attention,’’ in Proc. 4th Int. Conf. Commun., Inf.
[143] S. Agarwal, V. Saxena, V. Singal, and S. Aggarwal, ‘‘LSTM based music Syst. Comput. Eng. (CISCE), May 2022, pp. 363–368.
generation with dataset preprocessing and reconstruction techniques,’’ in [171] ESAC. Accessed: Sep. 30, 2022. [Online]. Available: http://www.esac-
Proc. IEEE Symp. Ser. Comput. Intell. (SSCI), Nov. 2018, pp. 455–462. data.org/
[144] H. Lim, S. Rhyu, and K. Lee, ‘‘Chord generation from symbolic melody [172] Y. Huang, X. Huang, and Q. Cai, ‘‘Music generation based on
using BLSTM networks,’’ 2017, arXiv:1712.01011. convolution-LSTM,’’ Comput. Inf. Sci., vol. 11, no. 3, pp. 50–56, 2018.
[145] Wikifonia Subset Dataset. Accessed: Sep. 30, 2022. [Online]. Available: [173] Midiworld. Accessed: Sep. 30, 2022. [Online]. Available: https://www.
http://marg.snu.ac.kr/chord_generation/ midiworld.com/
[146] H. H. Tan, ‘‘ChordAL: A chord-based approach for music generation [174] S. M. Tony and S. Sasikumar, ‘‘Generative adversarial network for music
using Bi-LSTMs,’’ in Proc. ICCC, 2019, pp. 364–365. generation,’’ in High Performance Computing and Networking. Cham,
[147] Nottingham Dataset. Accessed: Sep. 30, 2022. [Online]. Available: Switzerland: Springer, pp. 109–119, 2022.
https://ifdo.ca/~seymour/nottingham/nottingham.html [175] S. Li and Y. Sung, ‘‘INCO-GAN: Variable-length music generation
method based on inception model-based conditional GAN,’’ Mathemat-
[148] Mcgill-Billboard Chord Annotations. Accessed: Sep. 30, 2022. [Online].
ics, vol. 9, no. 4, p. 387, Feb. 2021.
Available: https://ddmal.music.mcgill.ca/research/SALAMI/
[176] G. Brunner, Y. Wang, R. Wattenhofer, and S. Zhao, ‘‘Symbolic music
[149] W. Yang, P. Sun, Y. Zhang, and Y. Zhang, ‘‘CLSTMS: A combination
genre transfer with CycleGAN,’’ in Proc. IEEE 30th Int. Conf. Tools Artif.
of two LSTM models to generate chords accompaniment for symbolic
Intell. (ICTAI), Nov. 2018, pp. 786–793.
melody,’’ in Proc. Int. Conf. High Perform. Big Data Intell. Syst. (HPB-
[177] J. Nistal, S. Lattner, and G. Richard, ‘‘DrumGAN: Synthesis of drum
DIS), May 2019, pp. 176–180.
sounds with timbral feature conditioning using generative adversarial
[150] H. H. Mao, T. Shin, and G. Cottrell, ‘‘DeepJ: Style-specific music genera- networks,’’ 2020, arXiv:2008.12073.
tion,’’ in Proc. IEEE 12th Int. Conf. Semantic Comput. (ICSC), Jan. 2018, [178] J. Engel, K. Krishna Agrawal, S. Chen, I. Gulrajani, C. Donahue, and
pp. 377–382. A. Roberts, ‘‘GANSynth: Adversarial neural audio synthesis,’’ 2019,
[151] Classical Piano-Midi Dataset. Accessed: Sep. 30, 2022. [Online]. Avail- arXiv:1902.08710.
able: http://piano-midi.de/ [179] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and
[152] C. D. Boom, S. V. Laere, T. Verbelen, and B. Dhoedt, ‘‘Rhythm, chord and K. Simonyan, ‘‘Neural audio synthesis of musical notes with wavenet
melody generation for lead sheets using recurrent neural networks,’’ in autoencoders,’’ in Proc. Int. Conf. Mach. Learn., 2017, pp. 1068–1077.
Proc. Joint Eur. Conf. Mach. Learn. Knowl. Discovery Databases. Cham, [180] F. Guan, C. Yu, and S. Yang, ‘‘A GAN model with self-attention mech-
Switzerland: Springer, 2019, pp. 454–461. anism to generate multi-instruments symbolic music,’’ in Proc. Int. Joint
[153] Q. Lyu, Z. Wu, and J. Zhu, ‘‘Polyphonic music modelling with LSTM- Conf. Neural Netw. (IJCNN), Jul. 2019, pp. 1–6.
RTRBM,’’ in Proc. 23rd ACM Int. Conf. Multimedia, Oct. 2015, [181] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, ‘‘MidiNet: A convolutional
pp. 991–994. generative adversarial network for symbolic-domain music generation,’’
[154] Musedata. Accessed: Sep. 30, 2022. [Online]. Available: 2017, arXiv:1703.10847.
https://musedata.org/ [182] Theorytab. Accessed: Sep. 30, 2022. [Online]. Available: https://www.
[155] Johann Sebastian Bach Chorales Dataset. Accessed: Sep. 30, 2022. hooktheory.com/theorytab
[Online]. Available: https://github.com/czhuang/JSB-Chorales-dataset [183] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, ‘‘MuseGAN:
[156] D. D. Johnson, ‘‘Generating polyphonic music using tied parallel net- Multi-track sequential generative adversarial networks for symbolic
works,’’ in Proc. Int. Conf. Evol. Biologically Inspired Music Art. Cham, music generation and accompaniment,’’ in Proc. AAAI Conf. Artif. Intell.,
Switzerland: Springer, pp. 128–143, 2017. vol. 32, 2018, pp. 1–8.
[157] M. Liang, ‘‘An improved music composing technique based on neural [184] L. Yu, W. Zhang, J. Wang, and Y. Yu, ‘‘SeqGan: Sequence generative
network model,’’ Mobile Inf. Syst., vol. 2022, pp. 1–10, Jul. 2022. adversarial nets with policy gradient,’’ in Proc. AAAI Conf. Artif. Intell.,
[158] K. Choi, J. Park, W. Heo, S. Jeon, and J. Park, ‘‘Chord conditioned vol. 31, 2017, pp. 1–7.
melody generation with transformer based decoders,’’ IEEE Access, [185] S.-G. Lee, U. Hwang, S. Min, and S. Yoon, ‘‘Polyphonic music
vol. 9, pp. 42071–42080, 2021. generation with sequence generative adversarial networks,’’ 2017,
[159] R. Yang, D. Wang, Z. Wang, T. Chen, J. Jiang, and G. Xia, arXiv:1710.11418.
‘‘Deep music analogy via latent representation disentanglement,’’ 2019, [186] A. Marafioti, P. Majdak, N. Holighaus, and N. Perraudin, ‘‘GACELA:
arXiv:1906.03626. A generative adversarial context encoder for long audio inpainting of
music,’’ IEEE J. Sel. Topics Signal Process., vol. 15, no. 1, pp. 120–131,
[160] P. S. Yadav, S. Khan, Y. V. Singh, P. Garg, and R. S. Singh, ‘‘A lightweight
Jan. 2020.
deep learning-based approach for jazz music generation in MIDI format,’’
[187] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Comput. Intell. Neurosci., vol. 2022, pp. 1–7, Aug. 2022.
Ł. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
[161] G. Keerti, A. Vaishnavi, P. Mukherjee, A. S. Vidya, G. S. Sreenithya,
Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–11.
and D. Nayab, ‘‘Attentional networks for music generation,’’ Multimedia
[188] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, I. Simon,
Tools Appl., vol. 81, no. 4, pp. 5179–5189, 2022.
C. Hawthorne, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck,
[162] Jazz ML Ready Midi. Accessed: Sep. 30, 2022. [Online]. Available: ‘‘Music transformer,’’ 2018, arXiv:1809.04281.
https://www.kaggle.com/datasets/saikayala/jazz-ml-ready-midi [189] Piano-E-Competition Dataset. Accessed: Sep. 30, 2022. [Online]. Avail-
[163] O. Yadav, D. Fernandes, V. Dube, and M. D’Souza, ‘‘Apollo: A classical able: https://www.piano-e-competition.com/
piano composer using long short-term memory,’’ IETE J. Educ., vol. 62, [190] N. Zhang, ‘‘Learning adversarial transformer for symbolic music gener-
no. 2, pp. 60–70, Jul. 2021. ation,’’ IEEE Trans. Neural Netw. Learn. Syst., early access, Jul. 2, 2020,
[164] Classical Music Midi—Kaggle. Accessed: Sep. 30, 2022. doi: 10.1109/TNNLS.2020.2990746.
[Online]. Available: https://www.kaggle.com/datasets/soumikrakshit/ [191] R. Child, S. Gray, A. Radford, and I. Sutskever, ‘‘Generating long
classical-music-midi sequences with sparse transformers,’’ 2019, arXiv:1904.10509.
[165] Midi Classic Music—Kaggle. Accessed: Sep. 30, 2022. [Online]. Avail- [192] S. Dieleman, A. V. D. Oord, and K. Simonyan, ‘‘The challenge of realistic
able: https://www.kaggle.com/datasets/blanderbuss/midi-classic-music music generation: Modelling raw audio at scale,’’ in Proc. Adv. Neural Inf.
[166] D. Makris, M. Kaliakatsos-Papakostas, I. Karydis, and K. L. Kermanidis, Process. Syst., vol. 31, 2018, pp. 1–11.
‘‘Conditional neural sequence learners for generating drums’ rhythms,’’ [193] Y.-S. Huang and Y.-H. Yang, ‘‘Pop music transformer: Beat-based model-
Neural Comput. Appl., vol. 31, no. 6, pp. 1793–1804, Jun. 2019. ing and generation of expressive pop piano compositions,’’ in Proc. 28th
[167] 911TABS. [Online]. Accessed: Sep. 30, 2022. Available: ACM Int. Conf. Multimedia, Oct. 2020, pp. 1180–1188.
https://www.911tabs.com/ [194] P. Dhariwal, H. Jun, C. Payne, J. Wook Kim, A. Radford, and I. Sutskever,
[168] Z. Cheddad and A. Cheddad, ‘‘ARMAS: Active reconstruction of missing ‘‘Jukebox: A generative model for music,’’ 2020, arXiv:2005.00341.
audio segments,’’ 2021, arXiv:2111.10891. [195] Y.-J. Shih, S.-L. Wu, F. Zalkow, M. Müller, and Y.-H. Yang, ‘‘Theme
[169] C. Jin, Y. Tie, Y. Bai, X. Lv, and S. Liu, ‘‘A style-specific music composi- transformer: Symbolic music generation with theme-conditioned trans-
tion neural network,’’ Neural Process. Lett., vol. 52, no. 3, pp. 1893–1912, former,’’ IEEE Trans. Multimedia, early access, Mar. 23, 2022, doi:
Dec. 2020. 10.1109/TMM.2022.3161851.

17048 VOLUME 11, 2023


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

[196] Z. Wang, K. Chen, J. Jiang, Y. Zhang, M. Xu, S. Dai, X. Gu, and G. Xia, [221] X. Liang, Z. Li, J. Liu, W. Li, J. Zhu, and B. Han, ‘‘Constructing a
‘‘POP909: A pop-song dataset for music arrangement generation,’’ 2020, multimedia Chinese musical instrument database,’’ in Proc. 6th Conf.
arXiv:2008.07142. Sound Music Technol. (CSMT). Singapore: Springer, 2019, pp. 53–60.
[197] D. Makris, G. Zixun, M. Kaliakatsos-Papakostas, and D. Herremans, [222] A. K. Sharma, G. Aggarwal, S. Bhardwaj, P. Chakrabarti, T. Chakrabarti,
‘‘Conditional drums generation using compound word representations,’’ J. H. Abawajy, S. Bhattacharyya, R. Mishra, A. Das, and H. Mahdin,
in Proc. Int. Conf. Comput. Intell. Music, Sound, Art Design (EvoStar). ‘‘Classification of Indian classical music with time-series matching deep
Cham, Switzerland: Springer, 2022, pp. 179–194. learning approach,’’ IEEE Access, vol. 9, pp. 102041–102052, 2021.
[198] S. Rhyu, H. Choi, S. Kim, and K. Lee, ‘‘Translating melody to chord: [223] B. S. Gowrishankar and N. U. Bhajantri, ‘‘Deep learning long short-term
Structured and flexible harmonization of melody with transformer,’’ IEEE memory based automatic music transcription system for carnatic music,’’
Access, vol. 10, pp. 28261–28273, 2022. in Proc. IEEE Int. Conf. Distrib. Comput. Electr. Circuits Electron. (ICD-
[199] Chord Melody Dataset. Accessed: Sep. 30, 2022. [Online]. Available: CECE), Apr. 2022, pp. 1–6.
https://github.com/shiehn/chord-melody-dataset [224] D. Makris, I. Karydis, and S. Sioutas, ‘‘The Greek music dataset,’’ in Proc.
[200] Hooktheory Lead Sheet Dataset. Accessed: Sep. 30, 2022. [Online]. 16th Int. Conf. Eng. Appl. Neural Netw. (INNS), Sep. 2015, pp. 1–7.
Available: https://www.hooktheory.com/ [225] Thrace and Macedonia. Accessed: Sep. 30, 2022. [Online]. Available:
[201] M. Ashraf, G. Geng, X. Wang, F. Ahmad, and F. Abid, ‘‘A globally reg- http://epth.sfm.gr/
ularized joint neural architecture for music classification,’’ IEEE Access, [226] M. K. Karaosmanoğlu, ‘‘A Turkish makam music symbolic database for
vol. 8, pp. 220980–220989, 2020. music information retrieval: SymbTr,’’ in Proc. 13th Int. Soc. Music Inf.
[202] Y. V. Koteswararao and C. B. Rama Rao, ‘‘An efficient optimal recon- Retr. Conf. Porto, Portugal: International Society for Music Information
struction based speech separation based on hybrid deep learning tech- Retrieval (ISMIR), Oct. 2012, pp. 223–228.
nique,’’ Defence Sci. J., vol. 72, no. 3, pp. 417–428, Jul. 2022. [227] X. Gong, Y. Zhu, H. Zhu, and H. Wei, ‘‘ChMusic: A traditional Chinese
[203] H. Zhang, S. Kandadai, H. Rao, M. Kim, T. Pruthi, and T. Kristjansson, music dataset for evaluation of instrument recognition,’’ in Proc. 4th Int.
‘‘Deep adaptive AEC: Hybrid of deep learning and adaptive acoustic echo Conf. Big Data Technol., Sep. 2021, pp. 184–189.
cancellation,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. [228] P. Cao, ‘‘Identification and classification of Chinese traditional musical
(ICASSP), May 2022, pp. 756–760. instruments based on deep learning algorithm,’’ in Proc. 2nd Int. Conf.
[204] Y. Ghatas, M. Fayek, and M. Hadhoud, ‘‘A hybrid deep learning approach Comput. Data Sci., Jan. 2021, pp. 1–5.
for musical difficulty estimation of piano symbolic music,’’ Alexandria [229] R. Li and Q. Zhang, ‘‘Audio recognition of Chinese traditional instru-
Eng. J., vol. 61, no. 12, pp. 10183–10196, Dec. 2022. ments based on machine learning,’’ Cognit. Comput. Syst., vol. 4, no. 2,
[205] L. Chen, C. Zhao, Y. Liu, and P. Zhuang, ‘‘A multi-modal joint voice parts pp. 108–115, Jun. 2022.
division method based on deep learning,’’ in Proc. 15th Int. Symp. Med. [230] K. Xu, ‘‘Recognition and classification model of music genres and Chi-
Inf. Commun. Technol. (ISMICT), Apr. 2021, pp. 35–40. nese traditional musical instruments based on deep neural networks,’’ Sci.
[206] J. Lin, ‘‘Integrated intelligent drowsiness detection system based on deep Program., vol. 2021, pp. 1–8, Jun. 2021.
learning,’’ in Proc. IEEE Int. Conf. Power, Intell. Comput. Syst. (ICPICS), [231] J. Li, J. Luo, J. Ding, X. Zhao, and X. Yang, ‘‘Regional classification
Jul. 2020, pp. 420–424. of Chinese folk songs based on CRF model,’’ Multimedia Tools Appl.,
[207] S. Bisht, H. T. Kanakia, and P. Thakur, ‘‘Music emotion prediction vol. 78, no. 9, pp. 11563–11584, May 2019.
based on hybrid approach combining lyrical and acoustic approaches,’’ [232] Q. Chen, W. Zhao, Q. Wang, and Y. Zhao, ‘‘The sustainable develop-
in Proc. 6th Int. Conf. Intell. Comput. Control Syst. (ICICCS), May 2022, ment of intangible cultural heritage with AI: Cantonese opera singing
pp. 1656–1660. genre classification based on CoGCNet model in China,’’ Sustainability,
[208] N. C. Thompson, K. Greenewald, K. Lee, and G. F. Manso, ‘‘The com- vol. 14, no. 5, p. 2923, Mar. 2022.
putational limits of deep learning,’’ 2020, arXiv:2007.05558. [233] H. Wang, J. Li, Y. Lin, W. Ru, and J. Wu, ‘‘Generate Xi’an drum music
[209] H. Chen, Y. Wang, C. Xu, B. Shi, C. Xu, Q. Tian, and C. Xu, ‘‘AdderNet: based on compressed coding,’’ in Proc. 40th Chin. Control Conf. (CCC),
Do we really need multiplications in deep learning?’’ in Proc. IEEE/CVF Jul. 2021, pp. 8679–8683.
Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 1468–1477. [234] J. Luo, X. Yang, S. Ji, and J. Li, ‘‘MG-VAE: Deep Chinese folk songs
[210] L. Glitsos, ‘‘Vaporwave, or music optimised for abandoned malls,’’ Pop- generation with specific regional styles,’’ in Proc. 7th Conf. Sound Music
ular Music, vol. 37, no. 1, pp. 100–118, Jan. 2018. Technol. (CSMT). Singapore: Springer, 2020, pp. 93–106.
[211] P. Ballam-Cross, ‘‘Reconstructed nostalgia: Aesthetic commonalities and [235] Z. Xu, ‘‘Construction of intelligent recognition and learning education
self-soothing in chillwave, synthwave, and vaporwave,’’ J. Popular Music platform of national music genre under deep learning,’’ Frontiers Psy-
Stud., vol. 33, no. 1, pp. 70–93, 2021. chol., vol. 13, May 2022, Art. no. 843427.
[212] N. Chauhan, ‘‘Is it possible to programmatically generate Vaporwave?’’ [236] A. Skoki, S. Ljubic, J. Lerga, and I. Štajduhar, ‘‘Automatic music tran-
IndiaRxiv, Mar. 2020. [Online]. Available: http://indiarxiv.org/9um2r, scription for traditional woodwind instruments sopele,’’ Pattern Recognit.
doi: 10.35543/osf.io/9um2r. Lett., vol. 128, pp. 340–347, Dec. 2019.
[213] Wikipedia. List of Cultural and Regional Genres of Music. [237] E. A. Retta, R. Sutcliffe, E. Almekhlafi, Y. K. Enku, E. Alemu,
Accessed: Sep. 30, 2022. [Online]. Available: https://en.wikipedia. T. D. Gemechu, M. A. Berwo, M. Mhamed, and J. Feng, ‘‘Kinit clas-
org/wiki/List_of_cultural_and_regional_genres_of_music sification in Ethiopian chants, azmaris and modern music: A new dataset
[214] S. Shahriar and U. Tariq, ‘‘Classifying maqams of Qur’anic recitations and CNN benchmark,’’ 2022, arXiv:2201.08448.
using deep learning,’’ IEEE Access, vol. 9, pp. 117271–117281, 2021. [238] V. S. Pendyala, N. Yadav, C. Kulkarni, and L. Vadlamudi, ‘‘Towards
[215] P. Kritopoulou, A. Stergiaki, and K. Kokkinidis, ‘‘Optimizing human building a deep learning based automated Indian classical music
computer interaction for byzantine music learning: Comparing HMMs tutor for the masses,’’ Syst. Soft Comput., vol. 4, Dec. 2022,
with RDFs,’’ in Proc. 9th Int. Conf. Modern Circuits Syst. Technol. Art. no. 200042.
(MOCAST), Sep. 2020, pp. 1–4. [239] S. John, M. S. Sinith, R. S. Sudheesh, and P. P. Lalu, ‘‘Classification
[216] N. Bassiou, C. Kotropoulos, and A. Papazoglou-Chalikias, ‘‘Greek folk of Indian classical carnatic music based on raga using deep learning,’’
music classification into two genres using lyrics and audio via canonical in Proc. IEEE Recent Adv. Intell. Comput. Syst. (RAICS), Dec. 2020,
correlation analysis,’’ in Proc. 9th Int. Symp. Image Signal Process. Anal. pp. 110–113.
(ISPA), Sep. 2015, pp. 238–243. [240] S. T. Madhusudhan and G. Chowdhary, ‘‘DeepSRGM—Sequence clas-
[217] E. Fotiadou, N. Bassiou, and C. Kotropoulos, ‘‘Greek folk music classifi- sification and ranking in Indian classical music with deep learning,’’ in
cation using auditory cortical representations,’’ in Proc. 24th Eur. Signal Proc. 20th Int. Soc. Music Inf. Retr. Conf., 2019, pp. 533–540.
Process. Conf. (EUSIPCO), Aug. 2016, pp. 1133–1137. [241] S. Nag, M. Basu, S. Sanyal, A. Banerjee, and D. Ghosh, ‘‘On the appli-
[218] K. Tsoulou, ‘‘Feature-based machine learning techniques towards Greek cation of deep learning and multifractal techniques to classify emotions
folk music classification,’’ M.S. thesis, School Sci. Technol., Int. Hellenic and instruments using Indian classical music,’’ Phys. A, Stat. Mech. Appl.,
Univ., Thermi, Greece, 2020. vol. 597, Jul. 2022, Art. no. 127261.
[219] N. Farajzadeh, N. Sadeghzadeh, and M. Hashemzadeh, ‘‘PMG-Net: Per- [242] A. Krishnan, A. Vincent, G. Jos, and R. Rajan, ‘‘Multimodal fusion for
sian music genre classification using deep neural networks,’’ Entertain- segment classification in folk music,’’ in Proc. IEEE 18th India Council
ment Comput., vol. 44, Jan. 2023, Art. no. 100518. Int. Conf. (INDICON), Dec. 2021, pp. 1–7.
[220] Y. Yang and X. Huang, ‘‘Research based on the application and explo- [243] S. Chowdhuri, ‘‘PhonoNet: Multi-stage deep neural networks for raga
ration of artificial intelligence in the field of traditional music,’’ J. Sen- identification in Hindustani classical music,’’ in Proc. Int. Conf. Multi-
sors, vol. 2022, pp. 1–9, Jul. 2022. media Retr., Jun. 2019, pp. 197–201.

VOLUME 11, 2023 17049


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

[244] D. P. Shah, N. M. Jagtap, P. T. Talekar, and K. Gawande, ‘‘Raga recog- [266] Y.-J. Hong, J. Han, and H. Ryu, ‘‘The effects of synthesizing music using
nition in Indian classical music using deep learning,’’ in Proc. Int. Conf. AI for preoperative management of Patients’ anxiety,’’ Appl. Sci., vol. 12,
Comput. Intell. Music, Sound, Art Design (EvoStar). Cham, Switzerland: no. 16, p. 8089, Aug. 2022.
Springer, 2021, pp. 248–263. [267] T. Gajȩcki and W. Nogueira, ‘‘Deep learning models to remix music
[245] R. Surana, A. Varshney, and V. Pendyala, ‘‘Deep learning for conversions for cochlear implant users,’’ J. Acoust. Soc. Amer., vol. 143, no. 6,
between melodic frameworks of Indian classical music,’’ in Proc. 2nd Int. pp. 3602–3615, Jun. 2018.
Conf. Adv. Comput. Eng. Commun. Syst. Cham, Switzerland: Springer, [268] J. Singh and A. Ratnawat, ‘‘Algorithmic music generation for the stimula-
2022, pp. 1–12. tion of musical memory in Alzheimer’s,’’ in Proc. 4th Int. Conf. Comput.
[246] I. Ali-MacLachlan, C. Southall, M. Tomczak, and J. Hockman, ‘‘Player Commun. Autom. (ICCCA), Dec. 2018, pp. 1–4.
recognition for traditional Irish flute recordings,’’ in Proc. 8th Int. Work- [269] J. Chen, F. Pan, P. Zhong, T. He, L. Qi, J. Lu, P. He, and Y. Zheng,
shop Folk Music Anal., 2018, pp. 3–8. ‘‘An automatic method to develop music with music segment and long
[247] A. Kolokolova, M. Billard, R. Bishop, M. Elsisy, Z. Northcott, L. Graves,
short term memory for tinnitus music therapy,’’ IEEE Access, vol. 8,
V. Nagisetty, and H. Patey, ‘‘GANs & reels: Creating Irish music using a
pp. 141860–141871, 2020.
generative adversarial network,’’ 2020, arXiv:2010.15772.
[248] B. L. Sturm and O. Ben-Tal, ‘‘Folk the algorithms: (Mis) applying artifi- [270] Y. Heping and W. Bin, ‘‘Online music-assisted rehabilitation
cial intelligence to folk music,’’ in Handbook of Artificial Intelligence for system for depressed people based on deep learning,’’ Prog.
Music. Cham, Switzerland: Springer, pp. 423–454, 2021. Neuro-Psychopharmacology Biol. Psychiatry, vol. 119, Dec. 2022,
[249] J. Lee, M. Lee, D. Jang, and K. Yoon, ‘‘Korean traditional music genre Art. no. 110607.
classification using sample and midi phrases,’’ KSII Trans. Internet Inf. [271] Y. Li, X. Li, Z. Lou, and C. Chen, ‘‘Long short-term memory-based music
Syst., vol. 12, no. 4, pp. 1869–1886, 2018. analysis system for music therapy,’’ Frontiers Psychol., vol. 13, Jun. 2022,
[250] M. Ebrahimi, B. Majidi, and M. Eshghi, ‘‘Procedural composition of Art. no. 928048.
traditional Persian music using deep neural networks,’’ in Proc. 5th Conf. [272] G. Kruthika, P. Kuruba, and N. Dushyantha, ‘‘A system for anxiety predic-
Knowl. Based Eng. Innov. (KBEI), Feb. 2019, pp. 521–525. tion and treatment using Indian classical music therapy with the applica-
[251] S. S. Hashemi, M. Aghabozorgi, and M. T. Sadeghi, ‘‘Persian music tion of machine learning,’’ in Intelligent Data Communication Technolo-
source separation in audio-visual data using deep learning,’’ in Proc. 6th gies and Internet of Things. Cham, Switzerland: Springer, pp. 345–359,
Iranian Conf. Signal Process. Intell. Syst. (ICSPIS), Dec. 2020, pp. 1–5. 2021.
[252] E. Hallström, S. Mossmyr, B. Sturm, V. Vegeborn, and J. Wedin, ‘‘From [273] S. Shaila, V. Gurudas, R. Rakshita, and A. Shangloo, ‘‘Music therapy
jigs and reels to schottisar OCH polskor: Generating Scandinavian-like for mood transformation based on deep learning framework,’’ in Proc.
folk music with deep recurrent networks,’’ in Proc. 16th Sound Music Comput. Vis. Robot. Singapore: Springer, pp. 35–47, 2022.
Comput. Conf., Malaga, Spain, May 2019, pp. 1–8. [274] S. Shaila, T. Rajesh, S. Lavanya, K. Abhishek, and V. Suma, ‘‘Music
[253] F. Marchetti, C. Wilson, C. Powell, E. Minisci, and A. Riccardi, ‘‘Con- therapy for transforming human negative emotions: Deep learning
volutional generative adversarial network, via transfer learning, for tra- approach,’’ in Proc. Int. Conf. Recent Trends Comput. Singapore:
ditional Scottish music generation,’’ in Proc. Int. Conf. Comput. Intell. Springer, pp. 99–109, 2022.
Music, Sound, Art Design (EvoStar). Cham, Switzerland: Springer, [275] Q. Ding, ‘‘Evaluation of the efficacy of artificial neural network-based
pp. 187–202, 2021. music therapy for depression,’’ Comput. Intell. Neurosci., vol. 2022,
[254] A. Huaysrijan and S. Pongpinigpinyo, ‘‘Automatic music transcription pp. 1–6, Aug. 2022.
for the Thai xylophone played with soft mallets,’’ in Proc. 19th Int. Joint
[276] Z. Hu, Y. Liu, G. Chen, S.-H. Zhong, and A. Zhang, ‘‘Make your favorite
Conf. Comput. Sci. Softw. Eng. (JCSSE), Jun. 2022, pp. 1–6.
music curative: Music style transfer for anxiety reduction,’’ in Proc. 28th
[255] A. Aydingun, D. Bagdatlioglu, B. Canbaz, A. Kokbiyik, M. F. Yavuz,
ACM Int. Conf. Multimedia, Oct. 2020, pp. 1189–1197.
N. Bolucu, and B. Can, ‘‘Turkish music generation using deep learning,’’
in Proc. 28th Signal Process. Commun. Appl. Conf. (SIU), Oct. 2020, [277] E. Idrobo-Ávila, H. Loaiza-Correa, F. Muñoz-Bolaños, L. van Noorden,
pp. 1–4. and R. Vargas-Cañas, ‘‘Development of a biofeedback system using
[256] I. H. Parlak, Y. Çebi, C. Işikhan, and D. Birant, ‘‘Deep learning for harmonic musical intervals to control heart rate variability with a gen-
Turkish makam music composition,’’ TURKISH J. Electr. Eng. Comput. erative adversarial network,’’ Biomed. Signal Process. Control, vol. 71,
Sci., vol. 29, no. 7, pp. 3107–3118, Nov. 2021. Jan. 2022, Art. no. 103095.
[257] S. Tanberk and D. B. Tukel, ‘‘Style-specific Turkish pop music compo- [278] A. E. Coca, G. O. Tost, and L. Zhao, ‘‘Characterizing chaotic melodies
sition with CNN and LSTM network,’’ in Proc. IEEE 19th World Symp. in automatic music composition,’’ Chaos, Interdiscipl. J. Nonlinear Sci.,
Appl. Mach. Intell. Informat. (SAMI), Jan. 2021, pp. 181–185. vol. 20, no. 3, Sep. 2010, Art. no. 033125.
[258] M. A. Kızrak and B. Bolat, ‘‘A musical information retrieval sys- [279] A. E. Coca, D. C. Corrêa, and L. Zhao, ‘‘Computer-aided music composi-
tem for classical Turkish music makams,’’ Simulation, vol. 93, no. 9, tion with lstm neural network and chaotic inspiration,’’ in Proc. Int. Joint
pp. 749–757, Sep. 2017. Conf. Neural Netw. (IJCNN), 2013, pp. 1–7.
[259] M. A. Kizrak and B. Bolat, ‘‘Classification of classic Turkish music [280] M. A. Kaliakatsos-Papakostas, M. G. Epitropakis, A. Floros, and
makams by using deep belief networks,’’ in Proc. 23rd Signal Process. M. N. Vrahatis, ‘‘Chaos and music: From time series analysis to evo-
Commun. Appl. Conf. (SIU), May 2015, pp. 1–6. lutionary composition,’’ Int. J. Bifurcation Chaos, vol. 23, no. 11,
[260] T. P. Van, N. T. N. Quang, and T. M. Thanh, ‘‘Deep learning approach for Nov. 2013, Art. no. 1350181.
singer voice classification of Vietnamese popular music,’’ in Proc. 10th [281] B. Sobota, F. Majcher, M. Sivy, and M. Hudak, ‘‘Chaos simulation and
Int. Symp. Inf. Commun. Technol. (SoICT), 2019, pp. 255–260. audio output,’’ in Proc. IEEE 15th Int. Sci. Conf. Informat., Nov. 2019,
[261] T. Stegemann, M. Geretsegger, E. Phan Quoc, H. Riedl, and pp. 000137–000142.
M. Smetana, ‘‘Music therapy and other music-based interventions [282] E. Berdahl, E. Sheffield, A. Pfalz, and A. T. Marasco, ‘‘Widening the
in pediatric health care: An overview,’’ Medicines, vol. 6, no. 1, p. 25, razor-thin edge of chaos into a musical highway: Connecting chaotic
Feb. 2019. maps to digital waveguides,’’ in Proc. Int. Conf. New Interfaces Musical
[262] M. de Witte, G.-J. Stams, X. Moonen, A. E. R. Bos, and S. van Hooren, Expression (NIME), 2018, pp. 390–393.
‘‘Music therapy for stress reduction: A systematic review and meta-
[283] M. Skarha, V. Cusson, C. Frisson, and M. M. Wanderley, ‘‘Le bâton: A
analysis,’’ Health Psychol. Rev., vol. 16, no. 1, pp. 134–159, Nov. 2020.
digital musical instrument based on the chaotic triple pendulum,’’ in Proc.
[263] H. L. Lam, W. T. V. Li, I. Laher, and R. Y. Wong, ‘‘Effects of music
NIME, 2021, pp. 1–17.
therapy on patients with dementia—A systematic review,’’ Geriatrics,
vol. 5, no. 4, p. 62, 2020. [284] S.-T. Lin and R.-F. Hsu, ‘‘Chaotic signal synthesizer applied on portable
[264] S. Tahmasebi, T. Gajecki, and W. Nogueira, ‘‘Design and evalua- devices for tinnitus therapy,’’ in Proc. Int. Symp. Intell. Signal Process.
tion of a real-time audio source separation algorithm to remix music Commun. Syst. (ISPACS), Nov. 2021, pp. 1–2.
for cochlear implant users,’’ Frontiers Neurosci., vol. 14, p. 434, [285] J.-M. Chen, P.-Y. He, and F. Pan, ‘‘Research on synthesizing music
May 2020. for tinnitus treatment based on chaos,’’ in Proc. 12th Int. Conf. Signal
[265] J. Gauer, A. Nagathil, K. Eckel, D. Belomestny, and R. Martin, ‘‘A ver- Process. (ICSP), Oct. 2014, pp. 2286–2291.
satile deep-neural-network-based music preprocessing and remixing [286] T.-L. Liao, H.-C. Chen, C.-Y. Peng, and Y.-Y. Hou, ‘‘Chaos-based secure
scheme for cochlear implant listeners,’’ J. Acoust. Soc. Amer., vol. 151, communications in biomedical information application,’’ Electronics,
no. 5, pp. 2975–2986, May 2022. vol. 10, no. 3, p. 359, Feb. 2021.

17050 VOLUME 11, 2023


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

[287] E. Bollt, ‘‘On explaining the surprising success of reservoir computing ACHILLES D. BOURSIANIS (Member, IEEE)
forecaster of chaos? The universal machine learning dynamical system received the B.Sc. degree in physics, the M.Sc.
with contrast to VAR and DMD,’’ Chaos, Interdiscipl. J. Nonlinear Sci., degree in electronic physics (radioelectrology) in
vol. 31, no. 1, Jan. 2021, Art. no. 013108. the area of electronics telecommunications tech-
[288] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, ‘‘Towards nology, and the Ph.D. degree in telecommunica-
musical query-by-semantic-description using the CAL500 data set,’’ in tions from the School of Physics, Aristotle Uni-
Proc. 30th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., 2007, versity of Thessaloniki, in 2001, 2005, and 2017,
pp. 439–446. respectively.
Since 2019, he has been a Postdoctoral
Researcher and an Academic Fellow with the
School of Physics, Aristotle University of Thessaloniki. He is currently
a member of the ELEDIA@AUTH Research Group. He is the author or
coauthor of more than 70 articles in international peer-reviewed journals
and conferences. His research interests include wireless sensor networks, the
Internet of Things (IoT), antenna design and optimization, 5G and beyond
communication networks, radio frequency energy harvesting, and artificial
LAZAROS MOYSIS received the B.Sc., M.Sc., intelligence.
and Ph.D. degrees from the Department of Math- Dr. Boursianis is a member of the Hellenic Physical Society and the Sci-
ematics, Aristotle University of Thessaloniki, entific Committee of the National Association of Federation des Ingenieurs
Greece, in 2011, 2013, and 2017, respectively. des Telecommunications de la Communaute Europeenne (FITCE). He is a
He is currently a Researcher with the Physics member of the Editorial Board of the Telecom journal. He serves as a reviewer
Department, Aristotle University of Thessaloniki, for several international journals and conferences and as a member of the
and the Laboratory of Nonlinear Systems, Circuits technical program committees for various international conferences, which
and Complexity. His research interests include are technically sponsored by IEEE.
the theory of control systems, descriptor systems,
chaotic systems, and their applications (notable
examples include observer design, synchronization, chaotification, chaos
encryption, and chaotic path planning).

MARIA S. PAPADOPOULOU (Member, IEEE)


received the B.Sc. degree in physics, the M.Sc.
degree in electronics, and the Ph.D. degree in non-
linear circuits from the School of Physics, Aristo-
tle University of Thessaloniki (AUTh). She is cur-
LAZAROS ALEXIOS ILIADIS (Graduate Stu- rently an Assistant Professor with the Department
dent Member, IEEE) received the B.Sc. degree of Information and Electronic Engineering, Inter-
in physics and the M.Sc. degree in electronic national Hellenic University. She is also a member
physics (radioelectrology) from the Aristotle Uni- of the ELEDIA@AUTH Research Group, ELE-
versity of Thessaloniki, in 2017 and 2021, respec- DIA Research Center Network. She has authored
tively, where he is currently pursuing the Ph.D. or coauthored several peer-reviewed journals and conferences. Her research
degree. His research interests include development interests include RF energy harvesting, wireless sensor networks, the Internet
of the sixth-generation communications systems of Things, nonlinear dynamics, and electronic design and optimization. She
(6G), antenna design and electromagnetics, arti- is a member of the Hellenic Physical Society. She serves as a reviewer
ficial intelligence techniques (evolutionary algo- for several international journals and conferences and as a member of the
rithms, machine learning, and deep learning methods), and computer vision. technical program committees for various international conferences.

KONSTANTINOS-IRAKLIS D. KOKKINIDIS
received the B.A. degree from Hellenic Open Uni-
SOTIRIOS P. SOTIROUDIS received the B.Sc. versity and the M.B.A. and Ph.D. degrees from the
degree in physics and the M.Sc. degree in elec- University of Macedonia, Thessaloniki, Greece.
tronics from the Aristotle University of Thessa- He is currently a Special Teaching/Technical Per-
loniki, in 1999 and 2002, respectively, the B.Sc. sonnel with the Department of Applied Informat-
degree in informatics from Hellenic Open Univer- ics, University of Macedonia. He has published
sity, in 2011, and the Ph.D. degree in physics from numerous articles in academic conferences and
the Aristotle University of Thessaloniki, in 2018. journals, such as the International Conference
From 2004 to 2010, he worked with the Telecom- on Modern Circuits and Systems Technologies
munications Center, Aristotle University of Thes- (MOCAST) on Electronics and Communications, the International Con-
saloniki. From 2010 to 2022, he worked as a ference on Movement and Computing (MOCO), and the International
Teacher of physics and informatics with the Greek Ministry of Education. Journal of Mechanical and Mechatronics Engineering (IJMME-IJENS). His
He joined the Department of Physics, Aristotle University of Thessaloniki, research interests include human-centered computing with special interests
in 2022, where he has been involved in several research projects. His research in human–computer interaction, machine learning, the Internet of Things
interests include wireless communications, radio propagation, optimization (IoT), gesture and audio signal processing and identification, and sensori-
algorithms, computer vision, and machine learning. motor learning, with a focus on sound and image processing.

VOLUME 11, 2023 17051


L. Moysis et al.: MDL: DL Methods for Music Signal Processing—A Review of the State-of-the-Art

CHRISTOS VOLOS received the Diploma degree SPIRIDON NIKOLAIDIS (Senior Member,
in physics, the M.Sc. degree in electronics, and IEEE) received the Diploma and Ph.D. degrees
the Ph.D. degree in chaotic electronics from the in electrical engineering from Patras University,
Physics Department, Aristotle University of Thes- Greece, in 1988 and 1994, respectively. Since
saloniki, Greece, in 1999, 2002, and 2008, respec- September 1996, he has been with the Department
tively. He is currently an Associate Professor with of Physics, Aristotle University of Thessaloniki,
the Physics Department, Aristotle University of Greece, where he is currently a Full Professor.
Thessaloniki. He is also a member of the Labo- From 2003 to 2017, he was also a contract teaching
ratory of Nonlinear Systems, Circuits and Com- staff of Hellenic Open University. He has worked
plexity, Physics Department, Aristotle University in the areas of digital circuits and system design.
of Thessaloniki. His current research interests include the design and study He is the author or coauthor of more than 200 scientific articles in interna-
of analog and mixed signal electronic circuits, chaotic electronics and their tional journals and conference proceedings, while his work has more than
applications (secure communications, cryptography, and robotics), experi- 2300 references (Google Scholar, H-index=23). Two articles presented at
mental chaotic synchronization, chaotic UWB communications, and mea- international conferences achieved honorary awards. His current research
surement and instrumentation systems. interests include the design of high-speed and low-power digital circuits
and embedded systems, modeling the operations of basic CMOS structures,
modeling the power consumption of embedded processors, and development
of algorithms for leak detection and localization in pipelines. He was a
member of the organization committees of three international conferences.
He is the founder and organizer of the Annual International Conference on
Modern Circuit and System Technologies (MOCAST) since 2012. He also
organized the 27th International Symposium on Power and Timing Mod-
eling, Optimization and Simulation (PATMOS), in 2017. He contributes or
has contributed to a number of research projects funded by the European
Union and the Greek Government, for many of which he has scientific
responsibility.

SOTIRIOS K. GOUDOS (Senior Member, IEEE)


received the B.Sc. degree in physics, the M.Sc.
degree in electronics, the Ph.D. degree in physics,
and the Diploma degree in electrical and com-
puter engineering from the Aristotle University
of Thessaloniki, Thessaloniki, Greece, in 1991,
1994, 2001, and 2011, respectively, and the M.Sc.
PANAGIOTIS SARIGIANNIDIS (Member, IEEE) degree in information systems from the University
received the B.Sc. and Ph.D. degrees in computer of Macedonia, Greece, in 2005.
science from the Aristotle University of Thessa- He is currently an Associate Professor with the
loniki, Thessaloniki, Greece, in 2001 and 2007, Department of Physics, Aristotle University of Thessaloniki. He is also the
respectively. He has been an Associate Professor Director of the ELEDIA@AUTH and a Laboratory Member of the ELEDIA
with the Department of Electrical and Computer Research Center Network. He has participated in more than 16 national
Engineering, University of Western Macedonia, and European-funded projects and has been a principal investigator of five
Kozani, Greece, since 2016. He has been involved national funded research projects. He is the author of the book titled Emerg-
in several national, European, and international ing Evolutionary Algorithms for Antennas and Wireless Communications
projects. He is currently the Project Coordinator (The Institution of Engineering and Technology, 2021). His research interests
of three H2020 projects, namely, a) H2020-DS-SC7-2017 (DS07-2017), include antenna and microwave structures design, evolutionary algorithms,
‘‘SPEAR: Secure and PrivatE smArt gRid,’’ and H2020-LC-SC3-EE2020- wireless communications, machine learning, and semantic web technolo-
1 (LC-SC3-EC-4-2020); b) ‘‘EVIDENT: bEhaVioral Insgihts anD Effec- gies.
tive eNergy policy acTions’’ and H2020-ICT-2020-1 (ICT-56-2020); and Prof. Goudos is a member of the IEICE, the Greek Physics Society, the
c) ‘‘TERMINET: nexT gEneRation sMart INterconnectEd ioT,’’ while Technical Chamber of Greece, and the Greek Computer Society. He is also
he coordinates the Operational Program ‘‘MARS: sMart fArming With a member of the editorial boards of the International Journal of Anten-
dRoneS’’ (Competitiveness, Entrepreneurship, and Innovation). He serves as nas and Propagation (IJAP), the EURASIP Journal on Wireless Commu-
a Principal Investigator for the H2020-SU-DS-2018 (SU-DS04-2018-2020), nications and Networking, and the International Journal on Advances on
‘‘SDN-microSENSE: SDN-microgrid reSilient Electrical eNergy SystEm,’’ Intelligent Systems. He is also a member of the Topic Board of the open
and the Erasmus+ KA2 ‘‘ARRANGE-ICT: pArtneRship foR AddressiNG access journal Electronics. He has also served as a member of the technical
mEgatrends in ICT’’ (Cooperation for Innovation and the Exchange of program committees for several IEEE and non-IEEE conferences. He is
Good Practices). He has published over 180 papers in international jour- the founding Editor-in-Chief of the open access journal Telecom (MDPI
nals, conferences, and book chapters, including the IEEE COMMUNICATIONS publishing). He is serving as an Associate Editor for the IEEE TRANSACTIONS
SURVEYS AND TUTORIALS, the IEEE INTERNET OF THINGS JOURNAL, the IEEE ON ANTENNAS AND PROPAGATION, IEEE ACCESS, and the IEEE OPEN JOURNAL OF
TRANSACTIONS ON BROADCASTING, the IEEE SYSTEMS JOURNAL, the IEEE THE COMMUNICATION SOCIETY. He was honored as an IEEE ACCESS Outstanding
Wireless Communications Magazine, the IEEE/OSA JOURNAL OF LIGHTWAVE Associate Editor, in 2019, 2020, and 2021. He has participated as a guest
TECHNOLOGY, IEEE ACCESS, and Computer Networks. His research interests editor or a lead guest editor of more than 20 special issues of international
include telecommunication networks, the Internet of Things, and network journals. He has co-organized four special sessions in international confer-
security. He participates in the editorial boards of various journals, including ences. He is also serving as the Chapter/AG Coordinator for the IEEE Greece
the International Journal of Communication Systems and the EURASIP Section. He has been elected as the IEEE Greece Section Secretary, in 2022.
Journal on Wireless Communications and Networking.

17052 VOLUME 11, 2023

You might also like