Imm 6321
Imm 6321
Georgios Papanikas
s090845
Dedicated to Evi and Thanasis
ii
Preface
This thesis was prepared at Informatics Mathematical Modelling, the Technical
University of Denmark in partial fulfillment of the requirements for acquiring the
degree of Master of Science in engineering.
The thesis deals with the automatic transcription of drums music. Literature
survey, mainly focused on limited computational complexity of the algorithms, has
been conducted. Based on simulation trials' results, an FPGA system is designed
and implemented, which recognizes in real-time the instruments of a limited drum
kit through the use of a single microphone.
Lyngby, April 2012
Georgios Papanikas
iv
Acknowledgements
I would like to thank my supervisor, Alberto Nannarelli, for his support and
attitude throughout the project and, most of all, for the freedom he gave me on
the topic's selection and the system's implementation.
Special thanks should be addressed to Mikkel Nrgaard Schmidt for the
discussions on the available approaches to NNMF based methodologies. Also, to
Per Friis for his quick responses on the licensing problems of some tools, and
Sahar Abbaspour for borrowing the development board whenever I needed it.
The contribution of Nikos Parastatidis and Giannis Chionidis on the recordings
and various technical issues was invaluable. As it was the feedback on the text
from Mara Vasileiou.
vi
Contents
Preface
.................................................................................................................iii
Acknowledgements
.......................................................................................................v
1. Introduction
1
1.1. Automatic music transcription
...............................................................1
1.2. Drums' transcription and sound characteristics
........................................2
1.3. Thesis' objectives and structure
...............................................................5
2. Background and related work
7
2.1. Approaches to drums transcription ............................................................7
2.1.1.
Pattern recognition approaches
.............................................7
2.1.2.
Separation-based approaches .........................................................9
2.1.2.1.
Sources and components
...................................................9
2.1.2.2.
Input data representation
.................................................11
2.1.2.3.
The approximated factorisation
......................................12
2.1.2.4.
Klapuri's audio onset detection algorithm ..........................14
2.2. Non-Negative Matrix Factorisation based approaches
..........................14
2.2.1.
Update rules and cost functions
...........................................15
2.2.2.
Using prior knowledge about the sources
................................16
2.2.3.
Relevant work on NNMF-based transcription systems
.........18
2.3.The Fourier transform
.........................................................................20
2.3.1.
Window functions
...................................................................21
2.3.2.
The Short-time Fourier Transform (STFT) ................................24
3. Implemented transcription algorithm and simulation
27
3.1. Introduction ................................................................................................27
3.2. Recordings ................................................................................................27
3.2.1.
Recording equipment ...................................................................28
3.2.2.
Tempo, note's value and time signature in the common musical
notation scheme
..............................................................................29
3.2.3.
Test and training samples .......................................................31
3.3. Algorithm's pseudocode
.........................................................................32
3.4. Simulation results ....................................................................................34
3.4.1.
Determining frame's overlapping level and length ....................34
3.4.2.
Determining the window function ............................................37
3.4.3.
Determining the frequency bands
............................................37
3.4.4.
Determining the convergence threshold
................................39
3.4.5.
3.4.6.
..................................................................................................65
Appendix A ..................................................................................................67
Appendix B ..................................................................................................69
Appendix C ..................................................................................................71
Appendix D ..................................................................................................75
viii
I
Introduction
1.1 Automatic music trancription
Automatic music trancription is defined as the "analysis of an acoustic music
signal so as to write down the pitch, onset time, duration and source of each sound that
occurs in it" [1]. Usually symbols of notes are used in order to indicate these parameters,
but depending on the type of music and the instruments taking part in it, written music
may take various forms. Applications of automatic music trancription comprise:
music. In many cases the goal is redefined as being able to transcribe only some welldefined part of the music signal, such as the dominant melody or the most prominent
drum sounds (partial transcription). What is achievable could be intuitively estimated by
considering what an average listener perceives while listening to music. Although
recognizing musical instruments, tapping along the rhythm, or humming the main melody
are relatively easy tasks, this is not the case with more complex aspects like recognizing
different pitch intervals and timing relationships.
The motivation regarding the real-time automatic transcription came into life after
the flourish of such systems in industry, during the last years. More and more interactive
applications (such as electronic tutor systems or games) which require the real-time
processing and feedback on musically meaningful sound events are being developed 1. In
most of the cases they are based on digital interfaces, such as MIDI, in order to recognise
these events, but this limits the instruments that could be used to some of the electronic
ones. In other cases the information about the sound events is indirectly extracted. For
example, through the use of piezoelectric sensors attached to the skins of the drums,
someone could get the information regarding when a specific instrument was hit, without
dealing at all with the sound itself. In the general case, though, a drummer (still) uses
acoustic drum kits and the most straightforward recognition approach is based on the use
of a single microphone's input, rather than the use of several microphones setups or
sensors which indirectly provide some aspects of the transcription.
the
the
the
the
duration
pitch
onset time
source instrument
However, in the drums-only case both duration and pitch are not critical, or even
meaningful in the typical percussion instruments of a drum kit.
Few examples are the Guitar Hero, the Rocksmith and the JamGuru.
-2-
Figure 1.12: 1. Bass drum, 2. floor tom, 3. snare drum, 4. hanging toms, 5. hi-hat cymbal, 6. crash cymbal, 7.
ride cymbal, 8. splash cymbal, 9. china type cymbal
There are pitched percussion instruments, not found in drum kits, though. Some characteristic ones are the
balafon, the marimba, the xylophone and the tabla.
-3-
contain such complex harmonic overtones and wide range of prominent frequencies that
no pitch is discernible. The sounds of the different type of strokes on a specific instrument
(meaning the different intensities of the hit, hitting the drum/cymbal with a different
type of stick or hitting it on a different spot) contain frequencies in the same, more or
less, wide ranges. Membranophones tend to have most of their spectral energy in the
lower range of the frequency spectrum, typically below 1000Hz, while idiophones' spectral
energy is more evenly spread out, resulting in more high-frequency content [2]. This is
clearly shown in figure 1.2. Below the lower membrane of the snare drum (the one not
being hit by the stick) there is a metal belt attached, whose vibrations cause the existence
of more high-frequency energy in snare drum's sounds.
200ms
200ms
Figure 1.2: Bass and ride strokes signals in time-domain (top) and in frequency-domain (bottom)
Figure 1.34: Attack, decay, sustain and release time of a sound event
4 The figure is taken from http://en.wikipedia.org/wiki/Synthesizer#ADSR_envelope
-4-
The onset time of a stroke refers to the beginning of the attack phase or to a
moment during it. The audio onset detection is an active research area 5. There is a
distinction between the physical and the perceptual onset time. The physical onset refers
to the starting point of the attack time phase, while the perceptual onset is an attempt to
define it in line with the human perception, as the subjective time that a listener first
notices a sound event. In [4] the perceptual onset time is considered to occur when the
signal reaches a level of approximately 6-15 dB below its maximum value. There are some
instruments, for instance the flute, whose attack time is very long and the distinction
between the physical and the perceptual onset time makes sense. However, in case of
percussion instruments' sounds the time between zero and maximum energy of a stroke is
almost instantaneous [2]. In [5] it is stated that the difference between the physical and
the perceptual onset times in percussion sounds is on the order of magnitude of a few
milliseconds, while it could be as much as 50-100ms for a note bowed slowly on a violin.
See http://www.music-ir.org/mirex/wiki/MIREX_HOME
-5-
-6-
II
Background and related
work
2.1 Approaches to drums transcription
According to [2]: "approaches to percussion transcription can be roughly divided
into two categories: pattern recognition applied to sound events and separation-based
systems". Pattern recognition approaches try to firstly locate a stroke and then classify it,
while separation-based techniques combine detection and classification into a single
technique.
Depending on the application the transcription results are used for, some systems
utilize preprocessing of training data in order to extract various temporal or spectral
features from the specific drum kit to be used. This prior knowledge makes the
transcription easier. It is more practical in interactive systems, such as electronic tutors or
games, rather than in transcription of different audio recordings, where different drum kits
were used.
segmenting the signal using this information, or a simpler segmentation based on a regular
temporal grid. Both ways come with their own problems. Locating the potential strokes
must be reliable enough so as not to misinterpret noise as a stroke. Moreover, it must be
able to detect a potential stroke even when it is masked by another stroke which occured
simultaneously or few milliseconds before/after. On the other hand, using a temporal grid
requires the calculation of the appropriate fixed grid spacing, which has to be related to
the fastest "rhythmic pulse" present in the signal. Therefore it is applicable only to cases
where the tempo of a music track is known, or can be easily found.
After the signal's segmentation feature extraction follows, for each one of the
segments. Based on the feature values the classification algorithm classify the segments
into the proper category; that is recognize the source instrument if there was indeed a
stroke detected. Mel-frequency cepstral coefficients (MFCCs) are widely used both in
speech and music recognition. The computation of the MFCCs is based on frequency
bands which are equally spaced in the mel scale, rather than being linearly spaced. The
mel scale is a scale of pitches judged by listeners to be equal in distance one from another
[11]. Other common features are the bandwise energy descriptors, where the frequency
range is divided into few frequency bands and the energy content of each is computed and
used as a feature, as well as the spectral centroid, spread, skewness and kurtosis [2]. Less
often time-domain features are used, such as the temporal centroid and the zero crossing
rate.
The feature set could comprise many features and is generally selected through a
trial and error procedure. The properties of the input music signal help us determine
which features perform better. The feature set could be also selected automatically, by
testing different combinations, a procedure that requires though to find an appropriate
function which evaluates their quality, which is not that simple in practice. In [2] it is
remarked: "it was noticed that in most cases, using a feature set that has been chosen via
some feature selection method yielded better results than using all the available features."
After the segment's features are extracted, they feed the classification algorithm.
The detection of a stroke in the segment may be necessary, before classifying it into the
recognised (combination of) instrument(s). Classification approaches are divided into two
categories; those which try to detect the presence of any given drum/cymbal separately,
so that simultaneous multiple strokes are recognized as the sum of individually recognised
strokes on one instrument, and the ones which try to recognize directly each different
combination of simultaneous strokes. Few examples are the decision-trees (sequences of
questions whose answers determine the excluded possible results and the next questions),
or methods which compare the analysed data to stored fixed data outcome of a training
-8-
procedure (for instance k-nearest neighbors or support vector machines SVMs). None of
these methods seem to perform clearly better than the others, so some advanced
techniques and higher-level processing have been incorporated to increase the
performance, such as language modelling with explicit N-grams, hidden Markov models, or
choosing the best feature subset dynamically [12].
use of multiple microphones itself separates the music signal in regard to the different
instruments-sources.
Figure 2.2: A single channel's signal is usually the input of separation-based approaches, while the output
components do not necessarily correspond to one and only one of the sources, unless we force it
6 An exemption is the independent component analysis (ICA), that requires at least as many channels as
there are sources [13]
-9-
Source-separation may not refer only to cases where each instrument is a single
source. Depending on the application, a transcription could require, for instance, twenty
violins playing simultaneously in an orchestra to be assigned to a single source. Another
one may require each different violin to be considered as a single source, while for the
needs of a third one the assignment of one source to each one of the equally-pitched notes
could be needed, regardless of the specific violin the sound originated from.
The algorithms referred to above could be designed to output meaningful musically
streams, but in order to achieve that it is necessary to utilise prior knowledge regarding
the instruments' sound properties. Otherwise the separation is "blind", in the sense that it
is unknown in advance how the separated components relate to the musically meaningful
sources. Of course, there may be applications whose properties or careful "tuning" of their
parameters, combined with the properties of the signal to be transcribed, lead to the
"blind" separation being musically meaningful. But this is usually the case when the total
number of components is small and the various instruments' sounds spectrums do not
extensively overlap. If NNMF, ISA, or NNSC were applied without forcing any
predetermined separation scheme, the resulting components would be determined by the
specific algorithm's properties, the properties of the input signal and the total number of
components. The latter is usually our only influence on the resulting separation.
Hypothetically the transcription of music played by three instruments is the
objective; one bass drum, one guitar and one piano played along. Let the number of
components, C, of each source be equal to one, resulting to total number of components
equal to three. Then, after applying a separation-based method the separation result could
be:
one component which "captured" the bass drum and few of the piano's low
frequency notes,
another component which "captured" most of the guitars's notes,
and a third component "capturing" all the rest sound events.
Depending on the value of C and the signal's and algorithm's properties any separation
result is possible, even the preferred one. If the number of components of each source is
increased, then the possibility of a musically meaningful "blind" separation is dramatically
decreased.
Therefore further processing is necessary in order to find out which source the
components refer to. An unsupervised classification algorithm could be used in order to
interpret them. Alternately, prior knowledge about the sources could be utilised, so as to
classify the components. However, forcing a predetermined separation scheme is easily
achieved, by training the algorithm to our specific instruments' sounds, a procedure that
is described in 2.2.2 for the NNMF case.
- 10 -
E {X k 2 }=Y 1 k 2Y 2 k 2
where E{} denotes the expectation. This means that in the expectation sense we can
approximate time domain summation in the power spectral domain, a result which holds
for more than two sources as well. Even though magnitude spectrogram representation
has been widely used and it often produces good results, it does not have similar
theoretical justification [13].
- 11 -
j=1
j=1
X = Y j= b j g Tj
If each basis vector, bj, captures one source's features, then the corresponding time-varying
gain vector, gj, can be translated as the level of the contribution of this source to each
frame's content. Figure 2.4 illustrates the vectors bj and gj for the same rhythm's input
signal, which comprise only the three instruments of interest (snare, bass and hi-hat). The
frequency range is [0, 22.05kHz] and is partitioned into 25 bands (the first 25 critical
bands see 3.4.3). Note that the frequency bands, fb, are not linearly spaced (although
they are depicted like they were). Instead, the first ones are narrow (fb 1=[0,100Hz),
fb2=[100Hz,200Hz), ..., fb6=[630Hz,770Hz), ..., fb10=[1.08kHz,1.27kHz)), while the last ones
are very wide (fb24=[12kHz,15.5kHz) and fb25=[15.5kHz, 22.05kHz) ). It is clearly shown in
the figure that hi-hat's basis vector has large values only in the high frequency bands
(fb21-fb25), bass' basis vector only in the low frequency bands (fb 1-fb2), while snare's one
both in low frequency (fb3-fb8) and high frequency bands (fb22-fb23).
Ideally all of gj's local maxima would mean that a stroke on the specific source did
occur at maxima's frames. But this is not the case since strokes on specific sources (or,
more often, combinations of simultaneous strokes on two or more sources) may result in
local maxima presence in another source's gain vector, without any stroke occuring in the
latter source. Depending on the algorithm they could be misinterpreted as recognized
strokes (false onset detection). In the example of figure 2.4 this is the case for false snare
onsets at every stroke on the bass drum, and vice versa, for false bass onsets at every
stroke on the snare drum. However, the amplitude of the false onsets' local maxima is
considerably lower, allowing the use of a threshold in order to distinguish the false and
the correct onsets. This is why further processing, usually inspired by Klapuri's onset
detection algorithm, is usually needed, in order to detect the real strokes out of the timevarying gain vectors.
- 12 -
n frames
i-th frame
m th band
.
..
3rd band
2nd band
1s t band
i
X =Y =Y 1 Y 2 ...Y k
Figure 2.3: The input signal's spectrogram is assumed to be equal to the superposition of k spectrograms
correct
onset
g bass
g hi-hat
Figure 2.4: The basis vector, b, and the time-varying gain vector, g, for each component
- 13 -
Figure 2.5 (taken from [3]): Klapuri's system overview (top) and processing at each frequency band (bottom)
restrictions alone are sufficient for the separation of the sources, without the explicit
assumption of statistical independence".
NNMF has been used for feature extraction and identification in text and spectral
data mining, as well as in data compression applications. Its principles were originally
developed by Pentti Paatero in '90s (positive matrix factorization), but has been widely
known after Daniel Lee and Sebastian Seung's work in '00s. In [14] two computational
methods for the factorisation are proposed by the latter. The NNMF problem can be
stated as follows:
X Rmn
[NNMF problem] Given a non-negative matrix
and a positive integer
k min {m , n } , find non-negative matrices B Rmk and G Rk n so that:
X BG
- 15 -
G G .[ BT X ./ BG./BT1 ]
m
D X BG = [ X log10 X / G X G ]
=1 =1
The values of B and G are initialized with random positive values. After each
iteration the new values of B and G are found, by multiplying the current values by some
factor that depends on the quality of the approximation X BG . The approximation's
quality is quantified by a cost function, which is calculated after each iteration and
determines when the convergence is supposed to be achieved. In our case, D X BG is
X =BG ) and reduces to the
lower bound by zero (which it reaches if and only if
Kullback-Leibler divergence when X ij= BGij=1 [14]. In [14] it is also proved
ij
ij
that: "the quality of the approximation improves monotonically with the application of
these update rules. In practice, this means that repeated iteration of the update rules is
guaranteed to converge to a locally optimal matrix factorisation".
course, strokes on only a single instrument. The time-varying gain Gi indicates the
contribution of the i-th source, since the polyphonic signal is approximated by the
weighted, by the values of G, sum of the basis vectors Bi of all the sources.
Figure 2.6 depicts this procedure in the general case, where the number of sources
is equal to s, the number of frequency bands is m, the total number of frames/time
windows processed is n and the number of components each source is represented by is c.
The dimensions of Bi are mc , the ones of X are mn and the ones of G cn .
It is worth noting that a different c for each source could be chosen:
s
s c i n
in this case.
i=1
Figure 2.6: The use of NNMF with prior knowledge about the sources
After the training NNMFs which produce the s fixed basis matrices, another
NNMF is applied, using this fixed matrix to approximate our polyphonic signal's
magnitude spectrogram X. What follows is determined by the value of c and the signal's
properties. If c was equal to one for every source, the gain matrix G itself could be
considered as the transcription results we are looking for. In this case, the sources' onsets
times are taken by the frame's position; that is for the i-th frame equal to it , where
t denotes the temporal resolution (10ms in our case). The only requirement for a
processed frame to be considered that it contains a recognized stroke is that the
corresponding source's value of G is greater than a threshold value (see figure 2.6). There
are many ways for these thresholds to be found. The simplest one is to determine them
during the training stage, by storing the maximum value, max(Gi), of every row of G.
Then, a safe threshold could be found by testing max(Gi)/n for different values of n>1.
In the next subsection, 2.2.3, an example of another method is presented, which also takes
into account only the training samples but in a more supervised way. Alternatively,
another algorithm, which adapts to better values by also taking into account the
polyphonic signal's values of G, could be used.
If taking directly the sources' onset times from G is not possible, then what follows
in order to find them is:
- 17 -
the onset detection: the need for it was discussed in 2.1.2.3, and in 2.1.2.4 a
common in practice algorithm for polyphonic audio signals was presented.
Klapuri's algorithm is usually a common starting point for the onset detection,
even in implementations where the input is not an audio signal itself, like the
input G in our case.
if C>1 and the sources' onset times cannot be reliably taken neither from G
itself, nor from an onset detection algorithm, this means that there is
contradiction in some of the components referring to the same source. Therefore,
some components need to be rejected, and/or the information of multiple
components need to be combined in order to find a single source's onset times
(see 3.4.6 and appendix D for an example).
The 2-stages procedure described above is not the only way prior knowledge about
the sources could be used in order to make the transcription easier. An even more
supervised one is to manually define the basis matrix. It is applicable in cases where it is
known in advance that regardless of the specific instrument's particularities, its sounds
have well-known frequency content. Let's consider, for instance, a violin's music
transcription case where each component of NNMF is assigned to all equally-pitched
notes. Then someone could approach the problem of predetermining the separation scheme
by forcing the use of fixed basis vectors, which are manually constructed based on the
information about which fundamental frequencies and harmonic overtones each note is
supposed to produce. The manually defined basis matrix could either remain constant, or
alternatively adapt to the observed data of the polyphonic signal to be transcribed.
Whatever was presented above concerns systems which focus on only low-level
recognition. However, knowledge at a higher abstraction level could be also utilized; either
prior, or knowledge gained after processing the polyphonic signal. Let's consider the
automatic speech recognition problem. The recognition approaches described so far
resemble the case where in speech recognition each phoneme is recognized independently,
regardless of its context. But this is not the case in practice, since by utilizing linguistic
knowledge the performance of such systems can improve considerably [2]. By
incorporating similar approaches the performance of automatic transcription systems could
be also improved. Such approaches could refer to simplistic scenarios where the music
player predefines some characteristics of the music he/she is going to play, like the tempo.
In more advanced approaches the statistical dependencies of sound event sequences of the
transcribed music could be modelled and used to correct the transcription results.
training samples used were not only produced by a single instrument of each type. Instead
the authors used monophonic recordings of various snare drums (and similarly bass drums
and hi-hats) in order to find the basis matrix of each. Then, they averaged them creating
a generic snare drum model, which was used in the same way in the next stage of their
algorithm. That is it was kept fixed in order to predefine the separation scheme, and so
only a multiplicative update rule for the time-varying matrix G was used for the
minimisation of the divergence function.
The signal was represented by the magnitude spectrogram, while the length of
each frame was 24ms with 75% overlap between consecutive frames, leading to an actual
temporal resolution of 6ms. The frequency resolution was a rather coarse one, since only
five bands were used (20-180Hz, 180-400Hz, 400-1000Hz, 1-10kHz, 10-20kHz).
The onset detection of each instrument is done from the corresponding row of the
time-varying gain matrix G. The authors implemented an algorithm motivated by
Klapuri's onset detection one. Its block diagram is illustrated in figure 2.7.
- 19 -
drum tracks on commercial recordings. Two other transcription systems were evaluated
with the same recordings' data, so as a comparison among them to make sense. One of
them belonged to the pattern recognition approaches and is based on an SVM classifier,
while the other one was also a separation-based method (PSA). The NNMF-based
algorithm performed better than both of them, achieving a successful recognition of 96%
and 94% of the strokes in the unprocessed and the "production-grade" processed signals,
respectively. The SVM system achieved 87% and 92%, while the PSA 67% and 63%,
respectively.
j k n
X k = x n e
, k [0, N 1]
n= 0
7
Various research results of psychoacoustics (the science of the human perception of sound) have been
applied successfully to audio applications
- 20 -
M 1
M 1
n
and w R n=0, otherwise
2
2
Below we present the rectangular window's properties in order to use them as a starting
point on the evaluation of few representative "real" window functions' properties. The
time-domain waveform and its DFT are illustrated in figure 2.9. The main-lobe width is
equal to 4/M radians per sample, so the bigger M becomes, the narrower is the mainlobe giving better frequency resolution. The first side-lobe is only 13dB lower than the
main-lobe's peak and the value of M has no impact on it.
By multiplying the rectangular window by one period of a cosine the generalized
Hamming window family is taken:
w H n=w R n [2 cos
- 21 -
2 n
]
M
For =0.5 and =0.25 the Hann window is taken, while for =0.54 and =0.23 the
Hamming window (illustrated in figures 2.10 and 2.11). The main-lobe's width has been
doubled in both cases compared to the rectangular window case and is equal to 8/M
radians per sample, giving a coarser frequency resolution. The first side-lobe has been
drastically decreased, though, being approximately 31.5dB and 41.5dB lower than the
main-lobe's peak in Hann and Hamming windows, respectively. In [8] it is stated that
since the Hamming window side-lobe level is more than 40 dB down, it is often a good
choice for 1% accurate systems, such as 8-bit audio signal processing systems.
(a)
(b)
(c)
Figure 2.88: an integer number of periods results to no "spectral leakage" at all (a), while a non-integer one
results to high "spectral leakage (b), which is reduced by applying a window function (c)
- 22 -
Audio signals of higher quality may require higher quality window functions. The
Blackman-Harris is a generalization of the Hamming family and is given by:
L 1
w BH n=w R n l cos l
l=1
2
n
M
For L=4 and 0=0.35875, 1=0.48829, 2=0.14128 and 3=0.01168 the 4-term
Blackman-Harris window is taken (figure 2.12), which in expense of a quadruple mainlobe's width, compared to the rectangular window case, has a side-lobe level of
approximately 92dB lower than the main-lobe's peak.
- 23 -
There are a lot of other window functions, such as the Bartlett, the DPSS, the
Kaiser, the Dolph-Chebyshev, etc. (see [8] and [9]). Depending on the signal's properties
one should mainly determine the attenuation level of the side-lobes needed and the mainlobe's width that can afford to sacrifice (but as well as other less important properties of
the windows, that were not referred above, such as the roll-off of the side-lobes) in order
to make the appropriate selection.
fs
N
Actually this is true only in the idealised case, where the signal is sampled for infinite time; any timelimited sampled signal cannot be perfectly bandlimited. Therefore, in practice only a very good
approximation is taken, instead of a perfect reconstruction.
- 24 -
T res =
N
fs
For instance, if fs=44.1kHz and N=32 then Fres=1378.125Hz and Tres=0.726ms, while
for N=8192, Fres=5.38Hz and Tres=185.8ms. In figure 2.13 the spectrograms of a signal
x(t), composed of one out of four frequencies (10Hz, 25Hz, 50Hz and 100Hz), are
illustrated for various non-overlapping frame lengths (10, 50, 150 and 400 samples/frame).
It is clearly shown that by increasing N the frequency resolution gets better, while the
time resolution gets worse. The definition of x(t) is:
R
fs
For instance, if fs=44.1kHz and R=441 then Tres=10ms, while for R=44100 it becomes
equal to 1s.
Figure 2.1310: Spectrograms of x(t) for N=10 (25ms) at top left, to N=400 (1000ms) at bottom right
- 25 -
- 26 -
III
Implemented transcription
algorithm and simulation
3.1 Introduction
The implemented algorithm utilises the NNMF with prior knowledge, a
methodology described in 2.2.2. The reason NNMF is preferred is its lack of computational
complexity, while its performance is comparable to, or even better than, more complex
methods. Simulation's aim is not only to confirm that this methodology works, at least for
a limited drum kit. It is also necessary in order to determine the parameters that give the
best transcription results, so as to design the hardware implementation based on them,
namely:
the segmentation of the signal, that is the length of each FFT's frame which,
together with the level of the successive frames' overlap and the sampling rate,
gives the actual temporal resolution,
the window function applied to the frame,
the frequency bands' partitioning,
the divergence threshold of the cost function, under which we consider that
convergence has been achieved, and
the number of components each source corresponds to.
3.2 Recordings
Recordings of training samples and test rthythms took place in a common, poorly
soundproofed room. The drum kit was a rather old one, in bad condition, although
- 27 -
another, decent drum kit, was also recorded in the same room and tested without any
difference at the transcription's performance.
The setup is based on a single microphone's input. Although more microphones
could be used, mixed down to one signal, having only one microphone is more realistic
and practical. It also suits more to the separation-based approaches, since this is usually
why they are used for: "mixing up" a single-channel's signal. Moreover, using many
microphones, each dedicated to only one or a limited number of instruments, makes sense
in professional recordings of drums, where each channel needs separate processing. If the
multiple microphones are carefully 11 setup and mixed down after the proper preprocessing, so as there is minimum interference among them, a higher quality, more
"clear" input signal is taken, which makes the transcription less challenging.
11 In practice the proper placements, spacings and orientations of multiple microphones to achieve a highquality recording of a drum kit is a complex task. Let a microphone A, attached to a snare drum which is 1
metre away from a high-tom drum (with another microphone B attached to it), be subject to leakage from the
high-tom strokes. Since sound roughly travels 1 metre in 3ms, A will output the leakage from high-tom with
3ms of delay. As it was mentioned the strokes on percussion instruments in general have a very short attack
time (on the order of a few milliseconds). The delay introduced by A would result to the blur of the hightom's stroke, if it was not taken into account and properly corrected before the mixing down of the signal.
12 This way if the microphone's output is amplified by a differential amplifier, the noise voltage that was
added to both signals (in the same level since the impedances at the source and at the load are identical) will
be cancelled out, making the use of long cables possible in environments with high electromagnetic
interference.
- 28 -
- 29 -
seconds. This means that the successive eighth notes' onset times distance is equal to
500ms and the one of sixteenth notes is 250ms. The maximum tempo that we performed
and recorded is 150bpm, meaning that these distances become 200ms and 100ms,
respectively. Automatic transcription systems may not allow a stroke to be recognised if it
occurs before a minimum time interval passes from the last recognised stroke. In [3] the
authors use such an interval of 50ms 14. That makes the speed of our rhythms pretty
challenging. Such an interval of 50ms was also used in our case, but applied only to each
instrument itself, meaning that a stroke on the i-th instrument would not be recognised, if
no more than 50ms passed after the recognition of another stroke on the same instrument
i.
Figure 3.2: tempo, note values, time signature, bars and time difference between successive notes.
Therefore, it has been clear why the information about the onset times of the
strokes, together with the information about which instruments were hit, may not be
enough to uniquely determine how the notes should be written; because this also depends
on the music notation scheme to be used. For example, if we were to fill just a simple
time grid with the recognized strokes, then we would not need any more information. But,
in order to write the notes on a common tablature, like the one in figure 3.2, it is
14 To get an idea of how restrictive this is, it is worth noting that only a few drummers in the world can
play sixteenth notes on double-bass (that is they have two bass drums, one at each foot) in a greater tempo
than 250bpm. This means that a right foot's bass stroke is followed by a left foot's bass stroke (and so on)
with only 60ms separating the successive strokes. This speed is "insane" (more than 16 hits in one second),
but still lower than what a system with 50ms limitation can handle.
- 30 -
necessary to know the time signature, the tempo and the note value that refers to tempo's
"beat". However, these parameters' values are invariate in the vast majority of music
tracks, or change only few times during them. More precisely, the "beat" is almost always
assigned to the quarter note value and the time signature rarely change during the same
track. The tempo, though, could change, but is usually kept invariate for many bars. In
case a drummer defined the tempo, the "beat" and the time signature in advance, it
would be possible for an automatic transcription system to output what he played in the
classic music notation scheme, something that is beyond the scope of this project.
- 31 -
It must be clarified that "hi-hat stroke" means in our case the "closed hi-hat" type
of stroke. The hi-hat is a unique type of cymbal since it consists of two metal plates,
which are mounted on a stand, one on top of the other. The relative position of the two
cymbals is determined by a foot pedal. When the foot pedal is pressed, then the two
cymbals are held together, there is no space between them, and that is the "closed hi-hat"
position. The less the foot pedal is pressed, the greater the plates' distance becomes,
reaching its maximum value when the pedal is free ("open hi-hat"). The closed hi-hat
stroke is one of the most important, since it produces a short length sound, unlike the
other cymbals, which is usually used by the drummer to help him "count" the beats and
properly adjust the timings of the strokes.
The 7-instruments rhythm does not contain all possible combinations of
simultaneous strokes, since this number is large, it would make the recording complex and
transcribing more than three instruments is out of this project's scope. In order to figure
out how many different combinations could exist among these seven sources we need to
take into account what is realistic in practice. For instance, the simultaneous strokes on
three cymbals is impossible since all cymbals are hit by the hands' sticks. Actually it is
only the bass drum's stroke that is driven by the drummer's foot. This limits the
maximum number of simultaneous strokes to three (both hands hit a drum or cymbal and
the foot also hits the bass drum the other foot always keeps the hi-hat closed). The total
number of combinations becomes equal to the sum of:
7 single strokes
7!
7 =
=21 combinations of simultaneous strokes on two sources
2 2!72!
6!
6 =
=15 combinations of simultaneous strokes on three sources, with
2 2!62!
- 32 -
xsnare
xbass
xhihat
xetc
xtest
Xi
get band-wise sums of Yi's magnitude spectrogram
initialize matrices Bi ( M C ) and Gi ( C N i ) to ones
while {cost function > convergence threshold}
Binew B i . [ X i ./ BiGi GTi ./1GTi ]
G new
G i .[ B Ti X i ./ BiG i ./ B Ti 1]
i
Ni
costFunction X mi , nlog10
m=1 n=1
end while
X im , n
X mi , n B iGmi ,n
B iG mi ,n
save Bi
end for
Bfixed
for n = 1 to Ntest
initialize matrix Gn ( SC1 ) to ones
Y ntest get STFT of the windowed n-th frame
X ntest get band-wise sums of Ytest's magnitude spectrogram
while {cost function > convergence threshold}
n
Gn , new G n .[ B Tfixed X test
./ B fixedGn ./ BTfixed1]
M
m ,n
end while
end for
- 33 -
m, n
X test
m ,n
m, n
X test B fixedG test
B fixedGtest m, n
- 34 -
TactualRes =R/fs
512 samples
265 samples
52%
1024 samples
441 samples
57%
10ms
2048 samples
441 samples
78%
10ms
4096 samples
441 samples
89%
10ms
6ms
Table 3.1: The actual temporal resolution for various hop-sizes and constant frame length of 4096 samples
Figure 3.5: Transcription of the 150bpm rhythm for various frame's lengths
The transcription results for the 60bpm rhythm15 and N={512, 1024, 2048, 4096}
are illustrated in figure 3.6. The rest parameters are the same as above. In this case the
number of false onsets is only one. It is worth noting that the 8 hi-hat strokes have, more
or less, the same magnitude, while this was not the case for the 150bpm rhythm of figure
3.5. It happens simply because the recorded strokes themselves have equal intensity in the
60bpm rhythm, while every second stroke of the 150bpm is of much lower intensity. That
is the usual way of playing hi-hat in high tempos and was recorded like that in order to
check if different stroke dynamics are tolerated by the algorithm. At least in the closed hihat case, dynamics of low intensity result to low values in G; so low that if a threshold
covering them had to be found, inevitably the two last false onsets of hi-hat would have
been exposed.
15 The 90bpm and 120 bpm rhythms have exactly the same behavior with the 60bpm and the 150bpm,
respectively, and that's why they are not presented. The rest of the tests concern only the 150bpm rhythm.
- 35 -
Beyond the strokes of lower intensity, more intense strokes may also cause false
onsets recognition. This can be explained by their different frequency content, caused by
two factors. Firstly, the physical properties of the instruments, which result to different
frequency content in case of different stroke's dynamics. Secondly, the fact that all the
instruments are mounted on the same rack, and hence are being vibrated even if another
instrument was hit, especially for intense strokes. Appendix A contains the transcription
of successive strokes of increasing intensity on each single instrument. The intensity
covers a wide range of dynamics, from barely listenable strokes to unrealistically intense
ones. Hi-hat does not produce much noise on snare and bass, but snare and bass strokes
produce considerable noise on hi-hat and snare, respectively, whose magnitude is
increasing for intense strokes.
Figure 3.6: Transcription of the 60bpm rhythm for various frame's lengths
Figure 3.7 shows the impact of the actual temporal resolution to the results; for
N=4096 samples, the values of R={221, 441, 661, 882, 1764, 2646} are tested. The rest of
parameters are the same as above. For R=2646 the inequality N<2R is violated, but the
transcription is relatively close to the lower values' ones. For 5-20ms the results are
almost identical. The chosen value of T actualRes is 10ms. The six values of R correspond to
the actual temporal resolutions and overlapping levels of table 3.2.
R
TactualRes =R/fs
221 samples
95%
441 samples
89%
10ms
661 samples
84%
882 samples
79%
20ms
1764 samples
57%
40ms
2646 samples
35%
60ms
5ms
15ms
Table 3.2: The actual temporal resolution for various hop-sizes and constant frame length of 4096 samples
- 36 -
Figure 3.7: Transcription of the 150bpm rhythm for various actual temporal resolutions
Figure 3.8: Transcription of the 150bpm rhythm for various window functions
12000, 15500] [16]. A 25th band was appended, ranging from the 25th Bark edge, 15500Hz,
to the half of our sampling rate, 22050Hz.
Beyond the 25 critical bands, the other band partitioning schemes tested were the
5-bands scheme of [12], defined by the following band edges [0, 180, 400, 1000, 10000,
22050], the 9-bands scheme defined by [0, 100, 200, 300, 400, 500, 1000, 5000, 10000,
22050] and two schemes of 128 and 512 linearly spaced bands, corresponding to equal
bands of approximately 172Hz and 43Hz, respectively. The frame's length is equal to 2048
samples, the hop-size is 441 samples, the Hamming window is used, the divergence
threshold is 10-4, the number of components of each source is 1 and the input file is the
rhythm of 150bpm.
Figure 3.9 shows that the frequency resolution is not an issue for the given signal's
properties. The clear differences in the frequency content among the strokes on snare, bass
and hi-hat make it possible for a 5-bands scheme, that contains band widths on the order
of 180-12050Hz, to perform as good as a linearly spaced one, with band width equal to
43Hz.
This is also clearly depicted at the values of the fixed matrices B that came of the
training NNMFs in the 5-bands case, shown in table 3.3. The bass has its energy mainly
in the first band, snare in the fourth and hi-hat in the fifth one. We might be able to
reduce the number of bands to three without considerably affecting the results, although
that would introduce very poor resolution in case more instruments need to be
transcribed. On the contrary, the 25 critical bands were used, allowing more instruments,
which demand finer resolution, to be introduced.
- 38 -
band (Hz)
snare
bass
hi-hat
0-180
5.27
125.84
1.39
180-400
36.54
57.25
2.45
400-1000
41.38
60.67
7.03
1000-10000
95.80
55.54
61.90
10000-22050
33.72
10.09
94.66
Table 3.3: The values of the fixed matrix B for the 5-bands scheme
Figure 3.9: Transcription of the 150bpm rhythm for various frequency bands schemes
Figure 3.10: Transcription of the 150bpm rhythm for various divergence thresholds
basis matrix, since this is the only NNMF that needs to run in real-time in the next
chapter's hardware implementation. The training NNMFs have no real-time requirements
and as such could use a lower threshold value. That is not necessary, though, since alike
the fixed basis NNMF they do not need many iterations in order to find good values for
the fixed matrices. Therefore, the same divergence threshold was used for both the
training NNMFs and the fixed basis one. Table B.1 in appendix B shows the snare's basis
matrices for three thresholds. The values are very close even in the extreme cases.
divergence
average number
maximum number
threshold
of iterations
of iterations
-11
654
20940
10-7
304
3051
10
-4
110
604
10-2
39
159
10
30
100
10
considerably the transcription results. If only one component is taken into account (the
highlighted ones black colored for the snare and green colored for the bass and hi-hat),
then a threshold can be found so as only one false onset is recognized, rather than four.
Introducing more components does not succeed in eliminating that false onset. As
it was mentioned before, the usage of multiple components per source requires an
algorithm that finds which is the correct one. In case the number of components is large
(more than 4 in our case), inevitably the information of more than one components must
be combined in order to find all the correct onsets, adding more complexity to this
algorithm. For simplicity, our implementation concerns the one component per source
case.
Figure 3.11: Transcription of the 150bpm rhythm for two components per source
- 41 -
- 42 -
IV
Hardware's design and
implementation
4.1 System's overview
The hardware part was implemented for Terasic's DE2-70 development board.
DE2-70 hosts one of the biggest FPGAs of Altera's Cyclone II series. It also provides an
audio CODEC chipset and a microphone input. The system's overview is illustrated in
figure 4.1. It comprises six main blocks, which handle the initialization of the CODEC,
the ADC communication, the Hamming window function, the Fourier transform, the
magnitude spectrogram computation and the NNMF. Each of the modules is analytically
described in the next sections.
The synthesis tool used was Altera's Quartus II 11.0 and simulation of most of the
modules was done in Modelsim SE 6.6d. The synthesis resulted to the usage of:
18% of the available logic elements (12,309/68,416),
31.6% of the available memory bits (364,042/1,152,000),
12% of the embedded 9-bit multipliers (36/300), and
50% of the available PLLs (2/4).
The implemented system does not include the training stage of the algorithm.
However, the training stage is the same with the real-time core that is implemented,
extended with the calculations needed in order to find the fixed basis matrices B. These
calculations are the supplementary of the real-time core's calculations needed in order to
find the gain matrix G, and as such would be implemented in the same way. The
implementation of the training stage would require more memory (58% in total, or
667,869/1,152,000, for training samples of length equal to 1.5 seconds). The values of the
- 43 -
- 44 -
DE2-70 uses such a resistor R mic=330, resulting to G1=4.84. The second stage consists of
a 0dB gain that can be programmed to provide an additional fixed 20dB.
In order to decide if the second gain stage is needed, the dynamic range of the
ADC's outputs was examined. With the help of LEDs, that were flashing whenever the
16-bit signed output of the ADC was greater than 4096,2048, or1024 , it was
determined that the vast majority of strokes produced values in the range
[1024,2048] , while more intense strokes were surpassing 2048 , but never
4096 . Hence, the dynamic range of the sampled data is 13bits (sign bit included). If
the fixed 20dB gain was used, which concerns a gain equal to 10, 16 bits might not be
enough (resulting to unwanted clipping), and therefore it was not used.
WM8731 can either generate the clock it needs and function as a master device, by
connecting an external crystal between the XTI/MCLK input and XTO output pins, or
receive its clock by a component other than WM8731 and function as a slave. In the latter
case, which is the one used, the external clock is applied directly through the XTI/MCLK
input, without any software configuration needed.
In figure 4.3 the interface between WM8731, functioning in slave mode, and the
FPGA is outlined. While in slave mode the WM8731 sends the sampled data, ADCDATA,
in response to the externally applied clocks, BCLK and ADCLRC. In the next subsection,
4.4.1, the initializer module is described. It configures, over the I 2C, the registers of
WM8731 to sample at 44.1kHz, outputting 16 bits words. In 4.4.2 a closer look is taken at
how the sampled data are fetched by the ADC controller.
- 45 -
Figure 4.3: The interface between the FPGA and WM8731 in slave mode
Figure 4.4 (taken from [17]): The two-wire serial interface for the software configuration of WM8731
- 46 -
Register
Address
Register's value
Left Line In
Right Line In
0 0000 0100
340804
0 0000 0000
340A00
0 0111 1001
340C79
0 0000 0001
340E01
Sampling Control
0 0010 0010
341022
Active Control
0 0000 0001
341201
stored in ROM
The initializer's block diagram is shown in figure 4.5. Its finite-state machine is
illustrated in figure 4.6. An 18bytes (6x24bits) ROM is used to store the registers' values
shown in table 4.1. The dataControl signal controls a tri-state buffer, allowing the
WM8731 to pull the I2C_DATA line low, acknowledging that it received 8 bits of data.
A 50kHz clock is generated by a counter, whose input is the main 50MHz clock of
our system. I2C_CLOCK is generated by an OR gate, whose inputs are the counter's
50kHz clock and the FSM's signal clockControl. Any frequency in the range
0<I2C_CLOCK<526kHz could be used. When all of the six registers are configured FSM's
signal clockControl is kept high, deactivating the software control interface.
- 47 -
- 48 -
always change on the falling edge of BCLK. The only requirement regarding the frequency
of BCLK is to provide sufficient cycles for each ADCLRC transition to clock the chosen
data word length (it could even be non-continuous).
Figure 4.7 (taken from [17]): ADC's output in left justified mode
The chosen sampling rate, fs, is 44.1kHz and WM8731 is configured to be clocked
by MCLK=384fs. A PLL, whose input is the 50MHz clock, is utilized in order to generate
MCLK. Table 4.2 shows the closest value PLL can generate, given the 50MHz input. Our
sampling rate is slightly higher than 44.1kHz. For simplicity, the frequency chosen for
BCLK is equal to 32fs, the lowest possible value for data word length of 16 bits. BCLK is
generated by a counter, whose input is the MCLK, while ADCLRC is generated by
another counter, whose input is the BCLK. The block diagram of the ADC controller is
illustrated in figure 4.8.
Frequency
Frequency
(expected)
(in practice)
MCLK = 384fs
16.9344MHz
16.935484MHz
BCLK = 32fs
1.4112MHz
1.41129033MHz
ADCLRC = fs
44.1kHz
44.102822916kHz
Clock
Table 4.2: The approximated frequency values for the three clocks that drive the WM8731
- 49 -
The Hamming controller's finite-state machine and block diagram are illustrated in
figures 4.10 and 4.11, respectively. It takes 25 cycles for five multiplications with
Hamming coefficients to be computed and stored in hammRAM, after each high to low
- 50 -
transition of ADCLRC. However, the next step (Fourier controller) is initiated at the
next low to high transition of ADCLRC. Therefore, the latency of Hamming controller is
1
1
0.011ms.
roughly equal to
2 44.1kHz
- 51 -
Transform length
2048 points
Data precision
16 bits
Twiddle precision
16 bits
Quad output
1
Burst
Logic elements
3710
Memory bits
114688
24
2668
6765
The burst I/O data flow's interface is illustrated in figure 4.12. It is implemented
by the finite-state machines of figure 4.14, part of the Fourier controller, whose block
diagram is shown in figure 4.15. The signal sink_ready indicates that the FFT can accept
a new block of data. When both sink_ready and sink_valid are asserted the data transfer
to FFT occurs. The assertion for one cycle of the signal sink_sop indicates the start of the
input block. On the next clock cycle, sink_sop is deasserted and the next 2047 input data
samples must be loaded. On the last sample sink_eop must be asserted. The 16-bit wide
- 52 -
sink_real contains our input data, while sink_imag is always equal to zero.
Once the transform has been completed, FFT asserts source_valid and, if
source_ready is asserted, outputs the complex data to the 16-bit source_real and
source_imag signals. The exponent of each block is taken from source_exp. The signals
source_sop and source_eop indicate the first and last output, respectively. The output
data are stored in fourierRAM. We only need to store the first 1024 of them. Since the
exponent is in the range [-16,0], each output needs 64 bits, 32 for the real part and 32 for
the imaginary. Hence, fourierRAM's size is equal to 64 1024 bits=8kB.
It takes 6765 cycles (see table 4.4) for Fourier controller to read 2048 input data
and output the result. But since only half of the outputs are used, the next step is
initiated after 6765-1024= 5741 cycles, or roughly 0.115ms for our 50Mhz clock.
- 53 -
Figure 4.14: Fourier controller's input (left) and output (right) FSMs
x= R2 I 2
The square root's calculation is considered a demanding computationally task, therefore
an approximation will just be used. A method which approximates the magnitude of a
complex number, called "max plus beta min", is presented in [6]. If MAX=max{|R|, |I|}
and MIN=min{|R|, |I|}, then the magnitude approximation is:
- 54 -
0b1
For simplicity b=2-k, where k is a natural number. The average relative errors for b=1,
b=0.5 and b=0.25 are shown in table 4.5, based on our test rhythm's data.
Approximation
b=1
27.25%
b=0.5
8.66%
b=0.25
3.19%
Table 4.5: Average relative errors of max plus beta min magnitude's approximation
The Bands controller approximates the magnitude of the 1024 FFT outputs using
the max plus beta min with b=0.25. Then, it sums the magnitudes following the 25
critical bands scheme. The block diagram of Bands controller is illustrated in figure 4.15,
while its finite-state machine in 4.16. Taking into account the worst-case scenario each
magnitude's width must have been 33 bits, while every band's sum 42 bits. However, in
order to check if this increase of the number of bits could be ignored, the sums were sent
through the UART, so as their range to be determined. As it was expected the widest of
the sums, the 25th one, had the maximum value for hi-hat strokes. In case of strokes of
normal intensity this was approximately 800,000 hex, while for intense strokes it reached
3,000,000hex. Hence, they can be represented by 24-bit and 26-bit word lengths,
respectively. Therefore, the increase is ignored and the sums' width is equal to 32-bit.
Since the FFT's length is 2048=211 points, a scale down by multiplying with 2 -11
could be applied to the Fourier transform's outputs. Instead of scaling down the 32-bit
outputs before the magnitude and sums computations, the scale down occurs after the
sums are found. As it was mentioned above the ADC output's dynamic range is 13 bits,
while it is represented by 16 bits. In order to account for these 3 bits, a scale down by
multiplying with 2-8, instead of 2-11, is implemented. This means that finally the bands
sums' widths are 24 bits, resulting to bandsRAM's size of 2524 bits=75 bytes.
- 55 -
It takes 6 cycles for each FFT output to be read from fourierRAM and its
magnitude to be approximated and added to the corresponding band's sum. The first
output needs 8 cycles and the storage of the 25 sums to bandsRAM takes 27 cycles. This
means that following the assertion of enableBands signal, in total 6173 cycles are needed
in order to get each frame's bandwise spectrum in bandsRAM. For a 50MHz clock, the
duration of 6173 cycles is roughly equal to 0.123ms.
costFunction X [ m ]log10
m=1
X [m]
X [ m ] BfixedGnew [ m ]
new
BfixedG [ m ]
- 56 -
The calculations are performed in the following five serially executed steps:
step 1:
st1= X ./ BfixedG ,
step 2:
step 3:
step 5:
The three first steps update the matrix G, step 4 re-calculates the matrix X ./ BfixedG ,
since its elements' logarithms are needed in step 5, which updates the cost function value.
The algorithm iteratively runs the five steps, until the difference of the last two cost
function's values is less than a constant threshold.
Fixed-point numbers are used. The widths of the elements of B and G are 12.4
unsigned numbers. The integer part is 12 bits, since the product of B and G approximates
the 24-bit unsigned integer spectrum stored in bandsRAM. G's values are initialized to
ones, before each spectrum's approximation. Matlab simulation with fixed-point numbers
was conducted and according to it, 4 fractional bits results to identical transcription to
higher precision. The only impact of the lower precision is the higher divergence threshold
needed to be used.
- 57 -
complete. Since G's initialization occurs only once at every time frame's spectrum
approximation the total number of cycles of step 1 is equal to 106+101(x-1), where x is
the total number of NNMF's iterations needed to converge.
- 58 -
Step 3 updates matrix G's value, by multiplying element-wise the matrix G with
the result of step 2 (3x1 matrix). The timing diagram is illustrated in figure 4.19. Two
cycles after step 3 is initiated matrixG's RAM and RAM_2 are read. Five cycles later the
three multiplications are computed and step 4 is initiated. Hence, step 3 needs 7x cycles,
where x is the total number os NNMF's iterations.
- 59 -
- 60 -
- 61 -
controller fetches the new sample at ADCLRC's high to low transitions. In order
to fetch the value earlier, a higher frequency for BCLK must be used.
Hamming controller Fourier controller: the latency in our case is equal to half
of ADCLRC's period (approximately 0.011ms), although it could be 25 cycles
(0.5us for our 50MHz).
NNMF LEDs: in total the five steps' latency is equal to 491x+10 cycles, where
x is the number of iterations NNMF needed to converge.
In case 1000 iterations were needed for NNMF to converge, the NNMF stage would
need roughly 10ms in order to complete. However, the algorithm in simulation showed
that such a high number of iterations is unnecessary. In fact, as it was shown in 3.4.4 any
average number of iterations greater than 10 gives identical transcription results. Matlab's
simulation for the fixed-point numbers, showed us that in order to get an average number
of iterations equal to 100, a divergence threshold equal to 200 must be used. In addition,
ignoring the cost function's value and forcing always ten iterations to occur, had no
impact on the performance, too. The latter means that 4920 cycles could be enough for
NNMF, resulting to total latency of 0.011+0.115+0.123+0.0984=0.3474ms.
The latency just described concerns only the computational part. In order to get
an idea of what the latency of the onset's detection could be, other factors should be also
taken into account. In an ideal case, let a single sample turn a time frame with no
recognized strokes, into one with at least one recognized stroke. This sample may come
first after a new Fourier computation has just began. Then, there is an additional latency
of roughly 10ms, until the next transform starts. And beyond that, the fact that window
functions are used must be also taken into account. Because the last 441 samples of the
2048-points FFT are always multiplied by the lowest values of the window function (less
than 0.4), and possibly an onset will be cancelled for (at least) one transform. Summing
the above factors, an estimation that an onset is signalled from LEDs approximately 1040ms after it occurs seems logical.
- 62 -
V
Conclusion
Regarding the hardware implementation, the performance of the system meets the
expected one, since the vast majority of strokes are correctly recognized. Unrecognized
strokes are more frequent in case of simultaneous strokes on at least two instruments. The
tests were performed on the same drum kit, using the simulation's fixed basis matrix of
the same kit's recordings.
Regarding the specific NNMF-based approach, it has been shown, in both real's
and simulation's results, that it behaves very reliably. In case more than one components
per source are used, the transcription is reliable even with more instruments present. It
could form a very computationally effective, automatic transcription system, together with
an algorithm that combines the various components' information, in order to extract the
correct onsets. However, even the minimal core that was implemented could be used,
without further additions, to implement a tempo tracker system; that is a system which
focuses only on snare and bass drums, in order to extract the variations in tempo during a
drumming performance.
Evaluating the work on the thesis project in general, very useful experience was
gained on VHDL programming, a main thesis project's motivation. In addition,
approaching the automatic transcription problem from scratch, the literature survey, and
building an FPGA friendly system has been an invaluable, instructive experience. A very
important lesson taught regards to the debugging procedure of the VHDL code.
Surprisingly, at some premature tests the system worked for the first time, although bugs
were present in various stages of the data flow. The performance was very poor, barely
half of the strokes were recognized, but the fact that it showed that it is working was
misleading itself. In the sense that this poor performance was pointing to possible bugs at
some fixed-point calculations, where errors in following the radix point, or manipulating
the data word widths, may easily lead to not so critically erroneous behavior. Indeed there
- 63 -
were such bugs, whose corrections though had the opposite effect to system's performance.
Until a bug at the very first stage of the algorithm was found and, as always, regarded the
simplest thing; that was a wrong configuration bit for the CODEC's registers, that
resulted to a wrong frequency in the clocks of ADC controller.
- 64 -
References
[1] Klapuri, Anssi, Introduction to Music Transcription, in Signal Processing Methods
for Music Transcription, 2006
[2] FitzGerald, Gerry and Paulus, Jouni, Unpitched Percussion Transcription, in
Signal Processing Methods for Music Transcription, 2006
[3] Klapuri, Anssi "Sound onset detection by applying psychoacoustic knowledge," in
Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing,
Phoenix, Arizona, USA, 1999.
[4] J. Vos and R. Rasch, "The perceptual onset of musical tones," Perception and
Psychophysics, vol. 29, no. 4, pp. 323-335, 1981.
[5] Simon Dixon, "Onset Detection Revisited," in Proc. DAFx-06, pp. 133-137, 2006.
[6] http://www.dsprelated.com/showmessage/38751/1.php
[7] Peter Soderquist and Miriam Leeser, "Division and Square Root: Choosing the Right
Implementation," IEEE Micro, Vol.17 No.4, pp.5666, July/August 1997
[8] Smith, J.O. "Spectral Audio Signal Processing,"
http://ccrma.stanford.edu/~jos/sasp/, online book.
- 65 -
[14] Daniel Lee and Sebastian Seung, Algorithms for Non-negative Matrix Factorization,
2001
[15] Michael Berry and Murray Browne, Algorithms and Applications for Approximate
Nonnegative Matrix Factorization, 2006
[16] Julius O. Smith III, Bark and ERB Bilinear Transforms,
https://ccrma.stanford.edu/~jos/bbt/bbt.html
[17] Wolfson's WM8731 datasheet,
http://www.wolfsonmicro.com/documents/uploads/data_sheets/en/WM8731.pdf
[18] FFT MegaCore Function User Guide,
http://www.altera.com/literature/ug/ug_fft.pdf
[19] Advanced Computer Architecture (DTU 02211),
http://www2.imm.dtu.dk/courses/02211/ ,
http://en.wikiversity.org/wiki/Computer_Architecture_Lab
- 66 -
Appendix A
- 67 -
Figure A.4: Transcription of basse's dynamics (considerable noise on snare, for very intense strokes)
Figure A.6: Transcription of hi-hat's dynamics (negligible noise on both snare and bass)
- 68 -
Appendix B
Table B.1 shows the Bsnare basis matrix values, calculated for three different values of
NNMF's divergence thresholds (10-11, 10-2, 100).
Bsnare for
Bsnare for
Bsnare for
frequency band
divergence
divergence
divergence
(Hz)
threshold
threshold
threshold
= 10
-11
= 10
-2
= 100
0-100
1.3726
1.1095
1.2348
100-200
7.8376
6.3353
7.0508
200-300
28.1783
22.7770
25.3495
300-400
11.6785
9.4399
10.5061
400-510
11.0821
8.9579
9.9696
510-630
12.2025
9.8635
10.9775
630-770
10.1586
8.2114
9.1388
770-920
11.1325
8.9985
10.0148
920-1080
6.3861
5.1620
5.7450
1080-1270
5.6378
4.5571
5.0718
1270-1480
6.0282
4.8727
5.4230
1480-1720
4.9896
4.0332
4.4887
1720-2000
4.6511
3.7595
4.1841
2000-2320
6.0283
4.8728
5.4231
2320-2700
5.2454
4.2400
4.7188
2700-3150
6.9593
5.6253
6.2606
3150-3700
6.9419
5.6113
6.2450
3700-4400
7.6547
6.1874
6.8862
4400-5300
9.5331
7.7057
8.5760
5300-6400
10.3215
8.3430
9.2853
6400-7700
11.4783
9.2781
10.3260
7700-9500
19.6498
15.8832
17.6771
9500-12000
19.3563
15.6460
17.4130
12000-15500
13.9664
11.2893
12.5643
15500-22050
11.1513
9.0137
10.0318
Table B.1: Snare's fixed basis matrices for three divergence thresholds
- 69 -
- 70 -
Appendix C
- 71 -
- 72 -
- 73 -
- 74 -
Appendix D
The transcription of the seven-instruments 60bpm rhythm follows. Each source is
represented by seven components; the useful ones are highlighted. In all the cases, except
the hi-hat's, by adding the highlighted components of each source, the recognition results
to 100% success rate. The hi-hat could be recognized correctly 3 out of 4 times, but also 2
false onsets are recognized.
- 75 -
- 76 -
- 77 -
- 78 -