Sound Localizer
Sound Localizer
Abstract
As our final project, we built a sound source
localizer - a system that determines the
position of human speech, and displays it
on a two-dimensional map on a screen,
using only cheap microphones and
on-the-fly calculation on a Nexys4 FPGA
board. The sounds picked up by the
microphones are compared to determine
how much time the sound spent traveling
through the air, which corresponds to how
far the microphones are away from each Figure 1: Schematic representation of our
other. This report details the process of how finalized set-up. 1. Sound source 2.
we approached the design of this system, Microphone A 3. Microphone B 4. VGA
some technical choices and limits involved, Display 5. Microphone preamplifiers 6.
and some mathematical background that Nexys4 FPGA board
brought us to the working set-up in Figure 1.
requires only two microphones, because the
wire xadc_convst_in = (xadc_eoc & Notice that C (n, m) depends on the future
(~xadc_eos)) | clk_sample_tick; value B[n + W − 1 + m] and the current value
A[n] , so we need to store at least m + W
To read digital values from XADC
samples, or S = m + W . The requirement on
post-conversion, we wrote a module to
m is discussed in the next section.
sequentially read its status registers at
designated memory addresses. These
values are fed into cross-correlation Another implication from the equation above
modules. is that our score will be delayed by S
sampling cycles, but this delay is negligible
in human perception.
Cross-correlation
File: cross_correlation.v - author:
Changping Example
In the example below, we choose W=5 and
The purpose of the correlation modules is to S=10. Suppose there exists a signal
compare two input signals, and determine
how much (in terms of number of samples, 0 0 0 0 0 1 2 3 4 5 4 3 2 1 0 0 0
which corresponds to time) one lags behind 0
the other.
This signal arrives at mic 1 after a delay of 2
Theory sampling cycles, and at mic 2 after 5
This module calculates the cross correlation sampling cycles. At one point, the sample
score (dot product) of two input signals buffer for the two mics will contain:
when one is time shifted by a lag m over a
Mic 1: 2 3 4 5 4 3 2 1 0 0
Mic 2: 0 0 1 2 3 4 5 4 3 2 implement audio buffers using Block RAMs
which has higher capacity and low latency.
Based on this dataset, the relative delay of One caveat we encountered is that using
signal received at mic 2 compared to mic 1 BRAM requires minimum of two cycle
is 3 cycles. If we calculate the cross delays (one cycle to program read address,
correlation score for m ∈ [0, 5] , we expect and another cycle for data return). To deal
the score for m = 3 to have be the highest. with this, we set up our BRAM at the system
clock rate of 100 MHz, and only perform
Here: calculation one in every four cycles,
C (n, 3) = (2)(2) + (3)(3) + (4)(4) + (5)(5) + (4)(4) effectively reducing its frequency from 100
. MHz to 25 MHz. This ensures the memory
Observe that with each streaming input delay is hidden, and requested data
sample, the difference in the new and old appears to be immediately available on the
correlation score for the same lag m is next calculation cycle.
C (n, m) − C (n − 1, m) =
W−1 The cross correlation module is a state
= ∑ A[n + k]B[n + k + m] − machine, where the state is (op, m) . op
k=0
W−1 specifies the operation it’s currently
∑ A[n − 1 + k]B[n − 1 + k + m] processing. The main two operations are
k=0
= A[n + W − 1]B[n + W − 1 + m] − OP SUB and OP ADD , the former subtracts
A[n − 1]B[n − 1 + m] from correlation score the product of the
oldest samples, and the latter adds to the
Therefore, in each cycle, we only need two correlation score the product of the newest
multiplications, one subtraction and one samples. Each operation is repeated for 200
addition to compute the next value. distinct m . Therefore, the total number of
cycles for updating scores given a new
sample is roughly 400 cycles.
Implementation
To implement the cross correlation module, Each correlation module stores two streams
initially we wanted to write a parallel module of audio samples, and total of two
that can calculate multiple correlation correlation modules are needed. We need
scores simultaneously for many different to store S samples in an audio stream,
time steps. However, that requires parallel which is double the correlation window W
reads from audio buffers, and requires using (see theory section). We find the empirical
limited registers. limit of W = 1024 , above which Vivado fails
to compile.
In order to increase the accuracy, we want
to use a large enough window size, but we
The input samples have a width of 8 bits
still want minimal latency. So we wrote our
(signed), and the output scores are stored
cross correlation module in sequential
as 32 bit signed numbers.
fashion, where in each cycle it issues one
read from each audio buffer and computes
one multiplication. This allows us to
Since we were aiming for 8-bit audio output,
Correlation Output Filter we used a pulse window width of 256 clock
cycles of the 100Mhz system clock, implying
File: cor_filter.v - author: Changping a PWM rate of around 390Khz, well above
the audible range. The duty cycle - the
Theoretically, the peak of the correlation amount of time along this window that the
score distribution corresponds to the true PWM output is held high - is then varied
delay of a signal. However, we found that between 0 and 256. FIgure 5 gives an
due to noise in the environment, the peak is example of what a PWM waveform looks
not necessarily stable. This translates into a like.
lot of jitter in triangulation result.
Music Player
Hence, we coded a very simple filtering
module with configurable parameters using File: music_player.v sd_controller.v -
switches. The filter specifies that the input author: Changping
delay must stay within a specified range for
at least a specified amount of cycles. Any The music player module plays a music
delay that fails to pass is rejected and does from SD card. We used sd_controller.v
not reach triangulation stage. This file to read sectors from SD card, and
introduces some visible level of stability. connected it to a Block RAM-backed 1KB
FIFO so that we can aggressively read from
In retrospect, a better filter would be an SD card in advance, and always maintain at
algorithm that tracks multiple peaks, and least one sector (512 bytes) buffer in
selects one that corresponds to previously memory. On each sampling clock cycle, one
selected peak even if there exists a peak byte is popped from FIFO and written to a
with a higher score. This filter would be audio PWM module for playback.
more tolerant of erroneous global peak.
On a MacBook, we wrote a simple Python
Audio Output script to dump length and raw bytes from a
8-bit @ 44KHz .wav file onto an SD card as
File: pwm.v - author: Joren
a raw block device.
y2 + x2 = a2
⇔ y = √a2 − x2
Note that g2 was computed in the previous In order to understand the results of our
iteration of the algorithm, or is 0 in the first real-time cross-correlation and triangulation
iteration. All other operations involved are calculations, we included a representation
simple additions and bit shifts. of the setup, with its microphones and the
location of the sound source. We also
Since the square of an n-bit number has at added some real-time insight into the results
most n/2 significant digits, the algorithm of the calculations beyond just the final x,y
starts with i = n/2 , decreases i by one each coordinate.
iteration, and consistently computes the root
of an n-bit unsigned integer in n/2 cycles.
Theoretical accuracy
http://www-mtl.mit.edu/Courses/6.111/labkit/vga.s
html
output shown is the start of a “string pluck”
sound generated by a Karplus-Strong String
Lessons Learned synthesizer, which resembles white noise
and as such has a significant random
high-frequency component. While the
Sampling from several channels
different delays between the signals are
We initially expected much more trouble clearly visible, and the two microphone
sampling from three microphones waveforms look very similar, they are
simultaneously. This turned out to be very significantly different from the raw output
simple due to the wizards built into the wave.
Vivado design suite, and the multiplexer
built into the XADC.
Understanding which the filtering effects at
Third microphone vs. sound output work could make location of a
from FPGA non-microphone sound source work just as
well, but we did not have the time to look
The localization is currently much more into this.
reliable when using a third microphone and
your voice as a sound source, than when
using the sound output from the FPGA. This Challenges due to periodicity of
is surprising, since the latter gives very sound
accurate information of exactly what sound
From the correlation score distributions, we
is being produced, and at what time.
learned that the score develops peaks for
different lag values a fixed distance apart,
One possible explanation would be that the
that depends on dominant frequencies of
process of converting a waveform through
the sound being picked up. When a sound
PWM, reproducing it through a speaker, and
is very periodic, the correlation module gets
then picking it up using a microphone acts
confused since the waveforms “align” at
as a significant filter on the signal. Figure 7
several places, each shifted by one period.
compares the output wave and the recorded
signals on two microphones. The sound
One solution could be to try and filter out the
period component of the sound, and focus
on correlating the irregularities. Another
interesting solution could be to return to our
initial idea of using three stationary
microphones, since the difference in arrival
time between two microphones that are
near each other could often be less than
one period of the sound. This requires
changes to our correlation module (faster
sampling, shorter window, …) that we did
not get to experiment with.
Division of work
Joren: preamplifiers, triangulation, and VGA
display modules
Changping: cross-correlation, music player,
and ADC sampling modules.