Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, John R. Hershey

Sound separation research at Google

Uploaded by

dicol14865

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views5 pages

Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, John R. Hershey

Sound separation research at Google

Uploaded by

dicol14865

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY

UNIVERSAL SOUND SEPARATION

Ilya Kavalerov1,2∗ , Scott Wisdom1 , Hakan Erdogan1 , Brian Patton1 ,

Kevin Wilson1 , Jonathan Le Roux3 , John R. Hershey1
1
Google Research, Cambridge MA
2
Department of Electrical and Computer Engineering, UMD
3
Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA
arXiv:1905.03330v2 [cs.SD] 2 Aug 2019

ABSTRACT of types of sound. We show that the best methods are surprisingly
successful, producing an average improvement of almost 10 dB in
Recent deep learning approaches have achieved impressive per- scale-invariant signal-to-distortion ratio (SI-SDR) [12].
formance on speech enhancement and separation tasks. However,
Previous experiments have focused mainly on scenarios where
these approaches have not been investigated for separating mixtures
at least one of the target signals to be separated is speech. In speech
of arbitrary sounds of different types, a task we refer to as universal
enhancement, the task is to separate the relatively structured sound
sound separation, and it is unknown how performance on speech
of a single speaker from a much less constrained set of non-speech
tasks carries over to non-speech tasks. To study this question, we
sounds. For separation of multiple speakers, the state of the art has
develop a dataset of mixtures containing arbitrary sounds, and use
progressed from speaker-dependent separation [13], where mod-
it to investigate the space of mask-based separation architectures,
els are trained on individual speakers or speaker combinations, to
varying both the overall network architecture and the framewise
speaker-independent speech separation [6–8], where the system has
analysis-synthesis basis for signal transformations. These network
to be flexible enough to separate unknown speakers. In partic-
architectures include convolutional long short-term memory net-
ular, ConvTasNet is a recently proposed model [14] that uses a
works and time-dilated convolution stacks inspired by the recent
combination of learned time-domain analysis and synthesis trans-
success of time-domain enhancement networks like ConvTasNet.
forms with a time-dilated convolutional network (TDCN), showing
For the latter architecture, we also propose novel modifications that
significant improvements on the task of speech separation relative
further improve separation performance. In terms of the frame-
to previously state-of-the-art models based on short-time Fourier
wise analysis-synthesis basis, we explore both a short-time Fourier
transform (STFT) analysis/synthesis transforms and long short-term
transform (STFT) and a learnable basis, as used in ConvTasNet.
memory (LSTM) recurrent networks. Despite this progress, it is still
For both of these bases, we also examine the effect of window size.
unknown how current methods perform on separation of arbitrary
In particular, for STFTs, we find that longer windows (25-50 ms)
types of sounds. The fact that human hearing is so adept at selective
work best for speech/non-speech separation, while shorter win-
listening suggests that more general principles of separation exist
dows (2.5 ms) work best for arbitrary sounds. For learnable bases,
and can be learned from large databases of arbitrary sounds.
shorter windows (2.5 ms) work best on all tasks. Surprisingly, for
This paper provides four contributions. First, we investigate
universal sound separation, STFTs outperform learnable bases. Our
the universal sound separation problem in depth for the first time,
best methods produce an improvement in scale-invariant signal-to-
by constructing a dataset of mixtures containing a wide variety
distortion ratio of over 13 dB for speech/non-speech separation and
of different sounds. Second, we evaluate ConvTasNet on both
close to 10 dB for universal sound separation.
speech/non-speech separation and universal sound separation tasks
Index Terms— Source separation, deep learning, non-speech for the first time. Third, we provide a systematic comparison
audio of different combinations of masking network architectures and
analysis-synthesis transforms, optimizing each over the effect of
1. INTRODUCTION window size. Finally, we propose novel variations in architecture,
including alternative feature normalization, improved initialization,
A fundamental challenge in machine hearing is that of selectively longer range skip-residual connections, and iterative processing
listening to different sounds in an acoustic mixture. Extracting es- that further improve separation performance on all tasks.
timates of each source is especially difficult in monaural recordings
where there are no directional cues. Recent advances have been 2. PRIOR WORK
made in solving monaural speech enhancement and speech sepa-
ration problems in increasingly difficult scenarios, thanks to deep A variety of networks have been successfully applied to two-
learning methods [1–11]. However, separation of arbitrary sounds source separation problems, including LSTMs and bidirectional
from each other may still be considered a “holy grail” of the field. LSTMs (BLSTMs) [2, 4], U-Nets [15], Wasserstein GANs [16],
In particular, it is an open question whether current methods are and fully convolutional network (FCN) encoder-decoders followed
best suited to learning the specifics of a single class of sounds, such by a BLSTM [17]. For multi-source separation, a variety of ar-
as speech, or can learn more general cues for separation that can ap- chitectures have been used that directly generate a mask for each
ply to mixtures of arbitrary sounds. In this paper, we propose a new source, including BLSTMs [6, 9], CNNs [18], DenseNets followed
universal sound separation task, consisting of mixtures of hundreds by an LSTM [19], separate encoder-decoder networks for each
source [20], joint one-to-many encoder-decoder networks with
∗ Work done during an internship at Google. o decoder per source [21], and TDCNs with learnable analysis-
2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY

Figure 1: Architecture for mask-based separation experiments. We vary the mask network and analysis/synthesis transforms.

synthesis basis [14]. Our models are most similar to [9] and [14]. learning the contributions of the bottom layers, similar to layer-wise
Networks that perform source separation in an embedding space training, and then easily adjusting the scale of each block’s contri-
instead of in the time-frequency domain, such as deep cluster- bution through the learnable scaling parameter. This initialization is
ing [6,11], have also been effective at separation tasks, but we leave partly inspired by “Fixup” initialization in residual networks [27].
exploration of those methods for future work. A third network variant we consider is an iterative improved
Previous source separation work has focused on speech en- TDCN network (iTDCN++), in which the signal estimates from an
hancement and speech separation [6, 16, 22, 23]. Small datasets initial mask-based separation network serve as input, along with
used for the non-speech multi-source separation setting have in- the original mixture, to a second separation network. This archi-
cluded distress sounds from DCASE 2017 [18], and speech and tecture is inspired by [7], in which a similar iterative scheme with
music in SiSEC-2015 [17, 20]. Singing voice separation has fo- LSTM-based networks led to significant performance improve-
cused on vocal and music instrument tracks [15, 24]. ments. In our version, both the first and second stage networks
To our knowledge, the work introduced here is the first to in- are identical copies of the TDCN++ network architecture, except
vestigate separation of arbitrary real-world sounds sourced from a for the inputs and parameters. In the second stage, the noisy
large number of sound classes. mixture and initial signal estimates are transformed by the same
basis (STFT or learned) prior to concatenation of their coefficients.
3. MODELS Because, with two iterations, the network is twice as deep as a
single-stage TDCN++, we also include a twice deeper TDCN++
We use mask-based separation systems driven by deep neural net- model (2xTDCN++) for comparison.
works, and we experiment with combinations of two different net-
3.2. Analysis-synthesis bases
work architectures and two different analysis-synthesis bases. All
masking networks use a sigmoid activation to predict a real number Whereas earlier mask-based separation work had used STFTs as the
in [0, 1] to modulate each basis coefficient. analysis-synthesis basis due to the sparsity of many signals in this
domain, ConvTasNet [14] uses a learnable analysis-synthesis basis.
3.1. Masking network architectures The analysis transform is a framewise basis analogous to the STFT,
The first masking network we use consists of 14 dilated 2D convo- and can also be described as a 1D convolution layer where the ker-
lutional layers, a bidirectional LSTM, and two dense layers, which nel size is the window size, the stride is the hop length, and the
we will refer to as a convolutional-LSTM-dense neural network number of filters is the number of basis vectors. A ReLU activa-
(CLDNN). The CLDNN is based on a network which achieves tion is applied to the analysis coefficients before processing by the
state-of-the-art performance on CHiME2 WSJ0 speech enhance- mask network. The learnable synthesis transform can be expressed
ment [25] and strong performance on a large internal dataset [26]. as a transposed 1D convolution and operates similarly to an inverse
STFT, where a linear synthesis basis operates on coefficients to pro-
Our second masking network is a TDCN inspired by ConvTas-
duce frames which are overlap-added to form a time-domain signal.
Net [14]. We employ the same parameters as the best noncausal
Unlike an STFT, this learnable basis and its resulting coefficients
model reported by [14]. We also consider an improved version of
are real-valued.
ConvTasNet’s TDCN masking network, which we refer to as “im-
The original work [14] found that ConvTasNet performed best
proved TDCN” (TDCN++). This new architecture includes three
with very short (2.5 ms) learnable basis functions. However, this
improvements to the original ConvTasNet network. First, global
window size is an important parameter that needs to be optimized
layer normalization within the TDCN, which normalizes over all
for each architecture, input transform, and data type. We therefore
features and frames, is replaced with a feature-wise layer normal-
compare a learnable basis with STFT as a function of window size,
ization over frames. This is inspired by cepstral mean and vari-
in combination with CLDNN and TDCN masking networks. All
ance normalization used in automatic speech recognition systems.
models apply mixture consistency projections to their outputs [26],
Second, we add longer-range skip-residual connections from ear-
which ensure the estimated sources add up to the input mixture.
lier repeat inputs to later repeat inputs after passing them through
Note that the TDCN with STFT basis is a novel combination that,
dense layers. This presumably helps with gradient flow from layer
as we show below, performs best on the universal separation task.
to layer during training. Third, we add a learnable scaling param-
eter after each dense layer. The scaling parameter for the second 4. EXPERIMENTS
dense layer in each convolutional block – which is applied right
before the residual connection – is initialized to an exponentially In this section, we describe the construction of a dataset for uni-
decaying scalar equal to 0.9L , where L is the layer or block index. versal sound separation and apply the described combinations of
This initial scaling contributes to better training convergence by first masking networks and analysis-synthesis bases to this task.
2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY

Figure 2: Mean SI-SDR improvement in dB on the test set as a function of basis window size in ms, using different combinations of network
architectures and bases, on a) speech/non-speech separation, b) two-sound universal separation, and c) three-sound universal separation.
Systems * and ** come from [14] and [25, 26], respectively. “Oracle BM” corresponds to an oracle binary STFT mask, a theoretical upper
bound on our systems’ performance. Note the CLDNN STFT failed to converge for 2.5 ms windows on two-sound separation and is omitted.

4.1. Dataset construction the estimated signal and that of the input mixture signal. The sample
rate for the mixtures was 16 kHz, and all STFTs use a square-root
We define a universal sound separation task designed to have Hann window, where windowed frames are zero-padded to the next
tremendous variability. To build a dataset for this task, we used power of 2 above the window size.
the Pro Sound Effects Library database [28], which contains an We use a permutation-invariant loss to align network outputs
encyclopedic sampling of movie production recordings, including with the reference sources during training, where the loss used for a
crawling insects, animal calls, creaking doors, construction noises, gradient step on a batch is the minimum error across the set SK of
musical instruments, speech, composed music, and artificial sounds all permutations of the K estimated sources, compared to the fixed
(e.g., arcade game sounds). Ambience/environment tracks are K reference sources [6–8]. Although the cardinality of SK is K!,
excluded since they tend to include multiple overlapping sounds. in our experiments K ≤ 3 and this minimization did not lengthen
Three-second clips were extracted from the Pro Sound database training time significantly. Even for larger K, the time-consuming
and used to create single-channel mixtures. Each sound file was an- loss function computation can be first done in parallel for all pairs
alyzed to identify the start of individual sound events, by detecting (i, j), 1 ≤ i, j ≤ K, and the exhaustive search over permutations
when the local root-mean-squared power changed from below aver- for the best combination is performed on the scores.
age to above average. For each of these detected event times within All networks use negative signal-to-noise ratio (SNR) as their
a file, a three-second segment was extracted, where the center of the training loss f between time-domain reference source y and sepa-
segment is equal to the detected event time plus a random uniform rated source ŷ, defined as
offset of up to half a second. Files that were shorter than 3 sec-
onds were looped with a random delay of up to a second to create a P 2
t yt
three-second segment. f (y, ŷ) = −10 log10 P . (2)
t (yt − ŷt )
2
To create each three-second mixture clip, K source clips were
chosen from different sound files randomly and added together. The Compared to negative SI-SDR used to train ConvTasNet [14], this
data were partitioned by source file, with 70% of the files used in negative SNR objective has the advantage that the scale of separated
the training set, 20% in the validation set, and 10% in the test set. sources is preserved and consistent with the mixture, which is fur-
Overall, the source material for the training set consists of 11,797 ther enforced by our use of mixture consistency layers [26]. Since
audio files, along with 3,370 for the validation set, and 1,686 for we measure loss in the time domain, gradients are backpropagated
the test set. In total, the two-source and three-source datasets each through the synthesis transform and its overlap-add layer, so STFT
contain 14,049 training mixtures (11.7 hours), 3898 validation mix- consistency [26, 32] is implicitly enforced when using the STFT.
tures (3.2 hours), and 2074 test mixtures (1.7 hours). A recipe to
recreate the dataset is publicly available [29]. 4.3. Results

4.2. Training and evaluation setup Results on the universal data are shown in Figure 2 and Table 1,
and audio demos may be found online [29]. Figure 2 shows results
All experiments are performed using TensorFlow [30], trained with for different window sizes, where for each size, the hop is half the
the Adam [31] optimizer with batch size 2 on a single NVIDIA window size. For comparison, speech/non-speech separation per-
Tesla V100 GPU. Separation performance is measured using scale- formance on data described in [26] is shown1 alongside results for
invariant signal-to-distortion ratio improvement (SI-SDRi) [7, 12], two-source and three-source universal sound separation. We also
which evaluates the fidelity of a signal estimate ŝ, represented as a tried training CLDNN networks with learned bases, but these net-
vector, relative to the ground truth signal s while accommodating a works failed to converge and are not shown. For all tasks, we show
possible scale mismatch. SI-SDR is computed as the performance of an oracle binary mask using an STFT for vary-
ing window sizes. These oracle scores provide a theoretical upper
def kαsk2
SI-SDR(s, ŝ) = 10 log10 , (1) bound on the possible performance of our methods.
kαs − ŝk2
1 Note that we consider here the more general “speech/non-speech sepa-
where α = argmina kas − ŝk2 = hs, ŝi/ksk2 , and ha, bi denotes ration” task, in contrast to the “speech enhancement” task, which typically
the inner product. SI-SDRi is the difference between the SI-SDR of refers to separating only the speech signal.
2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY

Table 1: Mean scale-invariant SDR improvement (dB) for speech/non-speech separation and two-source or three-source sound separation.
Note that the bottom four TDCN networks below the thick line are twice as deep as the top four TDCN networks above the thick line.
Speech/non-speech separation Two-source separation Three-source separation
Best Val. Test Best Val. Test Best Val. Test
Masking network, basis
win. size SI-SDRi SI-SDRi win. size SI-SDRi SI-SDRi win. size SI-SDRi SI-SDRi
CLDNN, STFT [25, 26] 50 ms 11.9 11.8 5.0 ms 7.8 7.4 5.0 ms 6.7 6.4
TDCN, learned [14] 2.5 ms 12.6 12.5 2.5 ms 8.5 7.9 2.5 ms 6.8 6.4
TDCN, STFT 25 ms 11.5 11.3 2.5 ms 9.4 8.6 2.5 ms 7.6 7.0
TDCN++, learned 2.5 ms 12.7 12.7 2.5 ms 9.1 8.5 2.5 ms 8.4 7.7
TDCN++, STFT 25 ms 11.1 11.0 2.5 ms 9.9 9.1 5.0 ms 8.8 8.2
2xTDCN++, learned 2.5 ms 13.3 13.2 2.5 ms 8.1 7.6 2.5 ms 8.0 7.3
2xTDCN++, STFT 25 ms 11.2 11.1 5.0 ms 9.3 8.3 5.0 ms 9.0 8.0
iTDCN++, learned 2.5 ms 13.5 13.4 2.5 ms 9.3 8.7 2.5 ms 8.1 7.4
iTDCN++, STFT 25 ms 11.6 11.5 2.5 ms 10.6 9.8 2.5 ms 9.6 8.7

The differences between tasks in terms of basis type are strik- mance of 13.4 dB SI-SDRi. The iterative networks were trained
ing. Notice that for speech/non-speech separation, longer STFT with the loss function applied to the output of each iteration. These
windows are preferred for all masking networks, while shorter win- results point to iterative separation as a promising direction for
dows are best when using a learnable basis. For universal sound future exploration.
separation, the optimal window sizes are shorter in general com- Figure 3 shows scatter plots of input SI-SDR versus improve-
pared to speech/non-speech separation, regardless of the basis. ment in SI-SDR for each example in the test set. Panel a) displays
Window size is an important variable since it controls the frame results for the best model from Table 1, and panel b) displays re-
rate and temporal resolution of the network, as well as the basis size sults for oracle binary masking computed using an STFT with 10
in the case of STFT analysis and synthesis transforms. The frame ms windows and 5 ms hop. Oracle binary masking achieves 16.3
rate also determines the temporal context seen by the network. On dB mean SI-SDRi, and indicates the potential separation that can
the speech/non-speech separation task, for all masking networks, be achieved on this dataset.
25-50 ms is the best window size. Speech may work better with
such relatively long windows for a variety of reasons: speech is
largely voiced and has sustained harmonic tones, with both the pitch
and vocal tract parameters varying relatively slowly. Thus, speech is
well described by sparse patterns in an STFT with longer windows
as preferred by the models, and may thus be easier to separate in this
domain. Speech is also highly structured and may carry more pre-
dictable longer-term contextual information than arbitrary sounds;
with longer windows, the LSTM in a CLDNN has to remember in-
formation across fewer frames for a given temporal context.
For universal sound separation, the TDCNs prefer short (2.5
ms or 5 ms) frames, and the optimal window size for the CLDNN Figure 3: Scatter plots of input SI-SDR versus SI-SDR improve-
is 5 ms or less, which in both cases is much shorter than the op- ment on two-source universal sound mixture test set for a) our best
timal window size for speech/non-speech separation. This holds model (iTDCN++, STFT) and b) oracle binary masking using an
both with learned bases and with the STFT basis. Surprisingly STFT with 10 ms window and 5 ms hop. The darkness of points is
the STFT outperforms learned bases for sound separation overall, proportional to the number of overlapping points.
whereas the opposite is true for speech/non-speech separation. In
contrast to speech/non-speech separation, where a learned basis can 5. CONCLUSION
exploit the structure of speech signals, it is perhaps more difficult to
learn general-purpose basis functions for the wide variety of acous- We introduced the universal sound separation problem and con-
tic patterns present in arbitrary sounds. In contrast to speech, arbi- structed a dataset of mixtures containing a wide variety of differ-
trary sounds may contain more percussive components, and hence ent sounds. Our experiments compared different combinations of
be better represented using an STFT with finer time resolution. To network architectures and analysis-synthesis transforms, optimiz-
fairly compare different models, we report results using the optimal ing each over the effect of window size. We also proposed novel
window size for each architecture, determined via cross-validation. variations in architecture, including longer-range skip-residual con-
Table 1 shows summary comparisons using the best window nections and iterative processing, that improve separation perfor-
size for each masking network and basis. The optimal performance mance on all tasks. Interestingly, the optimal basis and window
for speech/non-speech separation is achieved by models using size are different when separating speech versus separating arbitrary
learnable bases, while for universal sound separation, STFTs pro- sounds, with learned bases working better for speech/non-speech
vide a better representation. For both two-source and three-source separation, and STFTs working better for sound separation. The
separation, the iTDCN++ with 2.5 ms STFT basis provides the best best models, using iterative TDCN++, produce an average SI-SDR
average SI-SDR improvement of 9.8 dB and 8.7 dB, respectively, improvement of almost 10 dB on sound separation, and over 13
on the test set, whereas the 2xTDCNN++, is not competitive on the dB on speech/non-speech separation. Overall, these are extremely
universal separation task. For speech/non-speech separation, the promising results which show that perhaps the holy grail of univer-
iTDCN++ with a 2.5 ms learnable basis achieves the best perfor- sal sound separation may soon be within reach.
2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY

6. REFERENCES [17] E. M. Grais and M. D. Plumbley, “Combining fully convolu-

tional and recurrent neural networks for single channel audio
[1] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhance- source separation,” in Proc. AES, May 2018.
ment based on deep denoising autoencoder,” in Proc. Inter- [18] Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley, “A joint
speech, Mar. 2013. separation-classification model for sound event detection of
[2] F. J. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller, weakly labelled data,” in Proc. ICASSP, Apr. 2018.
“Discriminatively trained recurrent neural networks for [19] N. Takahashi, N. Goswami, and Y. Mitsufuji, “MMDenseL-
single-channel speech separation,” in Proc. GlobalSIP, Dec. STM: An efficient combination of convolutional and recurrent
2014. neural networks for audio source separation,” arXiv preprint
[3] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental arXiv:1805.02410, 2018.
study on speech enhancement based on deep neural networks,”
[20] E. M. Grais and M. D. Plumbley, “Single channel audio source
IEEE Signal Processing Letters, vol. 21, no. 1, 2014.
separation using convolutional denoising autoencoders,” in
[4] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, Proc. GlobalSIP, Nov. 2017.
“Phase-sensitive and recognition-boosted speech separation
[21] P. Chandna, M. Miron, J. Janer, and E. Gómez, “Monoaural
using deep recurrent neural networks,” in Proc. ICASSP, Apr.
audio source separation using deep convolutional neural net-
2015.
works,” in Proc. LVA/ICA, Feb. 2017.
[5] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent,
J. Le Roux, J. R. Hershey, and B. Schuller, “Speech enhance- [22] E. Vincent, J. Barker, S. Watanabe, J. Le Roux, F. Nesta, and
ment with LSTM recurrent neural networks and its application M. Matassoni, “The second ‘CHiME’ speech separation and
to noise-robust ASR,” in Proc. LVA/ICA, Aug. 2015. recognition challenge: Datasets, tasks and baselines,” in Proc.
ICASSP, May 2013.
[6] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep
clustering: Discriminative embeddings for segmentation and [23] Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-independent
separation,” in Proc. ICASSP, Mar. 2016. speech separation with deep attractor network,” IEEE/ACM
Transactions on Audio, Speech, and Language Processing,
[7] Y. Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey, vol. 26, no. 4, 2018.
“Single-channel multi-speaker separation using deep cluster-
ing,” in Proc. Interspeech, Sep. 2016. [24] Y. Luo, Z. Chen, J. R. Hershey, J. Le Roux, and N. Mesgarani,
“Deep clustering and conventional networks for music sepa-
[8] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation
ration: Stronger together,” in Proc. ICASSP, Mar. 2017.
invariant training of deep models for speaker-independent
multi-talker speech separation,” in Proc. ICASSP, Mar. 2017. [25] K. Wilson, M. Chinen, J. Thorpe, B. Patton, J. Hershey,
A. R. Saurous, J. Skoglund, and F. R. Lyon, “Exploring trade-
[9] M. Kolbæk, D. Yu, Z.-H. Tan, J. Jensen, M. Kolbaek, D. Yu,
offs in models for low-latency speech enhancement,” in Proc.
Z.-H. Tan, and J. Jensen, “Multitalker speech separation with
IWAENC, Sep. 2018.
utterance-level permutation invariant training of deep recur-
rent neural networks,” IEEE/ACM Transactions on Audio, [26] S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chi-
Speech and Language Processing, vol. 25, no. 10, 2017. nen, B. Patton, and R. A. Saurous, “Differentiable consistency
[10] D. Wang and J. Chen, “Supervised Speech Separation constraints for improved deep speech enhancement,” Proc.
Based on Deep Learning: An Overview,” in arXiv preprint ICASSP, May 2019.
arXiv:1708.07524, 2017. [27] H. Zhang, Y. N. Dauphin, and T. Ma, “Fixup initializa-
[11] Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Alternative ob- tion: Residual learning without normalization,” arXiv preprint
jective functions for deep clustering,” in Proc. ICASSP, Apr. arXiv:1901.09321, 2019.
2018. [28] Pro Sound Effects Library. Available from
[12] J. Le Roux, S. Wisdom, H. Erdogan, and J. Hershey, “SDR – http://www.prosoundeffects.com, accessed: 2018-06-01.
half-baked or well done?” Proc. ICASSP, May 2019. [29] Universal Sound Separation project webpage. Available from
[13] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and https://universal-sound-separation.github.io.
P. Smaragdis, “Deep learning for monaural speech sepa- [30] M. A. et al., “TensorFlow: Large-scale machine learning
ration,” in Proc. ICASSP, May 2014. on heterogeneous systems,” 2015, software available from
[14] Y. Luo and N. Mesgarani, “Tasnet: Surpassing ideal time- tensorflow.org.
frequency masking for speech separation,” arXiv preprint [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic op-
arXiv:1809.07454, 2018. timization,” arXiv preprint arXiv:1412.6980, 2014.
[15] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Ku- [32] J. Le Roux, N. Ono, and S. Sagayama, “Explicit consistency
mar, and T. Weyde, “Singing voice separation with deep U- constraints for STFT spectrograms and their application to
Net convolutional networks,” Proc. ISMIR, Oct. 2017. phase reconstruction,” in Proc. SAPA, Sep. 2008.
[16] Y. C. Subakan and P. Smaragdis, “Generative adversarial
source separation,” in Proc. ICASSP, Apr. 2018.

Conv-TasNet Surpassing Ideal TimeFrequency Magnitude Masking For Speech Separation
No ratings yet
Conv-TasNet Surpassing Ideal TimeFrequency Magnitude Masking For Speech Separation
11 pages
Efthymios Tzinis Scott Wisdom John R. Hershey Aren Jansen Daniel P. W. Ellis
No ratings yet
Efthymios Tzinis Scott Wisdom John R. Hershey Aren Jansen Daniel P. W. Ellis
5 pages
Speech Separation with Conv-TasNet
No ratings yet
Speech Separation with Conv-TasNet
12 pages
ConvTasNet Compressed
No ratings yet
ConvTasNet Compressed
12 pages
Effects of Dataset Sampling Rate For Noise Cancellation Through Deep Learning
No ratings yet
Effects of Dataset Sampling Rate For Noise Cancellation Through Deep Learning
16 pages
Music Source Separation Advances
100% (2)
Music Source Separation Advances
17 pages
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
No ratings yet
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
29 pages
Targeted Voice Separation
No ratings yet
Targeted Voice Separation
4 pages
Yu Xuan Wang 2014
No ratings yet
Yu Xuan Wang 2014
10 pages
A Convolutional Recurrent Neural Network For Real-Time Speech Enhancement
No ratings yet
A Convolutional Recurrent Neural Network For Real-Time Speech Enhancement
5 pages
Wave U Net
No ratings yet
Wave U Net
7 pages
Speech Communication: Ashish Alex, Lin Wang, Paolo Gastaldo, Andrea Cavallaro
No ratings yet
Speech Communication: Ashish Alex, Lin Wang, Paolo Gastaldo, Andrea Cavallaro
16 pages
1907.01160v1 Wham!
No ratings yet
1907.01160v1 Wham!
5 pages
Agrawal Et Al - 2023 - A Review On Speech Separation in Cocktail Party Environment
No ratings yet
Agrawal Et Al - 2023 - A Review On Speech Separation in Cocktail Party Environment
33 pages
Chapter 8 D NN Based Speech Separation
No ratings yet
Chapter 8 D NN Based Speech Separation
38 pages
Effect of Noise Suppression Losses On Speech Distortion and Asr Performance
No ratings yet
Effect of Noise Suppression Losses On Speech Distortion and Asr Performance
5 pages
BTP Group-1 Report
No ratings yet
BTP Group-1 Report
21 pages
Voicefilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
No ratings yet
Voicefilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
5 pages
Deep Learning for Speech Separation
No ratings yet
Deep Learning for Speech Separation
2 pages
Deep Learning For Environmentally Robust Speech Recognition - An Overview of Recent Developments PDF
No ratings yet
Deep Learning For Environmentally Robust Speech Recognition - An Overview of Recent Developments PDF
28 pages
Inter Speech 2018
No ratings yet
Inter Speech 2018
5 pages
Noise To Noise
No ratings yet
Noise To Noise
20 pages
DL For Acoustics
No ratings yet
DL For Acoustics
4 pages
Vlsi Implementation of Temporal and Transform Domain Speech Enhancement Algorithms
No ratings yet
Vlsi Implementation of Temporal and Transform Domain Speech Enhancement Algorithms
21 pages
2023 - Peter - Deep Neural Network Techniques For Monaural
No ratings yet
2023 - Peter - Deep Neural Network Techniques For Monaural
46 pages
PHD Thesis
No ratings yet
PHD Thesis
99 pages
Tokensplit: Using Discrete Speech Representations For Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition
No ratings yet
Tokensplit: Using Discrete Speech Representations For Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition
5 pages
DIHARD-III Speech Diarization System
No ratings yet
DIHARD-III Speech Diarization System
5 pages
DDSP Differentiable Digital Signal Processing
No ratings yet
DDSP Differentiable Digital Signal Processing
19 pages
Unsupervised Music Source Separation Using Differentiable Parametric Source Models
No ratings yet
Unsupervised Music Source Separation Using Differentiable Parametric Source Models
14 pages
Audio Query Based Interface
No ratings yet
Audio Query Based Interface
8 pages
Shengkui Zhao Yukun Ma Chongjia Ni Chong Zhang Hao Wang Trung Hieu Nguyen Kun Zhou Jia Qi Yip Dianwen NG Bin Ma
No ratings yet
Shengkui Zhao Yukun Ma Chongjia Ni Chong Zhang Hao Wang Trung Hieu Nguyen Kun Zhou Jia Qi Yip Dianwen NG Bin Ma
5 pages
Deep Learning-Based Expressive Speech Synthesis: A Systematic Review of Approaches, Challenges, and Resources
No ratings yet
Deep Learning-Based Expressive Speech Synthesis: A Systematic Review of Approaches, Challenges, and Resources
34 pages
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
No ratings yet
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
3 pages
Speech Separation for Meetings
No ratings yet
Speech Separation for Meetings
8 pages
Seminar Report Parthiv
No ratings yet
Seminar Report Parthiv
58 pages
MIMO-Speech End-to-End Multi-Channel Multi-Speaker Speech Recognition
No ratings yet
MIMO-Speech End-to-End Multi-Channel Multi-Speaker Speech Recognition
8 pages
Deep Neural Networks For Speech Enhancement
No ratings yet
Deep Neural Networks For Speech Enhancement
7 pages
Nicolas Turpault, Scott Wisdom, Hakan Erdogan, John R. Hershey, Romain Serizel, Eduardo Fonseca, Prem Seetharaman, Justin Salamon
No ratings yet
Nicolas Turpault, Scott Wisdom, Hakan Erdogan, John R. Hershey, Romain Serizel, Eduardo Fonseca, Prem Seetharaman, Justin Salamon
5 pages
Background Noise Suppression in Audio File Using LSTM Network
No ratings yet
Background Noise Suppression in Audio File Using LSTM Network
9 pages
Wu 2019
No ratings yet
Wu 2019
4 pages
Speech Separation with CRFs
No ratings yet
Speech Separation with CRFs
9 pages
Seminar Report Final
No ratings yet
Seminar Report Final
37 pages
2023 - Dragas - A Survey On Low-Latency DNN-Based Speech Enhancement
No ratings yet
2023 - Dragas - A Survey On Low-Latency DNN-Based Speech Enhancement
26 pages
Applsci 15 02919
No ratings yet
Applsci 15 02919
19 pages
On The Compensation Between Magnitude and Phase in Speech Separation
No ratings yet
On The Compensation Between Magnitude and Phase in Speech Separation
5 pages
Speech Processing Research Paper
No ratings yet
Speech Processing Research Paper
13 pages
BASNET
No ratings yet
BASNET
5 pages
Multichannel Speech Enhancement
No ratings yet
Multichannel Speech Enhancement
5 pages
On-Device Voice Separation Tech
No ratings yet
On-Device Voice Separation Tech
5 pages
Team14 AAI Report
No ratings yet
Team14 AAI Report
39 pages
PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network
No ratings yet
PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network
9 pages
Convai Technical Overview Speech Ai Part 2 2301964
No ratings yet
Convai Technical Overview Speech Ai Part 2 2301964
11 pages
Espi2015 Article ExploitingSpectro-temporalLoca
No ratings yet
Espi2015 Article ExploitingSpectro-temporalLoca
12 pages
Study On Speech Recognition Method of Artificial Intelligence Deep Learning
No ratings yet
Study On Speech Recognition Method of Artificial Intelligence Deep Learning
6 pages
Environmental Sound Classificationwith Convolutional Neural Networks
No ratings yet
Environmental Sound Classificationwith Convolutional Neural Networks
6 pages
Cmgan
No ratings yet
Cmgan
5 pages
Neural Networks for Speech Recognition
No ratings yet
Neural Networks for Speech Recognition
155 pages
15 - Chapter 6
No ratings yet
15 - Chapter 6
26 pages
Importance of Phase in Image Processing
No ratings yet
Importance of Phase in Image Processing
30 pages
The Gabor Transform, STFT and CWT Invertibility, and Generalized Parseval S Like Theorem
No ratings yet
The Gabor Transform, STFT and CWT Invertibility, and Generalized Parseval S Like Theorem
7 pages
EEE 5502 Code 5: 1 Problem
No ratings yet
EEE 5502 Code 5: 1 Problem
8 pages
Nieuwenhuizen - The Study and Implementation of Shazam's Audio Fingerprinting Algorithm For Advertisement Identification
No ratings yet
Nieuwenhuizen - The Study and Implementation of Shazam's Audio Fingerprinting Algorithm For Advertisement Identification
4 pages
A Two-Stage Classification of Heart Sounds Using Tunable Quality Wavelet Transform Features
No ratings yet
A Two-Stage Classification of Heart Sounds Using Tunable Quality Wavelet Transform Features
5 pages
2006 04 C Laar, B.L.A.van - De.,-Emotion Detection in Music, A Survey
No ratings yet
2006 04 C Laar, B.L.A.van - De.,-Emotion Detection in Music, A Survey
7 pages
AI Device for Cardiac Health Monitoring
No ratings yet
AI Device for Cardiac Health Monitoring
11 pages
2021 Deep Learning Audio Book
No ratings yet
2021 Deep Learning Audio Book
38 pages
Stockwell-Why Use The S-Transform
No ratings yet
Stockwell-Why Use The S-Transform
31 pages
A Tutorial of The Wavelet Transform: Chun-Lin, Liu February 23, 2010
No ratings yet
A Tutorial of The Wavelet Transform: Chun-Lin, Liu February 23, 2010
72 pages
Timbre Toolbox: Audio Descriptor Extraction
100% (1)
Timbre Toolbox: Audio Descriptor Extraction
15 pages
Drone RF Signal Detection and Fingerprinting UAVSig Dataset and Deep Learning Approach
No ratings yet
Drone RF Signal Detection and Fingerprinting UAVSig Dataset and Deep Learning Approach
6 pages
Bowon Lee Mark Hasegawa-Johnson
No ratings yet
Bowon Lee Mark Hasegawa-Johnson
5 pages
Speech Emotion Recognition From Raw Audio Using Deep Learning
No ratings yet
Speech Emotion Recognition From Raw Audio Using Deep Learning
83 pages
Music Genre AI for Streaming Services
No ratings yet
Music Genre AI for Streaming Services
6 pages
Discrete Wavelet Transforms - Theory and Applications
No ratings yet
Discrete Wavelet Transforms - Theory and Applications
268 pages
Wavelet Transform Byruby Paleker
No ratings yet
Wavelet Transform Byruby Paleker
17 pages
Example Fast Fourier Oppenheim PDF Transform
No ratings yet
Example Fast Fourier Oppenheim PDF Transform
2 pages
A Review of Induction Motors Signature Analysis As A Medium For Faults Detection
No ratings yet
A Review of Induction Motors Signature Analysis As A Medium For Faults Detection
10 pages
Multirate Filters and Wavelets: From Theory To Implementation
No ratings yet
Multirate Filters and Wavelets: From Theory To Implementation
22 pages
Faults Diagnosis in Robot Systems
No ratings yet
Faults Diagnosis in Robot Systems
12 pages
Audio-To-Score Alignment of Piano Music Using Rnn-Based Automatic Music Transcription
No ratings yet
Audio-To-Score Alignment of Piano Music Using Rnn-Based Automatic Music Transcription
6 pages
(Article) Sim Et Al - A Python Based Tutorial On Prognostics
No ratings yet
(Article) Sim Et Al - A Python Based Tutorial On Prognostics
15 pages
The DFT FFT and Practical Spectral Analysis 2.1
No ratings yet
The DFT FFT and Practical Spectral Analysis 2.1
106 pages
System Data: Software For Pulse™ Labshop
No ratings yet
System Data: Software For Pulse™ Labshop
20 pages
Feature Extraction and Selection For Emotion Recognition From EEG
No ratings yet
Feature Extraction and Selection For Emotion Recognition From EEG
13 pages
4A Programming Assignment: Windows and STFT (Week 4) : Instructions
No ratings yet
4A Programming Assignment: Windows and STFT (Week 4) : Instructions
3 pages
Bridgeview and Labview: Sound and Vibration Toolset Reference Manual
No ratings yet
Bridgeview and Labview: Sound and Vibration Toolset Reference Manual
202 pages

Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, John R. Hershey

Uploaded by

Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, John R. Hershey

Uploaded by

2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY

UNIVERSAL SOUND SEPARATION

Ilya Kavalerov1,2∗ , Scott Wisdom1 , Hakan Erdogan1 , Brian Patton1 ,

6. REFERENCES [17] E. M. Grais and M. D. Plumbley, “Combining fully convolu-

You might also like