0% found this document useful (0 votes)
24 views5 pages

Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, John R. Hershey

Sound separation research at Google

Uploaded by

dicol14865
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views5 pages

Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, John R. Hershey

Sound separation research at Google

Uploaded by

dicol14865
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY

UNIVERSAL SOUND SEPARATION

Ilya Kavalerov1,2∗ , Scott Wisdom1 , Hakan Erdogan1 , Brian Patton1 ,


Kevin Wilson1 , Jonathan Le Roux3 , John R. Hershey1
1
Google Research, Cambridge MA
2
Department of Electrical and Computer Engineering, UMD
3
Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA
arXiv:1905.03330v2 [cs.SD] 2 Aug 2019

ABSTRACT of types of sound. We show that the best methods are surprisingly
successful, producing an average improvement of almost 10 dB in
Recent deep learning approaches have achieved impressive per- scale-invariant signal-to-distortion ratio (SI-SDR) [12].
formance on speech enhancement and separation tasks. However,
Previous experiments have focused mainly on scenarios where
these approaches have not been investigated for separating mixtures
at least one of the target signals to be separated is speech. In speech
of arbitrary sounds of different types, a task we refer to as universal
enhancement, the task is to separate the relatively structured sound
sound separation, and it is unknown how performance on speech
of a single speaker from a much less constrained set of non-speech
tasks carries over to non-speech tasks. To study this question, we
sounds. For separation of multiple speakers, the state of the art has
develop a dataset of mixtures containing arbitrary sounds, and use
progressed from speaker-dependent separation [13], where mod-
it to investigate the space of mask-based separation architectures,
els are trained on individual speakers or speaker combinations, to
varying both the overall network architecture and the framewise
speaker-independent speech separation [6–8], where the system has
analysis-synthesis basis for signal transformations. These network
to be flexible enough to separate unknown speakers. In partic-
architectures include convolutional long short-term memory net-
ular, ConvTasNet is a recently proposed model [14] that uses a
works and time-dilated convolution stacks inspired by the recent
combination of learned time-domain analysis and synthesis trans-
success of time-domain enhancement networks like ConvTasNet.
forms with a time-dilated convolutional network (TDCN), showing
For the latter architecture, we also propose novel modifications that
significant improvements on the task of speech separation relative
further improve separation performance. In terms of the frame-
to previously state-of-the-art models based on short-time Fourier
wise analysis-synthesis basis, we explore both a short-time Fourier
transform (STFT) analysis/synthesis transforms and long short-term
transform (STFT) and a learnable basis, as used in ConvTasNet.
memory (LSTM) recurrent networks. Despite this progress, it is still
For both of these bases, we also examine the effect of window size.
unknown how current methods perform on separation of arbitrary
In particular, for STFTs, we find that longer windows (25-50 ms)
types of sounds. The fact that human hearing is so adept at selective
work best for speech/non-speech separation, while shorter win-
listening suggests that more general principles of separation exist
dows (2.5 ms) work best for arbitrary sounds. For learnable bases,
and can be learned from large databases of arbitrary sounds.
shorter windows (2.5 ms) work best on all tasks. Surprisingly, for
This paper provides four contributions. First, we investigate
universal sound separation, STFTs outperform learnable bases. Our
the universal sound separation problem in depth for the first time,
best methods produce an improvement in scale-invariant signal-to-
by constructing a dataset of mixtures containing a wide variety
distortion ratio of over 13 dB for speech/non-speech separation and
of different sounds. Second, we evaluate ConvTasNet on both
close to 10 dB for universal sound separation.
speech/non-speech separation and universal sound separation tasks
Index Terms— Source separation, deep learning, non-speech for the first time. Third, we provide a systematic comparison
audio of different combinations of masking network architectures and
analysis-synthesis transforms, optimizing each over the effect of
1. INTRODUCTION window size. Finally, we propose novel variations in architecture,
including alternative feature normalization, improved initialization,
A fundamental challenge in machine hearing is that of selectively longer range skip-residual connections, and iterative processing
listening to different sounds in an acoustic mixture. Extracting es- that further improve separation performance on all tasks.
timates of each source is especially difficult in monaural recordings
where there are no directional cues. Recent advances have been 2. PRIOR WORK
made in solving monaural speech enhancement and speech sepa-
ration problems in increasingly difficult scenarios, thanks to deep A variety of networks have been successfully applied to two-
learning methods [1–11]. However, separation of arbitrary sounds source separation problems, including LSTMs and bidirectional
from each other may still be considered a “holy grail” of the field. LSTMs (BLSTMs) [2, 4], U-Nets [15], Wasserstein GANs [16],
In particular, it is an open question whether current methods are and fully convolutional network (FCN) encoder-decoders followed
best suited to learning the specifics of a single class of sounds, such by a BLSTM [17]. For multi-source separation, a variety of ar-
as speech, or can learn more general cues for separation that can ap- chitectures have been used that directly generate a mask for each
ply to mixtures of arbitrary sounds. In this paper, we propose a new source, including BLSTMs [6, 9], CNNs [18], DenseNets followed
universal sound separation task, consisting of mixtures of hundreds by an LSTM [19], separate encoder-decoder networks for each
source [20], joint one-to-many encoder-decoder networks with
∗ Work done during an internship at Google. o decoder per source [21], and TDCNs with learnable analysis-
2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY

Figure 1: Architecture for mask-based separation experiments. We vary the mask network and analysis/synthesis transforms.

synthesis basis [14]. Our models are most similar to [9] and [14]. learning the contributions of the bottom layers, similar to layer-wise
Networks that perform source separation in an embedding space training, and then easily adjusting the scale of each block’s contri-
instead of in the time-frequency domain, such as deep cluster- bution through the learnable scaling parameter. This initialization is
ing [6,11], have also been effective at separation tasks, but we leave partly inspired by “Fixup” initialization in residual networks [27].
exploration of those methods for future work. A third network variant we consider is an iterative improved
Previous source separation work has focused on speech en- TDCN network (iTDCN++), in which the signal estimates from an
hancement and speech separation [6, 16, 22, 23]. Small datasets initial mask-based separation network serve as input, along with
used for the non-speech multi-source separation setting have in- the original mixture, to a second separation network. This archi-
cluded distress sounds from DCASE 2017 [18], and speech and tecture is inspired by [7], in which a similar iterative scheme with
music in SiSEC-2015 [17, 20]. Singing voice separation has fo- LSTM-based networks led to significant performance improve-
cused on vocal and music instrument tracks [15, 24]. ments. In our version, both the first and second stage networks
To our knowledge, the work introduced here is the first to in- are identical copies of the TDCN++ network architecture, except
vestigate separation of arbitrary real-world sounds sourced from a for the inputs and parameters. In the second stage, the noisy
large number of sound classes. mixture and initial signal estimates are transformed by the same
basis (STFT or learned) prior to concatenation of their coefficients.
3. MODELS Because, with two iterations, the network is twice as deep as a
single-stage TDCN++, we also include a twice deeper TDCN++
We use mask-based separation systems driven by deep neural net- model (2xTDCN++) for comparison.
works, and we experiment with combinations of two different net-
3.2. Analysis-synthesis bases
work architectures and two different analysis-synthesis bases. All
masking networks use a sigmoid activation to predict a real number Whereas earlier mask-based separation work had used STFTs as the
in [0, 1] to modulate each basis coefficient. analysis-synthesis basis due to the sparsity of many signals in this
domain, ConvTasNet [14] uses a learnable analysis-synthesis basis.
3.1. Masking network architectures The analysis transform is a framewise basis analogous to the STFT,
The first masking network we use consists of 14 dilated 2D convo- and can also be described as a 1D convolution layer where the ker-
lutional layers, a bidirectional LSTM, and two dense layers, which nel size is the window size, the stride is the hop length, and the
we will refer to as a convolutional-LSTM-dense neural network number of filters is the number of basis vectors. A ReLU activa-
(CLDNN). The CLDNN is based on a network which achieves tion is applied to the analysis coefficients before processing by the
state-of-the-art performance on CHiME2 WSJ0 speech enhance- mask network. The learnable synthesis transform can be expressed
ment [25] and strong performance on a large internal dataset [26]. as a transposed 1D convolution and operates similarly to an inverse
STFT, where a linear synthesis basis operates on coefficients to pro-
Our second masking network is a TDCN inspired by ConvTas-
duce frames which are overlap-added to form a time-domain signal.
Net [14]. We employ the same parameters as the best noncausal
Unlike an STFT, this learnable basis and its resulting coefficients
model reported by [14]. We also consider an improved version of
are real-valued.
ConvTasNet’s TDCN masking network, which we refer to as “im-
The original work [14] found that ConvTasNet performed best
proved TDCN” (TDCN++). This new architecture includes three
with very short (2.5 ms) learnable basis functions. However, this
improvements to the original ConvTasNet network. First, global
window size is an important parameter that needs to be optimized
layer normalization within the TDCN, which normalizes over all
for each architecture, input transform, and data type. We therefore
features and frames, is replaced with a feature-wise layer normal-
compare a learnable basis with STFT as a function of window size,
ization over frames. This is inspired by cepstral mean and vari-
in combination with CLDNN and TDCN masking networks. All
ance normalization used in automatic speech recognition systems.
models apply mixture consistency projections to their outputs [26],
Second, we add longer-range skip-residual connections from ear-
which ensure the estimated sources add up to the input mixture.
lier repeat inputs to later repeat inputs after passing them through
Note that the TDCN with STFT basis is a novel combination that,
dense layers. This presumably helps with gradient flow from layer
as we show below, performs best on the universal separation task.
to layer during training. Third, we add a learnable scaling param-
eter after each dense layer. The scaling parameter for the second 4. EXPERIMENTS
dense layer in each convolutional block – which is applied right
before the residual connection – is initialized to an exponentially In this section, we describe the construction of a dataset for uni-
decaying scalar equal to 0.9L , where L is the layer or block index. versal sound separation and apply the described combinations of
This initial scaling contributes to better training convergence by first masking networks and analysis-synthesis bases to this task.
2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY

Figure 2: Mean SI-SDR improvement in dB on the test set as a function of basis window size in ms, using different combinations of network
architectures and bases, on a) speech/non-speech separation, b) two-sound universal separation, and c) three-sound universal separation.
Systems * and ** come from [14] and [25, 26], respectively. “Oracle BM” corresponds to an oracle binary STFT mask, a theoretical upper
bound on our systems’ performance. Note the CLDNN STFT failed to converge for 2.5 ms windows on two-sound separation and is omitted.

4.1. Dataset construction the estimated signal and that of the input mixture signal. The sample
rate for the mixtures was 16 kHz, and all STFTs use a square-root
We define a universal sound separation task designed to have Hann window, where windowed frames are zero-padded to the next
tremendous variability. To build a dataset for this task, we used power of 2 above the window size.
the Pro Sound Effects Library database [28], which contains an We use a permutation-invariant loss to align network outputs
encyclopedic sampling of movie production recordings, including with the reference sources during training, where the loss used for a
crawling insects, animal calls, creaking doors, construction noises, gradient step on a batch is the minimum error across the set SK of
musical instruments, speech, composed music, and artificial sounds all permutations of the K estimated sources, compared to the fixed
(e.g., arcade game sounds). Ambience/environment tracks are K reference sources [6–8]. Although the cardinality of SK is K!,
excluded since they tend to include multiple overlapping sounds. in our experiments K ≤ 3 and this minimization did not lengthen
Three-second clips were extracted from the Pro Sound database training time significantly. Even for larger K, the time-consuming
and used to create single-channel mixtures. Each sound file was an- loss function computation can be first done in parallel for all pairs
alyzed to identify the start of individual sound events, by detecting (i, j), 1 ≤ i, j ≤ K, and the exhaustive search over permutations
when the local root-mean-squared power changed from below aver- for the best combination is performed on the scores.
age to above average. For each of these detected event times within All networks use negative signal-to-noise ratio (SNR) as their
a file, a three-second segment was extracted, where the center of the training loss f between time-domain reference source y and sepa-
segment is equal to the detected event time plus a random uniform rated source ŷ, defined as
offset of up to half a second. Files that were shorter than 3 sec-
onds were looped with a random delay of up to a second to create a  P 2 
t yt
three-second segment. f (y, ŷ) = −10 log10 P . (2)
t (yt − ŷt )
2
To create each three-second mixture clip, K source clips were
chosen from different sound files randomly and added together. The Compared to negative SI-SDR used to train ConvTasNet [14], this
data were partitioned by source file, with 70% of the files used in negative SNR objective has the advantage that the scale of separated
the training set, 20% in the validation set, and 10% in the test set. sources is preserved and consistent with the mixture, which is fur-
Overall, the source material for the training set consists of 11,797 ther enforced by our use of mixture consistency layers [26]. Since
audio files, along with 3,370 for the validation set, and 1,686 for we measure loss in the time domain, gradients are backpropagated
the test set. In total, the two-source and three-source datasets each through the synthesis transform and its overlap-add layer, so STFT
contain 14,049 training mixtures (11.7 hours), 3898 validation mix- consistency [26, 32] is implicitly enforced when using the STFT.
tures (3.2 hours), and 2074 test mixtures (1.7 hours). A recipe to
recreate the dataset is publicly available [29]. 4.3. Results

4.2. Training and evaluation setup Results on the universal data are shown in Figure 2 and Table 1,
and audio demos may be found online [29]. Figure 2 shows results
All experiments are performed using TensorFlow [30], trained with for different window sizes, where for each size, the hop is half the
the Adam [31] optimizer with batch size 2 on a single NVIDIA window size. For comparison, speech/non-speech separation per-
Tesla V100 GPU. Separation performance is measured using scale- formance on data described in [26] is shown1 alongside results for
invariant signal-to-distortion ratio improvement (SI-SDRi) [7, 12], two-source and three-source universal sound separation. We also
which evaluates the fidelity of a signal estimate ŝ, represented as a tried training CLDNN networks with learned bases, but these net-
vector, relative to the ground truth signal s while accommodating a works failed to converge and are not shown. For all tasks, we show
possible scale mismatch. SI-SDR is computed as the performance of an oracle binary mask using an STFT for vary-
ing window sizes. These oracle scores provide a theoretical upper
def kαsk2
SI-SDR(s, ŝ) = 10 log10 , (1) bound on the possible performance of our methods.
kαs − ŝk2
1 Note that we consider here the more general “speech/non-speech sepa-
where α = argmina kas − ŝk2 = hs, ŝi/ksk2 , and ha, bi denotes ration” task, in contrast to the “speech enhancement” task, which typically
the inner product. SI-SDRi is the difference between the SI-SDR of refers to separating only the speech signal.
2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY

Table 1: Mean scale-invariant SDR improvement (dB) for speech/non-speech separation and two-source or three-source sound separation.
Note that the bottom four TDCN networks below the thick line are twice as deep as the top four TDCN networks above the thick line.
Speech/non-speech separation Two-source separation Three-source separation
Best Val. Test Best Val. Test Best Val. Test
Masking network, basis
win. size SI-SDRi SI-SDRi win. size SI-SDRi SI-SDRi win. size SI-SDRi SI-SDRi
CLDNN, STFT [25, 26] 50 ms 11.9 11.8 5.0 ms 7.8 7.4 5.0 ms 6.7 6.4
TDCN, learned [14] 2.5 ms 12.6 12.5 2.5 ms 8.5 7.9 2.5 ms 6.8 6.4
TDCN, STFT 25 ms 11.5 11.3 2.5 ms 9.4 8.6 2.5 ms 7.6 7.0
TDCN++, learned 2.5 ms 12.7 12.7 2.5 ms 9.1 8.5 2.5 ms 8.4 7.7
TDCN++, STFT 25 ms 11.1 11.0 2.5 ms 9.9 9.1 5.0 ms 8.8 8.2
2xTDCN++, learned 2.5 ms 13.3 13.2 2.5 ms 8.1 7.6 2.5 ms 8.0 7.3
2xTDCN++, STFT 25 ms 11.2 11.1 5.0 ms 9.3 8.3 5.0 ms 9.0 8.0
iTDCN++, learned 2.5 ms 13.5 13.4 2.5 ms 9.3 8.7 2.5 ms 8.1 7.4
iTDCN++, STFT 25 ms 11.6 11.5 2.5 ms 10.6 9.8 2.5 ms 9.6 8.7

The differences between tasks in terms of basis type are strik- mance of 13.4 dB SI-SDRi. The iterative networks were trained
ing. Notice that for speech/non-speech separation, longer STFT with the loss function applied to the output of each iteration. These
windows are preferred for all masking networks, while shorter win- results point to iterative separation as a promising direction for
dows are best when using a learnable basis. For universal sound future exploration.
separation, the optimal window sizes are shorter in general com- Figure 3 shows scatter plots of input SI-SDR versus improve-
pared to speech/non-speech separation, regardless of the basis. ment in SI-SDR for each example in the test set. Panel a) displays
Window size is an important variable since it controls the frame results for the best model from Table 1, and panel b) displays re-
rate and temporal resolution of the network, as well as the basis size sults for oracle binary masking computed using an STFT with 10
in the case of STFT analysis and synthesis transforms. The frame ms windows and 5 ms hop. Oracle binary masking achieves 16.3
rate also determines the temporal context seen by the network. On dB mean SI-SDRi, and indicates the potential separation that can
the speech/non-speech separation task, for all masking networks, be achieved on this dataset.
25-50 ms is the best window size. Speech may work better with
such relatively long windows for a variety of reasons: speech is
largely voiced and has sustained harmonic tones, with both the pitch
and vocal tract parameters varying relatively slowly. Thus, speech is
well described by sparse patterns in an STFT with longer windows
as preferred by the models, and may thus be easier to separate in this
domain. Speech is also highly structured and may carry more pre-
dictable longer-term contextual information than arbitrary sounds;
with longer windows, the LSTM in a CLDNN has to remember in-
formation across fewer frames for a given temporal context.
For universal sound separation, the TDCNs prefer short (2.5
ms or 5 ms) frames, and the optimal window size for the CLDNN Figure 3: Scatter plots of input SI-SDR versus SI-SDR improve-
is 5 ms or less, which in both cases is much shorter than the op- ment on two-source universal sound mixture test set for a) our best
timal window size for speech/non-speech separation. This holds model (iTDCN++, STFT) and b) oracle binary masking using an
both with learned bases and with the STFT basis. Surprisingly STFT with 10 ms window and 5 ms hop. The darkness of points is
the STFT outperforms learned bases for sound separation overall, proportional to the number of overlapping points.
whereas the opposite is true for speech/non-speech separation. In
contrast to speech/non-speech separation, where a learned basis can 5. CONCLUSION
exploit the structure of speech signals, it is perhaps more difficult to
learn general-purpose basis functions for the wide variety of acous- We introduced the universal sound separation problem and con-
tic patterns present in arbitrary sounds. In contrast to speech, arbi- structed a dataset of mixtures containing a wide variety of differ-
trary sounds may contain more percussive components, and hence ent sounds. Our experiments compared different combinations of
be better represented using an STFT with finer time resolution. To network architectures and analysis-synthesis transforms, optimiz-
fairly compare different models, we report results using the optimal ing each over the effect of window size. We also proposed novel
window size for each architecture, determined via cross-validation. variations in architecture, including longer-range skip-residual con-
Table 1 shows summary comparisons using the best window nections and iterative processing, that improve separation perfor-
size for each masking network and basis. The optimal performance mance on all tasks. Interestingly, the optimal basis and window
for speech/non-speech separation is achieved by models using size are different when separating speech versus separating arbitrary
learnable bases, while for universal sound separation, STFTs pro- sounds, with learned bases working better for speech/non-speech
vide a better representation. For both two-source and three-source separation, and STFTs working better for sound separation. The
separation, the iTDCN++ with 2.5 ms STFT basis provides the best best models, using iterative TDCN++, produce an average SI-SDR
average SI-SDR improvement of 9.8 dB and 8.7 dB, respectively, improvement of almost 10 dB on sound separation, and over 13
on the test set, whereas the 2xTDCNN++, is not competitive on the dB on speech/non-speech separation. Overall, these are extremely
universal separation task. For speech/non-speech separation, the promising results which show that perhaps the holy grail of univer-
iTDCN++ with a 2.5 ms learnable basis achieves the best perfor- sal sound separation may soon be within reach.
2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY

6. REFERENCES [17] E. M. Grais and M. D. Plumbley, “Combining fully convolu-


tional and recurrent neural networks for single channel audio
[1] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhance- source separation,” in Proc. AES, May 2018.
ment based on deep denoising autoencoder,” in Proc. Inter- [18] Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley, “A joint
speech, Mar. 2013. separation-classification model for sound event detection of
[2] F. J. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller, weakly labelled data,” in Proc. ICASSP, Apr. 2018.
“Discriminatively trained recurrent neural networks for [19] N. Takahashi, N. Goswami, and Y. Mitsufuji, “MMDenseL-
single-channel speech separation,” in Proc. GlobalSIP, Dec. STM: An efficient combination of convolutional and recurrent
2014. neural networks for audio source separation,” arXiv preprint
[3] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental arXiv:1805.02410, 2018.
study on speech enhancement based on deep neural networks,”
[20] E. M. Grais and M. D. Plumbley, “Single channel audio source
IEEE Signal Processing Letters, vol. 21, no. 1, 2014.
separation using convolutional denoising autoencoders,” in
[4] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, Proc. GlobalSIP, Nov. 2017.
“Phase-sensitive and recognition-boosted speech separation
[21] P. Chandna, M. Miron, J. Janer, and E. Gómez, “Monoaural
using deep recurrent neural networks,” in Proc. ICASSP, Apr.
audio source separation using deep convolutional neural net-
2015.
works,” in Proc. LVA/ICA, Feb. 2017.
[5] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent,
J. Le Roux, J. R. Hershey, and B. Schuller, “Speech enhance- [22] E. Vincent, J. Barker, S. Watanabe, J. Le Roux, F. Nesta, and
ment with LSTM recurrent neural networks and its application M. Matassoni, “The second ‘CHiME’ speech separation and
to noise-robust ASR,” in Proc. LVA/ICA, Aug. 2015. recognition challenge: Datasets, tasks and baselines,” in Proc.
ICASSP, May 2013.
[6] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep
clustering: Discriminative embeddings for segmentation and [23] Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-independent
separation,” in Proc. ICASSP, Mar. 2016. speech separation with deep attractor network,” IEEE/ACM
Transactions on Audio, Speech, and Language Processing,
[7] Y. Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey, vol. 26, no. 4, 2018.
“Single-channel multi-speaker separation using deep cluster-
ing,” in Proc. Interspeech, Sep. 2016. [24] Y. Luo, Z. Chen, J. R. Hershey, J. Le Roux, and N. Mesgarani,
“Deep clustering and conventional networks for music sepa-
[8] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation
ration: Stronger together,” in Proc. ICASSP, Mar. 2017.
invariant training of deep models for speaker-independent
multi-talker speech separation,” in Proc. ICASSP, Mar. 2017. [25] K. Wilson, M. Chinen, J. Thorpe, B. Patton, J. Hershey,
A. R. Saurous, J. Skoglund, and F. R. Lyon, “Exploring trade-
[9] M. Kolbæk, D. Yu, Z.-H. Tan, J. Jensen, M. Kolbaek, D. Yu,
offs in models for low-latency speech enhancement,” in Proc.
Z.-H. Tan, and J. Jensen, “Multitalker speech separation with
IWAENC, Sep. 2018.
utterance-level permutation invariant training of deep recur-
rent neural networks,” IEEE/ACM Transactions on Audio, [26] S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chi-
Speech and Language Processing, vol. 25, no. 10, 2017. nen, B. Patton, and R. A. Saurous, “Differentiable consistency
[10] D. Wang and J. Chen, “Supervised Speech Separation constraints for improved deep speech enhancement,” Proc.
Based on Deep Learning: An Overview,” in arXiv preprint ICASSP, May 2019.
arXiv:1708.07524, 2017. [27] H. Zhang, Y. N. Dauphin, and T. Ma, “Fixup initializa-
[11] Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Alternative ob- tion: Residual learning without normalization,” arXiv preprint
jective functions for deep clustering,” in Proc. ICASSP, Apr. arXiv:1901.09321, 2019.
2018. [28] Pro Sound Effects Library. Available from
[12] J. Le Roux, S. Wisdom, H. Erdogan, and J. Hershey, “SDR – http://www.prosoundeffects.com, accessed: 2018-06-01.
half-baked or well done?” Proc. ICASSP, May 2019. [29] Universal Sound Separation project webpage. Available from
[13] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and https://universal-sound-separation.github.io.
P. Smaragdis, “Deep learning for monaural speech sepa- [30] M. A. et al., “TensorFlow: Large-scale machine learning
ration,” in Proc. ICASSP, May 2014. on heterogeneous systems,” 2015, software available from
[14] Y. Luo and N. Mesgarani, “Tasnet: Surpassing ideal time- tensorflow.org.
frequency masking for speech separation,” arXiv preprint [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic op-
arXiv:1809.07454, 2018. timization,” arXiv preprint arXiv:1412.6980, 2014.
[15] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Ku- [32] J. Le Roux, N. Ono, and S. Sagayama, “Explicit consistency
mar, and T. Weyde, “Singing voice separation with deep U- constraints for STFT spectrograms and their application to
Net convolutional networks,” Proc. ISMIR, Oct. 2017. phase reconstruction,” in Proc. SAPA, Sep. 2008.
[16] Y. C. Subakan and P. Smaragdis, “Generative adversarial
source separation,” in Proc. ICASSP, Apr. 2018.

You might also like