Skip to content

Conversation

@kauterry
Copy link
Contributor

Almost done with implementation, saving it as a draft since I'm yet to test it.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 20, 2023
@yilinyang7
Copy link

yilinyang7 commented Dec 20, 2023

I'm not super sure of the intention of this PR, can I ask a few questions:

  • Why having both inference/ and predict/ subdirectories, as they're seem to serve same purpose it could be confusing
  • This new ExpressiveTranslator only support single-audio translation, its use-case would be too narrow, isn't it? The single audio translation is well-served by our two public demo (internal & HF) plus an gadio app.py script. I'm not convinced why we'll need this.

@kauterry
Copy link
Contributor Author

I'm not super sure of the intention of this PR, can I ask a few questions:

  • Why having both inference/ and predict/ subdirectories, as they're seem to serve same purpose it could be confusing
  • This new ExpressiveTranslator only support single-audio translation, its use-case would be too narrow, isn't it? The single audio translation is well-served by our two public demo (internal & HF) plus an gadio app.py script. I'm not convinced why we'll need this.

There is absolutely no harm in having this extra way of doing expressivity inference. Developers aren’t going to use the demo, they need the code. If you see for M4T we have m4t_predict which is a thin layer around the translator class, we want to stick to the same format for expressivity_predict as well. Btw the more ways we have the easier we make our models for adoption, that’s the whole point of OSS. In this case, writing an expressive translator strips out all the unnecessary boilerplate from the user, and they can call expressive inference easily, making it in the same format as M4T inference.

I don’t think it’s confusing, the CLI directory has stuff which the user can leverage directly. Inference acts as a bridge between the CLI and the actual models. If we have a translator, we definitely need to have an expressive_translator. So I strongly disagree with you on this. In fact, this discrepancy between m4t_predict’s design and expressivity_predict’s design was pointed out by @elbayadm. I feel we should’ve paid more attention to the design of these interfaces which expose our models to the users.

@kauterry
Copy link
Contributor Author

predict falls under the CLI directory. Inference is a directory of its own, feel free to suggest a better name to make it clear.

@yilinyang7
Copy link

yilinyang7 commented Dec 21, 2023

Thanks for the reply.
I agree it doesn't harm to implement this extra single-audio inference script.
Yet my point is, when I looked at ExpressiveTranslator class, it only supports single-audio inference. With such a general class name, isn't that too limited, and could confuse future users? I believe it should support batching as Translator and PretsselGenerator does.
For the directory, I don't understand why you want to have both cli/expressivity/predict and cli/expressivity/inference. We could instead put the script files under one directory?

@yilinyang7
Copy link

For expressivity inference, the batched inference and the automatic evaluation metrics that runs after it should be the default point-of-entry.
I don't disagree with an additional single audio inference script, but just align us on what's the main point-of-entry.
If this ExpressiveTranslator class doesn't support batched inference, it'll be confusing to the users.

@kauterry
Copy link
Contributor Author

This is going to support batched inference, and I will refactor expressivity_evaluate to use this script. Exactly like how Translator is used in m4t_evaluate and m4t_predict. How does that sound?

@yilinyang7
Copy link

This is going to support batched inference, and I will refactor expressivity_evaluate to use this script. Exactly like how Translator is used in m4t_evaluate and m4t_predict. How does that sound?

yep, that sounds good, thanks!

@elbayadm
Copy link
Contributor

Thank you @kauterry for drafting this PR.
@yilinyang7, the main thing that I thought was missing in SC is this:
speech_output, text_output = expressive_translator.predict(input) where a user wouldn't need to compute fbank and src_gcmvn for prosody outside of the call to a predict function. The point is not to add new functionalities (e.g. batching) but to provide a better entry point in python à la HF's transformers. Not gradio, not demo and not command line.
Functionalities can be added later and refactoring expressivity_evaluate is definitely necessary.

| Model Name | #params | checkpoint |
| ----------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| W2v-BERT 2.0 | 600M | [🤗 Model card](https://huggingface.co/facebook/conformer-shaw) - [checkpoint](https://huggingface.co/facebook/conformer-shaw/resolve/main/conformer_shaw.pt)
| W2v-BERT 2.0 | 600M | [🤗 Model card](https://huggingface.co/facebook/w2v-bert-2.0) - [checkpoint](https://huggingface.co/facebook/w2v-bert-2.0/resolve/main/conformer_shaw.pt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we fix this in another PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@yilinyang7
Copy link

predict falls under the CLI directory. Inference is a directory of its own, feel free to suggest a better name to make it clear.

sorry I missed this, and I just realize I mis-read the directory layers.

@yilinyang7
Copy link

Thank you @kauterry for drafting this PR. @yilinyang7, the main thing that I thought was missing in SC is this: speech_output, text_output = expressive_translator.predict(input) where a user wouldn't need to compute fbank and src_gcmvn for prosody outside of the call to a predict function. The point is not to add new functionalities (e.g. batching) but to provide a better entry point in python à la HF's transformers. Not gradio, not demo and not command line. Functionalities can be added later and refactoring expressivity_evaluate is definitely necessary.

Thanks for the clarification, that makes sense!
I think the current discrepancy is that expressivity_evaluate does batched inference, while current ExpressiveTranslator class only support single-audio inference.
If we want a clearer API call, we might need to refactor every script to use this new ExpressiveTranslator class, which should support batched inference for efficiency.
If I misunderstood anything, let me know!

@elbayadm
Copy link
Contributor

Thank you @kauterry for drafting this PR. @yilinyang7, the main thing that I thought was missing in SC is this: speech_output, text_output = expressive_translator.predict(input) where a user wouldn't need to compute fbank and src_gcmvn for prosody outside of the call to a predict function. The point is not to add new functionalities (e.g. batching) but to provide a better entry point in python à la HF's transformers. Not gradio, not demo and not command line. Functionalities can be added later and refactoring expressivity_evaluate is definitely necessary.

Thanks for the clarification, that makes sense! I think the current discrepancy is that expressivity_evaluate does batched inference, while current ExpressiveTranslator class only support single-audio inference. If we want a clearer API call, we might need to refactor every script to use this new ExpressiveTranslator class, which should support batched inference for efficiency. If I misunderstood anything, let me know!

That's reasonable, if it's already implemented then the new ExpressiveTranslator should use that as well.

Copy link

@yilinyang7 yilinyang7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the bug fix, the batched ExpressiveTranslator inference is done, and could replicate previous results over all metrics.

@facebook-github-bot
Copy link

Hi @kauterry!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants