Voice Translator Research Paper (27-10-24)
Voice Translator Research Paper (27-10-24)
Ms. Shikha Rai1, Dr Veeresh2, Mithun S3, Monisha Madappa4, Mudunuri Aditya
Varma, Kamal Deep U6
1
Electronics and Communication Engineering, New Horizon College of Engineering, Ben-
galuru, India, [email protected]
2
Mechanical Engineering, New Horizon College of Engineering, Bengaluru,
India,[email protected]
3
Electronics and Communication Engineering, New Horizon College of Engineering, Ben-
galuru, India, [email protected]
4
Electronics and Communication Engineering, New Horizon College of Engineering, Ben-
galuru, India, [email protected]
5
Electronics and Communication Engineering, New Horizon College of Engineering, Ben-
galuru, India, [email protected]
6
Mechanical Engineering New Horizon College of Engineering, Bengaluru, India,
[email protected]
Abstract. This research paper presents the development of a Multilingual Real-Time Voice
Translator using Python programming and various supporting libraries. The system is designed
to facilitate seamless, real-time translation across multiple languages, enabling smooth commu-
nication between speakers of different languages. Operating on diverse platforms, including
Windows, macOS, and Linux, the solution utilizes essential libraries such as googletrans, Spee-
chRecognition, gtts, and playsound. Through integration with the Google Translate API and
Google Speech Recognition, the translator captures spoken input, processes it to recognize the
language, translates it into the desired target language, and delivers the translated speech output
almost instantaneously. This ensures an intuitive and effortless conversational experience by
maintaining the natural flow of dialogue.
Through continuous research and development, the project emphasizes flexibility and user-
friendliness, with compatibility across various Python-compatible IDEs such as PyCharm,
VSCode, and Jupyter Notebook. The requirement for an active internet connection guarantees
that translations remain accurate and up-to-date. Potential applications include assisting travel-
ers, improving personal communication, supporting international business, and enabling
broader access to services for non-native speakers. By promoting real-time multilingual com-
munication, this system aims to enhance global connectivity and inclusiveness, ensuring effect-
ive interactions across language barriers
1 INTRODUCTION
The core functionality of the translator is to facilitate instant spoken language transla-
tion, allowing users to engage in natural conversation without requiring a shared lan-
guage. The system processes voice input, translates it into the target language, and
provides translated voice output almost instantaneously, ensuring a smooth and fluent
dialogue. This approach maintains the natural flow of conversation, making interac-
tions intuitive and effortless.
The project has a wide range of practical applications, from enhancing personal inter-
actions and assisting travellers to supporting international businesses and improving
access to services for non-native speakers. By bridging language gaps, the Multilin-
gual Real-Time Voice Translator aims to foster understanding, collaboration, and
inclusiveness in our global community.
Through dedicated research and development, the team is committed to ensuring the
system's accuracy, reliability, and ease of use. The project leverages Python-based
tools, such as google trans, Speech Recognition, Gtts or pytts and play sound, along
with integration of Google Translate and Google Speech Recognition APIs, to
provide efficient and real-time multilingual communication. This project is not just a
technological achievement; it represents a step towards a more connected and inclus-
ive world.
3
2 RELATED WORKS
This section explores various language translation services, focusing on systems that,
while not always translating directly from speech to speech, employ encoder-decoder
networks similar to the one implemented in this project.
2.2 Moses
Moses [6] is an open-source translation system using Statistical Machine Translation,
employing an encoder-decoder network. It can train a model to translate between any
two languages using a collection of training pairs. The system aligns words and
phrases guided by heuristics to eliminate misalignments, and the decoder's output
undergoes a tuning process where statistical models are weighed to determine the best
translation. However, the system primarily focuses on word or phrase translations and
often overlooks grammatical accuracy. As of September 2020, Moses does not offer
an end-to-end architecture for speech-to-speech translations.
2.4 Translatotron
Translatotron, a translation system funded by Google Research, served as the inspira-
tion for this project. The model, currently in its beta phase, was initially developed for
Spanish-to-English translation. As of September 2020, the technical aspects of the
raw code have not been publicly released. The system employs an attention-based
sequence-to-sequence neural network, mapping speech spectrograms from source to
5
target languages using pairs of speech utterances. Notably, it can mimic the original
speaker's voice in the translated output.
Translatotron's training utilized two datasets: the Fisher Spanish-to-English Callhome
corpus [8] and a synthesized corpus created using the Google Translate API [9]. Our
project aims to develop a simplified version of this model to explore its feasibility.
While the original research included a complex speech synthesis component (post-de-
coder) based on Tacotron 2 [10], this project does not include the voice conversion
and auxiliary decoder elements to maintain simplicity. Voice transfer, akin to
Google’s Parrotron [11], was also employed in the original model, but we have ex-
cluded it in this version.
3 PROPOSED METHODOLOGY
The proposed Multilingual Real-Time Voice system seeks to overcome the limitations
of existing voice translation technologies by leveraging advanced artificial intelli-
gence, machine learning, and 48 deep learning techniques. This comprehensive sys-
tem is designed to provide accurate, contextually 9 aware, and seamless translations
in real-time, enhancing communication across different languages and cultural con-
texts.
Several recent studies highlight the potential and limitations of AI-powered real-time
speech translation for various applications. Thanuja Babu, Uma R., and collaborators
(2024) presented a machine learning-based approach to real-time speech translation
aimed at enhancing virtual meetings by enabling seamless multilingual communica-
tion. The model provides immediate benefits for global business negotiations, virtual
6
The diagram illustrates the workflow of the Multilingual Real-Time Voice Translator,
showcasing how voice input is processed to produce a translated voice output using
Python libraries and packages. The system operates through a series of integrated
components, each responsible for specific tasks to achieve seamless translation from
one language to another in real time.
The process initiates with the user speaking in their preferred language. This spoken
input is captured by the system through a microphone, facilitated by the pyaudio li-
brary, which enables voice data acquisition for further processing.
7
The captured voice input is then passed to the speech recognition module. This stage
uses the SpeechRecognition library, leveraging the Google Speech-to-Text API to
transcribe the spoken language into text form.
Once transcribed, the text serves as an intermediary form of the original voice input,
setting the stage for translation into the desired target language.
The text is then processed through the translation component, which utilizes the
google trans library. This library interacts with the Google Translate API to translate
the text from the source to the target language.
After translation, the text is converted into speech using the gtts (Google Text-to-
Speech) library, enabling users to hear the translated content in the target language.
This conversion completes the translation loop, making the system a real-time voice
translator.
The final step involves delivering the translated speech as audio output in the target
language. The playsound library is utilized to play the synthesized audio, completing
the cycle and enabling effective communication between users of different languages.
4.1 Overview:
System Requirements
Hardware Requirements
4.2 Outcome:
support translations across multiple languages and dialects, which means that it can be
easily expanded to accommodate new languages and dialectical variations as required,
without needing a major overhaul of the existing framework.
The future development of this project is set to focus on enhancing its features to
improve usability, accessibility, and overall functionality. Below are the key elements
of the planned upgrades:
11
5 CONCLUSION
To sum up, this project delivers a functional solution for real-time voice translation
across multiple languages by leveraging a blend of established technologies and con-
temporary software tools. The existing setup efficiently translates spoken input from
one language into audio output in another, using integrated APIs like Google Trans-
late and Google Speech Recognition to ensure swift and accurate translations across
diverse linguistic groups.
Our future plans focus on enhancing user experience, accessibility, and overall func-
tionality. By transitioning to a web-based platform, the goal is to make the translation
process more streamlined and user-friendly, allowing users to select their preferred
languages seamlessly. Offline functionality is also a priority, enabling key features to
operate without an internet connection—thereby overcoming one of the key limita-
tions of existing systems. Additionally, expanding support to include a broader range
of lesser-known languages will increase the system’s inclusiveness, providing a tool
that is more valuable and versatile for users around the world.
In essence, this project aims to break down language barriers by creating an adaptable
and comprehensive platform for real-time communication. Whether for informal in-
teractions, business exchanges, or educational engagements, the vision is to foster
smoother multilingual communication across various contexts. With continuous im-
provements in both software and hardware integration, this solution aspires to be a
vital tool for enabling seamless cross-cultural dialogue.
.
14
REFERENCES
[1] Krupakar, H., Rajvel, K., Bharathi, B., Deborah, A., & Krishnamurthy, V. (2016). A
survey of voice translation methodologies - Acoustic dialect decoder. International
Conference on Information Communication & Embedded Systems (ICICES).
[2] Geetha, V., Gomathy, C. K., Kottamasu, M. S. V., & Kumar, N. P. (2021). The Voice
Enabled Personal Assistant for PC using Python. International Journal of Engineering
and Advanced Technology, 10(4).
[3] Yang, W., & Zhao, X. (2021). Research on Realization of Python Professional Eng-
lish Translator. Journal of Physics: Conference Series, 1871(1), 012126.
[4] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based
neural machine translation,” in Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, (Lisbon, Portugal), pp. 1412–1421, Associ-
ation for Computational Linguistics, Sept. 2015.
[5] F. J. Och, C. Tillmann, and H. Ney, “Improved alignment models for statistical ma-
chine translation,” in 1999 Joint SIGDAT Conference on Empirical Methods in Natu-
ral Language Processing and Very Large Corpora, 1999..
[6] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B.
Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E.
Herbst, “Moses: Open source toolkit for statistical machine translation,” in Proceed-
ings of the 45th Annual Meeting of the Association for Computational Linguistics
Companion Volume Proceedings of the Demo and Poster Sessions, (Prague, Czech
Republic), pp. 177–180, Association for Computational Linguistics, June 2007.
[7] J. Guo, X. Tan, D. He, T. Qin, L. Xu, and T.-Y. Liu, “Non-autoregressive neural machine
translation with enhanced decoder input,” Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 33, p. 3723–3730, 2019.
[8] A. L. D. K. C. C.-B. S. K. Matt Post, Gaurav Kumar†, “Improved speech-to-text
translation with the fisher and callhome spanish–english speech translation corpus,”
Human Language Technology Center of Excellence, Johns Hopkins University †
Center for Language and Speech Processing, Johns Hopkins University, 2013.
[9] Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao, C.-C. Chiu, N. Ari, S. Lau-
renzo, and Y. Wu, “Leveraging weakly supervised data to improve end-to-end
speech-to-text translation,” 2018.
[10] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y.
Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural tts
synthesis by conditioning wavenet on mel spectrogram predictions,” 2017.
[11] F. Biadsy, R. J. Weiss, P. J. Moreno, D. Kanevsky, and Y. Jia, “Parrotron: An end-to-end
speech-to-speech conversion model and its applications to hearingimpaired speech
and speech separation,” 2019.
[12] Duarte, T., Prikladnicki, R., Calefato, F., & Lanubile, F. (2014). Speech Recognition
for Voice-Based Machine Translation. IEEE Software. DOI: 10.1109/MS.2014.14.
15
[13] An, K., Chen, Q., Deng, C., Du, Z., & Gao, C. (2024). High-Quality Multilingual
Understanding and Generation Foundation Models for Natural Interaction Between
Humans and LLM.
[14] Babu, T., R., U., Karunya, S., & Jalakandeshwaran, M. (2024). AI-Powered Real-
Time Speech-to-Speech Translation for Virtual Meetings. IEEE Xplore. DOI:
10.1109/ICCEBS58601.2023.10448600
[15] T. Duarte, R. Prikladnicki, F. Calefato, and F. Lanubile, "Speech Recognition for
Voice-Based Machine Translation," IEEE Software, pp. 26–31, Jan./Feb. 2014, doi:
10.1109/MS.2014.14.