Anjali Fin Fin
Anjali Fin Fin
Project Report
on
BACHELOR OF TECHNOLOGY
IN
Submitted by
CERTIFICATE
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
This is to certify that the project report entitled “NLP - DRIVEN VIRTUAL EDUCATOR
FOR SMART TEACHING” is being submitted by Sunkara Sathish (21A31A05J1), Repaka
M V S D K Anjali (21A31A05F3), Muppana Anand Kumar (21A31A05H9), Talasila
Kowshik Ram (21A31A05J2), Bandaru Lakshmi Venkata Sandeep (21A31A05G6) in
partial fulfillment for the award of the Degree of Bachelor of Technology, during the year 2024-
25 in Computer Science and Engineering of Pragati Engineering College, for the record of a
bonafide work carried out by them.
External Examiner
ACKNOWLEDGEMENT
We express our thanks to project guide Mrs. D. Kanaka Mahalakshmi Devi, Assistant
Professor of Computer Science and Engineering, who deserves a special note of thanks
and gratitude, for having extended their fullest co-operation and guidance, without this,
project would never have materialized.
We express our deep sense of gratitude to Dr. D. V. Manjula, Associate Professor and
Head of the Department of Computer Science and Engineering, for having shown keen
interest at every stage of development of our project and for guiding us in every aspect.
We wish to express our special thanks to our beloved Dr. G. NARESH, Professor &
Principal for giving guidelines and encouragement.
We wish to express sincere gratitude to our beloved and respected Dr. P. KRISHNA RAO,
Chairman and Sri. M. V. HARANATHA BABU, Director (Management) and Sri. M.
SATISH, Vice-President for their encouragement and blessings.
We are thankful to all our faculty members of the Department for their valuable
suggestions. Our sincere thanks are also extended to all the teaching and non-teaching staff
of Pragati Engineering College.
We also thank our parents whose continuous support has helped us in the successful
completion of the project.
ABSTRACT
This research explores the use of natural language processing (NLP) techniques to
answer complex questions in the context of computer science education. Applying
connectivism as the theoretical framework, the study demonstrates the effectiveness of
web scraping to extract large datasets from publicly available sources and applies these
insights to inform educational practices. Additionally, the research highlights how NLP
can be used to extract relevant information from textual data, supporting qualitative
analysis. A practical example is provided, showcasing current trends in the job market
for computer science students. The findings emphasize the need to enhance
programming and testing skills in the curriculum. To facilitate this, the paper introduces
a chatbot framework using LangChain and Streamlit that integrates multiple document
types such as PDFs, DOCX, and TXT files. Powered by FAISS for vector-based
document retrieval and Replicate’s Llama 2 for conversational AI, the system enables
interactive question answering and document analysis, providing a tool for educators
and researchers to efficiently gather and analyze knowledge.
TABLE OF CONTENT
ACKNOWLEGEMENT
ABSTRACT
LIST OF FIGURES
REFERENCES 35
APPENDIX 36-39
LIST OF FIGURES
CHAPTER-1
INTRODUCTION
CHAPTER-1
INTRODUCTION
The rapid advancements in artificial intelligence (AI) and natural language processing
(NLP), the education sector is experiencing a significant transformation. Traditional
methods of teaching are evolving, incorporating digital tools that enhance the learning
experience for students and provide better support for educators. Among these
innovations, NLP-based teaching assistants stand out as a revolutionary development,
offering personalized, interactive, and intelligent educational support.
Evolution of AI in Education
Artificial intelligence has been shaping various industries, and education is no exception.
From early computer-assisted learning programs to modern AI-driven platforms, the role
of technology in education has grown exponentially. Initially, e-learning platforms
provided basic automation in content delivery. However, with advancements in machine
learning and NLP, AI-powered teaching assistants now provide real-time, interactive
support to students. These intelligent systems have significantly enhanced learning by
making it more adaptive, efficient, and student-centric.
Understanding NLP in Education
The incorporation of NLP in education also promotes inclusivity by supporting multiple
languages and accommodating diverse linguistic backgrounds. Language barriers often
hinder effective learning, particularly in multicultural and international educational
settings. NLP-powered assistants can offer multilingual support, translating content and
assisting non-native speakers in understanding complex concepts. This capability
enables students from different regions to access quality education without being
restricted by language constraints. Additionally, the application of NLP in education
extends beyond student engagement and assessment. This helps educators stay updated
with the latest advancements and enhances their teaching strategies.
Automating Routine Tasks for Educators
Educators often face time constraints due to administrative tasks such as grading
assignments, evaluating essays, and providing feedback. NLP-based teaching assistants
can automate many of these tasks, allowing teachers to focus more on interactive
learning and mentorship. Some key automation capabilities include NLP-based teaching
assistants play a crucial role in automating various aspects of the educational process,
particularly in grading, feedback, and content summarization.
CHAPTER-2
LITERATURE REVIEW
CHAPTER-2
LITERATURE REVIEW
Author(s): S. Fincher and M. Petre.
Title: " Computer Science Education Research"
This book provides a comprehensive analysis of research in computer science education. It
explores methodologies, teaching practices, and challenges faced in imparting computer science
knowledge. The authors emphasize the importance of empirical studies in understanding how
students learn programming and other core computer science concepts. By discussing different
instructional methods and assessment techniques, the book serves as a foundational guide for
educators and researchers looking to improve computer science curricula..
Author(s): A. M. Christie
Title: "Software Process Automation: The Technology and Its Adoption "
This book delves into the concept of software process automation, explaining how
automation tools and technologies can improve the software development lifecycle. It
explores different frameworks, best practices, and adoption strategies that help
organizations optimize development processes. By automating repetitive tasks, teams
can enhance productivity, reduce human errors, and streamline workflows in software
engineering..
CHAPTER-3
SYSTEM ANALYSIS
CHAPTER-3
SYSTEM ANALYSIS
Current AI tutors, such as Duolingo, use NLP and rule-based systems for adaptive
responses.Existing System
In the existing system, research has been conducted on virtual assistants used in
education, where rule-based chatbots respond to predefined student queries. These
systems use keyword matching or pattern-based techniques to generate responses.
Studies show that while rule-based systems are easy to implement, they lack the ability
to understand context or handle dynamic queries. A few systems have integrated basic
NLP methods like tokenization and part-of-speech tagging; however, they still fall short
in providing meaningful interactions or adaptive learning. Most of these systems operate
without deep learning models and rely heavily on static responses. The significance of
this approach was to offer basic tutoring help, but with limited accuracy and
engagement. Datasets used in such systems are often small and domain-specific,
restricting their performance. The average response accuracy is estimated at around
55%. These systems aimed to reduce student support load, but were not scalable or
personalized. Models used include simple decision trees and bag-of-words models with
no real contextual understanding.
Moreover, these earlier systems lacked adaptability and could not update themselves
based on user interaction or feedback. As a result, students often encountered repetitive
answers or irrelevant responses, leading to poor user experience. Since most rule-based
systems do not leverage semantic analysis, they fail to handle paraphrased or
grammatically incorrect inputs, which are common among learners. These limitations
significantly reduce their effectiveness in real-world classroom or e-learning
environments. Although some platforms attempted to incorporate speech recognition
and basic sentiment detection, the absence of deep contextual NLP models limited their
functionality. However, they often rely on predefined content limiting their adaptability
to individual learner needs.
The existing fraud detection models achieve only around 50% accuracy in identifying
sentiment analysis. Many text related problems are like name identification are remain
same.
document data is highly imbalanced, with fraud cases making up only a small portion of
total transactions. Many machine learning algorithms struggle to classify fraudulent
transactions correctly.
Artificial Neural Networks (ANN) and Cluster Analysis require high processing power
Data normalization and pre processing steps increase complexity and system.
Regulations restrict the sharing of fraud detection datasets, limiting the system's
learning ability.
Proposed System
In the proposed system, we are implementing advanced NLP models for building a
virtual educator capable of understanding and responding to student queries
intelligently. The architecture integrates transformer-based models like BERT for
question understanding and GPT for generating accurate and context-aware responses.
These models process input text using tokenization, attention mechanisms, and
contextual embeddings to derive semantic meaning. Unlike traditional rule-based
systems, the NLP-driven approach can handle complex sentence structures, ambiguous
queries, and natural variations in language. A preprocessed dataset of student-teacher
dialogues is used to train the system. The model learns to map questions to relevant
answers, enabling personalized and meaningful interaction with users. The proposed
system is designed to adapt and improve through continuous learning and feedback
loops.
5. Evaluating system performance using metrics like accuracy, precision, and response
relevance.
1. Hardware Requirements:
d) GPU (Optional) Required for large-scale fraud detection using deep learning
2. Software Requirements:
CHAPTER-4
SYSTEM DESIGN
CHAPTER-4
SYSTEM DESIGN
SYSTEM ARCHITECTURE:
Fig 4.2 Flow Chart Diagram for NLP Based Teaching Assistant
UML Diagrams
Objectives of UML
Extensibility: To give users the ability to adapt and extend the language to meet
the specific needs of different projects.
o Class Diagram
o Component Diagram
o Object Diagram
o Deployment Diagram
o Package Diagram
o Activity Diagram
o Sequence Diagram
o Communication Diagram
o State Diagram
CLASS DIAGRAM
The Class Diagram is a core UML diagram that describes the static structure of a
system. It shows the system's classes, their attributes (data fields), methods (functions
or procedures), and the relationships between the classes. They act as a blueprint for
defining the system's architecture and help in efficiently structuring and organizing the
system’s elements.
SEQUENCE DIAGRAM
The main purpose of a sequence diagram is to model the dynamic behavior of a system,
providing a clear view of how objects collaborate over time to accomplish a specific
task or goal. By visualizing the interactions and message flow, the diagram helps
developers understand the system’s execution order and the relationships between
components involved in the process.
CHAPTER-5
IMPLEMENTATION
CHAPTER-5
IMPLEMENTATION
Modules
1.Option Selection Module: The Option Selection module is the initial interaction point
for learners within the Adaptive AI Tutor system. It allows learners to choose from a set of
predefined options, such as selecting a quiz topic, difficulty level, or specific learning
activity. The selected options are then processed by the system to tailor the learning content
accordingly. Key features of this module include:
o Interactive Interface: Utilizes Streamlit to provide a user-friendly dropdown or
button-based selection mechanism for choosing quiz topics or difficulty levels.
o Dynamic Content Loading: Based on the learner’s selection, the system retrieves
relevant content (e.g., questions on binary search trees) using FAISS for semantic
search.
o Personalization Trigger: The selected option informs the Personalization Engine
to adapt the content to the learner’s proficiency and preferences, ensuring a tailored
experience.
3.Final Score & Restart Module: The Final Score & Restart module concludes the
learning activity by presenting the learner with their overall performance summary, such as
the final score, number of correct answers, and areas for improvement. It also provides an
option to restart the activity or begin a new one, allowing learners to retry and reinforce
Software Environment
Ease of Use: Python’s clear and readable syntax makes it easy to learn, reducing the
Streamlit Framework:
Streamlit is an open-source Python framework designed to simplify the creation of interactive,
web-based applications, particularly for data science, machine learning, and educational projects.
It allows developers to build user-friendly interfaces with minimal effort, enabling rapid
prototyping and deployment of applications without requiring extensive knowledge of traditional
web development technologies like HTML, CSS, or JavaScript
Key Features of Streamlit:
1.Rapid Development with Python: Streamlit allows developers to create web applications
entirely in Python, eliminating the need for separate front-end development skills. This means the
team can focus on integrating NLP models (e.g., Mistral-7B-Instruct) and backend logic while
quickly building an interactive interface using Python scripts.
2.Interactive Widgets for User Input: Streamlit provides a variety of built-in widgets, such as
buttons, dropdowns, sliders, text inputs, and radio buttons, to capture user inputs seamlessly. In
the tutor system, these widgets are used in the Option Selection module to let learners choose quiz
topics or difficulty levels, enhancing user engagement through interactive elements.
3.Real-Time Updates: Streamlit apps automatically update in real-time as users interact with the
interface. For example, changing a slider value or selecting an option from a dropdown instantly
refreshes the app to reflect the new input, providing a seamless and responsive user experience.
4.Easy Deployment and Sharing: Streamlit offers straightforward deployment options through
platforms like Streamlit Community Cloud (formerly Streamlit Sharing), Heroku, or AWS.
CHAPTER-6
RESULTS
CHAPTER-6
RESULTS
A random variable is a way to map the outcomes of a random process to numbers, allowing the
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 25 | P a g e
NLP - DRIVEN VIRTUAL EDUCATOR FOR SMART TEACHING
quantification of uncertain events such as flipping a coin or rolling dice by assigning numerical
values to possible outcomes. For example, if we flip a coin, we can define a random variable "X"
as 1 if it lands heads up and 0 if it lands tails up. Similarly, if we roll a die, a random variable "Y"
can represent the sum of the upward faces after rolling seven dice. Unlike traditional variables,
random variables can take different values with varying probabilities, making it more common to
discuss the probability of a random variable equaling a certain value or falling within a range
rather than assigning a fixed value. The chatbot in the image provides this explanation in response
to a user’s query about random variables, ensuring a clear and interactive learning experience.
CHAPTER-7
SYSTEM TESTING
CHAPTER-7
SYSTEM TESTING
Purpose of Testing
The main goal of software testing is to detect issues within the software by methodically
checking its components. Testing ensures that the software operates as intended and
complies with its predefined specifications, confirming the system’s overall functionality.
Key Testing Methodologies:
Example:
Verifying that the evaluate_answer() function correctly identifies "All of the above" as
the answer to "What is the purpose of decentralization in India?".
Test Objective: To confirm that the NLP response parsing function accurately
extracts answers from text inputs for the tutor’s evaluation module.
Functional Testing: Validates that the system’s features perform as intended based on
specified requirements.
Example:
Submitting a quiz answer (e.g., "All of the above" for "What is the purpose of
decentralization in India?") and checking if the tutor provides correct feedback and
updates the score.
Test Objective: To ensure the quiz evaluation and feedback system correctly
assesses answers and adapts content per the learner’s performance.
Black Box Testing: Examines the system’s functionality without knowledge of its
internal code, focusing on inputs and outputs.
Example:
Entering an ambiguous prompt (e.g., "Tell me about government") and verifying the tutor
responds with a clear, context-aware explanation (e.g., federalism basics).
UNIT TESTING
Test Test Description Expected Output Actual Output Status
Case
ID
UT-01 Test the Quiz data with 10 Quiz data with 10 Pass
load_quiz_data() questions and options is questions
function to load loaded into memory successfully loaded
JSON quiz data without errors. into memory.
correctly.
FUNCTIONAL TESTING
Test Test Description Expected Output Actual Output Status
Case
ID
FT-01 Verify that the quiz Displays "Which of the Question and 4 Pass
page displays a following is NOT a options displayed
question and its feature of federalism?" correctly on the quiz
options correctly. with 4 options from page.
JSON data.
FT-03 Test adaptive Next quiz includes a mix Next quiz generated Pass
difficulty adjustment of medium and hard with medium and
after completing a quiz questions based on prior hard questions as
with 80% accuracy. performance. expected.
FT-04 Verify that the final Completing a 10- "Your Score: 70% Pass
score is calculated and question quiz with 7 (7/10)" displayed
displayed after quiz correct answers shows correctly after quiz
completion. "Your Score: 70% completion.
(7/10)".
FT-05 Test the restart Restarting after 60% New quiz generated Pass
functionality to ensure score generates a quiz with focus on local
a new quiz is focusing on weaker government topics.
generated with prior topics (e.g., local
data. government).
BB-01 Input a correct answer Submitting "All of the "Correct!" displayed Pass
on the quiz page and above" for "What is the after submitting "All
submit without constitutional status of of the above".
knowing internal logic. local government?"
shows "Correct!".
BB-03 Start a quiz, complete After scoring 80%, New quiz with Pass
it, and restart to verify restarting shows a new tailored questions
new content quiz with different displayed after
generation. questions tailored to restart.
performance.
BB-05 Test the quiz page load Quiz page loads within Page loads in 1.8 Pass
time and question 2 seconds, showing seconds with the
display without "What is the question displayed
technical details. significance of the correctly.
Porto Alegre
experiment?"
CHAPTER-8
CONCLUSION AND FUTURE WORK
CHAPTER-8
CONCLUSION AND FUTURE WORK
CONCLUSION
Random variables play a crucial role in probability and statistical analysis by assigning
numerical values to outcomes of random events. They serve as fundamental components
in understanding randomness and variability in real-world scenarios. The discussion
highlights two types of random variables: discrete (having countable values, like the
result of a coin toss or rolling dice) and continuous (having an infinite range of possible
values, like height, temperature, or time measurements).
By providing a structured way to model uncertainty, random variables allow us to make
data-driven decisions, perform risk assessments, and predict future outcomes in various
domains. Their application extends beyond theoretical probability into practical uses,
such as quality control in manufacturing, customer behavior analysis in business, and
medical diagnosis predictions in healthcare. Statistical techniques like expected value,
variance, and probability distributions (e.g., binomial, normal, and Poisson distributions)
further enhance the understanding and utility of random variables in decision-making.
FUTURE SCOPE
2.Machine Learning & AI: Random variables form the foundation for probabilistic
models in AI, including Bayesian networks and deep learning techniques that rely on
uncertainty estimation.
3.Finance & Risk Analysis: Financial markets use random variables to model stock
price fluctuations, risk assessment, and investment strategies based on probabilistic
predictions.
4.Engineering & Scientific Research: Random variables are used in reliability testing,
quality control, and simulations in physics, engineering, and medical sciences.
5.Big Data & Analytics: With the rise of data-driven decision-making, the application of
random variables in big data analytics helps in predictive modeling, anomaly detection,
and optimization problems.
REFERENCES
S. Fincher and M. Petre, Computer science education research. CRC Press, 2004.
J. J. Randolph, G. Julnes, E. Sutinen, and S. Lehman, “A methodological review of
computer science education research,” Journal of Information Technology Education:
Research, vol. 7, no. 1, pp. 135–162, 2008.
A. M. Christie, Software process automation: the technology and its adoption.
Springer Science & Business Media, 2012.
G. Siemens, “Connectivism: Learning as network-creation,” ASTD Learning News,
vol. 10, no. 1, pp. 1–28, 2005.
G. Siemens, “Connectivism: Learning theory or pastime of the selfamused,” 2006.
G. Siemens, “Connectivism,” Foundations of Learning and Instructional Design
Technology, 2017.
S. d. S. Sirisuriya, “A comparative study on web scraping,” 8th International
Research Conference, KDU, p. 135–140, November 2015.
D. Jurasky and J. H. Martin, “Speech and language processing: An introduction to
natural language processing,” Computational Linguistics and Speech Recognition.
Prentice Hall, New Jersey, 2000.
K. M. Alhawiti, “Natural language processing and its use in education,” Computer
Science Department, Faculty of Computers and Information technology, Tabuk
University, Tabuk, Saudi Arabia, 2014.
R. B. Mbah, M. Rege, and B. Misra, “Discovering job market trends with text
analytics,” in 2017 International Conference on Information Technology (ICIT). IEEE,
2017, pp. 137–142.
M. A. Mardis, J. Ma, F. R. Jones, C. R. Ambavarapu, H. M. Kelleher, L. I. Spears,
and C. R. McClure, “Assessing alignment between information technology educational
opportunities, professional requirements, and industry demands,” Education and
Information Technologies, vol. 23, no. 4, pp. 1547–1584, 2018.
R. Florea and V. Stray, “Software tester, we want to hire you! an analysis of the
demand for soft skills,” in International Conference on Agile Software Development.
Springer, 2018, pp. 54–67.
S. Downes, “Learning networks and connective knowledge. Instructional technology
forum: Paper 92,” 2006.
APPENDIX
SOURCE CODE
Launcher Sample Python Code:
def main():
load_dotenv()
st.title("MultiDoc Chatbot")
initialize_session_state()
st.sidebar.title("Document Processing")
uploaded_files = st.sidebar.file_uploader("Upload files", accept_multiple_files=True)
if uploaded_files:
# 1. Extract and chunk text
text_chunks = process_uploaded_files(uploaded_files)
# 3. Display chat UI
display_chat_history(chain)
if __name__ == "__main__":
main()
Handles document parsing, embedding and building the conversational chain:
def process_uploaded_files(uploaded_files):
documents = []
for file in uploaded_files:
extension = os.path.splitext(file.name)[1]
with tempfile.NamedTemporaryFile(delete=False) as temp:
temp.write(file.read())
path = temp.name
loader = {
".pdf": PyPDFLoader,
".docx": Docx2txtLoader,
".doc": Docx2txtLoader,
".txt": TextLoader
}.get(extension)
if loader:
documents.extend(loader(path).load())
os.remove(path)
def create_conversational_chain(text_chunks):
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={'device': 'cpu'}
)
vector_store = FAISS.from_documents(text_chunks, embedding=embeddings)
llm = Replicate(
streaming=True,
model="replicate/llama-2-70b-
chat:58d078176e02c219e11eb4da5a02a7830a283b14cf8f94537af893ccff5ee781",
callbacks=[StreamingStdOutCallbackHandler()],
input={"temperature": 0.01, "max_length": 500, "top_p": 1}
)
memory = ConversationBufferMemory(memory_key="chat_history",
return_messages=True)
return ConversationalRetrievalChain.from_llm(
llm=llm,
chain_type='stuff',
retriever=vector_store.as_retriever(search_kwargs={"k": 2}),
memory=memory
)
import streamlit as st
from streamlit_chat import message
from chat import conversation_chat
def initialize_session_state():
if 'history' not in st.session_state:
st.session_state['history'] = []
if 'generated' not in st.session_state:
st.session_state['generated'] = ["Hello! Ask me anything about your lecture 🤗"]
def display_chat_history(chain):
with st.form(key='my_form', clear_on_submit=True):
user_input = st.text_input("Question:", placeholder="Ask about your Documents",
key='input')
submit_button = st.form_submit_button(label='Send')
for i in range(len(st.session_state['generated'])):
message(st.session_state["past"][i], is_user=True, key=str(i) + '_user',
avatar_style="thumbs")
message(st.session_state["generated"][i], key=str(i), avatar_style="fun-emoji")
PAPER PUBLICATION