Supriya Synopsis Final
Supriya Synopsis Final
A
INTERNSHIP
Report
Submitted for
3rd Year
CSE-DS
By
SUPRIYA SINGH
(Roll No.- 2201331540196)
Submitted to
Ms. Manisha Pundir
1
STUDENT’S DECLARATION
This is to certify that the “Internship report” submitted by SUPRIYA SINGH (Roll no.
2201331540196) is work done by her and submitted during 2024-2025 academic year, in
partial fulfillment of the requirements for the award of the degree of BACHELOR OF
TECHNOLOGY in COMPUTER SCIENCE ENGINEERING IN DATA SCIENCE.
2
ABSTRACT
This internship project involves applying data science techniques to solve real-world problems across
three diverse domains: Titanic survival prediction, movie rating prediction, and credit card fraud
detection. Each project presents its own unique challenge and learning opportunity, allowing the use of
data science to make predictions based on historical data and patterns.
In the Titanic survival prediction, the task was to build a model that could predict whether passengers
survived or not based on factors such as age, gender, class, and fare. The key challenge was to deal
with missing data and categorical variables, requiring methods like data imputation and encoding. The
outcome was a classification model that helps us understand how different features influence survival
probability.
For movie rating prediction, the goal was to predict how a user would rate a movie based on their
previous ratings. This is an example of collaborative filtering, where patterns from user behavior are
used to recommend new movies. The challenge was to build an accurate recommendation system that
can handle a large number of users and items, using techniques like matrix factorization and k-nearest
neighbors.
Finally, the credit card fraud detection project addressed the critical need for financial institutions to
identify fraudulent transactions in real time. Due to the highly imbalanced nature of the dataset
(fraudulent transactions being much rarer than legitimate ones), specialized techniques like SMOTE
and ensemble learning were used to enhance the model’s ability to detect fraud while minimizing false
positives.
Together, these projects demonstrate the versatility of data science in solving various problems across
domains. From predicting survival on a historic ship voyage to recommending movies and identifying
fraud, the integration of statistical analysis, machine learning models, and evaluation metrics provided
valuable insights and practical solutions. These experiences not only enhanced my technical skills but
also taught me the importance of understanding the problem context and applying the right
methodology.
3
ACKNOWLEDGEMENT
We are highly grateful to Dr. Manali Gupta, HOD of the Department of Data Science, Noida
Institute of Engineering and Technology, Greater Noida, for providing this opportunity.
The constant guidance and encouragement received from Dr. Manali Gupta, HOD (Data Science,
dept.), NIET, Greater Noida, has been of great help in carrying out the project work and is
acknowledged with reverential thanks.
We would like to express a deep sense of gratitude and profuse thanks to the Dr. Manali Gupta
project guide, without her wise counsel and able guidance, it would have been impossible to
complete the report in this manner.
We express gratitude to other faculty members of Data Science department of NIET for their
intellectual support throughout the course of this work.
Finally, the authors are indebted to all who have contributed to this report.
4
INDEX
1. Introduction 7-9
3. System Design 15
6. References 26-27
5
INTRODUCTION
• Data science is a powerful and versatile field that uses data-driven techniques to uncover
patterns, make predictions, and solve complex problems across various domains. The internship
project leverages data science methodologies to tackle three real-world challenges: Titanic
survival prediction, movie recommendation systems, and credit card fraud detection. These
projects provide insights into different aspects of machine learning, including data
preprocessing, model building, evaluation, and deployment.
• The following sections break down the introduction into specific subtopics, providing a
structured overview of the project:
➢ Objective
• Build a classification model to predict Titanic passenger survival based on demographic and
ticket-related attributes.
• Create a recommendation system to predict user preferences and recommend movies using
collaborative filtering techniques.
• Develop a fraud detection system to identify fraudulent transactions from imbalanced datasets.
• Enhance Technical Skills:
• Gain expertise in handling real-world datasets, performing data preprocessing, and feature
engineering.
• Train and evaluate machine learning models using industry-standard tools.
• Deploy models into production using frameworks like Flask or Django.
Problem: Build a system to recommend movies by predicting how users would rate movies
based on their previous interactions.
Key Challenges: Handling large datasets with numerous users and items while ensuring
efficiency and scalability.
• Credit Card Fraud Detection:
Problem: Identify fraudulent transactions in a highly imbalanced dataset where fraudulent cases
are rare.
Key Challenges: Managing class imbalance, reducing false positives and false negatives, and
achieving high detection accuracy.
➢ Scope
7
• Optimizing model parameters using techniques like cross-validation and grid search.
• Deployment:
• Deploying the models as web services using Flask or Django, accessible via REST APIs.
• Exploring scalability for real-time prediction systems, especially for fraud detection.
Definitions:
• Machine Learning: A subset of artificial intelligence that focuses on building algorithms that
can learn and make predictions from data.
• Data Preprocessing: The process of cleaning, transforming, and organizing raw data to prepare
it for analysis.
Acronyms and Abbreviations:
• SMOTE: Synthetic Minority Oversampling Technique – A method used to handle imbalanced
datasets.
• API: Application Programming Interface – A tool for enabling communication between
software applications.
• ROC: Receiver Operating Characteristic – A curve that evaluates the performance of a binary
classification model.
• SQL/NoSQL: Structured Query Language/Non-relational database systems used for data
storage.
➢ Technologies to Be Used
The project utilizes several tools and frameworks to implement the machine learning solutions
effectively:
• Programming Languages:
Python: The primary programming language used for data analysis, model building, and
deployment.
• Libraries and Frameworks:
Pandas and NumPy: For data manipulation and numerical computations.
Scikit-learn: For machine learning algorithms and evaluation metrics.
Matplotlib and Seaborn: For data visualization.
SMOTE: For handling imbalanced datasets.
8
• Deployment Tools:
Flask/Django: For creating APIs and deploying machine learning models as web services.
Data Storage:
SQL and NoSQL databases for storing and querying structured and unstructured data.
• Other Tools:
Jupyter Notebook for developing and testing machine learning models.
Cloud Platforms (e.g., AWS, Google Cloud, or Heroku) for deploying scalable solutions.
9
Software Requirement Specifications
➢ Introduction
Purpose
The purpose of this system is to build predictive models for the following real-world problems:
• Titanic Survival Prediction: Predict whether passengers survived the Titanic disaster based on
historical attributes such as age, gender, class, and fare.
• Movie Recommendation System: Provide personalized movie recommendations by predicting
user ratings based on their past behavior and preferences.
• Credit Card Fraud Detection: Identify fraudulent credit card transactions from a dataset
containing a large imbalance between legitimate and fraudulent transactions.
• The software aims to apply machine learning algorithms to make predictions and generate
actionable insights for users. It will offer user-friendly interactions through a web interface,
allow model evaluations, and support real-time predictions.
Project Scope
• Data Preprocessing: The project involves cleaning and transforming raw datasets, dealing with
missing values, categorical data, and feature scaling.
• Model Training: The system will train models using classification algorithms for Titanic
survival prediction, collaborative filtering for movie recommendations, and anomaly detection
for fraud detection.
• Model Evaluation: It will include metrics such as accuracy, precision, recall, F1-score, and
confusion matrices.
• Deployment: The system will be deployed as a RESTful API via Flask or Django, allowing
users to interact with the models and receive predictions in real-time.
➢ Overall Description
1) Product/Project Perspective
• This system is a predictive analytics solution that incorporates various machine learning
techniques. It is intended for use by data science professionals, machine learning enthusiasts, or
any organization that wants to leverage predictive models for decision-making. The project
aims to demonstrate the versatility of machine learning in solving different types of problems
across various industries.
2) Product/Project Function
• Titanic Survival Prediction: The system will take passenger attributes (e.g., age, class, fare) as
input and predict the likelihood of survival based on historical data.
• Movie Recommendation System: Based on users' historical ratings, the system will predict
which movies a user might enjoy and provide recommendations.
10
• Credit Card Fraud Detection: The system will process transaction data, flagging potentially
fraudulent transactions by using machine learning models trained on historical data.
• End Users: These are individuals or businesses interacting with the system to obtain
predictions. End users may include casual users seeking movie recommendations or financial
institutions needing fraud detection alerts.
• Administrators: Data scientists or system administrators responsible for maintaining and
updating models, monitoring system performance, and managing user access.
• Data Scientists: Individuals responsible for developing and tuning the models, evaluating model
performance, and fine-tuning the system based on real-time feedback.
4) Operating Environment
• The system will operate in a web-based environment, where users can interact with the
predictive models through a user-friendly interface. It will run on a server using the following
configuration:
5) Architecture Design
• The system will be based on a client-server architecture. The design consists of:
• Frontend: A web interface built with HTML, CSS, and JavaScript (or ReactJS) for user
interaction.
• Backend: A Flask or Django framework, which will handle the requests from the frontend,
invoke machine learning models, and return predictions to the user.
• Database: A SQL or NoSQL database for storing the transaction records (for fraud detection) or
movie ratings (for the recommendation system).
• Diagram:
• Client-Server Interaction:
• Client (User Interface) → Server (Flask/Django) → Model (Machine Learning Algorithms) →
Prediction Output
6) Constraints
• Limited Data Availability: For Titanic prediction, the dataset is relatively small, which may
affect model performance.
• Class Imbalance: In fraud detection, fraudulent transactions are much less frequent than
legitimate transactions, requiring special handling like SMOTE (Synthetic Minority Over-
sampling Technique).
• Real-Time Processing: Fraud detection requires quick processing of transaction data, which
may need real-time prediction capabilities.
• Computational Complexity: Training models with large datasets, especially for movie
recommendations, may require significant computational resources.
11
7) Use Case Model Description
User inputs passenger features (age, gender, class, fare) → system predicts survival.
User inputs past movie ratings → system returns a list of recommended movies.
• Assumptions:
• Dependencies:
• Python libraries such as Pandas, Scikit-learn, Flask/Django for model training and deployment.
• External APIs for real-time transaction data (for fraud detection).
➢ System Features
• Data Preprocessing:
Handle missing values, encode categorical features, and scale numerical data for all models.
• Model Training:
Build and train models using algorithms such as Random Forest, XGBoost, k-Nearest
Neighbors (for recommendations), and anomaly detection methods.
• Prediction Generation:
The system will return predictions (survival, ratings, fraud detection) based on user inputs.
• Model Evaluation:
12
➢ External Interface Requirements
User Interfaces
Web-based Interface: A responsive user interface (UI) that allows users to input data, view
predictions, and access the system’s features.
• The interface should be easy to navigate, allowing users to interact with the Titanic survival
prediction, movie recommendation, and fraud detection functionalities.
Hardware Interfaces
• Server Requirements: A server capable of running machine learning models and handling API
requests. The system can be deployed on virtual servers or cloud platforms.
Software Interfaces
• Database Interface: The system must interface with SQL/NoSQL databases to store transaction
data and movie ratings.
• Machine Learning Libraries: The system will use Scikit-learn, TensorFlow, and other libraries
for training and prediction.
Communications Interfaces
• RESTful API: Communication between the frontend and backend is done through RESTful
APIs, with data exchanged in JSON format.
Performance Requirements
• The system must be capable of handling multiple concurrent requests, particularly for fraud
detection systems.
• Model inference (prediction) must be completed in less than 1 second for real-time applications.
Safety Requirements
• Data Integrity: The system must handle user data securely, ensuring that no data is lost during
transactions.
• Model Robustness: Models should be resilient against noisy or incomplete input data.
Security Requirements
• Authentication: Secure authentication for accessing the system (e.g., using OAuth or API
tokens).
• Data Encryption: Sensitive data, especially financial transactions, should be encrypted during
transfer and storage.
14
System Design
15
System Implementation
The System Implementation section is a critical part of any project as it outlines how the system was
developed and tested. It involves explaining the coding process, the testing phase, and showcasing the
system with real-world outputs. Below is an expanded explanation of the implementation process,
including the corresponding outputs for each component of the system.
➢ Coding
The code for each of the system modules (Titanic Survival Prediction, Movie Recommendation
System, and Credit Card Fraud Detection) was implemented using Python. Below are the code
snippets along with the expected outputs for each system.
• Data Preprocessing:
o The Titanic dataset was cleaned by handling missing values, encoding categorical
features, and scaling numerical data.
Code:
data = pd.read_csv('titanic.csv')
X = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y = data['Survived']
X = pd.get_dummies(X, drop_first=True)
X['Age'].fillna(X['Age'].mean(), inplace=True)
# Train-test split
16
# Model training
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Evaluation
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
OUTPUT:
Explanation of Output:
Precision: The proportion of true positives (correct predictions) among all predicted positives.
Recall: The proportion of true positives (correct predictions) among all actual positives.
F1-Score: The weighted harmonic mean of precision and recall, providing a balance between them.
17
Movie Recommendation System
• Data Preprocessing:
o The MovieLens dataset was used to create a user-item matrix. Collaborative filtering
was applied using K-Nearest Neighbors (KNN).
Code
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
OUTPUT
Explanation of Output:
The model finds the 5 closest neighbors (based on cosine similarity) to the input user, recommending
movies with the closest ratings.
The array of movie indices (e.g., 37, 11, 88, etc.) corresponds to the most recommended movies for the
user.
18
Credit Card Fraud Detection
• Data Preprocessing:
Code
# Load data
data = pd.read_csv('creditcard.csv')
X = data.drop('Class', axis=1)
y = data['Class']
smote = SMOTE(sampling_strategy='minority')
X_res, y_res = smote.fit_resample(X, y)
# Train-test split
# Model training
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Evaluation
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
19
OUTPUT:-
Explanation of Output:
Confusion Matrix: Shows the number of true positives (2796 fraudulent transactions detected), true
negatives (28448 legitimate transactions correctly identified), false positives, and false negatives.
Precision: The proportion of positive predictions that were actually correct.
Recall: The proportion of actual positive cases that were correctly identified.
20
Testing
The testing phase ensured that the system works as intended, free of bugs, and meets performance
expectations. Below are the main types of testing performed:
1. Unit Testing:
o Ensured individual components like data preprocessing, model training, and evaluation
worked independently.
2. Integration Testing:
o Checked if the different modules worked together seamlessly (e.g., Titanic prediction
API with the user interface).
3. System Testing:
o Tested the complete system end-to-end to ensure all functionalities were integrated and
working as expected.
4. Performance Testing:
o Tested the fraud detection system under heavy transaction load to evaluate the real-
time prediction speed.
Snapshots
Below are snapshots of the system interfaces showcasing the user experience:
21
Conclusion and future scope
Conclusion
The internship focused on applying data science techniques to real-world problems, leveraging
machine learning algorithms and tools to create practical solutions. The key outcomes of this
internship are summarized below:
Hands-on Experience in Data Science:
The internship provided valuable, hands-on experience in solving complex problems using data
science methodologies. By working with datasets, I gained practical skills in data cleaning,
preprocessing, feature engineering, and model selection.
I implemented machine learning models, including classification algorithms like Random Forest,
Logistic Regression, and XGBoost, and built recommendation systems using Collaborative Filtering
and Matrix Factorization.
By predicting Titanic passenger survival, the project demonstrated the use of historical data to derive
actionable insights. The Random Forest and Logistic Regression models achieved satisfactory results,
helping identify the factors that influenced survival. This project showcased the importance of data
preprocessing and handling missing data in a real-world scenario.
Key lessons learned: the importance of addressing missing data, understanding the significance of
different features, and the role of feature engineering in improving model accuracy.
The movie recommendation system developed during the internship showcased the power of
Collaborative Filtering in providing personalized content to users. By predicting user ratings based on
their past behavior, the system was able to recommend movies efficiently.
The project also demonstrated the challenges of working with large datasets, ensuring scalability, and
optimizing the model for accuracy.
Key lessons learned: the effectiveness of collaborative filtering, scalability considerations for
recommendation systems, and how to handle user-item matrices.
22
The fraud detection project, focusing on detecting fraudulent transactions, demonstrated the
importance of addressing class imbalance using SMOTE (Synthetic Minority Over-sampling
Technique) and the use of ensemble learning techniques. The system achieved high recall and
precision, ensuring minimal false negatives in fraud detection.
Key lessons learned: the challenges of working with imbalanced datasets, the significance of precision
and recall in fraud detection, and the need to minimize false positives and false negatives in real-time
applications.
The internship also involved deploying machine learning models as web services using Flask. This
allowed the models to be accessed in real-time, providing immediate predictions for users. By doing
so, I gained practical experience in deploying machine learning models into production environments,
making the models accessible to end users.
Key lessons learned: the process of model deployment, how to expose machine learning models as
APIs, and the importance of scalability and performance in a production setting.
Overall, the internship provided a comprehensive experience in building, deploying, and evaluating
machine learning models, with a focus on practical applications in various domains. The project not
only enhanced my technical proficiency but also strengthened my problem-solving skills, preparing
me for a career in data science.
Future Scope
Although the internship projects were successful in meeting their objectives, there are several areas
that can be explored further to improve and extend the functionality of the systems. Below are the
possible avenues for future work:
Deep Learning Models: The current models used in Titanic survival prediction, such as Random
Forest and Logistic Regression, can be improved by exploring more complex algorithms, such as
Deep Neural Networks (DNNs) or XGBoost with cross-validation. This can help capture more
intricate patterns in the data.
Model Generalization: The model can be extended to predict survival for other historical events, such
as the sinking of other ships or even non-maritime disasters.
Feature Engineering: Advanced feature engineering techniques, such as time-series analysis or
interaction features, can be employed to further improve prediction accuracy.
23
• Movie Recommendation System:
Hybrid Recommendation Systems: Currently, the movie recommendation system uses collaborative
filtering. However, combining content-based filtering with collaborative methods (a hybrid approach)
could improve recommendation accuracy, as it would take into account both user preferences and
movie features.
Deep Learning: Moving beyond traditional collaborative filtering, deep learning models such as
Neural Collaborative Filtering (NCF) or Recurrent Neural Networks (RNNs) could be explored to
make more accurate recommendations by capturing complex relationships between users and movies.
Real-Time Recommendations: Implementing real-time recommendations, based on user behavior as
they interact with the system, could further personalize user experiences. This would involve
constantly updating the recommendation list based on new interactions.
Real-Time Fraud Detection: The current model works offline using a batch processing approach. A
real-time fraud detection system could be developed where transactions are processed instantaneously
as they occur. This requires optimizing the model for fast inference and integrating it with financial
transaction systems.
Advanced Sampling Techniques: While SMOTE was used to handle class imbalance, other advanced
techniques such as Adaptive Synthetic Sampling (ADASYN) or Borderline-SMOTE could further
improve model performance in handling highly imbalanced data.
Deployment and Scalability:
Scalable Infrastructure: As the models grow in size and complexity, it will be important to consider
deploying the system on scalable infrastructure like Kubernetes or cloud-based services (e.g., AWS,
Google Cloud, or Azure). This will allow the system to handle large numbers of requests efficiently
and ensure that predictions can be made at scale.
Containerization and Microservices: Using Docker for containerizing the models and deploying them
as microservices will allow for easier scaling and management of the system in a production
environment.
Improved User Interface: The user interfaces for the Titanic survival prediction, movie
24
recommendation, and fraud detection systems could be further developed to be more interactive and
intuitive, incorporating features such as data visualization, user feedback loops, and real-time updates.
Multi-language Support: Expanding the system’s usability to multiple languages and regions can
increase accessibility and broaden the user base, especially in global applications like movie
recommendations or fraud detection in financial transactions.
25
References
Books and Articles:
Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts,
Tools, and Techniques to Build Intelligent Systems. O'Reilly Media, 2019.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer Science & Business Media, 2009.
Brownlee, Jason. Machine Learning Mastery with Python: Understand Your Data, Create Accurate
Models, and Work Projects End-to-End. Machine Learning Mastery, 2016.
Online Resources:
Research Papers:
Chawla, N.V., et al. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial
Intelligence Research, 2002.
Koren, Y., Bell, R., & Volinsky, C. Matrix Factorization Techniques for Recommender Systems.
IEEE Computer Society, 2009.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "Why Should I Trust You?" Explaining the
Predictions of Any Classifier. Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2016.
26
Tools and Libraries:
27