Complete Step-by-Step Guide: Building Your First
Machine Learning Project in VS Code with PARAM
Utkarsh Supercomputer
This comprehensive guide will walk you through building a simple machine learning project from scratch using VS Code
and running it on the PARAM Utkarsh supercomputer.
Prerequisites
Before starting, ensure you have:
Access to PARAM Utkarsh supercomputer
VS Code installed on your local machine
Basic knowledge of Python programming
SSH client (MobaXterm or PuTTY)
Step 1: Setting Up Your Local Development Environment
Install Required Software
1. Install VS Code: Download from [Link]
2. Install Python Extension: In VS Code, install the Python extension by Microsoft
3. Install SSH Client: Download MobaXterm (recommended) or PuTTY
Create Project Directory
mkdir my_first_ml_project
cd my_first_ml_project
Step 2: Connecting to PARAM Utkarsh Supercomputer
SSH Connection
ssh -X username@[Link]
Replace username with your actual username
The -X flag enables X11 forwarding for graphical applications
Enter the captcha (case sensitive) and your password when prompted
Check Available Resources
Once logged in, check the system information displayed in the terminal. You'll see:
Total compute nodes: 156
CPU nodes: 107
High Memory nodes: 39
GPU accelerated nodes: 10
Step 3: Setting Up Python Environment on PARAM Utkarsh
Load Required Modules
# Check available modules
module avail
# Load Python with TensorFlow (recommended for ML projects)
module load anaconda3/tensorflow
# Alternative options:
# module load anaconda3/anaconda3
# module load anaconda3/pytorch
Verify Python Installation
python3 -V
pip list
Create Virtual Environment
# Create virtual environment in your project directory
python -m venv ml_project_env
# Activate virtual environment
source ml_project_env/bin/activate
# Verify virtual environment is active
which python
Step 4: Install Required Python Packages
Create [Link] file
cat > [Link] << EOF
numpy>=1.19.0
pandas>=1.1.0
matplotlib>=3.3.0
scikit-learn>=0.23.0
seaborn>=0.11.0
EOF
Install Packages
pip install -r [Link]
Step 5: Create Your First ML Project - Iris Classification
Create the main Python script
# [Link] - Iris Flower Classification Project
# Step 1: Import Required Libraries
import numpy as np
import pandas as pd
import [Link] as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from [Link] import StandardScaler
from sklearn.linear_model import LogisticRegression
from [Link] import DecisionTreeClassifier
from [Link] import RandomForestClassifier
from [Link] import SVC
from [Link] import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from [Link] import accuracy_score, classification_report, confusion_matrix
from [Link] import load_iris
import warnings
[Link]('ignore')
print("Starting Iris Classification Project...")
print("="*50)
# Step 2: Load and Explore the Dataset
def load_and_explore_data():
"""Load the Iris dataset and perform initial exploration"""
print("Loading Iris dataset...")
# Load the dataset
iris = load_iris()
df = [Link]([Link], columns=iris.feature_names)
df['target'] = [Link]
df['species'] = df['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})
print(f"Dataset shape: {[Link]}")
print(f"Features: {list(iris.feature_names)}")
print(f"Target classes: {list(iris.target_names)}")
print("\nFirst 5 rows:")
print([Link]())
print("\nDataset Info:")
print([Link]())
print("\nStatistical Summary:")
print([Link]())
print("\nClass Distribution:")
print(df['species'].value_counts())
return df, iris
# Step 3: Data Visualization
def visualize_data(df):
"""Create visualizations to understand the data better"""
print("\nCreating visualizations...")
# Set up the plotting style
[Link]('default')
fig, axes = [Link](2, 2, figsize=(12, 10))
# Box plots for each feature
features = [Link][:-2] # Exclude target and species columns
for i, feature in enumerate(features):
row, col = i // 2, i % 2
[Link](data=df, x='species', y=feature, ax=axes[row, col])
axes[row, col].set_title(f'{feature} by Species')
plt.tight_layout()
[Link]('iris_boxplots.png', dpi=150, bbox_inches='tight')
print("Box plots saved as 'iris_boxplots.png'")
# Correlation heatmap
[Link](figsize=(10, 8))
correlation_matrix = [Link][:, :-2].corr()
[Link](correlation_matrix, annot=True, cmap='coolwarm', center=0)
[Link]('Feature Correlation Heatmap')
plt.tight_layout()
[Link]('correlation_heatmap.png', dpi=150, bbox_inches='tight')
print("Correlation heatmap saved as 'correlation_heatmap.png'")
# Pair plot
[Link](figsize=(12, 10))
[Link](df, hue='species', diag_kind='hist')
[Link]('[Link]', dpi=150, bbox_inches='tight')
print("Pair plot saved as '[Link]'")
# Step 4: Prepare Data for Machine Learning
def prepare_data(df):
"""Prepare features and target variables"""
print("\nPreparing data for machine learning...")
# Separate features and target
X = [Link][:, :-2] # All columns except target and species
y = df['target']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)
print(f"Training set size: {X_train.shape[^0]}")
print(f"Test set size: {X_test.shape[^0]}")
return X_train_scaled, X_test_scaled, y_train, y_test, scaler
# Step 5: Train Multiple Models
def train_models(X_train, y_train):
"""Train multiple machine learning models"""
print("\nTraining multiple machine learning models...")
# Define models
models = {
'Logistic Regression': LogisticRegression(random_state=42),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM': SVC(random_state=42),
'KNN': KNeighborsClassifier(n_neighbors=5),
'Naive Bayes': GaussianNB()
}
# Train and evaluate models using cross-validation
results = {}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for name, model in [Link]():
cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='accuracy')
results[name] = {
'model': model,
'cv_mean': cv_scores.mean(),
'cv_std': cv_scores.std(),
'cv_scores': cv_scores
}
print(f"{name}: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
return results
# Step 6: Select Best Model and Make Predictions
def evaluate_best_model(results, X_train, X_test, y_train, y_test):
"""Select the best model and evaluate on test set"""
print("\nSelecting best model and making predictions...")
# Find the best model
best_model_name = max([Link](), key=lambda x: results[x]['cv_mean'])
best_model = results[best_model_name]['model']
print(f"Best model: {best_model_name}")
print(f"Cross-validation accuracy: {results[best_model_name]['cv_mean']:.4f}")
# Train the best model on full training set
best_model.fit(X_train, y_train)
# Make predictions
y_pred = best_model.predict(X_test)
# Evaluate on test set
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test set accuracy: {test_accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(cm)
# Plot confusion matrix
[Link](figsize=(8, 6))
[Link](cm, annot=True, fmt='d', cmap='Blues')
[Link](f'Confusion Matrix - {best_model_name}')
[Link]('True Label')
[Link]('Predicted Label')
[Link]('confusion_matrix.png', dpi=150, bbox_inches='tight')
print("Confusion matrix saved as 'confusion_matrix.png'")
return best_model, best_model_name, test_accuracy
# Step 7: Save Results
def save_results(results, best_model_name, test_accuracy):
"""Save model results to a file"""
print("\nSaving results...")
with open('model_results.txt', 'w') as f:
[Link]("Iris Classification Project Results\n")
[Link]("="*40 + "\n\n")
[Link]("Cross-validation Results:\n")
for name, result in [Link]():
[Link](f"{name}: {result['cv_mean']:.4f} (+/- {result['cv_std'] * 2:.4f})\n")
[Link](f"\nBest Model: {best_model_name}\n")
[Link](f"Test Accuracy: {test_accuracy:.4f}\n")
print("Results saved to 'model_results.txt'")
# Main execution function
def main():
"""Main function to run the entire ML pipeline"""
print("Starting Machine Learning Pipeline...")
# Step 1: Load and explore data
df, iris = load_and_explore_data()
# Step 2: Create visualizations
visualize_data(df)
# Step 3: Prepare data
X_train, X_test, y_train, y_test, scaler = prepare_data(df)
# Step 4: Train models
results = train_models(X_train, y_train)
# Step 5: Evaluate best model
best_model, best_model_name, test_accuracy = evaluate_best_model(
results, X_train, X_test, y_train, y_test
)
# Step 6: Save results
save_results(results, best_model_name, test_accuracy)
print("\n" + "="*50)
print("Machine Learning Project Completed Successfully!")
print("="*50)
# Run the project
if __name__ == "__main__":
main()
Step 6: Create SLURM Job Script
Create [Link] file for CPU execution
cat > slurm_cpu.sh << 'EOF'
#!/bin/sh
#SBATCH --N=1
#SBATCH --ntasks-per-node=2
#SBATCH --time=[Link]
#SBATCH --job-name=iris_ml_cpu
#SBATCH --error=job.%[Link]
#SBATCH --output=job.%[Link]
#SBATCH --partition=standard
# Load required modules
module load anaconda3/tensorflow
# Change to project directory
cd $HOME/my_first_ml_project
# Activate virtual environment
source ml_project_env/bin/activate
# Run the Python program
python [Link]
# Deactivate virtual environment
deactivate
EOF
Create [Link] file for GPU execution (if needed)
cat > slurm_gpu.sh << 'EOF'
#!/bin/sh
#SBATCH --N=1
#SBATCH --ntasks-per-node=2
#SBATCH --time=[Link]
#SBATCH --job-name=iris_ml_gpu
#SBATCH --error=job.%[Link]
#SBATCH --output=job.%[Link]
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
# Load CUDA and required modules
module load cuda/11.0
module load anaconda3/tensorflow
export CUDA_VISIBLE_DEVICES=0
# Change to project directory
cd $HOME/my_first_ml_project
# Activate virtual environment
source ml_project_env/bin/activate
# Run the Python program
python [Link]
# Deactivate virtual environment
deactivate
EOF
Step 7: Submit and Monitor Your Job
Submit the job
# For CPU execution
sbatch slurm_cpu.sh
# For GPU execution (if applicable)
sbatch slurm_gpu.sh
Monitor job status
# Check your running jobs
squeue --me
# Check job details
scontrol show job <job_id>
# View output files
tail -f job.<job_id>.out
tail -f job.<job_id>.err
Step 8: Retrieve and View Results
Check output files
# List generated files
ls -la
# View results
cat model_results.txt
# Transfer files to local machine using scp
scp username@[Link]:~/my_first_ml_project/*.png ./local_directory/
scp username@[Link]:~/my_first_ml_project/model_results.txt ./local_directory
Step 9: Advanced Project Enhancements
Create a more complex project structure
# Create organized directory structure
mkdir -p data src notebooks results models
# Move files to appropriate directories
mv [Link] src/
mv *.png results/
mv model_results.txt results/
Add configuration file
# [Link]
import os
class Config:
# Data settings
TEST_SIZE = 0.2
RANDOM_STATE = 42
# Model settings
CV_FOLDS = 5
# File paths
RESULTS_DIR = 'results'
MODELS_DIR = 'models'
# Plotting settings
FIGURE_SIZE = (12, 10)
DPI = 150
Step 10: Best Practices and Tips
Version Control
# Initialize git repository
git init
git add .
git commit -m "Initial ML project setup"
Error Handling and Logging
# Add to your [Link]
import logging
# Set up logging
[Link](
level=[Link],
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
[Link]('ml_project.log'),
[Link]()
]
)
logger = [Link](__name__)
Performance Monitoring
# Add timing to your functions
import time
from functools import wraps
def timing_decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
start_time = [Link]()
result = func(*args, **kwargs)
end_time = [Link]()
print(f"{func.__name__} took {end_time - start_time:.2f} seconds")
return result
return wrapper
# Apply to your functions
@timing_decorator
def train_models(X_train, y_train):
# ... existing code
Troubleshooting Common Issues
Connection Issues
Ensure you have the correct username and password
Check if the cluster is under maintenance
Verify network connectivity
Module Loading Issues
# Clear all modules and reload
module purge
module load anaconda3/tensorflow
Virtual Environment Issues
# If virtual environment creation fails
rm -rf ml_project_env
python -m venv ml_project_env --system-site-packages
Job Submission Issues
Check partition availability: sinfo
Verify resource requests don't exceed limits
Ensure scripts have execute permissions: chmod +x slurm_cpu.sh
Memory Issues
Monitor memory usage in your script
Use batch processing for large datasets
Request more memory in SLURM script: #SBATCH --mem=8GB
Next Steps
1. Expand the project: Try different datasets, algorithms, or preprocessing techniques
2. Hyperparameter tuning: Use GridSearchCV or RandomizedSearchCV
3. Feature engineering: Create new features or select the most important ones
4. Deploy the model: Create a simple web service or API
5. Experiment with deep learning: Use TensorFlow or PyTorch for neural networks
Conclusion
You have successfully created and run your first machine learning project on the PARAM Utkarsh supercomputer! This
project covers the essential steps of a typical ML workflow:
Data loading and exploration
Data visualization
Model training and comparison
Model evaluation and selection
Result analysis and saving
The combination of VS Code for development and PARAM Utkarsh for execution provides a powerful environment for
machine learning experimentation and research.
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21]
1. [Link]
2. [Link]
3. [Link]
ning_projects/
4. [Link]
5. [Link]
6. [Link]
-with-code
7. [Link]
8. [Link]
9. [Link]
10. [Link]
11. [Link]
12. [Link]
13. [Link]
14. [Link]
[Link]
15. [Link]
16. [Link]
17. [Link]
18. [Link]
19. [Link]
20. [Link]
21. [Link]