DA Lab - Manual
DA Lab - Manual
COURSE OBJECTIVES
The course will enable the students to obtain practical experience with data analytics
algorithms and get familiarized with the development of web services and
applications in the cloud framework.
COURSE CONTENT:
List of Experiments
1. Find procedure to run the virtual machine of different configuration. Check how
many virtual machines can be utilized at particular time.
There are several ways to run a virtual machine with different configurations, depending
on the virtualization software being used. Here are some general steps for running a
virtual machine with different configurations:
The number of virtual machines that can be utilized at a particular time depends on the
resources available on the host machine, such as CPU, RAM, and disk space. It also
depends on the specific virtualization software being used and its capabilities. For
example, some virtualization software can allow you to run multiple virtual machines
simultaneously on the same host, while others may limit the number of running virtual
machines.
You can check the number of virtual machines that can be utilized at a particular time by
checking the host machine's resource usage and the virtualization software's settings. For
example, in VirtualBox, you can check the number of virtual machines running under the
"Machine" menu, and also check the host machine's resource usage by going to the
"Help" menu, then "System Information".
Page 1 of 120
2. Find procedure to attach virtual block to the virtual machine and check whether it
holds the data even after the release of the virtual machine.
The procedure to attach a virtual block to a virtual machine can vary depending on the
virtualization software being used. Here are some general steps for attaching a virtual
block to a virtual machine in VirtualBox:
1. Open VirtualBox and select the virtual machine to which you want to attach the virtual
block.
2. Click on the "Settings" button to open the virtual machine's settings.
3. Click on the "Storage" tab.
4. Click on the "+" button to add a new storage controller.
5. Select "Hard Disk" as the storage type and click "Create new hard disk" button.
6. Follow the steps to create a new virtual hard disk.
7. Select the new virtual hard disk you've created and click on the "Add Attachment" button.
8. Select the new hard disk and click "Choose Virtual Optical Disk File" and navigate to the
location of the virtual block you want to attach.
9. Click "Open" and then click "OK" to close the settings window.
10. Start the virtual machine and check whether it recognizes the new virtual block, it
should be available as a new disk.
Once the virtual block is attached to the virtual machine, the virtual machine will treat it as
a regular physical disk and can use it to store data. The data stored on the virtual block
will persist even after the virtual machine is released, as long as the virtual block is not
deleted.
To check whether the data holds after the release of the virtual machine, you can take a
snapshot of the virtual machine before shutting it down, after that, you can start the
virtual machine from that snapshot and check the data.
Alternatively, you can also use the export feature of the virtualization software to create a
copy of the virtual machine, including the virtual block, and then use the copy to start the
virtual machine and check the data.
1. Start the virtual machine and log in as an administrator or a user with root/sudo access.
Page 2 of 120
2. Open a terminal window.
3. Update the package manager by running the command "sudo apt-get update" (for
Ubuntu/Debian) or "sudo yum update" (for Fedora/CentOS)
4. Install the C compiler by running the command "sudo apt-get install gcc" (for
Ubuntu/Debian) or "sudo yum install gcc" (for Fedora/CentOS)
5. Verify the installation by running the command "gcc --version"
You should now see the program's output displayed in the terminal window.
the above steps are for Linux-based virtual machine and may vary depending on the
specific operating system and version of the virtual machine.
Also, Windows-based virtual machine can use compilers such as MinGW, Visual Studio or
GCC for Windows, to install it you can download the installer and run it, then you can use
the command prompt or the integrated development environment (IDE) to run and
compile C programs.
4. Show the virtual machine migration based on the certain condition from one node to
the other.
Virtual machine migration is the process of moving a running virtual machine from one
physical host to another without any interruption to the running services. The process of
migrating a virtual machine from one node to another can vary depending on the
virtualization software being used. Here are some general steps for migrating a virtual
machine using the live migration feature in VMware vSphere:
1. Log in to the vSphere web client and navigate to the host or cluster where the virtual
machine is currently running.
2. Right-click on the virtual machine and select "Migrate."
3. Select "Change host" and select the destination host or cluster where you want to migrate
the virtual machine.
Page 3 of 120
4. Select the migration type. If you want the virtual machine to be migrated without
any interruption, then you should choose "Ensure the virtual machine is powered
on"
5. Select the storage where the virtual machine's files should be located on the
destination host.
6. Click "Next" and review the migration settings, then click "Finish" to start the
migration process.
The virtual machine migration can also be triggered based on certain conditions such
as resource utilization, power consumption, or a specific time schedule. This is known
as automatic or scheduled migration.
In order to set the condition-based migration, you will have to use the vSphere
Distributed Resource Scheduler (DRS) which automatically balances the virtual machine
workloads across the hosts in a cluster. DRS uses the current resource usage, resource
reservations, and constraints, to determine which host is the best for a virtual machine.
You can also use vSphere HA (High availability) that restarts virtual machines
automatically on other hosts in the event of a host failure.
1. Choose a cloud provider and create an account if you do not already have one.
2. Select the appropriate storage service for your needs, such as an object storage service or
a block storage service.
3. Create a storage container or volume, depending on the service you have chosen.
4. Configure any necessary settings, such as access controls or performance tiers.
5. Obtain the credentials needed to interact with the storage service, such as access keys or
connection strings.
6. Use a programming language or command-line tool to interact with the storage service,
such as the AWS SDK for your language of choice or the AWS Command Line Interface
(CLI).
Page 4 of 120
7. Once the storage controller is installed, you can use it to create, read, update, and
delete data stored in the storage container or volume.
8. Configure backup and replication policies and monitor the storage usage.
It is important to note that the specific steps and tools you will use will depend on the cloud
provider and storage service you have chosen.
6. Find procedure to set up the one node Hadoop cluster.
# define X and y
X = iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = iris['petal_length']
Page 5 of 120
y_pred = linreg.predict(X_test)
Multiple Regression
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Make predictions
y_pred = model.predict(X)
Logistic Regression
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
Page 7 of 120
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# define X and y
X = iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = iris['species']
Page 8 of 120
# load the dataset
data = pd.read_csv("data.csv")
# define X and y
X = data[['feature1', 'feature2', 'feature3', 'feature4']]
y = data['target']
# define X and y
Page 9 of 120
X = data[['feature1', 'feature2', 'feature3', 'feature4']]
y = data['target']
# load dataset
data = pd.read_csv("data.csv")
Page 10 of 120
#predict cluster of new data
new_data = [[1, 2, 3], [4, 5, 6]]
predictions = kmeans.predict(new_data)
print(predictions)
Hierarchical Clustering
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn import datasets
# Perform linkage
Z = linkage(X, method='ward')
# Plot dendrogram
dendrogram(Z)
plt.show()
# Data set
transactions = [
['milk', 'bread', 'butter'],
['milk', 'bread', 'butter', 'cheese'],
['milk', 'bread', 'eggs'],
['milk', 'bread', 'eggs', 'cheese'],
['milk', 'bread', 'butter', 'cheese', 'eggs'],
]
# Print results
for item in rules:
pair = item[0]
items = [x for x in pair]
print("Rule: " + items[0] + " -> " + items[1])
print("Support: " + str(item[1]))
print("Confidence: " + str(item[2][0][2]))
print("Lift: " + str(item[2][0][3]))
print("=====================================")
This program uses the apriori function from the apyori library to perform association
rule mining on the given dataset of transactions. The minimum support, confidence, and
lift values are set to 0.5, 0.7, and 1 respectively. The resulting rules, support, confidence,
and lift values are then printed.
Page 12 of 120
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# Load iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# KNN classification
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
# Make predictions
predictions = knn.predict(X_test)
print("Predictions:", predictions)
This program uses the load_iris function from the sklearn.datasets module to load the iris
dataset. It then splits the dataset into training and test sets using the train_test_split
function. The KNeighborsClassifier class from the sklearn.neighbors module is then used
to perform KNN classification on the training data with 5 nearest neighbors. The model's
accuracy is then evaluated on the test data, and predictions are made on the test data.
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import svm
Page 13 of 120
# Load iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# SVM classification
clf = svm.SVC(kernel='linear', C=1)
clf.fit(X_train, y_train)
# Make predictions
predictions = clf.predict(X_test)
print("Predictions:", predictions)
This program uses the load_iris function from the sklearn.datasets module to load the iris dataset. It
then splits the dataset into training and test sets using the train_test_split function. The SVC class
from the sklearn.svm module is then used to perform SVM classification on the training data with a
linear kernel and a regularization parameter C of 1. The model's accuracy is then evaluated on the
test data, and predictions are made on the test data.
Page 14 of 120
COURSE OUTCOMES:
REFERENCES:
1. Bart Baesens, “Analytics in a Big Data World: The Essential Guide to Data
Science and its Applications”, Wiley Publication, 1st Edition, 2014.
2. Subhashini Chellappan, Seema Acharya, “Big Data and Analytics”, Wiley
Publication, 2nd Edition, 2019.
3. Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman, “Mining of Massive
Datasets”, Cambridge University Press, 2nd Edition, 2014.
4. Rajkumar Buyya, Christian Vecchiola, S. Thamarai Selvi, “Mastering Cloud
Computing”. McGraw Hill Education, 1st Edition, 2017
5. Rajiv Misra, Yashwant Singh Patel, “Cloud and Distributed Computing:
Algorithms and Systems”, Wiley, 1st Edition. 2020.
Page 15 of 120