Lab Record DEV
Lab Record DEV
1. Install the following Data Mining and data Analysis tool: Weka, KNIME, Tableau
Public.
2. Perform exploratory data analysis (EDA) on with datasets like email data set. Export all
your emails as a dataset, import them inside a pandas data frame, visualize them and get
different insights from the data.
3. Perform Time Series Analysis with datasets like Open Power System Data.
5. Perform Data Analysis and representation on a Map using various Map data sets with
Mouse Rollover effect, user interaction, etc..
6. Build cartographic visualization for multiple datasets involving various countries of the
world; states and districts in India etc.
7. Perform text mining on a set of documents and visualize the most important words in a
visualization such as word cloud.
8. Use a case study on a data set and apply the various visualization techniques and present
an analysis report.
EX.NO.1 Install the following Data Mining and data Analysis tool: Weka, KNIME, Tableau
Public.
Step 3: Now check for the executable file in downloads in your system and run it.
Step 4: It will prompt confirmation to make changes to your system. Click on Yes.
Step 7: Next screen is of choosing components, all components are already marked
so don’t change anything just click on the Install button.
step 8: The next screen will be of installing location so choose the drive which will
have sufficient memory space for installation. It needed a memory space of 301 MB.
Step 9: Next screen will be of choosing the Start menu folder so don’t do anything
just click on Install Button.
Step 10: After this installation process will start and will hardly take a minute to
complete the installation.
Step 11: Click on the Next button after the installation process is complete.
Step 12: Click on Finish to finish the installation process.
Step 13: Weka is successfully installed on the system and an icon is created on the
desktop.
Step 14: Run the software and see the interface.
# for visualization
import matplotlib as mpl
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.graph_objs as go
from wordcloud import WordCloud
# Preprocessing (sklearn)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
# Modeling
from sklearn.ensemble import RandomForestClassifier
from lightgbm.sklearn import LGBMClassifier
import xgboost as xgb
from sklearn.svm import SVC
from catboost import CatBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
# Neural Network
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, GlobalM
axPooling1D, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
# scoring
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_scor
e, roc_auc_score
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, RocCurveDisplay
# styling
plt.style.use('ggplot')
/opt/conda/lib/python3.7/site-packages/geopandas/_compat.py:115: UserWarning:
The Shapely GEOS version (3.9.1-CAPI-1.14.2) is incompatible with the GEOS ver
sion PyGEOS was compiled with (3.10.1-CAPI-1.16.0). Conversions between both w
ill be slow.
shapely_geos_version, geos_capi_version_string
read dataset:
df = pd.read_csv('../input/spam-email/spam.csv')
msno.matrix(df).set_title('Distribution of missing values',fontsize=20)
fig = px.pie(values=category_ct.values,
names=category_ct.index,
color_discrete_sequence=px.colors.sequential.OrRd,
title= 'Pie Graph: spam or not')
fig.update_traces(hoverinfo='label+percent', textinfo='label+value+percent', t
extfont_size=15,
marker=dict(line=dict(color='#000000', width=2)))
fig.show()
Length distribution of spam & ham message
categories = pd.get_dummies(df["Category"])
spam_or_not = pd.concat([df, categories], axis=1)
spam_or_not.drop('Category',axis=1,inplace=True)
df["length"] = df["Message"].apply(len)
ham.drop('index',axis=1,inplace=True)
spam.drop('index',axis=1,inplace=True)
hist_data = [ham['length'],spam['length']]
group_labels = ['ham','spam']
# Add title
fig.update_layout(title_text='Length distribution of ham and spam messages',
template = 'simple_white')
fig.show()
RESULT: Thus Exploratory data analysis (EDA) on with datasets like email data set is
successfully executed.
EX.NO.3: Perform Time Series Analysis with datasets like Open Power System Data.
Code:
print(data.head())
print('\n')
print(data.columns)
print('\n')
print(data.info())
print('\n')
print(data.describe())
for x in axes:
Result:
Thus the time series analysis is performed and visualized.
EX.NO.4 Build a time-series model on a given dataset and evaluate its accuracy.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
data = pd.read_csv('Healthcare-Diabetes.csv')
data.head()
The output of the code :
X = data.drop("Outcome", axis=1)
y = data["Outcome"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LogisticRegression()
LogisticRegression
LogisticRegression()
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_rep)
plt.bar(labels, accuracies)
plt.ylabel('Accuracy')
plt.title('Accuracy for Diabetes and No Diabetes')
plt.ylim(0, 1) # Set y-axis range to 0-1 for percentages
plt.show()
import matplotlib.pyplot as plt
import seaborn as sns
RESULT: Thus the Time series model is built and its accuracy is Vizualised.
EX.NO.5
Perform Data Analysis and representation on a Map using various Map data sets with Mouse
Rollover effect, user interaction, etc
# Create a map
m_1 = folium.Map(location=[42.32,-71.0589], tiles='openstreetmap', zoom_start=
10)
a map
m_2 = folium.Map(location=[42.32,-71.0589], tiles='cartodbpositron', zoom_star
t=13)
import pandas as pd
import altair as alt
from vega_datasets import data
Up to this point, we have worked with JSON and CSV formatted datasets that correspond to
data tables made up of rows (records) and columns (fields). In order to represent geographic
regions (countries, states, etc.) and trajectories (flight paths, subway lines, etc.), we need to
expand our repertoire with additional formats designed to support rich geometries.
Here is a GeoJSON feature object for the boundary of the U.S. state of Colorado:
{
"type": "Feature",
"id": 8,
"properties": {"name": "Colorado"},
"geometry": {
"type": "Polygon",
"coordinates"
:[
[[-106.32056285448942,40.998675790862656],[-106.19134826714341,40.99813863734313],[-
105.27607827344248,40.99813863734313],[-104.9422739227986,40.99813863734313],[-
104.05212898774828,41.00136155846029],[-103.57475287338661,41.00189871197981],[-
103.38093099236758,41.00189871197981],[-102.65589358559272,41.00189871197981],[-
102.62000064466328,41.00189871197981],[-102.052892177978,41.00189871197981],[-
102.052892177978,40.74889940428302],[-102.052892177978,40.69733266640851],[-
102.052892177978,40.44003613055551],[-102.052892177978,40.3492571857556],[-
102.052892177978,40.00333031918079],[-102.04930288388505,39.57414465707943],[-
102.04930288388505,39.56823596836465],[-102.0457135897921,39.1331416175485],[-
102.0457135897921,39.0466599009048],[-102.0457135897921,38.69751011321283],[-
102.0457135897921,38.61478847120581],[-102.0457135897921,38.268861604631],[-
102.0457135897921,38.262415762396685],[-102.04212429569915,37.738153927339205],[-
102.04212429569915,37.64415206142214],[-102.04212429569915,37.38900413964724],[-
102.04212429569915,36.99365914927603],[-103.00046581851544,37.00010499151034],[-
103.08660887674611,37.00010499151034],[-104.00905745863294,36.99580776335414],[-
105.15404227428235,36.995270609834606],[-105.2222388620483,36.995270609834606],[-
105.7175614468747,36.99580776335414],[-106.00829426840322,36.995270609834606],[-
106.47490250048605,36.99365914927603],[-107.4224761410235,37.00010499151034],[-
107.48349414060355,37.00010499151034],[-108.38081766383978,36.99903068447129],[-
109.04483707103458,36.99903068447129],[-109.04483707103458,37.484617466122884],[-
109.04124777694163,37.88049961001363],[-109.04124777694163,38.15283644441336],[-
109.05919424740635,38.49983761802722],[-109.05201565922046,39.36680339854235],[-
109.05201565922046,39.49786885730673],[-109.05201565922046,39.66062637372313],[-
109.05201565922046,40.22248895514744],[-109.05201565922046,40.653823231326896],[-
109.05201565922046,41.000287251421234],[-107.91779872584989,41.00189871197981],[-
107.3183866123281,41.00297301901887],[-106.85895696843116,41.00189871197981],[-
106.32056285448942,40.998675790862656]]
]
}
}
The feature includes a properties object, which can include any number of data fields, plus
a geometry object, which in this case contains a single polygon that consists
of [longitude, latitude] coordinates for the state boundary.
Let’s load a TopoJSON file of world countries (at 110 meter resolution):
world = data.world_110m.url
world
'https://cdn.jsdelivr.net/npm/[email protected]/data/world-110m.json'
world_topo = data.world_110m()
world_topo.keys()
dict_keys(['type', 'transform', 'objects', 'arcs'])
world_topo['type']
'Topology'
world_topo['objects'].keys()
dict_keys(['land', 'countries'])
alt.topo_feature(world, 'countries')
{
"values": world,
"format": {"type": "topojson", "feature": "countries"}
}
Geoshape Marks
To visualize geographic data, Altair provides the geoshape mark type. To create a basic map, we
can create a geoshape mark and pass it our TopoJSON data, which is then unpacked into
GeoJSON features, one for each country of the world:
alt.Chart(alt.topo_feature(world, 'countries')).mark_geoshape()
In the example above, Altair applies a default blue color and uses a default map projection
(mercator). We can customize the colors and boundary stroke widths using standard mark
properties. Using the project method we can also add our own map projection:
alt.Chart(alt.topo_feature(world, 'countries')).mark_geoshape(
fill='#2a1d0c', stroke='#706545', strokeWidth=0.5
).project(
type='mercator')
By default Altair automatically adjusts the projection so that all the data fits within the width
and height of the chart. We can also specify projection parameters, such as scale (zoom level)
and translate (panning), to customize the projection settings. Here we adjust
the scale and translate parameters to focus on Europe:
alt.Chart(alt.topo_feature(world, 'countries')).mark_geoshape(
fill='#2a1d0c', stroke='#706545', strokeWidth=0.5
).project(
type='mercator', scale=400, translate=[100, 550]
)
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')
import string
import collections
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.cm as cm
import matplotlib.pyplot as plt
% matplotlib inline
Reading the File and Understanding the Data
loading the data file
df = pd.read_csv('emails2.csv')
RESULT: Thus the visualize the most important words in a visualization such as word cloud
is performed.