Feature Selection
Feature Selection
Feature Analysis
Further more, let’s check their coviances among features and class:
# manually verify the correlation among features and classes
iris_cov = iris_norm_.cov()
sns.heatmap(iris_cov, annot = True, cbar = False)
Feature Selection
From the covariance heatmap, we can see ‘sepal width’ is the least
relevant to the class. This can explain why is class 1 and 2 are tangled
in the pairplot chart from the previous section.
As you can see the second feature has the least score and the largest p-
value. And the resulting dataset is of shape 150 x 3, the second feature
(sepal width) was removed.
We can draw a 3D chart for the 3 features now for a more intruitive
view.
from mpl_toolkits import mplot3d
fig = plt.figure(figsize=(8,8))
ax = plt.axes(projection='3d')
ax.scatter3D(iris_trim[:, 0], iris_trim[:, 1], iris_trim[:, 2], c = target,
cmap='Accent', marker = '>')
Validation
Now let’s compare both 4 feature case and 3 feature case. Define a
training and validation function first, then prepare both datesets.
def train_and_validate(X_train, X_test, y_train, y_test):
mode = GaussianNB()
mode.fit(X_train, y_train);
y_calc = mode.predict(X_test)
y_prob = mode.predict_proba(X_test)
#print(y_prob)
mat = confusion_matrix(y_test, y_calc)
sns.heatmap(mat.T, annot=True, cbar = False)X_train4, X_test4,
y_train, y_test = train_test_split(iris_norm, target, test_size = 0.10,
stratify = None, random_state=0)
X_train3, X_test3 = X_train4.drop(['sepal width (cm)'], axis=1),
X_test4.drop(['sepal width (cm)'], axis=1)
As we can see, the reduced feature set has a better result. In the
confusion matrics the 3 feature dataset yields a 100% accuracy, while
the 4 feature set model misses one sample.
I changed the random_state to generate different sets of data to repeat
the process, and I can see the 3-feature dataset performs better or at
least equally good as a 4-feature dataset.
Conclusion