Using CountVectorizer to Extracting Features from Text

CountVectorizer is used to transform a given text into a vector based on the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts and we wish to convert each word in each text into vectors (for using in further text analysis)

Visualizing Count Vectorizer

CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample.

document = [ "One Geek helps Two Geeks", "Two Geeks help Four Geeks", "Each Geek helps many other Geeks at GeeksforGeeks."]

	at	each	four	geek	geeks	geeksforgeeks	help	helps	many	one	other	two
document[0]	0	0	0	1	1	0	0	1	0	1	0	1
document[1]	0	0	1	0	2	0	1	0	0	0	0	1
document[2]	1	1	0	1	1	1	0	1	1	0	1	0

There are 12 unique words in the document, represented as columns of the table.
There are 3 text samples in the document, each represented as rows of the table.
Every cell contains a number, that represents the count of the word in that particular text.
All words have been converted to lowercase.
The words in columns have been arranged alphabetically.

Implementation

Step 1: Import necessary libraries and define corpus

We will import numpy and scikit learn library and define our corpus as list of strings

Python

from sklearn.feature_extraction.text import CountVectorizer
documents = [
    "I love machine learning",
    "I love deep learning",
    "machine learning is powerful"
]

Step 2: Define sklearn Vectorizer

We will define our sklearn vectorizer class, which can take the following hyper-parameters as input

ngram_range: controls whether single words, word pairs, or longer sequences are used as features.
stop_words: removes common, low-information words (e.g., “the”, “is”) from the vocabulary.
max_features: limits the vocabulary size to the most frequent terms to control dimensionality.

Python

vectorizer = CountVectorizer(
    stop_words="english",
    ngram_range=(1, 2),
    max_features=20
)

Step 3: Fit and Transform our documents using sklearn

We will first fit our vectorizer to the data , so that it learns patterns and then use this fitted model to transform our data.

Python

X = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
print("Features:")
print(feature_names)
print("\nDocument-Term Matrix:")
print(X.toarray())

Output:

Screenshot-2026-01-19-104039 — Bigram document matrix

Advantages

Count Vectorizer allows our words to be understood by Machine Learning models through Bag of Words (BOW) representation.
Each feature directly represents a word or n-gram count.
Performs well with classical ML models on small to medium text datasets.

Limitations

No semantic understanding treats words independently and ignores meaning or context.
Loses sequence information unless n-grams are explicitly used.
Vocabulary can grow large, leading to sparse feature vectors.

Using CountVectorizer to Extracting Features from Text

Visualizing Count Vectorizer

Implementation

Step 1: Import necessary libraries and define corpus

Step 2: Define sklearn Vectorizer

Step 3: Fit and Transform our documents using sklearn

Advantages

Limitations

Explore