Using CountVectorizer to Extracting Features from Text

Last Updated : 19 Jan, 2026

CountVectorizer is used to transform a given text into a vector based on the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts and we wish to convert each word in each text into vectors (for using in further text analysis)

Visualizing Count Vectorizer

CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample.

document = [ "One Geek helps Two Geeks", "Two Geeks help Four Geeks", "Each Geek helps many other Geeks at GeeksforGeeks."]

 ateachfourgeekgeeksgeeksforgeekshelphelpsmanyoneothertwo
document[0]000110010101
document[1]001020100001
document[2]110111011010
  • There are 12 unique words in the document, represented as columns of the table.
  • There are 3 text samples in the document, each represented as rows of the table.
  • Every cell contains a number, that represents the count of the word in that particular text.
  • All words have been converted to lowercase.
  • The words in columns have been arranged alphabetically.

Implementation

Step 1: Import necessary libraries and define corpus

We will import numpy and scikit learn library and define our corpus as list of strings

Python
from sklearn.feature_extraction.text import CountVectorizer
documents = [
    "I love machine learning",
    "I love deep learning",
    "machine learning is powerful"
]

Step 2: Define sklearn Vectorizer

We will define our sklearn vectorizer class, which can take the following hyper-parameters as input

  • ngram_range: controls whether single words, word pairs, or longer sequences are used as features.
  • stop_words: removes common, low-information words (e.g., “the”, “is”) from the vocabulary.
  • max_features: limits the vocabulary size to the most frequent terms to control dimensionality.
Python
vectorizer = CountVectorizer(
    stop_words="english",
    ngram_range=(1, 2),
    max_features=20
)

Step 3: Fit and Transform our documents using sklearn

We will first fit our vectorizer to the data , so that it learns patterns and then use this fitted model to transform our data.

Python
X = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
print("Features:")
print(feature_names)
print("\nDocument-Term Matrix:")
print(X.toarray())

Output:

Screenshot-2026-01-19-104039
Bigram document matrix

Advantages

  1. Count Vectorizer allows our words to be understood by Machine Learning models through Bag of Words (BOW) representation.
  2. Each feature directly represents a word or n-gram count.
  3. Performs well with classical ML models on small to medium text datasets.

Limitations

  1. No semantic understanding treats words independently and ignores meaning or context.
  2. Loses sequence information unless n-grams are explicitly used.
  3. Vocabulary can grow large, leading to sparse feature vectors.
Comment

Explore