Text Clustering with Sklearn

This lecture will be about Text Clustering with Sklearn.

Basically, text clustering is a technique for unsupervised machine learning aimed at grouping a set of documents into various clusters. In this way, one can arrange a huge group of documents into meaningful document groups based on the actual content. This will be very important in the organization of documents, topic modeling, and activities aimed at discovering hidden trends in textual data.

Key Steps in Text Clustering:

Text Preprocessing: The raw text data needs to be put into a form appropriate for clustering. This includes removal of stop words, stemming, lemmatization, and eventually vectorizing the text.

Feature Extraction: Maintain a numerical representation of the text data. Commonly used techniques include TF-IDF(Term Frequency-Inverse Document Frequency) and word embeddings.

Clustering Algorithm: Run algorithms like K-means, DBSCAN, or Hierarchical Clustering.

Evaluation: The quality of the cluster is measured with metrics such as the Silhouette score or inertia, looking at it visually.

Example Code Using Sklearn

Here’s a step-by-step example of text clustering using Sklearn with the K-means algorithm and TF-IDF vectorization:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Sample data
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
    "The quick brown fox jumps over the lazy dog.",
    "Never jump over the lazy dog quickly.",
]

# Step 1: Text Preprocessing & Feature Extraction
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

# Step 2: Clustering
num_clusters = 2
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(X)

# Step 3: Analyze the clusters
clusters = kmeans.labels_

# Display the results
document_clusters = pd.DataFrame({'Document': documents, 'Cluster': clusters})
print(document_clusters)

# Optional: Visualize the clusters
pca = PCA(n_components=2)
scatter_plot_points = pca.fit_transform(X.toarray())
colors = ["r", "b", "g", "c", "m", "y", "k", "lime", "orange", "purple"]

plt.figure(figsize=(10, 7))
for i in range(num_clusters):
    points = scatter_plot_points[clusters == i]
    plt.scatter(points[:, 0], points[:, 1], s=50, c=colors[i], label=f'Cluster {i}')

plt.legend()
plt.title('K-means Clustering of Documents')
plt.show()

Explanation:

Sample Data: Lists of text documents to be clustered.

Text Preprocessing & Feature Extraction: A ‘TfidfVectorizer’ converts text documents into TF-IDF features used as input for clustering.

Clustering: K-means algorithm clusters documents into a specified number of clusters, num_clusters.

Analyzing the Clusters: Printing the obtained clusters tells which document belongs to which cluster.

Visualization: Reduction by PCA (Principal Component Analysis) of the TF-IDF features to 2 dimensions for the purpose of visualization.
The clusters are color-coded.

Output

                                           Document          Cluster 
0                        This is the first document.            1 
1              This document is the second document.            1 
2                         And this is the third one.            0 
3                        Is this the first document?            1 
4        The quick brown fox jumps over the lazy dog.           0 
5               Never jump over the lazy dog quickly.           0

This example shows how to preprocess the text, then create features, apply a clustering algorithm, and plot it using Python with Sklearn.

 

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top