Amazon Reviews Analysis

Data Acquisition

I began by working with a SQLite database containing over 560,000 Amazon customer reviews. Using Python’s sqlite3 library and Pandas, I extracted the relevant data fields: review score and review text. Since my focus was on negative experiences, I filtered reviews with a score of 1 and then randomly sampled 20,000 reviews to balance the dataset size with computational feasibility.

Text Preprocessing

To prepare the data for modeling, I implemented a structured NLP preprocessing pipeline:

Lowercasing and cleaning text with regular expressions (re).
Stopword removal using NLTK’s stopwords.
Lemmatization with WordNetLemmatizer to reduce words to their root forms.
Removal of irrelevant frequent terms such as “product”, “amazon”, and “good” to sharpen topic quality.

The result was a corpus of normalized, meaningful tokens better suited for topic modeling.

Topic Modeling with BERTopic

I used BERTopic, a state-of-the-art algorithm that combines:

Sentence embeddings (transforming text into semantic vectors).
Dimensionality reduction (I applied PCA instead of UMAP to avoid MacOS-related segmentation faults).
Clustering to group semantically similar reviews.
Class-based TF-IDF to generate representative keywords for each cluster.

This process produced multiple coherent clusters of reviews, each reflecting a dominant theme in customer complaints.

Reducing Outliers

A major challenge in BERTopic is that many documents are often marked as outliers. I leveraged BERTopic’s reduce_outliers functionality to reassign reviews where possible, reducing noise and increasing the stability of clusters. This ensured that the majority of the 20,000 reviews were represented within meaningful topics instead of being discarded.

Human-Readable Topic Labels with OpenAI

While BERTopic generates keywords as topic representations, these are often too fragmented to communicate meaning clearly. To address this, I integrated OpenAI’s GPT models:

Extracted keywords from each cluster.
Prompted GPT to generate a concise 2–4 word label summarizing the cluster.
Stored the LLM-generated labels alongside BERTopic’s results in a CSV file.

This step transformed raw keywords into intuitive, human-readable topics (e.g., “Late Delivery Issues” or “Damaged Packaging”).

Visualization and Reporting

To make the analysis interpretable, I exported outputs as:

CSV files containing topics, keywords, and AI-generated labels.
Interactive HTML visualizations (bar charts, topic maps, and hierarchies) directly from BERTopic.

These visualizations enable stakeholders to explore customer feedback themes interactively and prioritize areas for improvement.

Technologies Used

Python (core scripting and analysis).
Pandas (data handling).
NLTK (text preprocessing).
BERTopic (topic modeling and clustering).
scikit-learn PCA (dimensionality reduction).
OpenAI API (LLM-based topic labeling).
SQLite3 (data extraction).
HTML/CSS via Plotly (embedded in BERTopic) for visualization.

Key Outcomes

Identified actionable customer complaint categories.
Reduced over 18,000 potential outliers into meaningful clusters.
Produced interpretable and concise topic labels with the help of LLMs.
Created a reusable, modular pipeline for future customer review analyses.

Project Link:

GitHub