Consumers are inundated with information; more information than ever before. Our brains have learned to ignore or otherwise become confused due to the enormous amounts of information we consume daily (Ozkan & Tolon, 2015). This has further complicated the field of marketing, and now businesses must leverage analytics to better understand their customers, and how to attract them.
In order to compete in this state of information overload, marketers have evolved. Enter segmentation.
What is segmentation?
Segmentation, either market or customer segmentation, has become a staple in the modern marketer’s toolbox. Market segmentation is the process of grouping consumers based on meaningful similarities (Miller, 2015). Segments are typically identified by geographic, demographic, psychographic, or behavioral characteristics.
Segmentation is used to inform several parts of a business, including product development, marketing campaigns, direct marketing, customer retention, and process optimization (Siegel, 2013).
Put simply, segmentation allows you to better understand your customers.
Benefits and drawbacks of segmentation
We’ve written about this before, so I’m going to drop this screenshot here for you and it’s linked to our extensive (and less technical) article on customer segmentation. If you want to learn about segmentation, but numbers and code make you uncomfortable, check out our gentler guide to customer segmentation.

With that said, we need to keep the drawbacks of cluster analysis for customer segmentation in mind. Below is a screenshot for the book Data Science For Marketing Analytics discussing the disadvantages of clustering.

Customer Segmentation Methods
There are numerous methods to perform segmentation, varying in rigor, data requirements, and purpose. The following methods are some of the most broadly used, but this is not an exhaustive list. There are papers discussing artificial neural networks, particle swarm optimization, and complex ensemble models, but they aren’t included due to limited exposure. In future articles, I may dive into some of these other methods, but for now, these more common methods should suffice.
Each of the following sections of this article will include a basic explanation of the method, as well as a basic coding example of the segmentation method applied. If you’re not technical, that’s fine, just skip over the code and you should still get a decent handle on each of the 4 approaches to segmentation we’re covering in this article.
Cluster Analysis
Cluster analysis is a method of grouping, or clustering, consumers based on their similarities.
There are 2 primary types of cluster analysis leveraged in market segmentation: hierarchical cluster analysis, and partitioning (Miller, 2015). For now, we’re going to discuss a partitioning cluster method called k-means.

Technical Preface
The code below was performed in a Jupyter notebook using Python 3.x and several Python packages for structuring, processing, analyzing, and visualizing the data.
Most of the code below is from the GitHub repository for the book Hands-On Data Science for Marketing. The book is available on Amazon or O’Reilly if you have a subscription.
The open-source dataset used in the following code came from UC Irvine’s Machine Learning Repository.
Import Packages and Data
To get started, we import the packages needed to execute our analysis and then import the xlsx (excel spreadsheet) data file. If you want to follow along with the same data, you’ll need to download it from UCI. For this example, I put the xlsx file in the folder (directory) where I launched the Jupyter notebook.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# read-in the excel spreadsheet using pandas
df = pd.read_excel('Online Retail.xlsx', sheet_name='Online Retail')
df.head() # take a look at the first 5 rows in the DataFrame
As you can see, we have 8 columns of data for each row and each row represents an item purchased. This isn’t that helpful yet, so let’s clean and organize this data in a way that allows us to formulate more actionable insights.
Data Cleanup
Below, we are going to remove data that isn’t helpful, is missing, or may cause issues later.
# Drop cancelled orders
df = df.loc[df['Quantity'] > 0]
# Drop records without CustomerID
df = df[pd.notnull(df['CustomerID'])]
# Drop incomplete month
df = df.loc[df['InvoiceDate'] < '2011-12-01']
# Calculate total sales from the Quantity and UnitPrice
df['Sales'] = df['Quantity'] * df['UnitPrice']
Now let’s transform the data so that each record represents a single customer’s purchase history.
# use groupby to aggregate sales by CustomerID
customer_df = df.groupby('CustomerID').agg({'Sales': sum,
'InvoiceNo': lambda x: x.nunique()})
# Select the columns we want to use
customer_df.columns = ['TotalSales', 'OrderCount']
# create a new column 'AvgOrderValu'
customer_df['AvgOrderValue'] = customer_df['TotalSales'] / customer_df['OrderCount']
customer_df.head()
Nice! We now have a DataFrame with total sales, order count, and average order value for each customer. But we’re not home free yet.
Normalize the data
Clustering algorithms like K-means are sensitive to the scales of the data used, so we’ll want to normalize the data.
Below is a screenshot from part of a StackExchange answer discussing why standardization or normalization is necessary for data used in K-means clustering. The screenshot is linked to the StackExchange question, so you can click on it and read the entirety of the discussion if you’d like more information.

rank_df = customer_df.rank(method='first')
normalized_df = (rank_df - rank_df.mean()) / rank_df.std()
normalized_df.head(10)
Our data is scaled between -2 and 2. Now let’s get to clustering.
Select the optimal number of clusters
Alright, we’re ready to run cluster analysis. But first, we need to figure out how many clusters we want to use. There are several approaches to selecting the number of clusters to use, but I’m going to cover two in this article: (1) silhouette coefficient, and (2) the elbow method.
Silhouette
For a quick rundown on silhouette, check out the screenshot from Wikipedia below. Once again, the image is linked to the Wikipedia page, so if you want to know more about the topics, click on the image.
![Silhouette (clustering)
From Wikipedia, the free encyclopedia
Jump to navigationJump to search
Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified.[1]
The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
The silhouette can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance.](https://www.mktr.ai/wp-content/uploads/2019/06/image-12.png)
Now that we know more about the silhouette coefficient, let’s dive into implementing the code so we can find the ideal number of clusters.
# Use silhouette coefficient to determine the best number of clusters
from sklearn.metrics import silhouette_score
for n_cluster in [4,5,6,7,8]:
kmeans = KMeans(n_clusters=n_cluster).fit(
normalized_df[['TotalSales', 'OrderCount', 'AvgOrderValue']])
silhouette_avg = silhouette_score(
normalized_df[['TotalSales', 'OrderCount', 'AvgOrderValue']],
kmeans.labels_)
print('Silhouette Score for %i Clusters: %0.4f' % (n_cluster, silhouette_avg))
Cluster 4 had the highest silhouette coefficient, indicating 4 would be the best number of clusters. But we’re going to double-check that with the elbow method.
The Elbow Method with the Sum of Squared Errors (SSE)
Instead of dragging you through a ton of math and a clunky explanation, I’m giving you a screenshot linked to a StackOverflow answer which does a good job explaining the elbow method with SSE (here’s a link to an explanation of SSE if you’re not familiar). If you want to see the rest of the discussion, click on the image.

Alright, with an intuitive understanding of the elbow method in hand, let’s use the elbow method to see if it agrees with our previous results suggesting 4 clusters.
Note: The code block below came from the GitHub repository for the book Data Science for Marketing Analytics.
from sklearn import cluster
import numpy as np
sse = []
krange = list(range(2,11))
X = normalized_df[['TotalSales','OrderCount','AvgOrderValue']].values
for n in krange:
model = cluster.KMeans(n_clusters=n, random_state=3)
model.fit_predict(X)
cluster_assignments = model.labels_
centers = model.cluster_centers_
sse.append(np.sum((X - centers[cluster_assignments]) ** 2))
plt.plot(krange, sse)
plt.xlabel("$K$")
plt.ylabel("Sum of Squares")
plt.show()
Based on the graph above, it looks like K=4, or 4 clusters is the optimal number of clusters for this analysis. Now let’s interpret the customer segments provided by these clusters.
Interpreting Customer Segments
kmeans = KMeans(n_clusters=4).fit(normalized_df[['TotalSales', 'OrderCount', 'AvgOrderValue']])
four_cluster_df = normalized_df[['TotalSales', 'OrderCount', 'AvgOrderValue']].copy(deep=True)
four_cluster_df['Cluster'] = kmeans.labels_
four_cluster_df.head(10)
Now let’s group the cluster metrics and see what we can gather from the normalized data for each cluster.
cluster1_metrics = kmeans.cluster_centers_[0]
cluster2_metrics = kmeans.cluster_centers_[1]
cluster3_metrics = kmeans.cluster_centers_[2]
cluster4_metrics = kmeans.cluster_centers_[3]
data = [cluster1_metrics, cluster2_metrics, cluster3_metrics, cluster4_metrics]
cluster_center_df = pd.DataFrame(data)
cluster_center_df.columns = four_cluster_df.columns[0:3]
cluster_center_df
Visualizing Clusters
For this next piece, we are going to visualize the clusters by putting the different columns on the x and y-axes. Let’s see what we get.
plt.scatter(
four_cluster_df.loc[four_cluster_df['Cluster'] == 0]['OrderCount'],
four_cluster_df.loc[four_cluster_df['Cluster'] == 0]['TotalSales'],
c='blue')
plt.scatter(
four_cluster_df.loc[four_cluster_df['Cluster'] == 1]['OrderCount'],
four_cluster_df.loc[four_cluster_df['Cluster'] == 1]['TotalSales'],
c='red')
plt.scatter(
four_cluster_df.loc[four_cluster_df['Cluster'] == 2]['OrderCount'],
four_cluster_df.loc[four_cluster_df['Cluster'] == 2]['TotalSales'],
c='orange')
plt.scatter(
four_cluster_df.loc[four_cluster_df['Cluster'] == 3]['OrderCount'],
four_cluster_df.loc[four_cluster_df['Cluster'] == 3]['TotalSales'],
c='green')
plt.title('TotalSales vs. OrderCount Clusters')
plt.xlabel('Order Count')
plt.ylabel('Total Sales')
plt.grid()
plt.show()
plt.scatter(
four_cluster_df.loc[four_cluster_df['Cluster'] == 0]['OrderCount'],
four_cluster_df.loc[four_cluster_df['Cluster'] == 0]['AvgOrderValue'],
c='blue')
plt.scatter(
four_cluster_df.loc[four_cluster_df['Cluster'] == 1]['OrderCount'],
four_cluster_df.loc[four_cluster_df['Cluster'] == 1]['AvgOrderValue'],
c='red')
plt.scatter(
four_cluster_df.loc[four_cluster_df['Cluster'] == 2]['OrderCount'],
four_cluster_df.loc[four_cluster_df['Cluster'] == 2]['AvgOrderValue'],
c='orange')
plt.scatter(
four_cluster_df.loc[four_cluster_df['Cluster'] == 3]['OrderCount'],
four_cluster_df.loc[four_cluster_df['Cluster'] == 3]['AvgOrderValue'],
c='green')
plt.title('AvgOrderValue vs. OrderCount Clusters')
plt.xlabel('Order Count')
plt.ylabel('Avg Order Value')
plt.grid()
plt.show()
plt.scatter(
four_cluster_df.loc[four_cluster_df['Cluster'] == 0]['TotalSales'],
four_cluster_df.loc[four_cluster_df['Cluster'] == 0]['AvgOrderValue'],
c='blue')
plt.scatter(
four_cluster_df.loc[four_cluster_df['Cluster'] == 1]['TotalSales'],
four_cluster_df.loc[four_cluster_df['Cluster'] == 1]['AvgOrderValue'],
c='red')
plt.scatter(
four_cluster_df.loc[four_cluster_df['Cluster'] == 2]['TotalSales'],
four_cluster_df.loc[four_cluster_df['Cluster'] == 2]['AvgOrderValue'],
c='orange')
plt.scatter(
four_cluster_df.loc[four_cluster_df['Cluster'] == 3]['TotalSales'],
four_cluster_df.loc[four_cluster_df['Cluster'] == 3]['AvgOrderValue'],
c='green')
plt.title('AvgOrderValue vs. TotalSales Clusters')
plt.xlabel('Total Sales')
plt.ylabel('Avg Order Value')
plt.grid()
plt.show()
The customers in green have low total sales AND low order count, meaning they are all-around low-value customers. On the other hand, the customers in orange have high total sales AND high order counts, indicating they are the highest value customers.
In this plot, we’re looking at the average order value vs the order count. Once again, the customers in green are the lowest value customers and the customers in orange are the highest value customers.
You could look at this in another way. You could look at the customers in the red cluster and attempt to find ways to increase their order count with email reminders or SMS push notifications targeted based on some other identifying factors. Maybe you could email them a discount if they return within 30 days. Better yet, you can offer a delayed coupon (to be used in a specific time period) upon checkout.
Likewise, with customers in the blue segment, you might want to try some cross-selling and up-selling techniques at the cart. Maybe a quick pop-up with an offer, based on market basket analysis (see the market basket analysis section below).
In this plot, we have the average order value versus total sales clusters. This plot further substantiates the previous 2 plots in identifying the orange cluster as the highest value customers, green as the lowest value customers, and the blue and red as high opportunity customers.
From a growth perspective, I’d focus my attention on the blue and red clusters. I’d attempt to better understand each cluster and their granular behaviors on-site in order to identify which cluster to focus on first and inform the first few rounds of experiments.
Find the best-selling item by segment
We know that we have 4 segments and know how much they spend per purchase, their total spending, and their number of orders. The next thing we can do that will help us better understand the customer segments is to identify which items are the best-selling within each segment.
high_value_cluster = four_cluster_df.loc[four_cluster_df['Cluster'] == 2]
pd.DataFrame(df.loc[df['CustomerID'].isin(high_value_cluster.index)].groupby(
'Description').count()['StockCode'].sort_values(ascending=False).head())
Based on this information, we now know that the Jumbo Bag Red Retrospot is the best-selling item for our highest-value cluster. With that information in hand, we can make recommendations of Other Items You Might Like to customers within this segment. These actions can be taken to another level of specificity with Association Rule Mining and Market Basket Analysis which I’ll cover below.
Wrapping-up Cluster Analysis
In this section, we ran through a basic application of K-means clustering based on the purchasing behaviors of historical customers. This type of analysis can be run for virtually any company with the requisite data. Ecommerce companies, SaaS companies, service-based companies, you name it. With that said, this is an easy example and without further testing and specific action, this information is useless.
Other Customer Segmentation Methods
In the coming weeks, I plan on updating this article with more robust explanations and code examples for each of the following methods.
Chi-square Automatic Interaction Detector (CHAID)
CHAID is a decision tree classification method that creates nodes or groupings of consumers enabling smaller group analysis (McCarty & Hastak, 2006).
If you’d like to use Python to implement CHAID this GitHub repository looks like a strong starting point.
Code example + pros and cons for CHAID coming.
Logistic Regression
Logistic regression is a modeling method used on a dichotomous or binary dependent variable (McCarty & Hastak, 2006).
If you want to dive into logistic regression use in segmentation, this article by Analytics Vidhya is a good place to start.
Code example + pros and cons for Logistic Regression coming.
Association Rule Mining
Association rule mining came to prominence, at least to the public, due to market basket analysis done by Target which famously informed a father that his teenage daughter was pregnant with targeted mail advertisements for pregnancy merchandise, although she hadn’t purchased anything directly indicative of her pregnancy (Hill, 2012).
Market basket analysis is especially helpful in purchasing behavior segmentation for retail businesses interested in finding items commonly purchased together, and how that may coincide with more typical demographic, psychographic, geographic, or behavioral data (Griva, Bardaki, Pramatari, & Papakiriakopoulos, 2018).
Practical code example + pros and cons of Association Rule Mining and Market Basket Analysis coming.
Conclusion
Customer segmentation can have an incredible impact on a business when done well. I highly suggest checking out my recent article outlining behavioral segmentation with R, as well as every one of the sources I have listed below in the reference list, especially the books.
If you have any questions or suggestions, please comment below. I’ll respond as soon as I can.
References
- Blanchard, Tommy. Bhatnagar, Pranshu. Behera, Debasish. (2019). Data Science for Marketing Analytics: Achieve your marketing goals with the data analytics power of Python. S.l.: Packt Publishing Limited.
- Griva, A., Bardaki, C., Pramatari, K., Papakiriakopoulos, D. (2018). Retail business analytics: Customer visit segmentation using market basket data. Expert Systems with Applications, 100, 1-16.
- Hill, K. (2012, February 16). How Target figured out a teen girl was pregnant before her father did. Forbes.com. Retrieved from https://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl- was-pregnant-before-her-father-did/
- Hong, T., Kim, E. (2011). Segmenting customers in online stores based on factors that affect the customer’s intention to purchase. Expert Systems with Applications, 39(2), 2127-2131.
- Hwang, Y. H. (2019). Hands-on Data Science For Marketing: Improve your marketing strategies with machine learning… using python and r. S.l.: Packt Publishing Limited.
- Kuruganti, S., & Basu, H. (2016). A Complete Guide to Build Better Predictive Models using Segmentation. Retrieved from https://www.analyticsvidhya.com/blog/2016/02/guide-build-predictive-models-segmentation/
- Miller, T. W. (2015). Modeling techniques in predictive analytics: Business problems and solutions with R (revised and expanded ed.). Upper Saddle River, NJ: Pearson Education, Inc.
- Miller, T. W. (2015). Marketing Data Science: Modeling Techniques in Predictive Analytics with R and Python. Upper Saddle River, NJ: Pearson Education, Inc.
- McCarty, J.A., Hastak, M. (2007). Segmentation approaches to data-mining: A comparison of RFM, CHAID, and logistic regression. Journal of Business Research, 60(6), 656-662.
- Online Retail Data Set. (2015). Retrieved from http://archive.ics.uci.edu/ml/datasets/online+retail
- Ozkan, E. Tolon, M. (2015). The effects of information overload on consumer confusion: An examination of user generated content. Bogazici Journal, 29(1), 27-52.
- PacktPublishing. (2019, April 04). PacktPublishing/Hands-On-Data-Science-for-Marketing. Retrieved from https://github.com/PacktPublishing/Hands-On-Data-Science-for-Marketing
- Residual sum of squares. (2019, February 22). Retrieved from https://en.wikipedia.org/wiki/Residual_sum_of_squares
- Rivas, A. (2018). How to Ignite Organic Growth: Customer Segmentation. Retrieved from https://www.mktr.ai/how-to-ignite-growth-with-customer-segmentation/
- Scikit Learn – K-Means – Elbow – criterion. (2013). Retrieved from https://stackoverflow.com/questions/19197715/scikit-learn-k-means-elbow-criterion
- Siegel, E. (2013). Predictive analytics: The power to predict who will click, buy, lie, or die. Hoboken, NJ: John Wiley & Sons, Inc.
- Silhouette (clustering). (2019, March 20). Retrieved from https://en.wikipedia.org/wiki/Silhouette_(clustering)
- TrainingByPackt. (2019, May 27). TrainingByPackt/Data-Science-for-Marketing-Analytics. Retrieved from https://github.com/TrainingByPackt/Data-Science-for-Marketing-Analytics
- Ttnphns (2012). Are mean normalization and feature scaling needed for k-means clustering? Retrieved from https://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering
5 comments
Amazing job. I appreciate it a lot.
Silvio, I’m glad you got something from the article! Don’t hesitate to reach out if you have any data questions.
– Mike
Fantastic article Mike!
But after you showed us the cake, we were not able to eat from it…I really would have loved to see the code for the three (3) visualizations…
Jose — the code for the visuals is shown right above them. The section above the code says:
“””
For this next piece, we are going to visualize the clusters by putting the different columns on the x and y-axes. Let’s see what we get.
“””
I’ll add a sub-header to that section to clarify. Thanks for pointing this out!
– Mike
It’s a pity you don’t have a donate button! I’d without a doubt
donate to this outstanding blog! I guess for now i’ll settle for book-marking and adding your RSS
feed to my Google account. I look forward to new updates and will share
this website with my Facebook group. Chat soon!