Customer Segmentation: A Technical Guide With Python Examples

Consumers are inundated with information; more information than ever before. Our brains have learned to ignore or otherwise become confused due to the enormous amounts of information we consume daily (Ozkan & Tolon, 2015). This has further complicated the field of marketing, and now businesses must leverage analytics to better understand their customers, and how to attract them.

The proposed structural model of the relationship between information overload and consumer confusion (Ozkan & Tolon, 2015).
Figure from Ozkan & Tolon’s study. (full PDF version of Ozkan & Tolon’s Paper)

In order to compete in this state of information overload, marketers have evolved. Enter segmentation.

What is segmentation?

Segmentation, either market or customer segmentation, has become a staple in the modern marketer’s toolbox. Market segmentation is the process of grouping consumers based on meaningful similarities (Miller, 2015). Segments are typically identified by geographic, demographic, psychographic, or behavioral characteristics.

Segmentation is used to inform several parts of a business, including product development, marketing campaigns, direct marketing, customer retention, and process optimization (Siegel, 2013).

Put simply, segmentation allows you to better understand your customers.

If you’re comfortable with customer or market segmentation and walk to see a more in-depth case study using R, here’s a write-up for you.

Benefits and drawbacks of segmentation

We’ve written about this before, so I’m going to drop this screenshot here for you and it’s linked to our extensive (and less technical) article on customer segmentation. If you want to learn about segmentation, but numbers and code make you uncomfortable, check out our gentler guide to customer segmentation.

Customer segmentation is more than just a way for businesses to optimize their marketing campaigns. Customer segmentation is a two-way street.

Both consumers and companies benefit.

Consumers benefit because:

1. They feel like companies have their best interest in mind.
2. Content and products address and fulfill their needs.

Companies benefit because they can:

1. Optimize their marketing spend
2. Increase customer lifetime value (CLV)
3. Improve customer service and customer experience
4. Implement optimal marketing channel selection for their each segment
5. Improve product features and offerings
6. Identify and cater to most profitable customers
Screenshot from our article How to Ignite Organic Growth: Customer Segmentation (Rivas, 2018).

With that said, we need to keep the drawbacks of cluster analysis for customer segmentation in mind. Below is a screenshot for the book Data Science For Marketing Analytics discussing the disadvantages of clustering.

Here are the disadvantages of clustering: Customer groups created may not be easily interpretable. If data is not based on consumer behavior (such as products or services purchased), it may not be clear how to use the clusters that are found. As you can see, one downside of clustering is that it may find groups that don't seem to make a lot of sense on the surface. Often this can be fixed by using a better suited clustering algorithm. Determining how to evaluate and fine-tune clustering algorithms will be the topic of our next chapter.
Screenshot from chapter 3 of the book Data Science for Marketing Analytics (Blanchard & Behera, 2019).

Customer Segmentation Methods

There are numerous methods to perform segmentation, varying in rigor, data requirements, and purpose. The following methods are some of the most broadly used, but this is not an exhaustive list. There are papers discussing artificial neural networks, particle swarm optimization, and complex ensemble models, but they aren’t included due to limited exposure. In future articles, I may dive into some of these other methods, but for now, these more common methods should suffice.

Each of the following sections of this article will include a basic explanation of the method, as well as a basic coding example of the segmentation method applied. If you’re not technical, that’s fine, just skip over the code and you should still get a decent handle on each of the 4 approaches to segmentation we’re covering in this article.

Cluster Analysis

Cluster analysis is a method of grouping, or clustering, consumers based on their similarities.

There are 2 primary types of cluster analysis leveraged in market segmentation: hierarchical cluster analysis, and partitioning (Miller, 2015). For now, we’re going to discuss a partitioning cluster method called k-means.

What is k-means clustering? The k-means clustering algorithm is a frequently used algorithm for drawing insights into the formations and separations within data. In marketing, it is often used to build customer segments and understand the behaviors of these different segments. Let's dive into building clustering models in Python.
Screenshot of K-means explanation from the book Hands-On Data Science for Marketing.

Technical Preface

The code below was performed in a Jupyter notebook using Python 3.x and several Python packages for structuring, processing, analyzing, and visualizing the data.

Most of the code below is from the GitHub repository for the book Hands-On Data Science for Marketing. The book is available on Amazon or O’Reilly if you have a subscription.

The open-source dataset used in the following code came from UC Irvine’s Machine Learning Repository.

Import Packages and Data

To get started, we import the packages needed to execute our analysis and then import the xlsx (excel spreadsheet) data file. If you want to follow along with the same data, you’ll need to download it from UCI. For this example, I put the xlsx file in the folder (directory) where I launched the Jupyter notebook.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# read-in the excel spreadsheet using pandas 
df = pd.read_excel('Online Retail.xlsx', sheet_name='Online Retail')

df.head() # take a look at the first 5 rows in the DataFrame

As you can see, we have 8 columns of data for each row and each row represents an item purchased. This isn’t that helpful yet, so let’s clean and organize this data in a way that allows us to formulate more actionable insights.

Data Cleanup

Below, we are going to remove data that isn’t helpful, is missing, or may cause issues later.

# Drop cancelled orders
df = df.loc[df['Quantity'] > 0]

# Drop records without CustomerID
df = df[pd.notnull(df['CustomerID'])]

# Drop incomplete month
df = df.loc[df['InvoiceDate'] < '2011-12-01']

# Calculate total sales from the Quantity and UnitPrice
df['Sales'] = df['Quantity'] * df['UnitPrice']

Now let’s transform the data so that each record represents a single customer’s purchase history.

# use groupby to aggregate sales by CustomerID
customer_df = df.groupby('CustomerID').agg({'Sales': sum, 
                                            'InvoiceNo': lambda x: x.nunique()})

# Select the columns we want to use
customer_df.columns = ['TotalSales', 'OrderCount'] 

# create a new column 'AvgOrderValu'
customer_df['AvgOrderValue'] = customer_df['TotalSales'] / customer_df['OrderCount']


Nice! We now have a DataFrame with total sales, order count, and average order value for each customer. But we’re not home free yet.

Normalize the data

Clustering algorithms like K-means are sensitive to the scales of the data used, so we’ll want to normalize the data.

Below is a screenshot from part of a StackExchange answer discussing why standardization or normalization is necessary for data used in K-means clustering. The screenshot is linked to the StackExchange question, so you can click on it and read the entirety of the discussion if you’d like more information.

If your variables are of incomparable units (e.g. height in cm and weight in kg) then you should standardize variables, of course. Even if variables are of the same units but show quite different variances it is still a good idea to standardize before K-means. You see, K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance, so clusters will tend to be separated along variables with greater variance.

enter image description here

A different thing also worth to remind is that K-means clustering results are potentially sensitive to the order of objects in the data set1. A justified practice would be to run the analysis several times, randomizing objects order; then average the cluster centres of those runs and input the centres as initial ones for one final run of the analysis.

Here is some general reasoning about the issue of standardizing features in cluster or other multivariate analysis.

1 Specifically, (1) some methods of centers initialization are sensitive to case order; (2) even when the initialization method isn't sensitive, results might depend sometimes on the order the initial centers are introduced to the program by (in particular, when there are tied, equal distances within data); (3) so-called running means version of k-means algorithm is naturally sensitive to case order (in this version - which is not often used apart from maybe online clustering - recalculation of centroids take place after each individual case is reasssigned to another cluster).
Screenshot from StackExchange stats question Are mean normalization and feature scaling needed for k-means clustering? Answer provided by the user ttnphns.
rank_df = customer_df.rank(method='first')
normalized_df = (rank_df - rank_df.mean()) / rank_df.std()
Table showing normalized values for total sales, order count, and average order value by customer.

Our data is scaled between -2 and 2. Now let’s get to clustering.

Select the optimal number of clusters

Alright, we’re ready to run cluster analysis. But first, we need to figure out how many clusters we want to use. There are several approaches to selecting the number of clusters to use, but I’m going to cover two in this article: (1) silhouette coefficient, and (2) the elbow method.


For a quick rundown on silhouette, check out the screenshot from Wikipedia below. Once again, the image is linked to the Wikipedia page, so if you want to know more about the topics, click on the image.

Silhouette (clustering)
From Wikipedia, the free encyclopedia
Jump to navigationJump to search
Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified.[1]

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

The silhouette can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance.
Screenshot from the Wikipedia page for Silhouette (clustering).

Now that we know more about the silhouette coefficient, let’s dive into implementing the code so we can find the ideal number of clusters.

# Use silhouette coefficient to determine the best number of clusters
from sklearn.metrics import silhouette_score

for n_cluster in [4,5,6,7,8]:
    kmeans = KMeans(n_clusters=n_cluster).fit(
        normalized_df[['TotalSales', 'OrderCount', 'AvgOrderValue']])
    silhouette_avg = silhouette_score(
        normalized_df[['TotalSales', 'OrderCount', 'AvgOrderValue']], 
    print('Silhouette Score for %i Clusters: %0.4f' % (n_cluster, silhouette_avg))
Silhouette Scores for 4, 5, 6, 7, and 8 clusters.

Cluster 4 had the highest silhouette coefficient, indicating 4 would be the best number of clusters. But we’re going to double-check that with the elbow method.

The Elbow Method with the Sum of Squared Errors (SSE)

Instead of dragging you through a ton of math and a clunky explanation, I’m giving you a screenshot linked to a StackOverflow answer which does a good job explaining the elbow method with SSE (here’s a link to an explanation of SSE if you’re not familiar). If you want to see the rest of the discussion, click on the image.

Via StackOverflow: 

Elbow Criterion Method:

The idea behind elbow method is to run k-means clustering on a given dataset for a range of values of k (num_clusters, e.g k=1 to 10), and for each value of k, calculate sum of squared errors (SSE).

After that, plot a line graph of the SSE for each value of k. If the line graph looks like an arm - a red circle in below line graph (like angle), the "elbow" on the arm is the value of optimal k (number of cluster). Here, we want to minimize SSE. SSE tends to decrease toward 0 as we increase k (and SSE is 0 when k is equal to the number of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster).

So the goal is to choose a small value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k.
Screenshot from a StackOverflow question Scikit Learn-K-Means-Elbow- criterion? Answer provided by the user Om Prakash.

Alright, with an intuitive understanding of the elbow method in hand, let’s use the elbow method to see if it agrees with our previous results suggesting 4 clusters.

Note: The code block below came from the GitHub repository for the book Data Science for Marketing Analytics.

from sklearn import cluster
import numpy as np

sse = []
krange = list(range(2,11))
X = normalized_df[['TotalSales','OrderCount','AvgOrderValue']].values
for n in krange:
    model = cluster.KMeans(n_clusters=n, random_state=3)
    cluster_assignments = model.labels_
    centers = model.cluster_centers_
    sse.append(np.sum((X - centers[cluster_assignments]) ** 2))

plt.plot(krange, sse)
plt.ylabel("Sum of Squares")
Elbow Graph exported from my working Jupyter notebook

Based on the graph above, it looks like K=4, or 4 clusters is the optimal number of clusters for this analysis. Now let’s interpret the customer segments provided by these clusters.

Interpreting Customer Segments

kmeans = KMeans(n_clusters=4).fit(normalized_df[['TotalSales', 'OrderCount', 'AvgOrderValue']])

four_cluster_df = normalized_df[['TotalSales', 'OrderCount', 'AvgOrderValue']].copy(deep=True)
four_cluster_df['Cluster'] = kmeans.labels_


Now let’s group the cluster metrics and see what we can gather from the normalized data for each cluster.

cluster1_metrics = kmeans.cluster_centers_[0]
cluster2_metrics = kmeans.cluster_centers_[1]
cluster3_metrics = kmeans.cluster_centers_[2]
cluster4_metrics = kmeans.cluster_centers_[3]

data = [cluster1_metrics, cluster2_metrics, cluster3_metrics, cluster4_metrics]
cluster_center_df = pd.DataFrame(data)

cluster_center_df.columns = four_cluster_df.columns[0:3]
Screenshot of a table I made of the metrics for each cluster. For the index on the leftmost column of the graph: o = cluster 1, 1 = cluster 2, 2 = cluster 3, 3 = cluster 4.
Visualizing Clusters

For this next piece, we are going to visualize the clusters by putting the different columns on the x and y-axes. Let’s see what we get.

    four_cluster_df.loc[four_cluster_df['Cluster'] == 0]['OrderCount'], 
    four_cluster_df.loc[four_cluster_df['Cluster'] == 0]['TotalSales'],

    four_cluster_df.loc[four_cluster_df['Cluster'] == 1]['OrderCount'], 
    four_cluster_df.loc[four_cluster_df['Cluster'] == 1]['TotalSales'],

    four_cluster_df.loc[four_cluster_df['Cluster'] == 2]['OrderCount'], 
    four_cluster_df.loc[four_cluster_df['Cluster'] == 2]['TotalSales'],

    four_cluster_df.loc[four_cluster_df['Cluster'] == 3]['OrderCount'], 
    four_cluster_df.loc[four_cluster_df['Cluster'] == 3]['TotalSales'],

plt.title('TotalSales vs. OrderCount Clusters')
plt.xlabel('Order Count')
plt.ylabel('Total Sales')


    four_cluster_df.loc[four_cluster_df['Cluster'] == 0]['OrderCount'], 
    four_cluster_df.loc[four_cluster_df['Cluster'] == 0]['AvgOrderValue'],

    four_cluster_df.loc[four_cluster_df['Cluster'] == 1]['OrderCount'], 
    four_cluster_df.loc[four_cluster_df['Cluster'] == 1]['AvgOrderValue'],

    four_cluster_df.loc[four_cluster_df['Cluster'] == 2]['OrderCount'], 
    four_cluster_df.loc[four_cluster_df['Cluster'] == 2]['AvgOrderValue'],

    four_cluster_df.loc[four_cluster_df['Cluster'] == 3]['OrderCount'], 
    four_cluster_df.loc[four_cluster_df['Cluster'] == 3]['AvgOrderValue'],

plt.title('AvgOrderValue vs. OrderCount Clusters')
plt.xlabel('Order Count')
plt.ylabel('Avg Order Value')


    four_cluster_df.loc[four_cluster_df['Cluster'] == 0]['TotalSales'], 
    four_cluster_df.loc[four_cluster_df['Cluster'] == 0]['AvgOrderValue'],

    four_cluster_df.loc[four_cluster_df['Cluster'] == 1]['TotalSales'], 
    four_cluster_df.loc[four_cluster_df['Cluster'] == 1]['AvgOrderValue'],

    four_cluster_df.loc[four_cluster_df['Cluster'] == 2]['TotalSales'], 
    four_cluster_df.loc[four_cluster_df['Cluster'] == 2]['AvgOrderValue'],

    four_cluster_df.loc[four_cluster_df['Cluster'] == 3]['TotalSales'], 
    four_cluster_df.loc[four_cluster_df['Cluster'] == 3]['AvgOrderValue'],

plt.title('AvgOrderValue vs. TotalSales Clusters')
plt.xlabel('Total Sales')
plt.ylabel('Avg Order Value')


The customers in green have low total sales AND low order count, meaning they are all-around low-value customers. On the other hand, the customers in orange have high total sales AND high order counts, indicating they are the highest value customers.

In this plot, we’re looking at the average order value vs the order count. Once again, the customers in green are the lowest value customers and the customers in orange are the highest value customers.

You could look at this in another way. You could look at the customers in the red cluster and attempt to find ways to increase their order count with email reminders or SMS push notifications targeted based on some other identifying factors. Maybe you could email them a discount if they return within 30 days. Better yet, you can offer a delayed coupon (to be used in a specific time period) upon checkout.

Likewise, with customers in the blue segment, you might want to try some cross-selling and up-selling techniques at the cart. Maybe a quick pop-up with an offer, based on market basket analysis (see the market basket analysis section below).

In this plot, we have the average order value versus total sales clusters. This plot further substantiates the previous 2 plots in identifying the orange cluster as the highest value customers, green as the lowest value customers, and the blue and red as high opportunity customers.

From a growth perspective, I’d focus my attention on the blue and red clusters. I’d attempt to better understand each cluster and their granular behaviors on-site in order to identify which cluster to focus on first and inform the first few rounds of experiments.

Find the best-selling item by segment

We know that we have 4 segments and know how much they spend per purchase, their total spending, and their number of orders. The next thing we can do that will help us better understand the customer segments is to identify which items are the best-selling within each segment.

high_value_cluster = four_cluster_df.loc[four_cluster_df['Cluster'] == 2]


Based on this information, we now know that the Jumbo Bag Red Retrospot is the best-selling item for our highest-value cluster. With that information in hand, we can make recommendations of Other Items You Might Like to customers within this segment. These actions can be taken to another level of specificity with Association Rule Mining and Market Basket Analysis which I’ll cover below.

Wrapping-up Cluster Analysis

In this section, we ran through a basic application of K-means clustering based on the purchasing behaviors of historical customers. This type of analysis can be run for virtually any company with the requisite data. Ecommerce companies, SaaS companies, service-based companies, you name it. With that said, this is an easy example and without further testing and specific action, this information is useless.

Other Customer Segmentation Methods

In the coming weeks, I plan on updating this article with more robust explanations and code examples for each of the following methods.

Chi-square Automatic Interaction Detector (CHAID)

CHAID is a decision tree classification method that creates nodes or groupings of consumers enabling smaller group analysis (McCarty & Hastak, 2006).

If you’d like to use Python to implement CHAID this GitHub repository looks like a strong starting point.

Code example + pros and cons for CHAID coming.

Logistic Regression

Logistic regression is a modeling method used on a dichotomous or binary dependent variable (McCarty & Hastak, 2006).

If you want to dive into logistic regression use in segmentation, this article by Analytics Vidhya is a good place to start.

Code example + pros and cons for Logistic Regression coming.

Association Rule Mining

Association rule mining came to prominence, at least to the public, due to market basket analysis done by Target which famously informed a father that his teenage daughter was pregnant with targeted mail advertisements for pregnancy merchandise, although she hadn’t purchased anything directly indicative of her pregnancy (Hill, 2012).

Market basket analysis is especially helpful in purchasing behavior segmentation for retail businesses interested in finding items commonly purchased together, and how that may coincide with more typical demographic, psychographic, geographic, or behavioral data (Griva, Bardaki, Pramatari, & Papakiriakopoulos, 2018).

Practical code example + pros and cons of Association Rule Mining and Market Basket Analysis coming.


Customer segmentation can have an incredible impact on a business when done well. I highly suggest checking out my recent article outlining behavioral segmentation with R, as well as every one of the sources I have listed below in the reference list, especially the books.

If you have any questions or suggestions, please comment below. I’ll respond as soon as I can.


    1. Jose — the code for the visuals is shown right above them. The section above the code says:
      For this next piece, we are going to visualize the clusters by putting the different columns on the x and y-axes. Let’s see what we get.
      I’ll add a sub-header to that section to clarify. Thanks for pointing this out!
      – Mike

  1. It’s a pity you don’t have a donate button! I’d without a doubt
    donate to this outstanding blog! I guess for now i’ll settle for book-marking and adding your RSS
    feed to my Google account. I look forward to new updates and will share
    this website with my Facebook group. Chat soon!

Leave a Reply

Your email address will not be published. Required fields are marked *