Customer segmentation is incredibly important, and now, incredibly practical.
In this article, I’m going to show you how to perform customer segmentation, using R, with data from Google Analytics.
Fair warning, this is from a real contract with a client. The data & code shared here have been scrubbed and I won’t be sharing it to avoid anyone reverse-engineering who the client is.
Note:
I don’t want to waste your time if you already know what segmentation is and just want to jump to the code, so here is our code-free intro guide to customer segmentation for the casual reader:
And here’s our intro guide to technical customer segmentation with code examples in python:
Ok, now that we know you’re in the right place, let’s dive into what to expect from this write-up.
This write-up is for two groups of people:
- A businessperson with some technical chops who is interested in using their web traffic or in-app data to drive insights.
- Or a data scientist, analyst, or technical marketer looking for a way to perform customer segmentation.
Why this matters to you
This is a real-world example of a contract I completed for a client which lead to actionable insights.
What I’m going to tell you
How I took vague and loose guidance and turned it into potentially valuable and actionable insights for a client. Hopefully, you’ll be able to take this information and do the same for your business or clients.
Questions I’m going to answer
- What value does this add?
- How to perform cluster analysis from start to finish, based on web traffic and in-app user data.
Alright, now that you know what to expect, let’s take a look at the project.
Background
The Client
This was a quick, one-off contract with a client in the financial services and crowdfunding space. They have significant web traffic and wanted to start looking at their data. They didn’t know where to start, and like virtually every data project, they wanted a quick win.
The Situation
I looked at gigs of in-app user data plus web traffic data and I got excited about the challenge.
I typically code in Python, but the client asked for the code to be written in R. This increased the challenge, which only made the project more interesting to me.
Note to new Data Nerds:
Honestly, after you’ve been working on data projects for a while, you start looking for different projects to challenge yourself. Otherwise, you’ll get stuck in a niche and bore yourself to death.
The Client’s Goals
The Client’s goals were simple: “Give me something that leads to more revenue, so I can justify using more analytics and data science to drive business decisions.”
My Goals
Jiu-Jitsu the heck out of this data, and squeeze every penny of value out of it I can for my client. Delivering quickly for clients like this will make you lifelong friends, especially when you give someone firepower to substantiate their initiative.
Note:
Understanding the internal context of the client, or your own business, is critical to succeeding as a professional in data science.
Alright, now let’s dive into the data…
The data
What was the data
The client’s data consisted of all of their web traffic and in-app actions for a 6 month period. Each of the 2.4 million rows in the dataset represents a single website session, with a Session ID. The 66 original columns are used to describe that session.
How did this shape my thinking?
As soon as I looked at the data, I knew I wanted to work some form of segmentation. Customer segmentation is easily one of the most valuable technical steps a company can take because it can help you find opportunities.
Application – The Code
Get started
Install & load packages
You may need to install more of the packages, depending on the environment you’re working in. If so, just add the package to the install.package(c(….)) section.
install.packages(c("NbClust", "flexclust", "rattle", "fastDummies", "clusterSim"), repos = #"http://cran.us.r-project.org")
library(lattice) # multivariate data visualization
library(vcd) # data visualization for categorical variables
library(cluster) # cluster analysis methods
library(flexclust)
library(grid)
library(gridExtra)
library(readr)
library(fastDummies)
library(dplyr)
library(readxl)
library(ggplot2)
Upload data
This data file was huge, ~1.7GB, so loading the data took a while.
df <- read_csv("session_data.csv")
Check out the data
head(df)
Data Selection
Create dummy variables for categorical variables
This data set had a ton of categorical variables that often yield very little value in typical data projects. With that in mind, I tend to convert categorical variables into dummy variables.
Note: There are easier ways to go about creating dummy variables, but I was moving quickly through this stage and simply copied and pasted the column names from a list of categorical variables.
cols <- c('IS_BOUNCE','IS_NEW_USER', 'IS_NEW_USER_AND_IDENTIFIER',
'REFERRAL_SOURCE_BUCKET', 'LANDED_ON_LEND', 'LANDED_ON_HOME',
'DEVICE_TYPE', 'HAS_TRANSACTION_EVENT', 'HAS_REGISTRATION',
'TRANSACT_WITHIN_30_DAYS', 'TRANSACT_WITHIN_14_DAYS',
'HAS_CATEGORY_OR_OTHER_LEND_PATH', 'USED_SEARCH_QUERY',
'USED_EXPIRING_SOON_SORT', 'USED_REPAYMENT_TERM_SORT',
'USED_OTHER_NON_POPULARITY_SORT', 'USED_TERM_MAX_FILTER',
'USED_RISK_MAX_FILTER', 'USED_GENDER_FILTER_MALE',
'USED_GENDER_FILTER_FEMALE', 'USED_COUNTRY_FILTER',
'USED_SECTOR_FILTER', 'USED_ATTRIBUTE_FILTER',
'USED_TAG_FILTER')
df <- dummy_cols(df, select_columns=cols, remove_first_dummy = TRUE)
Replace NaN or null values with 0
For some of the upcoming processes, we needed to eliminate all null values, which can be accomplished with the following code:
df[is.na(df)] <- 0
Note: The above code is a basic imputation method, and there are plenty of other ways to accomplish this task.
Create a new dataframe for cluster analysis
At this point, we need to put together all of the columns I want to use moving forward, dummy variables included.
new_cols = c('COOKIE_ID', 'SESSION_ID', 'USER_ID',
'SESSION_DAY_OF_MONTH', 'SESSION_DAY_OF_WEEK',
'SESSION_LENGTH_MINUTES', 'COOKIE_SESSION_NUMBER',
'USER_SESSION_NUMBER','NUM_PAGE_VIEWS',
'NUM_EVENTS', 'LOAN_PURCHASE_INDIVIDUAL',
'LOAN_PURCHASE_MANAGED', 'DONATION_INDIVIDUAL',
'DEPOSIT_INDIVIDUAL',
'GIFT_PURCHASE_INDIVIDUAL', 'DAYS_TO_NEXT_TRANSACTION',
'SESSION_DEVICE_START_HOUR', 'SECONDS_SPENT_ON_LEND',
'NUM_LEND_PAGEVIEWS', 'NUM_BASKET_PAGEVIEWS',
'NUM_CLICK_READ_MORE_EVENTS', 'NUM_CLICK_FILTER_BUTTON_EVENTS',
'NUM_SEARCH_EVENTS','NUM_ADD_TO_BASKET_EVENTS',
'NUM_SEARCH_QUERIES','MAX_RESULT_PAGE_NUMBER',
'IS_NEW_USER_TRUE', 'IS_NEW_USER_AND_IDENTIFIER_TRUE',
'REFERRAL_SOURCE_BUCKET_other', 'REFERRAL_SOURCE_BUCKET_direct',
'REFERRAL_SOURCE_BUCKET_search', 'REFERRAL_SOURCE_BUCKET_social',
'REFERRAL_SOURCE_BUCKET_email', 'LANDED_ON_LEND_FALSE',
'LANDED_ON_HOME_TRUE', 'DEVICE_TYPE_Desktop', 'DEVICE_TYPE_Tablet',
'DEVICE_TYPE_Mobile Device', 'DEVICE_TYPE_Ebook Reader',
'DEVICE_TYPE_TV Device', 'DEVICE_TYPE_NA', 'DEVICE_TYPE_Console',
'DEVICE_TYPE_Car Entertainment System',
'HAS_TRANSACTION_EVENT_yes',
'HAS_REGISTRATION_TRUE', 'TRANSACT_WITHIN_30_DAYS_TRUE',
'TRANSACT_WITHIN_14_DAYS_TRUE',
'HAS_CATEGORY_OR_OTHER_LEND_PATH_FALSE',
'USED_SEARCH_QUERY_TRUE', 'USED_EXPIRING_SOON_SORT_TRUE',
'USED_REPAYMENT_TERM_SORT_TRUE',
'USED_OTHER_NON_POPULARITY_SORT_TRUE',
'USED_TERM_MAX_FILTER_TRUE', 'USED_RISK_MAX_FILTER_TRUE',
'USED_GENDER_FILTER_MALE_TRUE', 'USED_GENDER_FILTER_FEMALE_TRUE',
'USED_COUNTRY_FILTER_TRUE', 'USED_SECTOR_FILTER_TRUE',
'USED_ATTRIBUTE_FILTER_TRUE', 'USED_TAG_FILTER_TRUE')
data_for_cluster <- df[new_cols]
head(data_for_cluster)
Create a dataframe of just users with a USER_ID
Here I want to reduce the dataset only to visitors with a USER_ID. This will limit our dataset to people who are active users of my client’s application.
user_df <- subset(data_for_cluster, (data_for_cluster$_USER_ID != 0))
length(user_df$SESSION_ID)
head(user_df)
Now let’s scale the user_df in preparation for cluster analysis
If you want to know more about why scaling matters, check out the technical guide to customer segmentation I wrote. I also suggest checking out this StackExchange post:

user_df_scaled <- user_df
user_df_scaled[c(4:60)] <- scale(user_df_scaled[c(4:60)])
head(user_df_scaled)
Notice the range of values.
Save the DataFrames as CSV files
This dataset is large and crashed my machine a few times. To protect me from repeating work, I exported CSV files regularly throughout this project.
write.csv(user_df_scaled, file = "user_df_scaled.csv")
Cluster Analysis
We’re using partitioning due to the size of this dataset. If the dataset were substantially smaller, I’d consider hierarchical clustering.
How many clusters?
Build a function to produce a chart for the elbow method.
Note:
The silhouette score calculation was too computationally expensive with this dataset. Otherwise, the silhouette score would have been calculated and used to select the number of clusters to use for segmentation.
wssplot <- function(data, nc=10, seed=0){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type='b', xlab='Number of Clusters',
ylab='Within groups of sum of squares')}
Now run the function for the elbow method
Beware: this function takes a significant amount of time to run (10-90 minutes for this dataset)
user_df_scaled[is.na(user_df_scaled)] <- 0 # Remove null values ... again
wssplot(user_df_scaled[c(4:60)]) # feed dataframe through elbow method function
From the elbow plot, it looks like 7 clusters make the most sense.
KMeans Clustering
fit.km <- kmeans(user_df_scaled[c(4:60)], 7, nstart=25)
Check the size of each cluster.
fit.km$size
Let’s look at the overall cluster metrics.
cluster_mean_metrics <- aggregate(user_df_scaled[-1], by=list(cluster=fit.km$cluster), mean)
cluster_mean_metrics
Look into cluster data for the number of transaction events.
transaction_event <- table(user_df_scaled$HAS_TRANSACTION_EVENT_yes, fit.km$cluster)
transaction_event
Add cluster number to each session entry in the user_df’s.
Here I kept scaled and unscaled df’s for further explainability and visualizations in the next stages of the project.
user_df_scaled$cluster <- fit.km$cluster
user_df$cluster <- fit.km$cluster
Write both user_df’s to CSV file for backup.
Note: Originally, this project was executed using an R distribution on Google Colab for the use of GPUs and the ability to run multiple notebooks at the same time. That’s part of the reason for CSV saving throughout the project.
write.csv(user_df_scaled, file = "user_df_scaled.csv")
write.csv(user_df, file = "user_df.csv")
Group by cluster – clustered by session
cluster_df <- user_df %>% group_by(cluster) %>%
summarise(Count=n(),
total_loan_purchase = sum(LOAN_PURCHASE_INDIVIDUAL),
total_loan_purchase_manage = sum(LOAN_PURCHASE_MANAGED),
total_donation = sum(DONATION_INDIVIDUAL),
total_deposit = sum(DEPOSIT_INDIVIDUAL),
total_gift_card_purchase = sum(GIFT_PURCHASE_INDIVIDUAL),
average_days_to_next_transaction=sum(DAYS_TO_NEXT_TRANSACTION)/
length(DAYS_TO_NEXT_TRANSACTION),
total_transaction_value = (total_loan_purchase +
total_loan_purchase_manage + total_donation + .
total_deposit + total_gift_card_purchase),
total_visits = length(unique(SESSION_ID)),
average_visit_value = (total_transaction_value / total_visits))
cluster_df # take a look at the data grouped by cluster and clustered by session
Let’s look at a couple of charts to get a feel for the data.
g1 <- ggplot(data=cluster_df,
aes(x=total_transaction_value,
y=average_days_to_next_transaction,
color=cluster)) + geom_point(size=2.0)
g2 <- ggplot(data=cluster_df,
aes(x=total_transaction_value,
y=total_visits, color=cluster)) + geom_point(size=2.0)
g1grob <- ggplotGrob(g1)
g2grob <- ggplotGrob(g2)
grid.arrange(g1grob, g2grob)
Now let’s group the data by user (USER_ID)
This should allow us to get a better understanding of how users vary.
grouped_user_df <- user_df %>% group_by(USER_ID) %>%
summarize(total_visits = n(),
total_loan_purchase =s um(LOAN_PURCHASE_INDIVIDUAL),
total_loan_purchase_manage = sum(LOAN_PURCHASE_MANAGED),
total_donation = sum(DONATION_INDIVIDUAL),
total_deposit = sum(DEPOSIT_INDIVIDUAL),
total_gift_card_purchase = sum(GIFT_PURCHASE_INDIVIDUAL),
average_days_to_next_transaction = (sum(DAYS_TO_NEXT_TRANSACTION) /
length(DAYS_TO_NEXT_TRANSACTION)),
total_transaction_value = (total_loan_purchase +
total_loan_purchase_manage +
total_donation + total_deposit +
total_gift_card_purchase),
average_visit_value = (total_transaction_value / total_visits),
cluster = sum(cluster)/total_visits)
head(grouped_user_df) # Take a look at the columns and values
Let’s look at a couple of charts.
g1 <- ggplot(data=grouped_user_df,
aes(x=total_transaction_value,
y=average_days_to_next_transaction,
color=cluster)) + geom_point(size=0.5)
g2 <- ggplot(data=grouped_user_df,
aes(x=average_days_to_next_transaction,
y=total_transaction_value,
color=cluster)) + geom_point(size = 0.5)
g1grob <- ggplotGrob(g1)
g2grob <- ggplotGrob(g2)
grid.arrange(g1grob, g2grob)
It looks like the highest value customers, based on total_transaction_value have smaller gaps between transactions than the less valuable customers.
ggplot(data=grouped_user_df,
aes(x=total_transaction_value,
y=total_visits, color=cluster)) + geom_point(size=1)
There’s one extreme outlier in terms of total_visits. Let’s remove that user and look at this graph again.
max(grouped_user_df$total_visits, na.rm=TRUE) # find max value of outlier
## [1] 102715
Take a look at the top 5 in total visits to grab its USER_ID.
head(arrange(grouped_user_df, desc(total_visits)))
Oddly enough, the user with 102,715 sessions made 0 transactions. I wonder if this is a company or test account. Or, maybe it’s a crawler/scraper which scrapes the website.
grouped_user_df[grouped_user_df$USER_ID == '2ea26f2b39c2ec5e592856a6429fdb96e0f6b21f09d5ebbe233e0396387b924b', ]
Tighten the range of total_visits and total_transaction_value to improve the granularity of the plot.
tight_user_df <- grouped_user_df[(grouped_user_df$total_visits <= 300) & (grouped_user_df$total_transaction_value > 100), ]
ggplot(data=tight_user_df,
aes(x=total_transaction_value,
y=total_visits, color=cluster)) + geom_point(size=1)
Compare the number of users in the tight_user_df and user_df.
length(tight_user_df$cluster)
## [1] 70964
length(grouped_user_df$cluster)
## [1] 315936
Let’s take a look at the overall distribution of total_transaction_values.
hist(grouped_user_df$total_transaction_value,
xlab='Total Transaction Value',
main='Frequency of Total Transaction Value',
ol='lightgreen', xlim=c(0,1500), breaks=25000) #25000 breaks to improve granularity of histogram
It looks like most users have a total_transaction_value < $500 Let’s look at it and see how many users.
low_trans_val <- grouped_user_df[grouped_user_df$total_transaction_value <=500, ]
length(low_trans_val$cluster)
## [1] 303394
As expected, the vast majority of all users total_transaction_value is less than $500 – 303394/315936 of the registered users in this dataset.
Now, let’s look at the distribution of total_transaction_value.
quantile(user_df$total_transaction_value, probs = c(0.315, 0.5, 0.65, 0.75,
0.95, 0.99, 0.999, 0.9999,
0.99999, 0.999999))
Note: In future work, it would make sense to remove the outliers with the highest transaction values. For this project, considering it’s more of an exploratory process, we’ll leave them in the dataset.
Cluster by user
Groupby USER_ID and create variables to tell us more about the users.
Notice we’re using the user_df instead of grouped_user_df for this grouping.
user_cluster_df <- user_df %>% group_by(USER_ID) %>%
summarize(
# Counting variables
total_visits = n(),
total_loan_purchase = sum(LOAN_PURCHASE_INDIVIDUAL),
num_of_loan_purchases = length(LOAN_PURCHASE_INDIVIDUAL[
LOAN_PURCHASE_INDIVIDUAL > 0]),
total_loan_purchase_manage = sum(LOAN_PURCHASE_MANAGED),
num_managed_loan_purchases = length( LOAN_PURCHASE_INDIVIDUAL[
LOAN_PURCHASE_INDIVIDUAL>0]),
total_donation = sum(DONATION_INDIVIDUAL),
num_donations = length(DONATION_INDIVIDUAL[
DONATION_INDIVIDUAL>0]),
total_deposit = sum(DEPOSIT_INDIVIDUAL),
num_deposits = length(DEPOSIT_INDIVIDUAL[DEPOSIT_INDIVIDUAL > 0]),
total_gift_card_purchase = sum(GIFT_PURCHASE_INDIVIDUAL),
num_gift_card_purchases = length(GIFT_PURCHASE_INDIVIDUAL[
GIFT_PURCHASE_INDIVIDUAL > 0]),
average_days_to_next_transaction = (sum(DAYS_TO_NEXT_TRANSACTION)/
length(DAYS_TO_NEXT_TRANSACTION)),
total_transaction_value = (total_loan_purchase +
total_loan_purchase_manage +
total_donation + total_deposit +
total_gift_card_purchase),
num_total_transactions = (num_of_loan_purchases +
num_managed_loan_purchases +
num_donations + num_deposits +
num_gift_card_purchases),
# Per visit variables
value_per_visit = (total_transaction_value / total_visits),
session_length_per_visit = sum(SESSION_LENGTH_MINUTES) /
total_visits,
page_views_per_visit = sum(NUM_PAGE_VIEWS) / total_visits,
events_per_visit = sum(NUM_EVENTS) / total_visits,
sec_spent_on_lend_per_visit = sum(SECONDS_SPENT_ON_LEND) /
total_visits,
lend_pageviews_per_visit = sum(NUM_LEND_PAGEVIEWS)/total_visits,
basket_pageviews_per_visit = sum(NUM_BASKET_PAGEVIEWS) /
total_visits,
session_cluster_visit = sum(cluster) / total_visits,
# Per transaction variables
value_per_transaction = total_transaction_value /
num_total_transactions,
session_length_per_transaction = sum(SESSION_LENGTH_MINUTES) /
num_total_transactions,
pageviews_per_transaction = sum(NUM_PAGE_VIEWS) /
num_total_transactions,
events_per_transaction = sum(NUM_EVENTS) / num_total_transactions,
sec_spent_on_lend_per_transaction = sum(SECONDS_SPENT_ON_LEND) /
num_total_transactions,
lend_pageviews_per_transaction = sum(NUM_LEND_PAGEVIEWS) /
num_total_transactions,
basket_pageviews_per_transaction = sum(NUM_BASKET_PAGEVIEWS) /
num_total_transactions,
visits_per_transaction = total_visits / num_total_transactions
)
head(user_cluster_df)
Now, select only users who have completed a transaction on the website.
transacting_user_df <- user_cluster_df[user_cluster_df$num_total_transactions > 0,]
Scale the data in order to cluster again, by user.
# create df ahead of scaling in order to pass USER_ID
user_cluster_df_scaled <- transacting_user_df
# set center=FALSE to keep scaled values positive.
user_cluster_df_scaled[2:31] <-scale(user_cluster_df_scaled[2:31],center=FALSE,
scale = apply(user_cluster_df_scaled[2:31], 2, sd, na.rm =FALSE))
# make sure all values are positive and not null.
head(user_cluster_df_scaled, 15)
Figure out how many clusters to use with the elbow method. This is the same function from earlier, but these portions of the script were created in separate environments, so I’ve kept it in case that’s the case for someone else – copying & pasting code into multiple environments or scripts.
wssplot <- function(data, nc=15, seed=0){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type='b', xlab='Number of Clusters',
ylab='Within groups of sum of squares')}
Run elbow method function with user_cluster_df_scaled.
wssplot(user_cluster_df_scaled[c(2:31)]) # pass df through elbow method graph function
From the elbow graph, it looks like 6 clusters make the most sense… although I’m not sure there’s actually a cluster pattern here at all.
Use KMeans to cluster the data
user_cluster <- kmeans(user_cluster_df_scaled[2:31], 6)
Review the clustering results.
user_cluster$size # size of each cluster
## [1] 87880 101555 752 2158 14976 22
user_cluster_mean_metrics <- aggregate(user_cluster_df_scaled[-1], by=list(cluster=user_cluster$cluster), mean)
user_cluster_mean_metrics
Add a column for user_cluster to the current DataFrames.
user_cluster_df_scaled$user_cluster <- user_cluster$cluster # the scaled df
transacting_user_df$user_cluster <- user_cluster$cluster # the unscaled df
head(transacting_user_df,15) # take another look at the data
Saving of dataframes to CSV Write a CSV file.
write.csv(user_cluster_df_scaled, file = "user_cluster_df_scaled.csv")
write.csv(transacting_user_df, file = "transacting_user_df.csv")
Plot the users to draw insights about their clusters
g1 <- ggplot(data=transacting_user_df,
aes(x=value_per_transaction,
y=visits_per_transaction,
color=user_cluster)) + geom_point(size=.5)
g2 <- ggplot(data=transacting_user_df,
aes(x=value_per_transaction,
y=session_length_per_transaction,
color=user_cluster)) + geom_point(size=.5)
g3 <- ggplot(data=transacting_user_df,
aes(x=value_per_transaction,
y=value_per_visit,
color=user_cluster)) + geom_point(size=.5)
g4 <- ggplot(data=transacting_user_df,
aes(x=value_per_transaction,
y=(total_donation/num_donations),
color=user_cluster)) + geom_point(size=.5)
g1grob <- ggplotGrob(g1)
g2grob <- ggplotGrob(g2)
g3grob <- ggplotGrob(g3)
g4grob <- ggplotGrob(g4)
grid.arrange(g1grob, g2grob, g3grob, g4grob)
g5 <- ggplot(data=transacting_user_df,
aes(x=total_visits,
y=value_per_transaction,
color=user_cluster)) + geom_point(size=.5)
g6 <- ggplot(data=transacting_user_df,
aes(x=lend_pageviews_per_transaction,
y=value_per_transaction,
color=user_cluster)) + geom_point(size=.5)
g7 <- ggplot(data=transacting_user_df,
aes(x=pageviews_per_transaction,
y=value_per_transaction,
color=user_cluster)) + geom_point(size=.5)
g8 <- ggplot(data=transacting_user_df,
aes(x=events_per_transaction,
y=value_per_transaction,
color=user_cluster)) + geom_point(size=.5)
g5grob <- ggplotGrob(g5)
g6grob <- ggplotGrob(g6)
g7grob <- ggplotGrob(g7)
g8grob <- ggplotGrob(g8)
grid.arrange(g5grob, g6grob, g7grob, g8grob)
g9 <- ggplot(data=transacting_user_df,
aes(x=total_visits,
y=total_transaction_value,
color=user_cluster)) + geom_point(size=.5)
g10 <- ggplot(data=transacting_user_df,
aes(x=lend_pageviews_per_transaction,
y=total_transaction_value,
color=user_cluster)) + geom_point(size=.5)
g11<- ggplot(data=transacting_user_df,
aes(x=pageviews_per_transaction,
y=total_transaction_value,
color=user_cluster)) + geom_point(size=.5)
g12 <- ggplot(data=transacting_user_df,
aes(x=events_per_transaction,
y=total_transaction_value,
color=user_cluster)) + geom_point(size=.5)
g9grob <- ggplotGrob(g9)
g10grob <- ggplotGrob(g10)
g11grob <- ggplotGrob(g11)
g12grob <- ggplotGrob(g12)
grid.arrange(g9grob, g10grob, g11grob, g12grob)
Quick notes:
- users in cluster 6 spend far more money than the rest of the clusters.
- users in clusters 3 and 6 engage more with the website than users in other clusters.
Group by user_cluster
Let’s look at transacting_user_df data, grouping by cluster
transacting_cluster_df <- transacting_user_df %>%
group_by(user_cluster) %>%
summarize(count=n(),
total_visits_per_user = sum(total_visits)/count,
total_loan_purchase_per_user =
sum(total_loan_purchase)/count,
num_of_loan_purchases_per_user =
sum(num_of_loan_purchases)/count,
total_loan_purchase_manage_per_user =
sum(total_loan_purchase_manage)/count,
num_managed_loan_purchases_per_user =
sum(num_managed_loan_purchases)/count,
total_donation_per_user =
sum(total_donation)/count,
num_donations_per_user = sum(num_donations)/count,
total_deposit_per_user = sum(total_deposit)/count,
num_deposits_per_user = sum(num_deposits)/count,
total_gift_card_purchase_per_user =
sum(total_gift_card_purchase)/count,
num_gift_card_purchases_per_user =
sum(num_gift_card_purchases)/count,
average_days_to_next_transaction_per_user =
sum(average_days_to_next_transaction) /
count,
total_transaction_value_per_user =
sum(total_transaction_value)/count,
num_total_transactions_per_user =
sum(num_total_transactions)/count,
value_per_visit_per_user =
sum(value_per_visit)/count,
session_length_per_visit_per_user =
sum(session_length_per_visit)/count,
events_per_visit_per_user =
sum(events_per_visit)/count,
sec_spent_on_lend_per_visit_per_user =
sum(sec_spent_on_lend_per_visit)/count,
lend_pageviews_per_visit_per_user =
sum(lend_pageviews_per_visit)/count,
basket_pageviews_per_visit_per_user =
sum(basket_pageviews_per_visit)/count,
session_cluster_visit_per_user =
sum(session_cluster_visit)/count,
value_per_transaction_per_user =
sum(value_per_transaction)/count,
session_length_per_transaction_per_users =
sum(session_length_per_transaction)/count,
pageviews_per_transaction_per_user =
sum(pageviews_per_transaction)/count,
events_per_transaction_per_user =
sum(events_per_transaction)/count,
sec_spent_on_lend_per_transaction_per_user =
sum(sec_spent_on_lend_per_transaction) /
count,
lend_pageviews_per_transaction_per_user =
sum(lend_pageviews_per_transaction) /
count,
basket_pageviews_per_transaction_per_user =
sum(basket_pageviews_per_transaction) /
count,
visits_per_transaction_per_user = .
sum(visits_per_transaction) / count)
# Look for information about user_clusters
knitr::kable(transacting_cluster_df, floating.environment="sidewaystable")
Analysis
Note 1:
Clusters 3 & 6 are much smaller (752 and 22, respectively) than the other clusters.
Note 2:
Based on total_transaction_value_per_user, the highest value users belong to:
- cluster 6 (total value = 262192.09)
- cluster 3 (total value = 6217.93)
- cluster 5 (total value = 582.11)
BUT, based on value_per_transaction_per_user, the highest value users belong to:
- cluster 6 (value/transaction = 1086.23)
- cluster 5 (value/transaction = 121.52)
- cluster 3 (value/transaction = 112.89)
AND, based on visits_per_transaction_per_user, the users which have the highest transaction/visit belong to:
- cluster 2 (visits/transaction = 0.43)
- cluster 5 (visits/transaction = 0.59)
- cluster 4 (visits/transaction = 0.95)
AND finally, based on total_visits_per_user, the users which visit the most frequently belong to:
- cluster 6 (visits/user = 609.82)
- cluster 3 (visits/user = 185.02)
- cluster 1 (visits/user = 5.73)
- cluster 4 (visits/user = 5.36)
- cluster 5 (visits/user = 2.16)
- cluster 2 (visits/user = 1.49)
Takeaway:
Users in cluster 5 present the greatest opportunity for overall transaction growth for the following reasons:
- Cluster 5 users produce $582 per user in transactions, which is the 3rd highest of all 6 clusters, trailing the 2 smallest clusters (both filled with relative outliers).
- Cluster 5 users’ average transaction value is second highest at $121.52.
- Cluster 5 users, on average, complete more transactions per website visit 1.69.
- Cluster 5 users, on average, visited the website only 2.16 times during the given time period which is the 2nd least.
- Cluster 5 has 14976 users, which is the 3rd largest cluster.
What’s the potential?
If the company could increase the average (mean) number of visits for users in cluster 5 from 2.16 to 4 or 5, then assuming the 1.69 transactions/visit and $121.52 value_per_transaction don’t completely plummet, the total value for cluster 5 should nearly double from $8.72M to more than $12M.
Example:
(4 website visits/user) x (1.69 transactions/visit) x ($121.52/transaction) x (14976 users) = $12.3M
(5 website visits/user) x (1.69 transactions/visit) x ($121.52/transaction) x (14976 users) = $15.4M
Project conclusion
Ok, so with this project, we took heaps of web traffic and user behavioral data and found the subset, or segment, of users with the highest potential for revenue growth for our client.
The next steps in the project would be to:
- Create heuristics describing this segment
- Meet with the client’s product & marketing team to identify specific methods for interacting with these users
- Possibly sit-in on interviews with users within this segment to gather further qualitative data to build a more realistic view of this target user.
Closing Notes
Those next steps are nice and make sense in a perfect world, from a consultant’s perspective. But, if you’re working in a junior role, or in a smaller business, these steps may not be as practical. In that case, I’d suggest finding your clearest path to providing actionable business value with your analysis.
You could add business value by creating a quick test with an email marketing campaign, maybe offering a specific discount. Or maybe you could run Facebook retargeting ads reminding people within this segment to use your product or service. Either way, make sure to track and measure your expenditure and ROI.
If you’re interested in this topic and looking for more help, don’t hesitate to reach out!
Good luck & happy hacking!