Behavioral Segmentation With R Using Google Analytics Data 📈

Customer segmentation is incredibly important, and now, incredibly practical.

In this article, I’m going to show you how to perform customer segmentation, using R, with data from Google Analytics.

Fair warning, this is from a real contract with a client. The data & code shared here have been scrubbed and I won’t be sharing it to avoid anyone reverse-engineering who the client is.

Note:
I don’t want to waste your time if you already know what segmentation is and just want to jump to the code, so here is our code-free intro guide to customer segmentation for the casual reader:

And here’s our intro guide to technical customer segmentation with code examples in python:

Ok, now that we know you’re in the right place, let’s dive into what to expect from this write-up.


This write-up is for two groups of people:

  1. A businessperson with some technical chops who is interested in using their web traffic or in-app data to drive insights.
  2. Or a data scientist, analyst, or technical marketer looking for a way to perform customer segmentation.

Why this matters to you

This is a real-world example of a contract I completed for a client which lead to actionable insights.

What I’m going to tell you

How I took vague and loose guidance and turned it into potentially valuable and actionable insights for a client. Hopefully, you’ll be able to take this information and do the same for your business or clients.

Questions I’m going to answer

  • What value does this add?
  • How to perform cluster analysis from start to finish, based on web traffic and in-app user data.

Alright, now that you know what to expect, let’s take a look at the project.

Background

The Client

This was a quick, one-off contract with a client in the financial services and crowdfunding space. They have significant web traffic and wanted to start looking at their data. They didn’t know where to start, and like virtually every data project, they wanted a quick win.

The Situation

I looked at gigs of in-app user data plus web traffic data and I got excited about the challenge. 

I typically code in Python, but the client asked for the code to be written in R. This increased the challenge, which only made the project more interesting to me.

Note to new Data Nerds:
Honestly, after you’ve been working on data projects for a while, you start looking for different projects to challenge yourself. Otherwise, you’ll get stuck in a niche and bore yourself to death.

The Client’s Goals

The Client’s goals were simple: “Give me something that leads to more revenue, so I can justify using more analytics and data science to drive business decisions.”

My Goals

Jiu-Jitsu the heck out of this data, and squeeze every penny of value out of it I can for my client. Delivering quickly for clients like this will make you lifelong friends, especially when you give someone firepower to substantiate their initiative.

Note:
Understanding the internal context of the client, or your own business, is critical to succeeding as a professional in data science.

Alright, now let’s dive into the data…

The data

What was the data

The client’s data consisted of all of their web traffic and in-app actions for a 6 month period. Each of the 2.4 million rows in the dataset represents a single website session, with a Session ID. The 66 original columns are used to describe that session.

How did this shape my thinking?

As soon as I looked at the data, I knew I wanted to work some form of segmentation. Customer segmentation is easily one of the most valuable technical steps a company can take because it can help you find opportunities.

Application – The Code

Get started

Install & load packages

You may need to install more of the packages, depending on the environment you’re working in. If so, just add the package to the install.package(c(….)) section.

install.packages(c("NbClust", "flexclust", "rattle", "fastDummies", "clusterSim"), repos = #"http://cran.us.r-project.org")

library(lattice)  # multivariate data visualization
library(vcd)  # data visualization for categorical variables
library(cluster)  # cluster analysis methods
library(flexclust)
library(grid)
library(gridExtra)
library(readr)
library(fastDummies)
library(dplyr)
library(readxl)
library(ggplot2)

Upload data

This data file was huge, ~1.7GB, so loading the data took a while.

df <- read_csv("session_data.csv")

Check out the data

head(df)

Data Selection

Create dummy variables for categorical variables

This data set had a ton of categorical variables that often yield very little value in typical data projects. With that in mind, I tend to convert categorical variables into dummy variables.

Note: There are easier ways to go about creating dummy variables, but I was moving quickly through this stage and simply copied and pasted the column names from a list of categorical variables.

cols <- c('IS_BOUNCE','IS_NEW_USER', 'IS_NEW_USER_AND_IDENTIFIER',
          'REFERRAL_SOURCE_BUCKET', 'LANDED_ON_LEND', 'LANDED_ON_HOME',
          'DEVICE_TYPE', 'HAS_TRANSACTION_EVENT', 'HAS_REGISTRATION',
          'TRANSACT_WITHIN_30_DAYS', 'TRANSACT_WITHIN_14_DAYS',
          'HAS_CATEGORY_OR_OTHER_LEND_PATH', 'USED_SEARCH_QUERY',
          'USED_EXPIRING_SOON_SORT', 'USED_REPAYMENT_TERM_SORT',
          'USED_OTHER_NON_POPULARITY_SORT', 'USED_TERM_MAX_FILTER',
          'USED_RISK_MAX_FILTER', 'USED_GENDER_FILTER_MALE',
          'USED_GENDER_FILTER_FEMALE', 'USED_COUNTRY_FILTER',
          'USED_SECTOR_FILTER', 'USED_ATTRIBUTE_FILTER',
          'USED_TAG_FILTER')

df <- dummy_cols(df, select_columns=cols, remove_first_dummy = TRUE)

Replace NaN or null values with 0

For some of the upcoming processes, we needed to eliminate all null values, which can be accomplished with the following code:

df[is.na(df)] <- 0

Note: The above code is a basic imputation method, and there are plenty of other ways to accomplish this task.

Create a new dataframe for cluster analysis

At this point, we need to put together all of the columns I want to use moving forward, dummy variables included.

new_cols = c('COOKIE_ID', 'SESSION_ID', 'USER_ID',
             'SESSION_DAY_OF_MONTH', 'SESSION_DAY_OF_WEEK', 
             'SESSION_LENGTH_MINUTES', 'COOKIE_SESSION_NUMBER', 
             'USER_SESSION_NUMBER','NUM_PAGE_VIEWS',
             'NUM_EVENTS', 'LOAN_PURCHASE_INDIVIDUAL',
             'LOAN_PURCHASE_MANAGED', 'DONATION_INDIVIDUAL', 
             'DEPOSIT_INDIVIDUAL',
             'GIFT_PURCHASE_INDIVIDUAL', 'DAYS_TO_NEXT_TRANSACTION',
             'SESSION_DEVICE_START_HOUR', 'SECONDS_SPENT_ON_LEND', 
             'NUM_LEND_PAGEVIEWS', 'NUM_BASKET_PAGEVIEWS',
             'NUM_CLICK_READ_MORE_EVENTS', 'NUM_CLICK_FILTER_BUTTON_EVENTS',
             'NUM_SEARCH_EVENTS','NUM_ADD_TO_BASKET_EVENTS',
             'NUM_SEARCH_QUERIES','MAX_RESULT_PAGE_NUMBER',
             'IS_NEW_USER_TRUE', 'IS_NEW_USER_AND_IDENTIFIER_TRUE',
             'REFERRAL_SOURCE_BUCKET_other', 'REFERRAL_SOURCE_BUCKET_direct', 
             'REFERRAL_SOURCE_BUCKET_search', 'REFERRAL_SOURCE_BUCKET_social',
             'REFERRAL_SOURCE_BUCKET_email', 'LANDED_ON_LEND_FALSE', 
             'LANDED_ON_HOME_TRUE', 'DEVICE_TYPE_Desktop', 'DEVICE_TYPE_Tablet',
             'DEVICE_TYPE_Mobile Device', 'DEVICE_TYPE_Ebook Reader',
             'DEVICE_TYPE_TV Device', 'DEVICE_TYPE_NA', 'DEVICE_TYPE_Console',
             'DEVICE_TYPE_Car Entertainment System', 
             'HAS_TRANSACTION_EVENT_yes',
             'HAS_REGISTRATION_TRUE', 'TRANSACT_WITHIN_30_DAYS_TRUE', 
             'TRANSACT_WITHIN_14_DAYS_TRUE', 
             'HAS_CATEGORY_OR_OTHER_LEND_PATH_FALSE',
             'USED_SEARCH_QUERY_TRUE', 'USED_EXPIRING_SOON_SORT_TRUE',
             'USED_REPAYMENT_TERM_SORT_TRUE', 
             'USED_OTHER_NON_POPULARITY_SORT_TRUE',
             'USED_TERM_MAX_FILTER_TRUE', 'USED_RISK_MAX_FILTER_TRUE',
             'USED_GENDER_FILTER_MALE_TRUE', 'USED_GENDER_FILTER_FEMALE_TRUE',
             'USED_COUNTRY_FILTER_TRUE', 'USED_SECTOR_FILTER_TRUE',
             'USED_ATTRIBUTE_FILTER_TRUE', 'USED_TAG_FILTER_TRUE')


data_for_cluster <- df[new_cols]
head(data_for_cluster)

Create a dataframe of just users with a USER_ID

Here I want to reduce the dataset only to visitors with a USER_ID. This will limit our dataset to people who are active users of my client’s application.

user_df <- subset(data_for_cluster, (data_for_cluster$_USER_ID != 0))
length(user_df$SESSION_ID)
head(user_df)

Now let’s scale the user_df in preparation for cluster analysis

If you want to know more about why scaling matters, check out the technical guide to customer segmentation I wrote. I also suggest checking out this StackExchange post:

The issue is what represents a good measure of distance between cases.If you have two features, one where the differences between cases is large and the other small, are you prepared to have the former as almost the only driver of distance?So for example if you clustered people on their weights in kilograms and heights in metres, is a 1kg difference as significant as a 1m difference in height? Does it matter that you would get different clusterings on weights in kilograms and heights in centimetres? If your answers are "no" and "yes" respectively then you should probably scale.On the other hand, if you were clustering Canadian cities based on distances east/west and distances north/south then, although there will typically be much bigger differences east/west, you may be happy just to use unscaled distances in either kilometres or miles (though you might want to adjust degrees of longitude and latitude for the curvature of the earth).
user_df_scaled <- user_df
user_df_scaled[c(4:60)] <- scale(user_df_scaled[c(4:60)])
head(user_df_scaled)

Notice the range of values.

Save the DataFrames as CSV files

This dataset is large and crashed my machine a few times. To protect me from repeating work, I exported CSV files regularly throughout this project.

write.csv(user_df_scaled, file = "user_df_scaled.csv") 

Cluster Analysis

We’re using partitioning due to the size of this dataset. If the dataset were substantially smaller, I’d consider hierarchical clustering.

How many clusters?

Build a function to produce a chart for the elbow method.

Note:
The silhouette score calculation was too computationally expensive with this dataset. Otherwise, the silhouette score would have been calculated and used to select the number of clusters to use for segmentation.

wssplot <- function(data, nc=10, seed=0){
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc){
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
  plot(1:nc, wss, type='b', xlab='Number of Clusters',
       ylab='Within groups of sum of squares')}

Now run the function for the elbow method

Beware: this function takes a significant amount of time to run (10-90 minutes for this dataset)

user_df_scaled[is.na(user_df_scaled)] <- 0 # Remove null values ... again

wssplot(user_df_scaled[c(4:60)]) # feed dataframe through elbow method function

From the elbow plot, it looks like 7 clusters make the most sense.

KMeans Clustering

fit.km <- kmeans(user_df_scaled[c(4:60)], 7, nstart=25)

Check the size of each cluster.

fit.km$size

Let’s look at the overall cluster metrics.

cluster_mean_metrics <- aggregate(user_df_scaled[-1], by=list(cluster=fit.km$cluster), mean)

cluster_mean_metrics

Look into cluster data for the number of transaction events.

transaction_event <- table(user_df_scaled$HAS_TRANSACTION_EVENT_yes, fit.km$cluster)
transaction_event

Add cluster number to each session entry in the user_df’s.

Here I kept scaled and unscaled df’s for further explainability and visualizations in the next stages of the project.

user_df_scaled$cluster <- fit.km$cluster
user_df$cluster <- fit.km$cluster 

Write both user_df’s to CSV file for backup.

Note: Originally, this project was executed using an R distribution on Google Colab for the use of GPUs and the ability to run multiple notebooks at the same time. That’s part of the reason for CSV saving throughout the project.

write.csv(user_df_scaled, file = "user_df_scaled.csv")
write.csv(user_df, file = "user_df.csv")

Group by cluster – clustered by session

cluster_df <- user_df %>% group_by(cluster) %>%
              summarise(Count=n(),
                 total_loan_purchase = sum(LOAN_PURCHASE_INDIVIDUAL),
                 total_loan_purchase_manage = sum(LOAN_PURCHASE_MANAGED),
                 total_donation = sum(DONATION_INDIVIDUAL),
                 total_deposit = sum(DEPOSIT_INDIVIDUAL),
                 total_gift_card_purchase = sum(GIFT_PURCHASE_INDIVIDUAL),
                 average_days_to_next_transaction=sum(DAYS_TO_NEXT_TRANSACTION)/
                                 length(DAYS_TO_NEXT_TRANSACTION),
                 total_transaction_value = (total_loan_purchase + 
                                 total_loan_purchase_manage + total_donation + . 
                                 total_deposit +  total_gift_card_purchase),
                 total_visits = length(unique(SESSION_ID)),
                 average_visit_value = (total_transaction_value / total_visits))

cluster_df # take a look at the data grouped by cluster and clustered by session

Let’s look at a couple of charts to get a feel for the data.

g1 <- ggplot(data=cluster_df,
             aes(x=total_transaction_value,
                 y=average_days_to_next_transaction,
                 color=cluster)) + geom_point(size=2.0)

g2 <- ggplot(data=cluster_df,
            aes(x=total_transaction_value,
                y=total_visits, color=cluster)) + geom_point(size=2.0)

g1grob <- ggplotGrob(g1)
g2grob <- ggplotGrob(g2)

grid.arrange(g1grob, g2grob)

Now let’s group the data by user (USER_ID)

This should allow us to get a better understanding of how users vary.

grouped_user_df <- user_df %>% group_by(USER_ID) %>%
  summarize(total_visits = n(),
            total_loan_purchase =s um(LOAN_PURCHASE_INDIVIDUAL),
            total_loan_purchase_manage = sum(LOAN_PURCHASE_MANAGED),
            total_donation = sum(DONATION_INDIVIDUAL),
            total_deposit = sum(DEPOSIT_INDIVIDUAL),
            total_gift_card_purchase = sum(GIFT_PURCHASE_INDIVIDUAL),
            average_days_to_next_transaction = (sum(DAYS_TO_NEXT_TRANSACTION) / 
                                              length(DAYS_TO_NEXT_TRANSACTION)),
            total_transaction_value = (total_loan_purchase + 
                                       total_loan_purchase_manage + 
                                       total_donation + total_deposit + 
                                       total_gift_card_purchase),
            average_visit_value = (total_transaction_value / total_visits),
            cluster = sum(cluster)/total_visits)

head(grouped_user_df) # Take a look at the columns and values

Let’s look at a couple of charts.

g1 <- ggplot(data=grouped_user_df, 
             aes(x=total_transaction_value, 
                 y=average_days_to_next_transaction, 
                 color=cluster)) + geom_point(size=0.5)
      
      
g2 <- ggplot(data=grouped_user_df, 
             aes(x=average_days_to_next_transaction,
                 y=total_transaction_value, 
                 color=cluster)) + geom_point(size = 0.5)

g1grob <- ggplotGrob(g1)
g2grob <- ggplotGrob(g2)

grid.arrange(g1grob, g2grob)

It looks like the highest value customers, based on total_transaction_value have smaller gaps between transactions than the less valuable customers.

ggplot(data=grouped_user_df, 
       aes(x=total_transaction_value, 
           y=total_visits, color=cluster)) + geom_point(size=1)

There’s one extreme outlier in terms of total_visits. Let’s remove that user and look at this graph again.

max(grouped_user_df$total_visits, na.rm=TRUE) # find max value of outlier
## [1] 102715

Take a look at the top 5 in total visits to grab its USER_ID.

head(arrange(grouped_user_df, desc(total_visits)))

Oddly enough, the user with 102,715 sessions made 0 transactions. I wonder if this is a company or test account. Or, maybe it’s a crawler/scraper which scrapes the website.

grouped_user_df[grouped_user_df$USER_ID == '2ea26f2b39c2ec5e592856a6429fdb96e0f6b21f09d5ebbe233e0396387b924b', ]

Tighten the range of total_visits and total_transaction_value to improve the granularity of the plot.

tight_user_df <- grouped_user_df[(grouped_user_df$total_visits <= 300) & (grouped_user_df$total_transaction_value > 100), ]

ggplot(data=tight_user_df, 
       aes(x=total_transaction_value, 
           y=total_visits, color=cluster)) + geom_point(size=1)

Compare the number of users in the tight_user_df and user_df.

length(tight_user_df$cluster)
## [1] 70964
length(grouped_user_df$cluster)
## [1] 315936

Let’s take a look at the overall distribution of total_transaction_values.

hist(grouped_user_df$total_transaction_value, 
    xlab='Total Transaction Value', 
     main='Frequency of Total Transaction Value', 
     ol='lightgreen', xlim=c(0,1500), breaks=25000) #25000 breaks to improve granularity of histogram

It looks like most users have a total_transaction_value < $500 Let’s look at it and see how many users.

low_trans_val <- grouped_user_df[grouped_user_df$total_transaction_value <=500, ]
length(low_trans_val$cluster)
## [1] 303394

As expected, the vast majority of all users total_transaction_value is less than $500 – 303394/315936 of the registered users in this dataset.

Now, let’s look at the distribution of total_transaction_value.

quantile(user_df$total_transaction_value, probs = c(0.315, 0.5, 0.65, 0.75, 
                                                    0.95, 0.99, 0.999, 0.9999, 
                                                    0.99999, 0.999999))

Note: In future work, it would make sense to remove the outliers with the highest transaction values. For this project, considering it’s more of an exploratory process, we’ll leave them in the dataset.

Cluster by user

Groupby USER_ID and create variables to tell us more about the users.

Notice we’re using the user_df instead of grouped_user_df for this grouping.

Now, select only users who have completed a transaction on the website.

Scale the data in order to cluster again, by user.

Figure out how many clusters to use with the elbow method. This is the same function from earlier, but these portions of the script were created in separate environments, so I’ve kept it in case that’s the case for someone else – copying & pasting code into multiple environments or scripts.

Run elbow method function with user_cluster_df_scaled.

From the elbow graph, it looks like 6 clusters make the most sense… although I’m not sure there’s actually a cluster pattern here at all.

Use KMeans to cluster the data

Review the clustering results.

Add a column for user_cluster to the current DataFrames.

Saving of dataframes to CSV Write a CSV file.

Plot the users to draw insights about their clusters

Quick notes:

  • users in cluster 6 spend far more money than the rest of the clusters.
  • users in clusters 3 and 6 engage more with the website than users in other clusters.

Group by user_cluster

Let’s look at transacting_user_df data, grouping by cluster

Analysis

Note 1:

Clusters 3 & 6 are much smaller (752 and 22, respectively) than the other clusters.

Note 2:

Based on total_transaction_value_per_user, the highest value users belong to:

  1. cluster 6 (total value = 262192.09)
  2. cluster 3 (total value = 6217.93)
  3. cluster 5 (total value = 582.11)

BUT, based on value_per_transaction_per_user, the highest value users belong to:

  1. cluster 6 (value/transaction = 1086.23)
  2. cluster 5 (value/transaction = 121.52)
  3. cluster 3 (value/transaction = 112.89)

AND, based on visits_per_transaction_per_user, the users which have the highest transaction/visit belong to:

  1. cluster 2 (visits/transaction = 0.43)
  2. cluster 5 (visits/transaction = 0.59)
  3. cluster 4 (visits/transaction = 0.95)

AND finally, based on total_visits_per_user, the users which visit the most frequently belong to:

  1. cluster 6 (visits/user = 609.82)
  2. cluster 3 (visits/user = 185.02)
  3. cluster 1 (visits/user = 5.73)
  4. cluster 4 (visits/user = 5.36)
  5. cluster 5 (visits/user = 2.16)
  6. cluster 2 (visits/user = 1.49)

Takeaway:

Users in cluster 5 present the greatest opportunity for overall transaction growth for the following reasons:

  1. Cluster 5 users produce $582 per user in transactions, which is the 3rd highest of all 6 clusters, trailing the 2 smallest clusters (both filled with relative outliers).
  2. Cluster 5 users’ average transaction value is second highest at $121.52.
  3. Cluster 5 users, on average, complete more transactions per website visit 1.69.
  4. Cluster 5 users, on average, visited the website only 2.16 times during the given time period which is the 2nd least.
  5. Cluster 5 has 14976 users, which is the 3rd largest cluster.
What’s the potential?

If the company could increase the average (mean) number of visits for users in cluster 5 from 2.16 to 4 or 5, then assuming the 1.69 transactions/visit and $121.52 value_per_transaction don’t completely plummet, the total value for cluster 5 should nearly double from $8.72M to more than $12M.

Example:

Project conclusion

Ok, so with this project, we took heaps of web traffic and user behavioral data and found the subset, or segment, of users with the highest potential for revenue growth for our client.

The next steps in the project would be to:

  • Create heuristics describing this segment
  • Meet with the client’s product & marketing team to identify specific methods for interacting with these users
  • Possibly sit-in on interviews with users within this segment to gather further qualitative data to build a more realistic view of this target user.

Closing Notes

Those next steps are nice and make sense in a perfect world, from a consultant’s perspective. But, if you’re working in a junior role, or in a smaller business, these steps may not be as practical. In that case, I’d suggest finding your clearest path to providing actionable business value with your analysis.

You could add business value by creating a quick test with an email marketing campaign, maybe offering a specific discount. Or maybe you could run Facebook retargeting ads reminding people within this segment to use your product or service. Either way, make sure to track and measure your expenditure and ROI.


If you’re interested in this topic and looking for more help, don’t hesitate to reach out! I provide technical marketing services like this segmentation for clients, and also don’t mind offering free consultations for people looking for tips.

Good luck & happy hacking!

Leave a Reply

Your email address will not be published.