Whether you’re selling groceries, financial services, or gym memberships, successful recruitment of new customers is only truly successful if they return to buy from you again. The metric which reflects this is called retention, and the approach we use is customer retention analysis. It’s one of the key metrics that influences revenue. When your customers’ retention is low, you’ll spend all of the income from your business on marketing.
At the same time, retention is easy to improve if you can calculate it the right way using SQL and your database. In this post, we’ll guide you step by step on how to make basic customer retention analysis, how to build customer retention over time, new vs. existing customers retention curves, and how to calculate retention analysis in cohorts.
Basic customer retention curves
Customer retention curves are essential to any business looking to understand its clients, and will go a long way towards explaining other things like sales figures or the impact of marketing initiatives. They are an easy way to visualize a key interaction between customers and the business, which is to say, whether or not customers return — and at what rate — after the first visit.
The first step to building a customer retention curve is to identify those who visited your business during the reference period, what I will call p1. It is important that the length of the period chosen is a reasonable one, and reflects expected frequency of visits.
Different types of businesses are going to expect their customers to return at different rates:
- A coffee shop may choose to use an expected frequency of visits of once a week.
- A supermarket may choose a longer period, perhaps 2 weeks or a month.
In the following example, I will use a month, and assume that we are looking at customer retention of customers who visited in January 2016 over the following year.
As previously stated, the first step is to identify the original pool of customers:
January_pool AS ( SELECT DISTINCT cust_id FROM dataset WHERE month(transaction_date)=1 AND year(transaction_date)=2016)
Then, we look at how those customers behaved over time: for example, how many of them returned per month over the rest of the year?
SELECT Year(transaction_date), Month(transaction_date), count (distinct cust_id) AS number FROM dataset WHERE year(transaction_date)=2016 AND cust_id IN january_pool GROUP BY 1, 2
As you can see, the original SELECT function is included in this second step.
If we had 1000 unique customers in January, we can expect our results to look something like this:
The resulting graph would then look like this:
Evolution of customer retention over time
What is described above is obviously only the first step, as we would also like to see whether there are any trends in customer retention, i.e. are we getting any better at it?
So, one idea we might have is to say: of those who came in January, how many returned in February? Of those who came in February, how many returned in March? And other one-month intervals.
So, then we need to set up an iterative model, which can be built in a few simple steps. First, we need to create a table where each user’s visits are logged by month, allowing for the possibility that these will have occurred over multiple years since whenever our business started operations. I have assumed here that the start date is the year 2000, but you can adjust this as necessary.
Visit_log AS SELECT cust_id, datediff(month, ‘2000–01–01’, transaction_date) AS visit_month FROM dataset GROUP BY 1, 2 ORDER BY 1, 2
This will give us a view that looks like this:
We then need to reorganize this information in order to identify the time lapse between each visit. So, for each person and for each month, see when the next visit is.
Time_lapse AS SELECT cust_id, visit_month lead(visit_month, 1) over (partition BY cust_id ORDER BY cust_id, visit_month) FROM visit_log
We then need to calculate the time gaps between visits:
Time_diff_calculated AS SELECT cust_id, visit_month, lead, lead — visit_month AS time_diff FROM time_lapse
Now, a small reminder of what customer retention analysis measures: it is the proportion of customers who return after x lag of time. So, what we want to do is compare the number of customers visiting in a given month to how many of those return the next month. We also want to define those who return after a certain absence, and those who don’t return at all. In order to do that, we need to categorize the customers depending on their visit pattern.
Custs_categorized AS SELECT cust_id, visit_month, CASE WHEN time_diff=1 THEN ‘retained’, WHEN time_diff>1 THEN ‘lagger’, WHEN time_diff IS NULL THEN ‘lost’ END AS cust_type FROM time_diff_calculated
This will allow us, in a final step, to establish a count of the number of customers who visited in a given month, and how many of those return the next month.
SSELECT visit_month, count(cust_id where cust_type=’retained’)/count(cust_id) AS retention FROM custs_categorized GROUP BY 1
This gives us, month by month, the proportion of customers who returned.
Other techniques for customer retention
There are, of course, other ways to think about customer retention analysis. In particular, we might want to look at how to deal with returning customers: if in p1 there are 100 customers and in p2 80 of them return, in p3 do we want to keep the original 100 or look only at the 80?
In the above example, I talked only about period-to-period retention, but I also classified those who were “laggers,” or who took more than one month to return. A business may want, through targeted marketing efforts, to reward customers who keep a monthly visit pattern, or perhaps, to encourage those who visit less than once a month to come back more often.
So, another way to look at it would be to look at what proportion of our visitors in any given month are retained, how many are returning, and how many are new. In this case, we would want to change our perspective slightly, and look not at the next visit, but at the previous one.
Time_lapse_2 AS SELECT cust_id, Visit_month, lag(visit_month, 1) over (partition BY cust_id ORDER BY cust_id, visit_month) FROM visit_log Time_diff_calculated_2 AS SELECT cust_id, visit_month, lag, visit_month — lag AS time_diff FROM time_lapse_2 Custs_categorized AS SELECT cust_id, visit_month, CASE WHEN time_diff=1 THEN ‘retained’, WHEN time_diff>1 THEN ‘returning’, WHEN time_diff IS NULL THEN ‘new’ END AS cust_type FROM time_diff_calculated_2
So then, it’s simple to count the number of each type of customer in any given month.
SELECT visit_month, cust_type, Count(cust_id) FROM custs_categorized GROUP BY 1, 2
With a simple export and graph, you will get something that looks like this:
A popular way to analyze customer retention is through the use of cohorts, i.e. defining each user by their first visit, and then seeing how they return over time.
Ideally, our end product will be a table like this:
Well, actually, ideally, our table will show that the number of new users is increasing every month, and that retention is improving over time. But you get the idea!
In order to build this, we need first to establish the number of new users for each month, and the amount of time that they stick around. Then, getting the percentages will be easy.
Our first subquery will classify each customer according to their initial visit, and the length of time that they are retained.
first_visit AS SELECT cust_id, min(visit_month) AS first_month FROM visit_log GROUP BY 1
By month, we can then calculate the number of new users.
New_users AS SELECT first_month, count(DISTINCT cust_id) AS new_users FROM first_visit GROUP BY 1
We also want to get a picture of whether a new user in, say, month 1, returned in both month 2 and 3 or only in month 3. So, we’ll create a subquery that allows us to track the visit pattern of each customer.
SELECT * FROM visit_log LEFT JOIN visit_log AS visit_tracker ON visit_log.cust_id = visit_tracker.cust_id AND visit_log.visit_month < visit_tracker.visit_month
Then, we’ll combine these to get a lovely final product. For each month a new user is recruited, we’ll have the size of their group, as well as, for each subsequent month, the percentage of those users who returned. Just like the table above!
In order to do that, we run the following script:
SELECT first_month, new_users, retention_month, retained, retention_percent FROM (SELECT first_month, new_users, ( visit_tracker.visit_month - visit_log.visit_month ) AS retention_month, Count(DISTINCT visit_tracker.cust_id) AS retained, Count(DISTINCT visit_tracker.cust_id)/new_users AS retention_percent FROM first_visit LEFT JOIN visit_log AS visit_tracker ON visit_log.cust_id = visit_tracker.cust_id AND visit_log.visit_month < visit_tracker.visit_month LEFT JOIN new_users ON new_users.first_month = first_visit.first_month GROUP BY 1, 2)
This would give us a table which looks like this:
We can then do a pivot table in Excel (or similar) in order to get the table above, or use a business intelligence tool such as Statsbot to get data right from our database without coding SQL.
Producing such a table is, of course, only the first step, and a pretty useless one at that if we don’t know how to interpret the numbers.
Taking our original table again, I’ll run through the basics of cohort analysis.
First of all, we’re going to look at the number of new users per month. Are we managing to recruit a healthy number of clients for our business? Here we can see that we are recruiting between 500 and 600 new clients per month. Is this what we had aimed to do? This is a discussion that should be had with the marketing team.
Then, is the number of new clients per month rising? If not, why? Are there peaks and troughs? So, for example, we can see that there were only 504 new customers in period 6, but 583 in period 5 and 598 in period 7. These are huge swings in the numbers. If there is something to explain it, like we’re a business that caters to school groups and there was a week of school vacation during that period, fine. But we need to have an explanation.
After that, the actual retention percentages. One business objective is going to be to improve our retention over time, i.e. to have an upward trend as we go down the lines. So, if we retain a higher percentage of new users who joined in period 2 than those who joined in period 1, we are on track.
As we see in the table above, this is what happened: after 1 month, 63% of new users from period 1 returned, but 68% of new users from period 2 returned. So we’re doing well! Basically, our customer experience should be getting better over time, so that people are more likely to come back.
After that, another metric we will look at is the rate at which we lose people. It is expected that, over time, retention will fall. This is the reason that we always need to keep recruiting! But, if we can slow customer attrition, so that the fall between each period gets smaller, then we can spend more resources on serving client needs, and not just on trying to find new clients. This, in turn, will help retention, as clients will like that we are paying more attention to them.
Frequency of visits
One element that I have neglected until now is frequency of visits, which is the other side of customer retention analysis. It is generally accepted that a customer who remains loyal to your business is worth much more than a new customer, or even one whose engagement is patchy (i.e. returning customers).
Adding a count of visits per customer, per month, or average lag time will complete the picture of how customers are interacting with your business. You can also build cohorts with the amount of customers, revenue from returning customers, etc.
Customer retention analysis will add depth to any business analysis, and allow decision makers to track not only the success of their recruitment strategy, but also how well they are performing in terms of customer experience. If your clients don’t come back, then there’s something to be worked on, either in terms of the quality of your product, or in terms of the relationship that you have with your customers. Retention analysis allows for easy flagging of this type of issue.