Listening to customers, at scale
March 08, 2019   ·  9min

How behavioral tracking system evolves in a rapidly growing startup

When a startup grows, the tracking system needs to change. Growing up is hard. Every startup that survives its first years knows that.

Founding a startup is fundamentally different than growing one: this is particularly true for web companies. They have almost no upfront costs in the beginning; when hiring a consultant to build the MVP some bugs in the product are expected, but no one knows you, so the traffic is low. The next phase, when customers that expect you to be up 24/7 and deliver tons of new features each day while paying the engineers a top class salary, is way harder.

Once you have tested your idea, delivering it to a larger audience seems to be the most challenging part: a lot of promising startups failed during this phase. The biggest issue being the paradigm shift between the two phases.

When you start, the most important feature of your platform is flexibility: you don't know what your customers want and you need to offer as many options as possible. A lot of books have been written about MVP and pivoting: fail fast and be ready to change everything.

When you have a consolidated business, on the other hand, your platform's best feature is continuous improvement, i.e. the ability to deliver a better experience with each update, in order to offer your customers something that's both easy-to-use and cool, something they could "feel" great and innovative.

And here is where the complexity comes in. When you start, it is enough to have a rough idea about the features you need to improve and the ones you need to drop. Usually, the analysis is simple: people use them or not. Plenty of tools are available for this purpose and if you prefer to implement it by yourself, in a sort of self-made Chtulhu-analytics, you are free to do so: the cost of taming complexity is still low.

However, things change when you start scaling. You're no longer interested in which features are the most used ones - because you know for certain that people will use what you offer - but you need to know how they use them, if they follow the designed UX, if they fail in interpreting some contents, if they like it in the end. The level of detail required here is way bigger: you need to be able to understand if the color of a button influences the subscription rate even just by a little percentage.

Here is where behavioral tracking comes in.

Tracking behavior of your customers means to observe, in the finest possible way, what they do in your business. Not only pageviews and orders on your main platform as atomic events, but also as part of a pattern that enables you to classify your users.

Most of the companies start this trip by merging Google Analytics metrics with a combination of custom tracked signals and production data. This works very well, particularly in the beginning when flexibility is still a value, but doesn't scale well.

Google Analytics has a curious pricing model that goes from zero to $150.000 USD per year when you hit 1M events, unless you're OK with results sampling. Moreover, custom tracked signals are great, but it means you have a bunch of "extra components" to manage inside all the features you are going to implement in your codebase and this, obviously, increases complexity and it's definitely more error-prone.

When all these problems start to be relevant, it's time to move to a standalone analytics platform.

What were we looking for?

Here at ProntoPro, we faced a lot of challenges in the last years: our user base went from zero to millions in less than 3 years and the development team followed, going from 3 to 20 during the same period of time. Our BI team is having a hard time taming the complexity of the current analytics sources, so we decided it's time to introduce a whole new stack.

At the moment, behavioral data comes mostly from Google Analytics. The main problem with it is that it's designed to analyze web traffic and lacks in adaptability when you try to model events that are not strictly related to the Web. There is no standard way to track something that happens outside of a website. Many third-party plugins are available to track business events from backends or from external services and the dashboard is extremely powerful, but it's not a one-size fits all service. Many events require a complex structure to be tracked and relying on GA APIs is not the preferred option.

Most of these tracking events are dispatched using custom code that is deeply coupled with the main platform. This approach gives us the ability to track exactly what we need using the most convenient data structure and relying on the speed of our main DB. Unfortunately, reconciling data later is usually a pain.

A/B testing is a key example: if you want to run one or multiple tests together using this approach and you want to analyze results of the test, you need to fetch data from multiple sources (Google Analytics, the main databases, other services involved), reconcile them in an uniform dataset (merging timeframe, IDs and so on) and, only at the end of this process, look at the events you want to observe.

This is not rocket science but has a cost and cost scales at least linearly with the size of your application and the size of your user base. Improving the quality of our platform requires a lot of tests and the effort required for the analysis puts an upper boundary to the speed we can evolve our service at. This is not good, at all.

It's clear that the current approach doesn't scale. We want to introduce something different, something able to overcome these limits. So, which are the requirements?

First of all, the system should be general-purpose. We need a tracking system that is able to track a generic events together with all the information required to understand where the event happened, hence the context of our system. Regardless if it happened on the frontend, on the backend, in an external service, 5 days ago or whatever. Collect any event data without imposing hard constraints on the event structure.

From an analytical point of view, one key aspect is decoupling between collection and modeling of the event. We want to collect raw events and postpone the modeling phase when we give a meaning to this event.

From an engineering point of view, another important aspect is modularity. Many platforms offer a complete set of tools that enables you to do almost anything with the data you collect. This is cool but usually most of these tools are suited for edge cases you are not interested in. The software we are looking for doesn't force you to deploy something you are not interested in using.

Obviously, scalability is another fundamental requirement. The ideal project runs well on any cloud provider and scales to billions of events per day and more with a predictable cost.

Last, but very important for us, it needs to be open source. We are in love with open source so our preferred choices in tech go for enterprise-grade OSS projects.

Which are the alternatives?

So, which are the candidates that satisfy all these requirements? Actually not many. In the field of open source event tracking platforms, we only have a few notable players.

Matomo Analytics (formerly know as Piwik Analytics) is one of the most popular solutions available in the last years. It was born on the LAMP stack and still preserve this legacy. In the last years the project evolved quickly, changed its name, and now is able to offer one of the most interesting alternatives to Google Analytics on the market. Anyway, the MySQL bottleneck is still a problem and most of its focus is on web tracking, not on being a general-purpose collector.

Open Web Analytics is one of the oldest alternatives on the market. The project is about 10 years old and offers easy integrations for stuff like WordPress and MediaWiki, and it probably was a good inspiration for most of the other projects but doesn't fit well for our modern requirements.

Divolte Collector is designed to be a large scale clickstream system. It is really easy to use and relies on well-tested components, Hadoop and Kafka, that proved to be able to scale easily to billions of events. Unfortunately, it doesn't offer anything to model and analyze events.

Fathom Analytics seems a really promising project. It's built on Golang and Preact and it's available for testing within minutes using the official Docker image. The first commit was less than 2 years ago and the feeling is that it's still too young for an enterprise environment.

Snowplow Analytics is, to us, the best alternative on the market. It offers a general-purpose solution to track any kind of events from any source. The project is organized into different components that you can combine in order to build a pipeline that collects and transforms your data exactly as you want. Moreover, by leveraging on the AWS and Hadoop ecosystem, it scales easily to billions of data points.

Reasons behind our choice

For these reasons, we decided to pick Snowplow Analytics. We really enjoy the flexibility offered by the pipeline. This gives us a lot of choices for integrating and customizing the tools. Instead, our biggest concern is the absence of a serving layer. Most of the competitors offer some kind of dashboard, although it's a simple one. The Snowplow team always relies on third-party tools that are powerful but still another piece to integrate. A few years ago, Viadeo started an open source project for a Snowplow Dashboard but the project is currently abandoned.

However, pros are still more than cons. We started building our Snowplow pipeline a couple of months ago and we are planning to integrate it almost everywhere in the technological stack of ProntoPro.

Obviously, the integration process will be progressive. First, we'll put Snowplow to track the same events that Google Analytics is tracking, side by side. Then we will integrate the Snowplow connector in any new custom tracking that will be added to the platform and finally, we will connect it to the external systems. The legacy tracking system will be integrated only if required or updated. We don't want to waste engineering time for something that is not adding real value to the project.

Our mission is to build an amazing service. We really hope the insights we can get by tracking our users' behavior at this level of details will help us to continuously improve our features and make our customers happier. That is all we want.