Six Minute AB Testing
// date: 2021-12-15
// filed: programming, golang
// perma
So, you want to roll features out slowly or maybe you just want to dip your toes in the water to try out that shiny new button. Great! AB testing has multitudes of methodologies, articles, and optometrists working on the problem. Let's roll our own and then explore the edge cases, problems, and considerations involved with AB testing.
AB Testing
There's all kinds of ways to do AB testing, some using DB functions,
others using hashing methods, and some are as simple as just
mod
ing an id. Each have their benefits and draw backs but
we'll aim for speed and even distribution even with small sample sizes.
Even distribution with a small sample size is doable but complex with
just a plain old mod
so we'll use hashing for the initial
setup.
Initial considerations:
- Determinism. A user should always be shown the same bucket they initially saw
- Even bucket distribution
- Bucketing should never use or require more than one query
- Tests can contain more than two buckets and each bucket can be weighted
- Tests can be updated and buckets may disappear, handle gracefully
Point one is fairly self explanatory. It'd be a pretty poor customer experience if every time you visited your favorite search engine (probably alta vista) the search bar was moved around.
To discuss mod
a little further, the basic idea is that
you take the remainder of the id divided by how many buckets the test
contains, with consideration their weights, and then choose the bucket
corresponding with the remainder. You can do funny stuff to make this
distribute more evenly distributed without hashing but this isn't a math
lesson, we're talking about AB testing and we just wanna hash, man.
If the bucketing only matters over the lifetime of the app then no queries should ever be necessary. If you have a distributed system or many things bucketing entities then you may end up storing bucket details in a database of some sort and memoizing locally.
Let's define some types:
// This is a bare AB Test
type Test struct {
[]string
Buckets []int
Weights uint32
totalWeight }
// This is our directory of AB Tests
type ABTests map[string][]Test
First thing to notice, type Test
contains no stateful
information. It doesn't keep track of ids bucketed or what type of
overall AB test it belongs to.
We're off to a blazing start and have satisfied no criterion we've designated for success.
Let's add the bucketing receiver so we can add Test
s to
ABTests
:
func (a ABTest) AddTest(testName string, t Test) bool {
if _, ok := a[testName]; !ok {
[testName] = []Test{}
a}
for idx := range t.Weights {
.totalWeight += uint32(t.Weights[idx])
t}
[testName] = append(a[testName], t)
areturn true
}
One thing to note here is that AddTest
is calculating
the totalWeight
for the test here rather than during
bucketing so we don't have to recalculate each time we want to know
something about the test.
Cool, now we can do something like this in main.go
:
func main() {
:= ABTest{}
tests .AddTest("test1", Test{
tests: []string{"control", "a", "b"},
Buckets: []int{2, 1, 1}, // 50% 25% 25%
Weights})
//...
}
Aaaand, we still have none of the success criteria being met. Let's at least do some bucketing so we can make progress towards our goals.
func (t Test) getBucket(initialWeight int64) (string, bool) {
for idx, ui := range t.Weights {
-= int64(ui)
initialWeight if initialWeight < 0 {
return t.Buckets[idx], idx == 0
}
}
return t.Buckets[0], true
}
func (a ABTest) GetBucket(testName, id string) (string, error) {
var t Test
if ts, ok := a[testName]; !ok || len(ts) == 0 {
return "", errors.New(fmt.Sprintf("cannot find test name %s", testName))
} else {
= ts[len(ts)-1]
t }
:= id + ":" + testName
key , _ := t.getBucket(int64(hash(key) % t.totalWeight))
bucketreturn bucket, nil
}
The keen observer will notice that this could be just one function. It could be but having it as two methods allows us to do other little tricks later on and makes it a little clearer whose responsibility it is for what part of the action.
Our GetBucket
function takes a test name and entity id,
looks for a test by that name, verifies it has more than zero tests to
its name, hashes the id with the test name, retrieves the bucket from
the test and returns this.
You might also notice that we are still using mod
(or
%
in go). Hashing is just one method of randomizing the
order of the buckets distribution. Don't believe it? Agreed, sounds a
little too mathy to let that sleeping dog lie.
Let's test it, in main.go
add:
const ACCEPTABLE_THRESHOLD = .01 // 1%
const POOL_SIZE = 100000
and in main.go:main()
add:
func (t Test) SampleSize(bucket string, pool uint64) (float64, error) {
for idx, bs := range t.Buckets {
if bs == bucket {
return float64(pool) * (float64(t.Weights[idx]) / float64(t.totalWeight)), nil
}
}
return 0, errors.New(fmt.Sprintf("bucket %s not found", bucket))
}
func main() {
// ...
// test bucketing distribution
:= map[string]uint32{}
buckets for i := 0; i < POOL_SIZE; i++ {
, err := tests.GetBucket("test1", fmt.Sprintf("%d", i))
bif err != nil {
.Printf("err=%v\n", err)
fmt}
[b]++
buckets}
for _, bkt := range tests["test1"][0].Buckets {
, err := tests["test1"][0].SampleSize(bkt, POOL_SIZE)
ssif err != nil {
.Printf("err=%v\n")
fmtreturn
}
:= math.Floor(ss * (1 - ACCEPTABLE_THRESHOLD))
min := math.Ceil(ss * (1 + ACCEPTABLE_THRESHOLD))
max if a, ok := buckets[bkt]; !ok || float64(a) > max || float64(a) < min {
.Printf("bucket %s is not within the right range, expected=%0.3f<=x<=%0.3f,got=%0.3f\n", bkt, min, max, float64(a))
fmt} else {
.Printf("bucket %s is within the right range, expected=%0.3f<=x<=%0.3f,got=%0.3f\n", bkt, min, max, float64(a))
fmt}
}
Now, upon running this hog, we should get some cool messages from golang letting us know our distribution is within the city limits.
bucket control is within the right range, expected=49500<=x<=50500,got=50000
bucket a is within the right range, expected=24750<=x<=25250,got=25001
bucket b is within the right range, expected=24750<=x<=25250,got=24999
Woohoo, point two is done and dusted. Great distribution. If you play around and make the pool size only 10, we should still be OK.
bucket control is within the right range, expected=4<=x<=6,got=5
bucket a is within the right range, expected=2<=x<=3,got=3
bucket b is within the right range, expected=2<=x<=3,got=2
What if it's really biased bucketing? Oh I don't know, 3 tests?
bucket control is within the right range, expected=1<=x<=2,got=2
bucket a is within the right range, expected=0<=x<=1,got=1
bucket b is within the right range, expected=0<=x<=1,got=0
Bingo! Okay, for the rest of this article we're going to use the OG
pool size of 100000
and we'll consider our even
distribution problem solved.
Now to test point one, remember point one? Determinism. Let's get after it:
, _ := tests.GetBucket("test1", "hello, world!")
testBucket:= 0
misses for i := 0; i < POOL_SIZE; i++ {
if tb, _ := tests.GetBucket("test1", "hello, world!"); tb != testBucket {
++
misses}
}
if misses > 0 {
.Printf("Determinism misses: %d\n", misses)
fmt}
No messages about misses, great. That half solves point one. If you
later add another Test
to that test name with a different
distribution then you're really asking for trouble. Users that saw you
were giving away $1m to anyone who saw the message and were on the fence
about it may no longer see that message and then you'll have some very
unhappy users (but only some of them).
Let's add on some memoization, more on this after the next heading. For this article we're going to pretend that the determinism is only required for the duration of this process. You might consider a database for a more highly distributed system and this is the only acceptable query our process should perform to maintain integrity. Here we're only going to use a local variable:
In your ABTest package add/alter:
var memoization = map[string]map[string]string{} //<=====
func (a ABTest) GetBucket(testName, id string) (string, error) {
var t Test
if ts, ok := a[testName]; !ok || len(ts) == 0 {
return "", errors.New(fmt.Sprintf("cannot find test name %s", testName))
} else {
= ts[len(ts)-1]
t if memoize[testName] == nil { //<=====
[testName] = map[string]string{} //<=====
memoize} //<=====
}
:= id + ":" + testName
key if bkt, ok := memoize[testName][key]; ok { //<=====
return bkt, nil //<=====
} //<=====
, _ := t.getBucket(int64(hash(key) % t.totalWeight))
bucket[testName][key] = bucket
memoizereturn bucket, nil
}
Testing our assumptions again we'd expect nothing to have changed and nothing does. All clear.
Updating Tests
If you want to be able to update tests then storing bucketed user's
bucket is probably fine enough. If you're pure of heart and declare
tests with different buckets or distributions to be similar but not the
same then the ABTests
object could be simplified above. For
this article we make no such proclamations and no guarantees. We want to
show a user the same bucket they saw before so long as that bucket still
exists.
Next step is to add another Test
to our
"test1"
bucket and make sure we get original distributions
with already bucketed ids and new distributions with new ids:
func main() {
//...
// test regressions
.AddTest("test1", Test{
tests: []string{"control", "a", "b"},
Buckets: []int{1, 1, 98}, // 1% 1% 98%
Weights})
= map[string]uint32{}
buckets for i := 0; i < POOL_SIZE; i++ {
, err := tests.GetBucket("test1", fmt.Sprintf("%d", i))
bif err != nil {
.Printf("err=%v\n", err)
fmt}
[b]++
buckets}
// all of these ids were wentered into a 50/25/25 test
for _, bkt := range tests["test1"][0].Buckets {
, err := tests["test1"][0].SampleSize(bkt, POOL_SIZE)
ssif err != nil {
.Printf("err=%v\n")
fmtreturn
}
:= math.Floor(ss * (1 - ACCEPTABLE_THRESHOLD))
min := math.Ceil(ss * (1 + ACCEPTABLE_THRESHOLD))
max if a, ok := buckets[bkt]; !ok || float64(a) > max || float64(a) < min {
.Printf("bucket %s is not within the right range, expected=%d<=x<=%d,got=%d\n", bkt, uint(min), uint(max), a)
fmt} else {
.Printf("bucket %s is within the right range, expected=%d<=x<=%d,got=%d\n", bkt, uint(min), uint(max), a)
fmt}
}
/*
Three more positive affirmations for our day:
bucket control is within the right range, expected=49500<=x<=50500,got=50000
bucket a is within the right range, expected=24750<=x<=25250,got=25001
bucket b is within the right range, expected=24750<=x<=25250,got=24999
Now let's make sure that new users are bucketed in our zany
control:1 / a:1 / b:98
bucketing scheme.
func main() {
//...
= map[string]uint32{}
buckets for i := POOL_SIZE * 2; i < POOL_SIZE*3; i++ {
, err := tests.GetBucket("test1", fmt.Sprintf("%d", i))
bif err != nil {
.Printf("err=%v\n", err)
fmt}
[b]++
buckets}
// none of these ids were wentered into a 50/25/25 test
// and should conform to the updated test buckets 1/1/98
for _, bkt := range tests["test1"][1].Buckets {
, err := tests["test1"][1].SampleSize(bkt, POOL_SIZE)
ssif err != nil {
.Printf("err=%v\n")
fmtcontinue
}
:= math.Floor(ss * (1 - ACCEPTABLE_THRESHOLD))
min := math.Ceil(ss * (1 + ACCEPTABLE_THRESHOLD))
max if a, ok := buckets[bkt]; !ok || float64(a) > max || float64(a) < min {
.Printf("bucket %s is not within the right range, expected=%d<=x<=%d,got=%d\n", bkt, uint(min), uint(max), a)
fmt} else {
.Printf("bucket %s is within the right range, expected=%d<=x<=%d,got=%d\n", bkt, uint(min), uint(max), a)
fmt}
}
}
UH OH:
bucket control is within the right range, expected=990<=x<=1010,got=1009
bucket a is not within the right range, expected=990<=x<=1010,got=1042
bucket b is within the right range, expected=97020<=x<=98980,got=97949
What's our major malfunction? Looks like we didn't test our
distribution enough. Let's try some stuff. We tried fnv
,
let's give adler32
a shot. Modifying our
func hash(string)
:
func hash(s string) uint32 {
:= adler32.New()
h .Write([]byte(s))
hreturn h.Sum32()
}
Aaaand, short test again:
bucket control is within the right range, expected=4950<=x<=5050,got=5034
bucket a is within the right range, expected=4950<=x<=5050,got=4975
bucket b is within the right range, expected=89100<=x<=90900,got=89991
WHEW! Back in business.
Okay, let's recap what our goals were and what we've done:
- ✓ Determinism. A user should always be shown the same bucket they initially saw
- ✓ Even bucket distribution
- ✓ Bucketing should never use or require more than one query
- ✓ Tests can contain more than two buckets and each bucket can be weighted
- 𐄂 Tests can be updated and buckets may disappear, handle gracefully
How to Handle Failing Buckets
Your users don't want green text on the yellow background,
apparently. Let's see what happens when we update that test to remove
bucket a
.
func main() {
//...
.AddTest("test1", Test{
tests: []string{"control", "b"}, //<== Notice that bucket 'a' is missing
Buckets: []int{5, 95}, // 5% 90%
Weights})
= map[string]uint32{}
buckets for i := 0; i < POOL_SIZE; i++ {
, err := tests.GetBucket("test1", fmt.Sprintf("%d", i))
bif err != nil {
.Printf("err=%v\n", err)
fmt}
[b]++
buckets}
}
Another undesirable outcome. We're really disappointing our parents
today. buckets
contains the original distribution of
control:50% / a:25% / b:25%
and we've ripped feature
a
out of the product entirely. Here we have some decisions
to make, we could make feature a
users now see
control
, they could be rebucketed into the largest bucket,
or they could just be reseeded using the latest Test
. Grab
your toolbox and let's reseed the users:
func (a ABTest) GetBucket(testName, id string) (string, error) {
var t Test
if ts, ok := a[testName]; !ok || len(ts) == 0 {
return "", errors.New(fmt.Sprintf("cannot find test name %s", testName))
} else {
= ts[len(ts)-1]
t if memoize[testName] == nil {
[testName] = map[string]string{}
memoize}
}
:= id + ":" + testName
key if bkt, ok := memoize[testName][key]; ok {
for _, tx := range t.Buckets { //<== make sure the bucket exists
if tx == bkt {
return bkt, nil
}
}
, _ := t.getBucket(int64(hash(key) % t.totalWeight))
bucket[testName][key] = bucket
memoizereturn bucket, nil
}
Now when we rerun our tests we end up with a nicer distribution of
closer to 50/50 (it's somewhere closer to 51.25/48.75) which is what
we'd expect given that we re-bucketed 25% of the users to b
95% of the time and the control
50% can only grow given our
constraint that users should have the same experience each time. The
math for bucket b
is
0.25 (original b) + (.25 (original a) * .95 (b's current distribution))
This gives us goal five, great work everyone!
Bonus: Testing
Note that about half way through this exercise we had to modify our hashing function. FNV wasn't performing poorly but it wasn't meeting our distribution constraint. One of the more fun aspects of being a programmer is that you can theorize about things then you can benchmark them, and then you get to argue about your findings with people who want nothing more than to gain your power or prove you wrong. Let's write a better test for something real world.
In the real world, our users aren't sequential numbers and sometimes they're not numbers at all. Users don't also sequentially visit features on your site. Let's find the best distributed algorithm for UUIDs among different sample sizes:
func adlerHash(s string) uint64 {
:= adler32.New()
h .Write([]byte(s))
hreturn uint64(h.Sum32())
}
func crc32HashIEEE(s string) uint64 {
:= crc32.New(crc32.MakeTable(crc32.IEEE))
h .Write([]byte(s))
hreturn uint64(h.Sum32())
}
func crc32HashCastagnoli(s string) uint64 {
:= crc32.New(crc32.MakeTable(crc32.Castagnoli))
h .Write([]byte(s))
hreturn uint64(h.Sum32())
}
func crc32HashKoopman(s string) uint64 {
:= crc32.New(crc32.MakeTable(crc32.Koopman))
h .Write([]byte(s))
hreturn uint64(h.Sum32())
}
func crc64HashECMA(s string) uint64 {
:= crc64.New(crc64.MakeTable(crc64.ECMA))
h .Write([]byte(s))
hreturn h.Sum64()
}
func crc64HashISO(s string) uint64 {
:= crc64.New(crc64.MakeTable(crc64.ISO))
h .Write([]byte(s))
hreturn h.Sum64()
}
func fnvHash32(s string) uint64 {
:= fnv.New32()
h .Write([]byte(s))
hreturn uint64(h.Sum32())
}
func fnvHash64(s string) uint64 {
:= fnv.New64()
h .Write([]byte(s))
hreturn h.Sum64()
}
func main() {
:= []int{30, 100, 100000}
sampleSizes := []Test{
buckets {
: []string{"a", "b"},
Buckets: []int{1, 1},
Weights: 2,
totalWeight},
{
: []string{"a", "b", "c", "d"},
Buckets: []int{1, 1, 1, 1},
Weights: 4,
totalWeight},
{
: []string{"a", "b", "c", "d", "e", "f", "g", "h"},
Buckets: []int{1, 1, 1, 1, 1, 1, 1, 1},
Weights: 8,
totalWeight},
{
: []string{"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p"},
Buckets: []int{1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1},
Weights: 16,
totalWeight},
}
:= []string{}
ids := map[string](func(string) uint64){
hashes "adler": adlerHash,
"crc32 IEEE": crc32HashIEEE,
"crc32 Castagnoli": crc32HashCastagnoli,
"crc32 Koopman": crc32HashKoopman,
"crc64 ECMA": crc64HashECMA,
"crc64 ISO": crc64HashISO,
"fnv32": fnvHash32,
"fnv64": fnvHash64,
}
:= []string{"adler", "crc32 IEEE", "crc32 Castagnoli", "crc32 Koopman", "crc64 ECMA", "crc64 ISO", "fnv32", "fnv64"}
order
for i := 0; i < 100000; i++ {
= append(ids, uuid.New().String())
ids }
/* Run our tests, this could take a while depending on your parameters */
for _, sample := range sampleSizes {
.Printf("sample size %d:\n", sample)
fmt
for _, t := range buckets {
.Printf(" #%d buckets (%0.2f%% dist)\n", t.totalWeight, 100/float64(t.totalWeight))
fmt
:= map[string]map[string]uint64{}
dist for _, n := range order {
:= hashes[n]
fn [n] = map[string]uint64{}
distfor idx := 0; idx < sample; idx++ { // only use the first X of sample
, _ := t.getBucket(int64(fn(ids[idx]) % uint64(t.totalWeight)))
b[n][b]++
dist}
}
for _, h := range order {
.Printf(" %-16s: ", h)
fmtfor _, tx := range dist[h] {
.Printf("%02.2f%% ", float64(tx)*100.0/float64(sample))
fmt}
.Printf("\n")
fmt}
}
}
}
This will give you a nice little table showing different distribution characteristics given different sample and bucket sizes for each hashing method listed. A fun thing to do before you head out to return your library books is to implement your own bucketing method, try to get a better distribution with low sample sizes!