A Year of SEO Testing with SearchPilot

If you asked a UX professional whether adding upsell items to a checkout process results in more sales or more abandoned carts, they’d tell you to test it — second guessing users is seen as foolish.

Yet for many companies, SEO has no testing at all, just endless reams of best practice and hand waving. Our senior consultant, Dominic Woodman, had the opportunity to change role for a year and got the chance to treat SEO differently, running over 50 SEO A/B tests across a range of different websites. At SearchLove London and San Diego, he shared his story with our audience and the lessons he learnt along the way.

Video Transcription

We’re going to talk about rabbit holes. I love a good rabbit hole. This presentation is about the mother of all rabbit holes.

The Black Box of an SEO

I’d been working for Distilled for about two years, long enough to find my feet in the industry and long enough to get frustrated by all the problems that we all get frustrated by.

“the biggest problem that SEO’s had … was getting their bosses and the people around them to understand the value of SEO”

Moz did a survey of their readership in 2017, and the biggest problem that SEO’s had, not the only, but the biggest was getting their bosses and the people around them to understand the value of SEO. I had that, I’m sure a lot of you have had that. The reason I had it was this. This big, black box.

Then I would go and make all these changes and I would tweak things and I would move things and I would shuffle things, but it was so hard to tie changes that I made to specific things that would happen on my clients’ sites.

When I joined Distilled, I’d imagined there was this big, amazing secret that just no one had told me, and I’d come into the interview with Will and he’d shake my hand and be like, “And now, Dom, here is the secret.” The wall would roll back and he’d take me into his lair and there would be all the secrets, and turn me into this incredible savant character who knew exactly what happened.

“If you’re a public company, having things not be predictable and difficult to measure and justify is unacceptable”

Of course, like the book, there really isn’t a secret, it’s just a lot of hard work and I ended up stuck back with this black box again. That is the thing, everyone has this problem and a lot of big companies have this problem. If you’re a public company, having things not be predictable and difficult to measure and justify is unacceptable. Nothing looks worse on your annual report, you don’t do really well, you don’t do really badly, you just want to be consistent, and boy, do we not do that.

Introduction to Testing SEO Changes

Pinterest first talked about it in 2015, they put a poster up on Demystifying SEO with Experiments. It’s the first time I saw it talked about publicly. They talked about a framework that they built for testing individual SEO changes on websites. We’ve been putting together our own one at Distilled, Distilled ODN [now called SearchPilot, Ed.]. Tom Anthony, who’s been running this project, he comes to me and he’s like, “Dom, how do you fancy putting consulting on pause for a year, and just running all of these tests? Taking care of our customers who are going to come and use this platform.” That, ladies and gentlemen, is the giant, screaming rabbit hole that I ran voluntarily in to. I said “yes”, obviously, I love experiments and this would be a fairly atrocious presentation if I didn’t.

CRO Testing is Not for SEO

CRO Walkthrough

How does this stuff work? A really simple example. We take a website where we have pages about animals. We have a cats page, a dog’s page, a badgers and unicorns page. Let’s say we have a very image and text heavy website, and we want to test a new video template. We’ve gone and spent all this money on videos. If we use the CRO test, the way it would work is that user one would come to our website and then all four of those pages. They would see the old design, which is the image version. Then user two comes to the website, they would all see the new design, they would see the new design on every one of those four pages.

“…you can’t split Google bot. You can’t show Google bot two different templates on the same page.”

You can’t do that. That’s not how SEO testing works. The most instinctive way to think about it is you can’t split Googlebot. You can’t show Google bot two different templates on the same page.

SEO Testing but with Google Bot in Mind

Instead, we do this, users one, two and Googlebot all see the new template on two of those pages, the cats and the dogs, and users one, two and Googlebot see the other template. You see our old template on the other two pages on unicorns and badgers. Then we sit there and we measure the change between two of those and that gives us a result.

SEO Testing Walkthrough

So, in a nutshell, that’s how it works. We split pages, rather than users and it’s template-based. Tiny sidebar before we go any further, for legal reasons, the two examples that I’m going to use for most of this presentation, are Argos and Cars.com. Argos is an eCommerce website. Cars.com is a listings website. We don’t work with either of them, which is why they’re great for this. Just know that as you’re going through this.

The Methodology – Taught The Hard Way

Methodology Walkthrough

This is the first test that we ever do. The first test I ever run, and it’s really simple. People we’ve got a nice eCommerce site and we say okay, in the H1 they have what the page is about. In this case, it’s about pushchairs. We go, “Okay, wouldn’t it be better if we added a common sentiment for that to the title?” So, in this case, it’s pushchairs and we add buggies into that H1.

We rolled it out across half the pages, and lo and behold, this is the first results that I ever get to see.

There are two really cool things about this. Firstly, it’s up and to the right, and as we all know that’s great, it means I’m good at my job. Second thing is what this isn’t. It’s not rank, it’s not search visibility. This is actual traffic. It’s actual sessions. Specifically what this graph is, is total additional sessions. It’s worth spending a bit of time with this because we’re about to see a lot of this graph.

On the y-axis is the total cumulative organic sessions per day, and on the x-axis is time. These fans show margin for error, 95% confidence interval, it’s a statistical technique that we’re using to measure this. The crucial bit that matters is basically, when this shaded area, when the bottom or the top of the shaded area crosses the zero access, that’s where it matters.

That’s when we say, yeah, we’re confident that this thing has actually happened. And in this case, it happens here. This is a positive test, so the bottom of the funnel crosses it.

Walkthrough of Daily Sessions Graph

So that first test, was it this perfect love at first sight? I get there and immediately everything is wonderful and we have this fantastic relationship? Unfortunately, it was not, because there’s another graph that I have to show you people.

“Control” is all the pages where we didn’t make a change to and “Variant” is all the pages we did make the change.

Again the language, because I’m about to use a lot of this. “Control” is all the pages where we didn’t make a change to and “Variant” is all the pages we did make the change.

The black line here shows daily organic traffic to the variant pages, which are all the pages we made the change. The red line is the start date. We take all the data to the left of the start date, and we used that to build a model. If you run a business site, you get less people on the weekdays than you do on weekends. We use those together to forecast how the variant is going to do.

We take all our data to the left, and we’re going to make that model, which is the blue line. Then we’re going to overlay the two on top of it. To the right of the start date, we can compare the variant afterwards, and see how it performs. The model is a counterfactual, built when we hadn’t made any changes. Our model is what would happen if we hadn’t done anything at all. When I compare the two, and in this case the black line is higher, so it’s a positive test.

Here’s the thing, we have perfect information to the left of the graph. We knew exactly what that model should look like. The blue line should match the black line exactly. And in this case, we clearly are. This is clearly a trustworthy model.

That first test that I run, it looks like this. Those don’t match. This is not a decent model. So instead, for this first model, we learn the first big, broad lesson of this which is, unfortunately, not everyone gets to do the SEO split-testing, certainly not in this fashion anyway. Our rule of thumb we found was that you needed about 1000 organic sessions a day, to all of their pages, a total in your section, to have that model be relatively decent. It’s a rule of thumb. There are exceptions, but by and large, this is a pretty decent one.

“…not everyone gets to do the SEO split testing…”

That model that you just saw, they had 60 sessions a day, so they were never close to that. That’s why it was such a terrible model. At this point we’re going to go, “great, that’s fine.” We learn that lesson and now we get into it. This is the actual start of my year, I’m not going to count that, we’re not going to count that as test one. We’re going to say, “this is test one”. I want you all to put yourselves in my shoes.

Test 1: Adding USP to Title Tags for eCommerce Site ~ Null

Put yourself where I was. You’ve been given this incredible tool. You even have these customers ready, and you’re ready now to go out and do this. I want you to run along with me. Put yourself in my place. I really want to do well in my role on my first test.

We have an eCommerce site and we have product pages. The title for the product pages are so-so, it’s what you’d expect, it’s the name of the product followed by the brand.

We go okay, what if we put their USP? What if we put a common USP like Free Shipping that they offer and their competitors don’t, into the title tag? More people would click on it, we would get more traffic.

I see the results are kind of null. Okay, it’s not great. My first one’s a swing and miss, but okay, we’ll move on. We don’t want to slow down here. We want this initial wind so we feel good and we get everyone feeling happy about the platform and the potential of this, so we roll to the next one.

Test 2: Making Titles to Exact Match Search ~ Negative

This eCommerce site also has a bunch of guides on it, and they’re buying guides for products. In this case, we’re going “which TV should I buy?” That’s the phrase that they target.

This phrase could be written one other way, and it can be written like this. It can be, “Which TV to buy?” By and large, this phrase, wherever it appears has a lot more search volume than the other one. I’ve got two bits to this point. If Google cares at all about exact match, this is clearly a good idea, we should change our titles to match this. If they don’t use the language of your customer, we’re constantly beating that drum talk, how your customers use, how they speak, how they write. This one seemed like a great idea.

This was a terrible idea. I lost about 11% traffic on my second test. That’s not great. We’re still not feeling very good. The second one is a swing and miss, but we move briefly on to three.

Test 3: Adding Alt Tags to Images ~ Null

Clearly titles have not worked out well for me, so at this point, we move on to alt tags.

I’m sure if you’ve read plenty of SEO audits, you will have seen this recommendation, “Put alt tags on your images.” That makes sense, why wouldn’t you put alt tags on your images? It’s helpful for people with disabilities. It gives Google a better idea of what the page is about.

“…why wouldn’t you put alt tags on your images?”

This model, was so perfectly null, I thought I had forgotten to launch it. It really did absolutely nothing at all, except help people with disabilities. At that point, that’s three strikes and I’m out.

Test 4 & 5: Remove Synonyms in Title Tags ~ Negative

For tests four and five, someone else gets brought in. Someone with 10 years of experience in the industry. A PhD in AI, someone who has won, questionable, but still technically correct, best SEO in the world award. Most importantly, Tom will be unbearably smug if he wins this when I have not been able to win it. So Tom steps in, and we have a listing site at this point, and the listing site has title tags. They look like this.

We’ve got this Ford car and Ford vehicles and we look at both of those and we do some work and we say, “we think Google understands that in this particular case, these are synonyms.” We don’t need both of these words. So we could cut one of them and we can just have a punchy title. If we remove one of those, more people would click it and we would gain clicks and we don’t lose in relevance. Fantastic, so I got a little bit of beta breath. At this point where I’m like, oh boy, this isn’t going to be bad. This is a great idea. This is a great test. And he rolls it out and … a stunning 27% drop.

Still until this day, this is the largest ever fail we’ve had on the platform. But I am not a man to be easily dissuaded. We have a couple of these listing sites at this point, and they’re for different geographical regions around the country. So, we roll it out, not just in this one, but in a couple of different regions. We rolled it on the second one, oh it’s still not good. Another 20% drop, which is great from a mass perspective. Identical sites, different locations, almost identical results. It does, however, mean that he can’t write titles and I’m back in the game.

Important Takeaways from First Tests

“…need to be able to pull these apart and you need to be able to move fast, and a framework is going to help you do both of these”

What do we take from the first couple of tests that I run? First, get a testing framework. This was clearly not as obvious as I thought it would be when I started. I thought I’d be rolling in these wins and instead, we were a bit down and frankly it’s harder than it looks. You need to be able to pull these apart and you need to be able to move fast, and a framework is going to help you do both of those.

“When you know the result, you don’t necessarily need the why, which is often useful because, frankly, a lot of clients don’t want to pay for the why.”

Secondly, you don’t necessarily need to know the why, if you’re testing things. I don’t know exactly why Tom writes such terrible title tags, but we know they’re terrible and we can just roll them back straight away. When you know the result, you don’t necessarily need the why, which is often useful because frankly a lot of clients don’t want to pay for the why. This is more of a consultant point of view, but often people don’t want to do that digging and they don’t want to understand once they’ve seen the result.

Test 6: Adding Relevant Terms to Titles ~ Negative

That brings us onto the titles and meta descriptions. I want to stick with this because I’m like, this is the first thing that you come to. When you enter SEO, this is the first thing you look at.

For test six, we’ve got another listing site, and they have these big pages, and the primary intent of these pages is reviews. But they had the secondary intent, which is safety information. When we looked at these pages, they just say “reviews” in the title tags. That seems like, “okay, this is a pretty easy win at this point.” We can just add safety information into the title tag.

It’s still clearly about reviews. We’ve just added the second one in. So, the idea being that we should gain relevancy for those terms that it wasn’t, and gaining traffic without losing it for reviews. We rolled it out, again not good. Another 10% drop.

This one I actually got a chance to do a little bit of digging in, to try and figure out what happened here. I think roughly what happened was of these two intents, the review intent was far larger than the search safety information intent. While we gained traffic in safety information, we lost it in the review. We lost a tiny bit of click-through because our title wasn’t quite as punchy. What we gained in relevance for the far smaller term just didn’t matter. We ended up having to walk back.

Test 7: Copying Titles & Making Them Motivational ~ Negative

Okay, fine. Seven, we’re not going to back down. We’re going to find a title meta description test that works and is successful and everyone’s going to love me. For this one, we have this wonderful thing that I’m sure many of you have seen in a vertical where you see a competitor when you’re pre-googling your main product, in this case television. You google televisions and you look at their SERP, you look at your competitor and you’re like, “They have a good title tag. That’s a great title tag. I wish I’d written that title tag.” I could just take it. I could just take that title tag. No one will stop me. I’m the SEO guy. No one else is looking at this. I can just take their title tag. They can do jack all about it.

That’s what people do and over a while what you’ll see is you’ll see the SERP’s for certain terms. The trend is everyone having exactly the same title tags. At some point someone had the idea, someone must’ve had the first idea to put date in the title tag, and now everyone does it.

At this point, the SERPs look entirely like products, name, brand. We go, “that’s everyone’s, everyone in the top 10”, all we have to do is stand out. Why don’t we write a motivational title tag that’s like, “oh, that grabs me”. I can just click to that. Wouldn’t I click that one over the other ones? I feel really good about this one so I roll it out and watch as their traffic drops by 15%. Clearly I just can’t write titles. We’re going to fast forward at this point, and we’re going to go to, what did I learn? Have we done a bunch more of these?

“…title tags do have a notable effect.”

Firstly, title tags do have a notable effect. By and large, they change traffic between five to 15% when they did make a change. Secondly, writing title tags is hard. 56% of the title tags failed. Outright failed and were bad for our clients. Only 6% were actually successful. Now, there’s a little bit of selection bias here because the websites who come to us and the websites we were working with are typically people who care about SEO. So we’re not going from nothing here. We’re going for people who have already thought about the titles, but it’s still clearly very hard when you’re in that position. Which unfortunately means we apologize to Tom. It’s just really hard to write title tags. He’s not particularly bad at it.

You also reach to this fun thing, which is that you can’t stop testing titles because when you find a good one, everyone just copies it in the entire set unifies and then you’re back to where you were in the beginning.

“We found that roughly we would see the changes reflected in the site’s traffic in about two to four days.”

We saw a lot of different time periods for different tests sometimes up to a month ago. For these meta description changes and that sort of thing, we found that roughly we would see the changes reflected in the sites traffic in about two to four days. This is also a very nice one to quickly validate because you can just crawl Google’s cache and you’ll see it if it’s there.

“That’s part of the reason for testing, is to save money and heavy investment is exactly where we’re looking to go.”

Which brings us to heavy investment. Obviously you want to test heavy investment changes. That’s part of the reason for testing, is to save money and heavy investment is exactly where we’re looking to go.

Test 8: Making Content Visible With & Without JavaScript ~ Positive

For test eight we look at JavaScript, and Google has been saying a variety of different things in this for quite some time. To begin with, they were like, “We don’t get JavaScript.” Then they were like, “well we absolutely really do get JavaScript, we definitely get JavaScript, we’re going to depreciate all the ways that we looked at JavaScript.” Then of course recently they went back on themselves a bit and we’re like, “Actually maybe give us a static version. JavaScript is quite hard.”

You’ll notice this is not Argos, this is iCanvas, this is one of our clients who were kind enough to let us use the actual website. They sell pictures for your office. What they have here is when JavaScript is disabled, none of the products on the actual page render. Now they’re still there in the HTML, but the CSS is basically not letting them display. For some fairly long and involved reasons, they can’t change that easily.

“Our theory being, okay at some level we know Google cares and struggles with JavaScript, and we also know that Google has said previously that they devalue content that’s not visible.”

When JavaScript is enabled that CSS is successfully working, and it’s rendered and the whole page looks fine. What if we fixed their CSS and the page with our testing tool? We can see whether or not JavaScript is disabled, and if all these products are visible. Our theory being, okay at some level we know Google cares and struggles with JavaScript, and we also know that Google has previously said that they devalue content that’s not visible. We go, “okay is this possibly negatively hurting them?” And it is.

Why does visible content matter?

We rolled it out and then increased traffic by 6%, and we wrote it out on a similar thing later on and increased traffic by 5%. There was clearly something to this, the visible content on the initial page load matters. Now I’ve missed out on a lot of subtlety here about how JavaScript works, and there’s multiple other presentations that you’ve probably all seen read about. There is clearly not everything to this story here, but at least for a lot of scenarios, it does map.

Test 9: Adding Category Text (Garbage Advice) ~ Positive

What about test nine? For test nine we move on to this, this text that we have all written and has appeared everywhere.

Every category page in the world has this text below it. And when we have a client come to us and they’re like, “well we’d like more SEO category texts, wouldn’t we?” I’m fed up of writing it, I’m fed up for paying people to write it. Let us finally test it and prove that it does nothing so I never have to write it again. We wrote it on two websites, and on the first one, it doesn’t really do anything. Maybe even trending down a little bit. And then on the second one, it takes about three weeks, but it increases their traffic by 3%. Sometimes it does help very frustratingly.

“…add[ing] SEO content to the bottom of pages, has different effects on different websites.”

It brings us to this very big lesson, which is that we’ve tested this on a bunch of websites since. This same change, this controversial, “let’s add SEO content to the bottom of pages,” has different effects on different websites. We found this consistently again and again. The same changes have different effects on different websites. That whole best practice thing that I’ve seen time and time again, talked about and loved on, is garbage but you still see it so often.

Next up. There’s this final summary here. We had a lot of tests, too many to go through where we wanted to remove content from non-article pages. In this example, you have like a category page, some version of this category page and their products, and you remove some text from that because you actually want to add a larger image in. We had a lot of these and in removing this content we found it was often null. As long as the content that was being moved was topically similar, we found often you could get away with less content. That was another interesting takeaway.

Test 10: Instead of Category Text Using Structured Data ~ Positive

Then we move on to structured data and dangerous assumptions. For number 10 there was something very appealing about structured data.

There’s something still appealing about the fact, I’m a technical person. People, I’m a nerd. I think a lot of us in this industry are, and this is really an appeal to the idea that, if I know something technical that you don’t, I can get a leg up on you. I can outwit you, I can outsmart you. Just by knowing something that you haven’t bothered to learn. And it posses and comes from a good place but it is a thing. And so at this point, I’d love to test this.

We go back to these category pages where we would add that text to help people understand it, because category pages are a lot of images and not much text. What if instead of writing that as your text, what if instead I just marked it up to describe the page? What if instead I added structured data to say, this is 20 products on a category page?

So we roll this one out on a really big website, really big multi country eCommerce brand. And it does really well. I can increase his traffic by 11%, and I’m like, that’s better than I ever hoped for. Before we get too excited here, there are multiple countries.

Test 11: Rolling Test 10 to Different Regions ~ Positive

We get to roll out on another one. We go to their Australian side and we roll it out there, it does exactly the same thing, and increases traffic by 11% of their category page right there. Big money pages.

“I found the money button. What’s going to happen with this money button? I want to make you a ton of money and we’re all going to go on holiday.”

The lesson here is I’m incredible at my job. Finally, finally, the result I wanted. You could tie 150,000 additional sessions a month, to one change that I made. I feel fucking great at this point. I am on top of the world. We have a bunch more eCommerce clients who we work with at this point, so they come to me and they’re like, “Oh Dom, we’re lost in a sea of uncertainty. What could we possibly do? We don’t know what we could do.” And I’m like, “Guys, I found the money button. What’s going to happen with this money button? I want to make you a ton of money and we’re all going to go on holiday.”

Test 12, 13, & 14: Rolling out Structured Data to More Clients ~ Negative

We get all this, we build it out and we rolled it out. We roll it across these multiple sites.

We roll it out on the first site and it’s null. The second site is null and by the time we get to the third site, “not again.” I go eat a large helping of humble pie. In this previous lesson I finally learned it, there really isn’t that best practice people. When you put it out there, you seem like an idiot.

“We did find something more interesting, which was that structured data outside of rich snippets could occasionally have these large big traffic swings, but it was incredibly inconsistent.”

We did find something more interesting, which was that structured data outside of rich snippets could occasionally have these large big traffic swings, but it was incredibly inconsistent. My best theory as to why it worked for that one eCommerce site rather than the other three, and this is absolutely a rationalization, so take this with a grain of salt, was that the eCommerce site where it did work had a lot of category pages that were closely connected together. They were very similar topically and Google needed more help in picking that, and that wasn’t the case in the other ones.

Finally, in all honesty on that one, I was lucky. I was lucky that that happened at all. Because if I’d done those tests in the other order, if I wrote that test three times, and it’d be null, I have really been beaten down the hatches to visit the fourth site and be like, “This is going to work for you guys. It’s been null for everyone else.” There are almost certainly things in your career at this point that you tried and they didn’t work, or you did them and they did work. It will never be true for any other site that you ever work on again, but you’ve just taken it as gospel. We are pattern machines, we just pick out patterns.

Go back and revisit those things because you’ve probably made a bunch of things and a bunch of assumptions that are wrong.

Test 15: Adding Review Schema ~ Positive

We’re on test 15, I carry along on the structured data, and we’re working with a listing site at this point. I just want consistency and we go to the most consistent of all of them, which is the review mark. Write those five stars in the SERPs. For the listing site, we go and collect a bunch of reviews for a lot of their category pages.

We spend quite a lot of time doing it. They spent a lot of time doing it. We roll it out and we thought, “surely this works.” Surely adding review snippets to these pages works, and it does, it increases the traffic by 16%. It was a good experience for users. It helped them, it added the right trust signals, it did exactly what you would expect it to do, which is fantastic.

Test 16: Review Schema on Business Profile Website ~ Null

Walkthrough of Test 16

We’re working with a couple of other people at this point, and one of them, for test 16 came in with a business site where they have a bunch of profiles.

They go, “Oh, we love all those.” We’ve been talking about the structured data test that we’ve been doing. They go, “Oh that would be great. We have a lot of profiles, they have loads of impressions, but we don’t get many clicks. Could you add some stars to them? Could we get those stars? Can we get those extra clicks? Stars make people click.” For the first time, I think I have a decent handle on why this isn’t going to work before it happens. I’m like, “No, I don’t think that’s going to work.” They’re pretty insistent, they really know they want these lovely five stars out there, so I’m like, “Sure, okay.” We rolled it out.

“…even the most consistent of changes, even that almost guaranteed win of five-star SERP snippet markup can fail if your intent is other garbage.”

Again, it was so null, I thought I had forgotten to launch it, but this time I caught it. The reason I think this one was more obvious was when you looked at the intent for those pages. Unfortunately, the reason that they got a lot of impressions and did not get any clicks, was because a lot of the business people who had the profiles on this site had the same names as relatively famous porn stars. Obviously those people get searched for quite a lot. But no one clicks on the business results for their business profile with extra stars, which is this wonderful reminder that even the most consistent of changes, even that almost guaranteed win of five-star SERP snippet markup can fail if your intent is other garbage. You can’t just put lipstick on a pig.

Let me move into risky business. You also get a bit cavalier at this point. When you’ve done a lot of these tests and you go, “I have a really powerful thing here.”

Test 17: Changing Date on Articles to be Fresh ~ Positive

I’d been a little bit obsessed for test 17 with freshness. We talked about it a lot that this was a post from Cyrus back in 2016, but it’s a really interesting thing that Google cares about. Certainly if you work in the news industry, you’ve seen this. What’s interesting is when you ask yourself, “how much does this matter outside of that?” It’s really difficult to measure freshness. It’s really difficult to change freshness on your website. It’s one of those big investment changes that we talked about earlier. It’s not easy to make your listings refresh more quickly, your developers complain at that.

I really want to test it, so I’m having to look around. What about this little date up here? This date, that’s when Google thinks this was updated, it must be, because they put it in the search. It’s clearly what they believe. Where does that come from? Well that comes from this, it comes from this little structured snippet markup on the page. That’s kind of fun. So does it just believe this? Or could it be nothing at all. Could I just put this on the page and I could just update this every day with a new date, that said I changed something yesterday?

“The takeaway that we get is that freshness matters and we can put a number on this now, you can take this to your developers, put it into your queue and prioritize what it correctly is.”

We did that and then it increased traffic by 8%. Perhaps not my proudest moment. The takeaway that we get from this admittedly is not to go and fake the last modified dates. The takeaway that we get is that freshness matters and we can put a number on this now, you can take this to your developers, put it into your queue and prioritize what it correctly is. By having a testing framework and its methodology, we can quickly roll out tests. We were able to take a risk that previously no one was going to. No one’s going to put that up there, but because we could turn it back so quickly, we were willing to test it and try it.

Test 18: Emojis in Title Tags ~ Positive

Finally, for the final section, we’re going to look at how testing changed the relationships with the people that I’ve worked with. For test 18, at this point we’re working with a business client. They are a good old fashioned business client and they bring me into their business meeting, in their business building, in their business suits, and say, “Dom well, sit down. We’ve got some business things. We really love the work that you’ve been doing with our business. We love all our business things. We love our business title tag. It’s fast, effective, and reliable, and we had a phenomenal idea.” Oh yeah, what’s your idea? “What if we put emojis in our title tags?” You see my face fall, I was waiting for the brilliant idea.

Okay guys, I don’t think this is going to work. They’re quite insistent about it, they’re business people and I say, “Okay, well if we’re going to do this, we can do it properly. We’re not going to add any emoji. We’re going to pick the business graph, because if anything says business, it’s this graph. That says business right there. So we’re going to put that in your title tags and that’s what’s going to drive this.”

But I’m not really paying much attention to the test in all honesty. I roll it out and turn the other way and doubled it back, and eventually I do go back. It increased traffic by 12%. That’s ridiculous. It’s completely absurd. How on earth could this work? It doesn’t work people because it’s bullshit. It was null, it was obviously null. You’re ridiculous, even thinking that it went up.

Here’s the joyful bit about it, I didn’t have to argue. That test took me 30 minutes to set up and run. It meant my relationship with them was better because I didn’t have to be like, you’re an idiot. I could just go yeah, “Let’s test it.” And then when they saw it, they were like, “Oh well it didn’t work. We learned an important lesson.”

They didn’t learn the important lesson. About a minute later they turn around, they go, “What if we put bullet points in the titles.” I say, “I don’t think it’s going to work guys, still does nothing.” But we tested it, again it’s just 30 minutes of my time. As another interesting thing about these sort of relationships, which is, at this point I have a client where the first three have failed and two of them have been null.

“…it just puts people in a different mental headspace, which meant that all of a sudden success wasn’t just up and to the right anymore.”

If this was an ordinary SEO client; the customer I was working with; I would not be too hot at this point. They would be going, “have we hired the right person? This man is questionable.” They question his skills. But the wonderful thing about testing; the way it bought us into it. We all thought these ideas would work. It wasn’t like they were suggesting the test thinking, “like these are dumb tests”, and we went to it with the framework of, “this is a test, it might fail”. What all that does is it just puts people in a different mental headspace, which meant that all of a sudden success wasn’t just up and to the right anymore.

“ I was far happier to be judged on. ‘Are these good test ideas, are we running enough of these?”

Now, it doesn’t mean that they stopped complaining, but they complained about the things that frankly, I was far happier to be judged on. “Are these good test ideas, are we running enough of these?” All the things that are very reasonable things to judge me and to judge my performance on. It meant that it wasn’t just the results of random test.

“The negative test is a bullet dodge, it’s not a failure.”

The second thing was that really these negative tests rather than being wins, they would have rolled them out if we hadn’t tested them. They wouldn’t have gone, “let’s not do these.” They would have rolled them out. They would’ve tanked their own traffic probably without realizing it, because it’s kind of hard to measure this stuff. The negative test is a bullet dodge, it’s not a failure.

Finally this thing that we always talk about, this bit of forcing ourselves out of silence and communicating across teams, testing is going to force you to do it. You are now going to have to go and talk to your engineering team. It’s going to give you an excuse to go do it. You’re going to have to go talk to your CRO team because you’re both making changes on the websites. You’re both going to fuck with the same things. This gives you a really good reason to go and do all those wonderful cross silo things that we endlessly power to conferences, but it’s actually quite hard to do.

“I can now tie specific changes that I’ve made, to moves in traffic.”

Let’s put a bow on it. I would love to tell you at this point that this went away, that I no longer have that black box, but it’s not true. There were still plenty of things that we can’t test. There are plenty of things that I could still do when I say, “I don’t really quite know what that did.” For a large percentage of my day to day job, this is a colossal improvement. I can now tie specific changes that I’ve made, to moves in traffic. I can go “that was successful, that wasn’t successful” and I can have confidence in what I had at the end of the day.

Of all the various things that I’ve said here, there were lots of little bits, little spotty bits, but these are the big four different ways to think that I’d like you to take away from this presentation.

Don’t be the same idiot that I was and promise everyone you can make them all that money. Different changes, different effects, different sites.

Get that framework. Most of our tests either fail or null, if you’re from a CRO background, you already know this, this is how testing works. Which in SEO, we’re learning that fresh because we are starting to do it.

Testing will probably improve those relationships that you have. It will give you ways to get around arguments, it will give you ways to be more data-driven, which allows you to put the emphasis on something other than your own performance.

Go and revisit those beliefs that you have. Because you have almost certainly somewhere out there had some beliefs that are wrong, and you have just not realized it because this stuff is really hard to measure.

I’ve talked about nothing in this presentation about how to do any of this. If you want to, I’ve put together a bunch of resources that you can download. This covers a lot of the maths, how you can make changes at scale across different websites. There are various different ways of doing that. If you’re smaller, you may just be able to do this with tag manager. Cloudflare edge workers are another interesting way of doing it. You may just be able to build it into your CMS if you’re small enough.

Further Questions About SEO A/B Testing

How long before you call a test “complete”?

For title and meta descriptions, we saw that often a week is enough. For the other ones, one of the big things that we’ll look at is the speed that the site gets caught. If you have access to your logs, often you’ve just got to be careful, particularly the very large site and different templates depending on the size and the importance of the website. We call it at different speeds.

To get you an idea, “has Google found all of my changes? Has it rolled it out?” We will go and look at how often those pages are crawled. If they were crawled, then great, we have an idea and we’re going, “okay, we probably don’t want to wait too much longer now.” By and large, my rule of thumb was about after four weeks, you will also find it quite difficult to call a test, because the margins there get a lot larger.

Because this is a statistical confidence and we’re building a model here, after about four weeks, the confidence was getting pretty large and it’s harder and harder to more confidently say that this has worked. That does, of course, mean some of these things are harder to pick up on. You’re not going to be able to use this to measure things like the step change that you would see when a quality update hits. You’ve made all these changes to your website and all of a sudden you’re now crushing it on your authority on this and your trustworthiness. You’re not going to be able to pick that up here.

By and large four weeks is usually pretty good, depending on your model quality. I’m now going to struggle to physically even say if this is higher, I don’t think we’ve seen a change take longer than five weeks to hit effect. Usually, by the time we hit five weeks, we’re like, “Okay, we can call this.” The other thing that’s worth mentioning here, is before you roll out the next test you need to wait. Let’s say that you roll out a test, and after two weeks you short uplift, as we had in a couple of those ones, we had an uplift of 10% traffic. You wait it out, you wait for it to be consistent. Sometimes the trajectory on this will change, like an increase in sessions.

We had a couple of tests, like that big one where the traffic increased by 11% of the structured data. The trajectory was pretty steep. We got this real notably sharper increase than it finally finished up at. That sort of hit a normal, it was like a 20% increase day on day, and then by the end it was 11% and it settled back down. You’ll often get this almost like freshness bump when one goes, “Oh, things have changed”. I’m going to reward this a bit more and then it’s going to take it back off.

When you’ve launched the result, if you’re seeing a positive or negative change, wait for 10 days to two weeks. Make sure it’s consistent. Finally, if you show that it’s consistent, roll it out to the rest of the site. Because it took you two weeks to see that initial positive change, you’re going to have to wait two weeks again before you can run your next test on the section. Otherwise, you’re going to get the backwash from what’s already been going on.

You showed Cars.com & Argos.co.uk who are not your clients but representative of the size of the clients you ran these tests for?

Yes. This works with the requirement of a thousand organic sessions. This typically works well for large websites with large templated pages. Part of being able to build that model is having consistent traffic. It works very well with large eCommerce websites, with large listing sites here, something like Craigslist.

Was the data you presented, all real results for different websites?

Yes. Real results for different websites, often in the same space, but we couldn’t use the atrophy group.

Which was actually more valuable to these brands? Was it having the platform in place so that they could implement changes, so they could run tests, or was it having you there as a consultant?

I think we saw my questionable title tag tests to begin with. This is a question I’m frankly not qualified to ask. Whoever asked this question, you clearly don’t want this answer from me. I’m so biased in this. But I am great and ODN is great. If you are someone who struggles to get things done, suddenly having a platform and a way to put something in front of it and make a lot of changes quite quickly, is going to be beneficial. You’re going to see results from that. At the same time, having someone who’s done a bunch of these tests is going to be beneficial and helpful.

If you’re smart and good at your job, you can just add the tool. You don’t need me. You’re just going to have the same ideas and we’re going to go through the same process. I am a little bit more on the technical side so I can do a little bit more of the developing and coding. But if you have a developer, then that’s great. You can just be a really great SEO. You can know your theory even if you’re a developer, mock-up everything else and just have your theory of how to test and the speed of the tests. Just making sure that you don’t fuck up the maths is far more important than everything else.