7. Scraping Google
This is probably the most popular use of ScrapeBox. In 2009 it was a feature allowing you to harvest more blogs for posting comments.
I personally use the ScrapeBox scraping feature for:
A word about proxies
To start scraping, we need some proxies, otherwise our IP will be blocked by Google after just a few minutes.
Usually the best way to find a Google proxy is to use the built-in ScrapeBox Proxy Harvester. Using it right is quite complicated though, and I will not cover the whole process here. As an easier way for SEOs starting with ScrapeBox, I recommend going to any large SEO forum. There are always a few “public proxy threads”.
As an example, you can go to one of the posts listed below and simply use the proxies listed there daily.
The average lifetime of a Google proxy is 1 to 8 hours. For more advanced scrapes you’ve got to use either a lot of private proxy IPs or simply use more advanced scraping software (feel free to contact me for info).
Google scraping workflow
1. Find and test Google proxies
We’ve already got a few proxy sources (pasted above). Let’s see if we can get any working Google proxies from those lists.
Find at least 2000 – 3000 proxies and paste them into the Proxy tab in ScrapeBox, then click on “Manage” and start the test. We are obviously looking for a Google Proxy.
This is how proxy testing looks:
You will see the number of Google-passed proxies at the bottom of the screen. Google proxy will also be shown on the list in green.
After the test is finished, we need to filter out the list. To do that, click Filter, and “Keep Google proxy”.
Now we’ve got only Google proxies on the list. We can save them to ScrapeBox and start scraping.
Remember to use proxies straight away, as they will usually not be alive for more than 1-3 hours.
2. Setup the desired keywords
Now that we’ve got the proxies, all we need to start scraping are our desired keywords or footprints.
To show the scrape in a “real life” example, I will scrape the footprint used for Expedia’s WordPress Theme. For those of you that didn’t read Expedia.com case study, it is a WordPress theme, with footer links. Pretty easy to footprint.
As you can see on the screenshot above, our footprint to scrape is “Designed by Expedia Rental Cars Team.”
Copy the footprint mentioned above and paste it into ScrapeBox.
To setup your scraping, follow the screenshot above. Paste your desired footprint to the top right field. Then add as many keywords as possible (I only used 3, as this is just an example), to generate more results.
Yahoo, Bing, AOL
I personally don’t like using them. In my opinion, scrapes done with them are not as precise as the ones done with Google. On the other hand, I know that many of my SEO colleagues use those search engines quite successfully. I leave the choice to you. You can run some benchmarks yourself and decide for yourself.
Why should we add extra keywords?
Each Google search is 10 – 1000 results (depending on the setup). If we want to scrape, for example, 20,000 results, we need to use extra keywords, so our footprint will look like:
This way we can cover much more “ground” and dig much deeper.
Before scraping, Google your footprint manually. With that you can have a clear idea of what you want to accomplish, and then benchmark your results.
For the footprint we’ve got Google shows ~180 unique results.
Of course, having 180 unique pages scraped is a perfect score for us, but it is not always possible. Let’s see if we will be lucky enough to get close to 180 pages.
All we’ve got to do now is press “Start Harvesting”.
Now we can watch ScrapeBox doing what it does best. Scraping Google.
OK, the search is finished, we’ve got 226 results. This is not epic, but pretty good for only 3 keywords.
After clicking OK, ScrapeBox will show us the good (with results) and bad (no results) keywords.
The stats above are really helpful, as with more complex searches you can be much more effective by filtering the keywords.
Unfortunately, we are not finished yet. The results we see are coming from different, unique searches, therefore they are almost always heavily duplicated. Fortunately all we need to do is click “Remove/Filter” and “Remove duplicate URLs”.
Let’s see how close we are to our desired 180 results:
We’ve got 145 unique results. With only 3 keywords used, this is a really great result. There are probably ~35 more pages that we missed out there, but I’m sure that we’ve got all the unique domains on the list.
Now let’s see the results:
With ~ 15 minutes of work, we’ve got the whole footprint scraped. 145 unique URLs with 21 domains.
In my opinion, scraping is a skill that is really important to anyone dealing with link audits. There are some SEO actions that you cannot do without scraping. The Orca Technique is a great example. It is not possible to implement it fully without scraping Google.
Scraping and de-duplicating is not all you can do though. Imagine that you want to see which of the domains above have already been disavowed. We can do that just by a really simple filtering.