Today, a company-google—controls almost all of the world’s access to information on the Internet. For billions of people, their monopoly on search means their gateway to knowledge, to products and their exploration of the web in the hands of one company. Most agree that this lack of competition in search is bad for individuals, communities and democracy.
Unbeknownst to many, one of the biggest obstacles to competing in searches is a lack of crawl neutrality. The only way to build an independent search engine and stand a chance of competing fairly with Big Tech is to search the internet efficiently and effectively first. However, the web is an actively hostile environment for novice search engine crawlers, with most websites allowing only Google’s crawler and discriminating against other search engine crawlers such as Neevas†
This all-important yet often overlooked issue has a huge impact on preventing startup search engines like Neeva from providing users with real alternatives, further reducing competition in search. As with net neutrality, today we need an approach to crawl neutrality. Without a change in policy and behavior, competitors will continue to look to fight with one hand tied behind their back.
Let’s start at the beginning. Building a comprehensive index of the web is a prerequisite to be able to compete in search results. In other words, the first step to building the Neeva Search Engine is “downloading the internet” via Neeva’s crawler, called Neevabot.
This is where the trouble begins. For the most part, websites only allow free access to Google and Bing crawlers, while discriminating against other crawlers like Neeva’s. These sites either allow everything else in their robots.txt files, or (more often) say nothing in robots.txt, but return errors instead of content to other crawlers. The intention may be to filter out malicious actors, but the result is that the baby is thrown out with the bathwater. And you can’t show search results if you can’t crawl the web.
This forces startups to spend inordinate amounts of time and resources coming up with workarounds. For example, Neeva implements a policy of “crawling a site as long as the robots.txt allows GoogleBot and does not specifically prohibit Neevabot.” Even after a workaround like this, areas of the web that contain useful search results remain inaccessible to many search engines.
As a second example, many websites will often allow a non-Google crawler through robots.txt and block it in other ways, either by generating different types of errors (503s, 429s,…) or by giving speed limit. To crawl these sites, one has to implement workarounds such as “obscure by crawling using a set of proxy IPs that rotate periodically.” Legitimate search engines like Neeva are not inclined to use these kinds of workarounds.
These roadblocks are often targeted at malicious bots, but have the effect of suppressing legitimate search competition. At Neeva, we put a lot of effort into building a well-behaved crawler that respects speed limits and crawls at the minimum speed needed to build a great search engine. Meanwhile, Google has carte blanche. It crawls the web 50 billion pages a day. It visits every page on the Internet every three days and loads the network bandwidth on all websites. This is the monopolist’s tax on the Internet.
For the lucky crawlers among us, a number of benefactors, webmasters, and well-meaning publishers can help whitelist your bot. Thanks to them, Neeva’s crawl is now running hundreds of millions of pages per day, on track to soon reach billions of pages per day. Even so, it still requires identifying the right people in these companies that you can talk, email, and cold calling, and hope for goodwill from webmasters on webmaster aliases that are typically ignored. A temporary solution that is not scalable.
Getting permission to crawl shouldn’t be about who you know. There must be a level playing field for everyone who participates and abide by the rules. Google is a search monopoly. Websites and webmasters face an impossible choice. Let Google crawl them or don’t show them prominently in Google results. As a result, Google’s search monopoly makes the Internet in general strengthen the monopoly by giving Googlebot preferential access.
The Internet should not discriminate between search engine crawlers based on who they are. Neeva’s crawler can crawl the web at the speed and depth that Google does. There are no technical restrictions, only anticompetitive market forces that make it more difficult to compete fairly. And if it’s too much extra work for webmasters to distinguish bad bots that slow down their websites from legitimate search engines, those who have free rein, such as GoogleBot, should be forced to share their data with responsible actors.
Regulators and policy makers need to step in if they want competition. The market needs creep neutrality, similar to net neutrality.
Vivek Raghunathan is the co-founder of Neeva, an ad-free private search engine. Asim Shankar is Neeva’s Chief Technology Officer.