Saturday, December 10, 2011
I’d like to take a couple of minutes to take a look at the past and the present. This post won’t teach much about SEO; it’s just a hypothetical look at an alternative universe, where the early days of search went a bit differently. (Is ‘search engine fanfiction’ a genre?) Here goes….
A Story - The Internet Without Google
In the mid-1990s, two Stanford grads – funded by various federal councils – were working on a project to apply their understanding of network topography and the principals of academic citation to Internet. As Internet had become more accessible to the public, the number of people browsing it – as well as the number of people generating content – had increased with ‘hockey stick’ like growth. Finding content was a matter of either knowing a particular domain that should be visited, visiting a trusted directory where links were categorized and curated, or using one of the new ‘Internet search portals‘.
The search portals (sometimes referred to as ‘search engines’) were technologically impressive, with millions of page available to search across, but they were unreliable: the results were usually based on little more than whether a user’s query term appeared prominently on the page. The Stanford project – Backrub – had taken a different approach, using the principle that pages to which someone had created a hyperlink – whether from a bulletin board, a news group or another site on Internet – were likely to be more trusted, and so should appear more highly in search portal results. Other signals, such as the topic of the linking page, were used to calculate the target page’s relevancy to a particular keyword.
Backrub was first tested with content from almost five thousand separate sites – and it worked well. The results were constantly appropriate and were still calculated in a fraction of a second.
With this successful proof of concept complete, the two engineers wrote up their analyses and results, and moved on from the project. They received their Ph.Ds later – but their creativity had attracted attention within tech circles: one of the students accepted a job with a successful software company from the Pacific North West; the other left the US to join one of the largest industrial technology companies in his home country.
The Backrub technology was recognized as useful, but Stanford had no resources to scale it further or Ph.Ds interested in taking it on. After being shelved for almost six months, the technology was passed on to the Department of Information Technology and Telecommunications; many of the underlying processes were successfully patented soon after.
Next Steps
The DoITT was a new agency, and had been assigned various (poorly specified) responsibilities pertinent to Internet; the mandate to ‘improve the user experience of Internet for all citizens’ combined with a substantial budget allowed them to scale up the technology. In the first large crawl in 1996 they retrieved over a hundred million pages from almost half a million sites. These were taken from a seed list of sites listed in two of the largest editorial directories, after around eight percent had been removed (DoITT’s guidelines prevent them from linking to adult material or ‘illegal content’.)
After this first large crawl, the frontend was released at http://backrub.portal.doitt.gov
Although it wasn’t heavily promoted, the service was popular and mainly recommended by word-of-mouth. In March 1997, less than six months after the unofficial launch, a New York Times editorial said that “the DoITT had restored faith that the Government is capable of innovative technology, after the fiascos of various bloated IT contracts for the Department of Defense. This portal was a triumph.” The Washington Post called it a “huge success.”
That year, three positive factors all combined: positive feedback from users (who mostly described it as ‘the first portal that actually finds what I was looking for’), further reviews in the press (who were happy to have a story about Internet that could be understood by the general public) and increased promotion by the government (who were proud to promote a service that demonstrated America’s cutting edge presence on Internet.)
Hugely accelerated growth followed: later that year, one ISP estimated that 19% of homes used the service more than once a week.
By early 1998, users had already seen a couple of significant changes to the service. The service had been spun off from the DoITT, and found a new home at http://Search.Gov
Due to the increasing popularity, the service now also required free registration to access search results: this only needed a verified email address, and the registered account could be used on any computer.
Various issues took place behind the scenes as well: a young internet company tried to diversify from being a pure Internet directory into offering search portal facilities as well: they used a similar algorithm to the original Backrub service, whilst other companies also tried to tackle the ‘Internet search’ issue from different approaches. The threat of prosecution for infringing upon the Governments various patents proved too much for these startups, and they generally didn’t pursue this any further.
Search.Gov
A team inside Search.Gov were responsible for managing ‘user satisfaction’ – this included analysis of various user behavior patterns (such as the number of searches to find a particular result, which they correlated to a worse user experience) and the manual review of ‘search quality reports’ which users could complete via a form on the site.
An overwhelming problem for this team – and for Search.Gov as a whole – was the increasing rate of growth of content published on Internet. Companies were increasingly putting their whole catalog online, news organizations would publish their best articles on a website every day, and new tools made it increasingly easy for the public to publish small sites of their own, usually on webspace provided by their ISP. Many of the percieved issues of quality percieved by Search.Gov were due to new sites not appearing in the results – the Search Quality team had to manually review sites that had been discovered or submitted, and were currently sitting on a backlog of around three months of submissions to process.
(The site tried to solve this by removing the manual review stage, and automatically adding every newly discovered site to the index. The idea was that sites which containted adult or illegal material could always be removed again later. This worked well for around six weeks, until users began to complain of increasingly irrelevant results, poor quality sites full of gibberish, and the increasing liklihood of pornography appearing in search results. Search.Gov quietly removed the changes, and returned to manually reviewing sites, but with additional resource and a better process. Sites that were permitted generally arrived in the Search Index within 3-4 weeks after having been submitted or recommended by an internal algorithm.)
Creative Funding
The increasing cost of running the service (both from technology and human resource) was of concern to various parties within the Government, as well as Conservatives who believed that this shouldn’t be a federally funded operation. Some web directories and similar services had seen success in selling targeted advertising on their listings pages: this was much more effective than traditional display advertising, as a ‘premium listing’ or ‘sponsored listing’ could place a website’s link prominently in the appropriate category and attract people who were looking for precisely that type of site.
While Search.Gov had looked into including adverts and premium listings on results pages, it was deemed inappropriate for a Government service to be supported by advertising.
“H.R. 1337 – Seach Portal Funding Bill” was designed to tax companies listed in Search.Gov based on different criteria. It was argued that the tax would stymie initiative from small companies and would completely remove non-commercial sites from the results – the bill was never passed in the House.
However, a funding opportunity came from a much larger source – overseas. Search.Gov had recognized that the number of registrations from international users was increasing. A suggestion to raise revenue by formarly licencing the technology to other countries saw very little objection from within the Government or in the press. The United Kingdom and Austrailia were the first countries to launch bespoke sites (at Search.Gov.UK and Search.Gov.AU) for a fee (technically a technology export tax) that equated to roughly $13 per citizen/year. Germany and Japan followed around a year later, after the system was extended to cope with non-English sites.
The governments of these countries could adjust certain parameters of the algorithms to better suit their audience: the most significant change was usualy to show a preference for websites from that country, and to block particular sites that carried content which was illegal in their country.
The steady stream of revenue from these licencing arrangements ensured that Search.Gov was self-sufficient; it was allowed to use a majority of the income to improve and extend the service, whilst a portion was passed up the chain into the Federal budget.
After the technology was passed overseas, it became necessary to restrict the use of Search.Gov to residents of the USA. This was implemented by having users associate their Social Security Number with their Search.Gov account. Since this allowed for the association of an individual’s real identity with their Search.Gov account, it naturally caused some controversy, particularly amongst those concerned about Search.Gov’s increasingly close cooperation with the FBI & National Security Agency.
However, the criticism was mainly silenced in early-2000. A school teacher was arrested for sexual abuse of his students; later analysis of his search history showed that he had regularly looked for lewd and inappropriate content. A bill was quickly passed that year, allowing a review of an individual’s search history when they apply for any government-based job. (The individual’s permission was required, but this was mandatory for most positions.)
In response to the 2001 terrorist attacks, the President created special new powers for the security agencies to increase the amount of search data collected and deep forensic analysis undertaken. Full details were never released, but it’s understood that as well as looking for general search patterns, it was now permitted to review the search history of any individual suspected of a serious crime.
Search Portal Optimization
During these years of growth for Search.Gov, a whole new industry was spawned: Search Portal Optimization, or SPO. SPO was the practice of optimizing sites to rank as highly as possible in Search.Gov for the appropriate phrases. While many sites paid little attention to SPO, some sites worked hard to optimize their sites high into the rankings.
SPO Consultants typically referred to four broadly separate stages of the SPO process:
- Ensuring a site was easily browsable by search portal robots (by using HTML standards, rather than Flash, etc)
- Applying for the site to be included in Search.Gov and appealing any rejections
- Identifying the search terms the site should rank for, and creating relevant content
- Building hyperlinks to the site from external sites.
Search.Gov did not release any data about the most searched for terms, but SPOs could get access to some data from ISPs who were able to estimate the Search Volume for some larger terms.
In addition, the Search Quality team rarely gave much feedback to sites that were not accepted into the portal’s index. In the early days, discussion centered around sites that were blocked for carrying information about homemade explosives or promoting drug use. Some sites were removed from the index for carrying illegal content: a retailer that sold radar jammers for automobiles was removed after three months.
An Internet Log (‘netlog’) that was critical of the Government’s approach to farming subsidies was not accepted in; the New York Times covered this controversy, and was also removed from the index. Two days later, both sites were available in the index: a Search.Gov spokesperson said “this had been unintentional, and was not an attempt to remove content from Internet. We are committed to fostering free speech online, and to a free and open Internet.”
(A Freedom of Information request showed that the number of sites not accepted into the Search.Gov index increased from around 10,000 in 2000 to 198,000 in 2008.)
Link Building
Once in the index, sites would work hard to improve their SPO, to try and rank above their competitors. The main focus of many SPO experts was link building: getting more links from more sites would generally result in a better ranking. Search.Gov did publish a set of Webmaster Guidelines that explicitly mentioned that links should not be created purely in order to manipulate the Search Algorithm – but this did not put off the so-called ‘blackhat SPOs’, who aimed to get high rankings for their site, at any cost.
This lead to something of an arms race between Search.Gov and blackhat SPOs: each time the blackhats found creative new tactics to create sneaky links online, the Search.Gov quality team would devise algorithmic methods to remove the impact of these links from their calculations.
However, the secrecy around Search.Gov’s algorithm, and the amount of authority passed through each link meant that webmasters/SPOs/link builders were never sure which links were passing value and which were not. Comment spam, low quality directories and other questionable tactics continued unabated. A particular issue for the search quality team was paid links. While some paid links were easy to detect, other were much only obvious when reviewed by a human. Rather than discounting links, Search.Gov used ranking penalties to dissuade sites from purchasing links.
A typical penalty would be when a page with many inbound paid links would have its ranking position dropped by 20 or 50 places – webmasters would have to remove the offending links to see that page start ranking well again. For more serious infractions of the guidelines, a whole site could be removed from the index for a month or two. (Although users could still navigate directly to a site, the lack of non-branded traffic would decimate a commercial site.)
Despite the risk of page and site penalties, many blackhat SPOs were still not put off from creating low quality content, and building links that violated the Webmaster Guidelines, since they would usually dump a penalized site, and work on another one else instead. In 2007, after increasing public complaints about the declining quality of Search.Gov’s results, the service was able to have parts of the Webmaster Guidelines adopted as law. The first criminal trials against SPOs were held in February 2008: the two owners of a credit card comparison website were charged with the Class C Misdemeanor of ‘Intent to Mislead a Search Portal Algorithm’ for their paid links and each served 30 days in jail.
Algorithm Updates
The changes to Search.Gov’s ranking algorithm were not always subtle. Various substantial changes were implemented, typically to try and tackle a specific type of search quality problem, such a duplicate content, content farms, etc.
One large update in March 2009 was seen to promote the sites of big brandstobacco industry lobbyists were successful in getting cigarette brand websites included in the index. (Until that point, they had been excluded on the basis of the broader laws about advertising tobacco products.) higher up the rankings; it was likely that this change came in response to a request by the newly elected President, who had received huge campaign contributions from large corporations in the election. Many small businesses said this pointed to the ‘increasing politicization’ of Search.Gov’s algorithms. The impact of lobbyists on these decisions was seen even more clearly in 2010, when
An algorithm update in 2011 appeared as if it was designed to reduce the impact of ‘content farms’. One notable side-effect of the change was that ‘scraper sites’ increasingly appeared to outrank the original versions of content they had republished. Shortly after this, Search.Gov assisted in the creation of laws that brought unauthorized content scraping within the formal definitions of grand larceny. (It was so far seen as simply copyright infringement.) Republishing another site’s content without their consent was now punishable by up to a year’s jail time. Although there were some prosecutions, it wasn’t particularly effective at stymieing the number of scraper sites hosted and owned overseas.
International Trouble
Search.Gov was facing some bigger issues, as far as its international relationships went.
By 2011, Search.Gov had licensed the technology to 44 countries. In May that year, two Turkish citizens arrived in the US, and were immediately arrested on charges of terrorism. Although the US had no access to information about the websites they’d visited directly, the search history of one of them had raised flags at the NSA, which lead to further investigation and ultimately their arrest.
Foreign governments and citizens were not enthused by the idea that Search.Gov technology was being used to track their Internet searches. The patents that prevented independent Search Portals launching in the USA were not recognized all over the world, and a number of commercial companies ran similar services – particularly in Europe and Asia. In the weeks after the Turkish controversy, the market share (outside the US) of services built upon Search.Gov dropped from around 95% to 90%. The number declined further over the next twelve months as the largest commercial Search Portals reinvested their profits in improved indexing and ranking technology. By August 2011, three countries (Turkey, Germany and Brasil) announced that they had ceased using Search.Gov’s technology. All three counties allowed competition in the Search Portal market, though the German Government maintained www.Suchen.DE, which was now powered by technology from a German company.
In late 2011, Search.Gov suffered a series of ‘cyber attacks’, which seemed to originate from Russia and China. (Neither country used Search.Gov technology for their own services.) Although the attacks may have been an attempt to hack the site and steal personal data, Search.Gov announced that no data had been accessed. However, the site suffered significant periods of downtime, initially for three hours, then a week later for six hours. Various rumours suggested that the US believed China’s government were behind the attacks – this was never confirmed by the Department of State.
The Future
Despite some controversies, Search.Gov has been a solid technology for users in the USA and abroad for more than a decade and a half. It’s now used by over 90% of US households on a weekly basis, though most people use it much more than that: the average user makes 9.5 searches per day on Search.Gov.
Search Portal Optimization continues to be serious business: various research suggests there could be up to 200,000 people employed in the US to perform SPO – either directly for a company or through an agency.
The Search.Gov technology continues to evolve: new content is now discovered more quickly, and the results continue to improve. The organization currently employs around 10,000 engineers and operations staff within the US, and another ~18,000 full and part time employees responsible for assessing the new sites for submission to the index. There are around 600 employees of Search.Gov stationed around the world to work with the teams in countries that licence the technology.
Search.Gov is the ‘first page of the internet’ for most people. I don’t know what we’d do without it….
Epilogue
OK, let’s step back to the real world. This thought-experiment focussed very much on what could have happened in my alternate universe.
What’s also interesting are the things that might not have happened:
- without the necessity to create PPC, internet advertising may never have really evolved beyond traditional banner ads
- no AdSense means that people may never have bothered creating/maintaining many smaller content sites
- innovations such as Gmail and Google Analytics wouldn’t have appeared
- with no-one to acquire them, YouTube wouldn’t have been able to grow so fast. (In fact, with their large amounts of copyright material in the early days, would they have even been allowed in the index?)
- newly wealthy employees wouldn’t have left to create sites such as Twitter, Foursquare and FriendFeed.
If you got this far, thanks for reading. I enjoyed the opportunity to try writing some long-form content.
If you’d like to share your thoughts on this potentially very different online landscape: you can leave a comment below.
Thanks,
for Interesting Article. Distilled.net