It's all about speed

By Jarred Cinman, Product director at Cambrient

Johannesburg, 23 Apr 2007

I've worked in Web development since 1995, but not until last month did I have the long overdue chance to work on a rare South African phenomenon: the high-traffic site. I want to relay some of the important lessons I learned in implementing and launching this system.

As the Web starts to pick up pace here, more and more old and new Web development hacks are going to face the challenges of sites that are actually visited.

And it's all about one thing really: speed.

Speeding up

It's been a long time since I worried about the download time of any of my Web sites. Why? Because the average corporate Web site is visited so infrequently that even a fairly large home page that takes a few minutes to download is acceptable.

Frankly, people aren't as impatient as they're cracked up to be. If I want to see Acme Traders' home page and it's a little slow, I'll bear with it. I might be slightly irritated, but I'll forget that within reason if I find what I want.

Not so on a high-traffic, frequently used site. Take ITWeb as a useful example. Visitors will typically return often, for news updates appear regularly. Links back from e-mailers, links from other sites and so forth. If the user were to wait five times a day for a slow page to appear, things move from slightly irritating to downright unacceptable.

Speed strategy number one: Caching

Caching is one of the most important and useful back-end features in Web browsers, content management systems, application servers and databases.

What is it? It's basically a way of doing some hard work once, and then trading off that many times.

Web browser caching has been around since I first downloaded Netscape, and it's simply this: anything that gets shown on more than one Web page, or any page that's visited more than once, is stored on a local hard drive instead of having to be downloaded again from the Web. It saves bandwidth and time.

Browser caching is, however, not particularly useful when dealing with dynamic content sites. If the home page of a news site changes every hour, then it's no good storing a saved version of it on your hard drive. You'll still be reading about Windows Vista long after Linux has taken over the world.

Instead, one of the secrets that high-traffic sites understand is to create a cached or 'baked' version of the site on the server. So, although a new version of the page has to be downloaded each time, it doesn't put extra pressure on the content management system or underlying database in the process. It is as though a static site were being delivered.

There are many areas where caching can be employed to make the site as fast as possible, and reduce server load as much as possible. The higher your traffic, the more time it's worth investing in caching everything: user profiles, searches, navigation, the works.

Speed strategy number two: Limit personalisation

Directly linked to the point above is the use of personalised pages and content. Simply put, anything you want to show to only one user in one way makes the task of creating a cache infinitely more difficult, and on some level, impossible.

Take, for example, a bank account. Only you, when you log in, can see the information on that page in the way that you see it. Therefore, the bank cannot reasonably create a cache of that page. Therefore every access of the banking system means a new page must be built on the fly for each user.

This all adds up to a load on the server and network. If one page takes three seconds to generate out of the content management system, imagine how long 150 000 pages will take. While that kind of load isn't expected simultaneously, serving that number of pages in a day or even, in really high-traffic site cases, an hour or less, could place intense demands on the system.

The rule here is simple: don't personalise what can be 'genericised'. It might sound like a cool idea to let everyone see their own unique view of the home page, but be ready for a huge hardware bill and, probably, a slower site anyway.

Speed strategy number three: Hardware

Hardware is one thing you can depend on to keep getting faster and cheaper.
Jarred Cinman, product director at Cambrient

Then there are the machines. Hardware is one thing you can depend on to keep getting faster and cheaper. The kind of machine you can build for R20 000 today can handle just about any high-traffic site in the country, all else being equal.

What this means is that the right balance between hardware and software needs to be found. There's no sense spending hundreds of development hours optimising a system when throwing an extra R700 RAM SIMM into the box will achieve the same thing.

Hardware is often the cheapest and quickest way to solve performance problems. Why not make the most of that?

Speed strategy number four: Configuration and tuning

All big Web applications these days depend on underlying infrastructural software for their existence. Whether it's a Java Application Server, a Web server, some .Net service or a database, Web applications are rarely standalone.

What this means is that in deploying a Web application, developers need to consider not only the code they've written, but the optimisation of these underlying applications for high-traffic conditions.

Many developers, and indeed development companies, ISPs and network administrators, know very little about running servers under high traffic conditions. They may have read the documentation, but for many sites this is all still theoretical.

Even the humble Web server, such as Apache or IIS, which is often deployed without thinking in out-of-the-box mode, has many performance and tuning options which can take a site from snail to Ferrari. It's not difficult, but it's often just overlooked.

Speed strategy number five: Indexing search engines

Search is one of the most demanding pieces of functionality deployed on the average Web site. Multiply the number of searchable pages or content items, and the number of concurrent searches, and it's very each to see how a simple keyword search that has to run a query in the database can bring the average site to its knees.

Of course, search engines and databases have known this for years, which is why the concept of an index exists. In essence, a "crawler" pre-processes all the pages or content on the site, and creates a simple list of relevant keywords mapped to each page. Then, when a search is executed, the database is bypassed completely and a much faster, less resource intensive search is run.

This is no revolutionary idea: database indexes have existed forever, and the likes of Google and Autonomy have made indexes of Web content and documents their core businesses. However, it's easy on any kind of Web application to forget these lessons and simply allow the search to take place against the database or content repository on the fly. It's the easiest way to do it, returns the most up-to-date data and allows for the greatest control.

It's also the biggest mistake it's possible to make on a high-traffic site. Often no-one stops to do the maths until it's too late. Putting in place a good indexed search is a vital part of providing users with a rapid site experience, and preventing server timeouts and downtime.

Running a high-traffic site poses many challenges, apart from those related to performance, but as things hot up in SA, and the users finally arrive in the droves we've always hoped they would, we're all going to learn the same lessons our US counterparts learned years ago. Hopefully, we can learn from them, and not simply in same way they had to: crisis and recovery.