The only two certainties: death and taxonomy

By Jarred Cinman, Product director at Cambrient

Johannesburg, 30 May 2006

Simply put: content taxonomy is the science, or art, of putting content where you, and your users, can best locate it the next time you need it. It is all about where stuff belongs, and where it doesn`t.

What is a taxonomy?

A taxonomy is a system of classification. We are at least mildly familiar with the taxonomy of the natural world, for example dogs being a species of the genus canis; canis of the family canidae and so forth, through mammals, up to animals and beyond.

Classifying everything in the natural world is a hellishly complex task, and probably one that we will be busy with in a million years` time if our species survives that long. One of the main reasons for that complexity is that it`s a completely manual task, and one that relies heavily on judgement calls. Also, there is the notion that where something in the natural world belongs is a matter of fact.

Content taxonomies, by contrast, are not so much a matter of fact, but rather of utility. What structure will work best for the purpose at hand? What will make the most sense to content users?

So, put simply, a content taxonomy is a filing or categorisation hierarchy in terms of which all content can be classified.

Simple, right?

Not so simple

There are two activities here: the filing activity and the construction of the taxonomy itself.

If you ask two people to file a piece of content against a set of given categories (let`s say, a PR release about a new product into the categories "product news", "PR" and "official company documentation") the results will vary dramatically between users. A whole range of psychological factors and judgement calls come into play. One person`s "product news" is another person`s "official company documentation".

And that may very well be the easy part of the job. Harder still is to construct the taxonomy in the first place - what categories are on offer for users to choose from. Working from a pile of document or a sprawling corporate shared file directory, and ending up with a neat classification system that can accommodate everything neatly is no small task.

Both of these jobs are harder still when you realise that a good taxonomy should allow for each member to appear in only one place in the hierarchy. A taxonomy is not very helpful if something may or may not appear in its expected place.

But why is it important?

There are several reasons why content taxonomies are important, if not essential in content management. Here are two:

Provide guidelines, not strict rules which may leave many people scratching their heads and filing mountains of things under "miscellaneous".
Jarred Cinman, product director at Cambrient

Firstly, predictability and order. If someone needs to find something, someone else can tell them where to look for it. Or they can browse logically through a repository of some kind and find it on their own. Of course, this presumes they understand the principles of the taxonomy and can think more or less along with the logic of the taxonomy creators. But this isn`t a hopeless cause. A visit to the local library confirms that we can more or less agree on where things belong, if not with complete precision (a visit to the local grocery store often seems to argue the opposite).

Secondly, computer tasks such as searching content, limiting access to content, sharing content between systems and so forth depend heavily on the content being stored in a structured way. Databases are the ultimate example of structured information; however, content resists the same granularity. It`s easy enough to put an employee`s name into a database, but the documents he`s written don`t become data quite so easily.

Approaches

There are two broad approaches to developing taxonomies for content. These can be referred to as the "machine brain" and the "human brain" techniques.

Machine brain techniques involve pointing complicated and rather clever mathematical algorithms and fuzzy logic software at content and asking the machine to come up with an all-encompassing taxonomy. Autonomy, the large US-based search software vendor, is perhaps the supreme example of this machine approach. With millions of dollars invested in the maths, autonomy`s indexing and classification tools do a pretty good job of making sense of unstructured content.

The human brain technique is, obviously, far more painstaking. It involves one or more people reading through content and building up a taxonomy to house it all. One of the most impressive examples of this work is the United Nations Standard Products and Services Code. This is a taxonomy for all products and services in the world. "Toothbrush" and "article writers services" each fall neatly somewhere in this giant hierarchy.

In fact, there is a third approach: what could be termed a "hybrid". Humans help to make sense of the content, create it in a structured, meaningful way, and the computer then handles the heavy lifting of searching, indexing, classifying and taxonomy building. This is the dream of the Semantic Web, the project under the stewardship of Tim Berners-Lee, the father of the Web. In this new, and rather idealistic new paradigm, authors of content would write and structure it in a way that made it inherently sensible. Software would then be able to query it and instantly know what it was about. With this information in hand, it is imagined, a classification system would evolve (and continue to evolve).

Practical issues

All of this may sound academic, but it has a very real impact on content management projects and content authoring in very non-academic contexts. Every day, if there are people creating information and storing it to a file directory or loading it into the content management system, there is an explicit or implied taxonomy at work.

Here are some practical suggestions for creating and maintaining a workable content taxonomy:

* Decide up-front that each piece of content (depending on your organisation, this may be at a "document" level or at a more granular "chunk" level) must belong in one place.

* Draw up a taxonomy by consulting with each content team, and allowing them to classify a sample of their content. Sorting cards is often a good technique for information structuring of this kind.

* Automated taxonomy tools should be employed with caution - what makes sense to even the cleverest algorithm doesn`t necessarily make sense to humans. It may also prove tricky to understand the basis of the classification. If you need to add a category or remove one, you may not be able to get inside the mind of the computer.

* Whatever taxonomy you settle on, try and make the rules explicit. For example, you could decide that the type of product is more important than the type of news, so there is a classification hierarchy which can help guide your authors when they have to make content classification decisions.

* Use decision support rather than decision automation. Provide guidelines, not strict rules which may leave many people scratching their heads and filing mountains of things under "miscellaneous".

* And let the taxonomy evolve. One of the best ways to do this is to ask users to report cases where things don`t fit. Whoever is in charge of the taxonomy, or its branches (yes, you do need such people) can then make calls about modifying or extending the taxonomy.