About The Joy of English

Part 2: Many different worlds

David Crystal

So since then, there are the two sides. There’s me as author, the linguist, and then there’s the me as encyclopedia editor. And they are two very different worlds.

Do they complement each other?

They do. I find myself using quite a lot of linguistic knowledge, as it were, in the editing process – as you’ve encountered yourself. You know that when you’re editing one of these books you have to make sure that the language is accessible, intelligible and clear. Half the task of editing is to make specialist work intelligible. Most of the time - I wish we had had e-mail in those days - it was me in correspondence with, say, my chemistry editor up in Edinburgh. He’d send in his entries, and I’d say “I can’t understand these”, so I’d rewrite them and send them back.

He’d write back stating that it’s very clear, but it’s bad chemistry now. So he’d rewrite it and send it back to me, and on it went. It was backwards and forwards until we were both satisfied that it was a clear and accurate entry. That's what took all the time. So there was a kind of linguistic thing there, from the beginning, but I never expected that it would become as dominant – the link between the two sides – as it has done in the last ten years or so [in the Google era].

I can tell you how that developed. When you do a book like that [pointing to the encyclopedia], you’ve got 25,000 entries in there. It was all being held in an early computational system; an early computer.

That was early 1980s, did you say?

Yes, 1986 to 1989, it was.

That would have been a very basic word processor.

Oh, absolutely. We had a – the first version of the book – was held on an Olivetti something or other, which had a maximum capacity of 25 kilobytes.

[We both laugh loudly]

Imagine it! But gradually things improved. The electronic version was there and therefore simple searches were possible, and we were getting questions once the book came out, like - a sixth former would write in and say: ‘We have bought your book. Can you tell us how many entries there are on the Napoleonic Wars? Can you tell us where we can find everything to do with these wars?’

Well, we simply realised that we needed to classify all our entries in detail. So that’s what we did. Every entry was classified in terms of what it was about. For for example, the entry on Churchill would classify Churchill as being a politician, as a journalist, as a painter and so on. There’d be about 10 categories that Churchill would be assigned to. Now, this classification system was there just to answer simple queries from readers. I had no idea at the time that it was going to develop in the way it subsequently did.

So, we’re developing this whole array of encyclopedias right through until 1995, but in 1995 Cambridge University Press had a change in policy. They decided that they didn’t want to do encyclopedias any more. Being an academic firm, they were always a bit dubious about going down the encyclopedic road. It was trade, not academic. While they were happy to do it, eventually they decided not to do it any more. It was primarily a question of economics. It was a costly enterprise, you know. To keep my operation going was costing them money. The encyclopedias were selling, but they weren’t bestsellers in any sense of the word - they were selling in the thousands. This was not a hugely profitable exercise for them. Anyway, they decided to sell the database that I had accumulated, which at this time was very, very large because there was all the biographical material, as well as all the other things.

So it was still owned by them, though it was written by you?

Oh yes, I had moral rights but I didn’t own the stuff at all. I was just a consultant in it.

So they then sell the database to a Dutch-based IT firm called AND, which was based in Rotterdam and had been in the IT business for 10–12 years, an established firm. And they were interested in having an encyclopedic database – why? [This was 1996 now] Because the guy in charge there had a vision. This is pre-Google, remember, the web had been invented in 1991 and that was beginning to build up and people were starting to see the potential of it. So this guy at AND said that he thought that the future was in search, and taxonomy, and he was looking for a classification system that would work for the web. He thought that this would be a good way forward. So he bought the thing – not for the encyclopedia, but for the classification system that was behind it. I was tasked with the role of developing the classification system for the Internet, at the time. We were still able to carry on publishing encyclopedias. Ironically, Cambridge did want to carry on for a while, so they licensed the data back to themselves from AND, and we carried on with the encyclopedias. You can see the fourth edition there [points], which was the year 2000.

Then CUP stopped being interested completely and Penguin took over. Penguin decided that they wanted an encyclopedia series. So what was the Cambridge encyclopedia became the Penguin encyclopedia, and since 2000 to this year we’ve been doing Penguin encyclopedias of one kind or other. There is, in fact, [goes to his bookshelf] very little difference between this book [Cambridge], and this book [Penguin]. They’ve stopped now, and they don’t want to do any encyclopedias either. But that was a big period for us.

So that was still going on, but the main aim was to develop the classification system. That is what I was told to do. So I did.

Did you have an interest in computer systems at the time?

Only a peripheral interest. We’d used them, you know. I’d never researched them though, and certainly never researched the issues of how you manage data on a computer, before. But, the task seemed simple and straightforward. If you have to find something in a database, it’s exactly the same task as how you find something in an encyclopedia. You just had to make sure that the categories were broad enough. The categories that we needed for the Cambridge Encyclopedia were things like history, geography, science and literature etc. On the Internet it’s things like refrigerators and cars, and sex – and all those sorts of things. As you know, there’s no sex in the Cambridge Encylcopedia. You would hope not! [laughs].

This was pre-sex!

Yes, that’s right! [laughs]. So I had to develop the classification system to include all those things, and that's what I did. Once the taxonomy had become more wide-ranging and covered the breadth of the Internet, then the question was: what sort of product – this was a commercial firm – could we devise to demonstrate the power of this classificatory system? And the first product we developed was a search-engine assistant.

To illustrate that: if you’re in Google now and you are interested in the economic sense of ‘depression’, and you type in ‘depression’, you’ll get millions of hits all to do with the psychiatric sense of depression. How do you solve that? You solve it by anticipating all the senses of depression that exist in the English language, so that, when you type in ‘depression’, basically a menu comes up asking which sense of the word depression you mean? The climate sense? The geographical sense? The psychiatric sense? The economic sense? And so on. And you click on the right one and a filter then operates. Only the sites that are relevant to that particular sense turn up.

It’s a very simple idea, but to implement it you have to go through the entire English language word by word by word, sense by sense by sense. You have make sure that you have anticipated all the senses of all the words that could be used in a search enquiry. Well, that’s what we did. It took four years and a team of about 40 lexicographers based mainly in Oxford - that’s where AND were. And then it was done. And once it was done it was a stunningly powerful tool, as you can imagine, an immensely power tool.

By the year 2000, this enormous research project was ready to go. And it would have gone, except that in 2000 AND went bust. They went bust because they overextended themselves. They were buying up all sorts of companies over the place. They ran out of money and went into liquidation - this is the Dutch system, so you can’t actually disappear, they still exist. But they immediately dropped all R&D developments of our type. This was one of many projects, and we were in Hamburg somewhere, Hilary and I, and we got a phone call saying: “Right, no more”. Sorry? My team was working here, four or five people working full time – in this room – who I had to ring up to say there was no more work. It was an awful, awful moment because this was a very depressed area, and still is, economically, and four jobs going just like that. Just horrible.

When we got back to the UK, myself and a colleague then thought: 'What are we going to do about all this?' All that research gone to nothing, all the encyclopedia stuff wasted. So we decided to have our own company, which we called Crystal Reference Systems. That was in 2001. We bought the assets first, - they were going quite cheaply. So, we thought that we’d try and make a go of this.

Crystal Reference Systems was set up with two divisions: Crystal Reference and Crystal Semantics. The reference side was to look after the encyclopedias, and the semantics side was to develop the Internet side of things. We got this new contract with Penguin; but the other side needed investment. It was quite difficult. In the end we got some investment from Finance Wales and local start-up business support. This enabled us to develop the IT side of things in the way that AND were hoping to develop, but which never went ahead.

Think of us now, in 2001 and 2002, with techies now on board to develop the software to implement the approach that I’d been developing linguistically years before. This room was getting fuller and fuller, as you can imagine. The one thing that I hadn’t anticipated was how many possible applications there would be of the technology that had been devised a few years previously. There were at least the following: search-engine assistance, as I told you earlier. Second, automatic document classification: if you hold all your documents electronically, and you want them classified, our system could do that because it had already anticipated all the words and senses. Third area, e-commerce: you go to a shop online, you type in something and it says that they haven’t got it. They have got it, it’s just that they are using the wrong words. I remember once, I typed in ‘Mobile Phone’, and this big firm said ‘we haven’t got any mobile phones’. I typed in ‘Cell Phones’ – 'we have no cell phones'. I typed in 'cell' hyphen 'phones'. No. Couldn’t get it. The only thing that worked was ‘Cellular Phones’. Stupid. But our system could solve that problem.

And that’s just the computer doing what it’s been told.

Absolutely. And being badly linguistically programmed - but you see, the linguistics is what we knew all about. We knew what the problems were. The fourth area was contextual advertising: advertisements on the web, which were becoming very big, but they were often irrelevant.

One of the earliest case we encountered was a CNN story about a street stabbing in Chicago, and the ads down the side said ‘Buy your knives here. Get your knives on eBay’ and so on. Now, our approach would avoid that because our system would analyse the whole contents of the page and work out that it was not about cutlery but about a homicide. It’s a combination of dictionary and encyclopedia. All the words in the language have been analysed, and their senses, and assigned to encyclopedic categories.

And fifth, another area was Internet security, and in particular child protection. The same system can be used to analyse the speech of anybody, say a paedophile in a chatroom, and work out that this is not a nice guy and warn the people. It operates by analysing such sentences as ‘What are you wearing?’. The entire system is based on words, not on grammar or anything like that – purely on a lexical analysis of the content of what’s being said.

So, there were all these possible applications. We were developing them all simultaneously because we didn’t know which one would sell.

You must have been itching, seeing all of the possibilities.

Absolutely, and we had no idea which one would sell. We were only a small operation. It was just me, the MD and Hilary, as well as our editorial people. But there was no sales director or marketing director, so poor old Ian [the MD] had to trawl around trying out whether people were interested in this sort of thing. Of course they all were.

We went all over the place, like Silicon Valley, and talked to people like Yahoo, and we thought that we’d get contracts very very quickly – but it turned out not to be so. We got a few small-scale contracts, enough to keep us going, but the big ones didn’t come through. Why? Because the firms we talked to said: ‘yes, we can see the potential of your system, but we’ve already invested so much in the system we already have that we are waiting to see how that evolves.’ We would say: ‘look, your system is crap, look at the knives on eBay example’ and their response was, ‘yes, we know, but we’re solving that in our own way’. And they’re still crap. But we didn’t get the kind of “Hey, you’re great. Here’s $100 million” response that we half hoped we’d get.

Which is deserving in a way, given all the dotcom ideas.

Yeah, but it didn’t happen, so we trundled along until about 2005; still developing everything, still using our investment money, contracts are bringing in a little bit, but there came a point when we said we can’t continue to carry on like this. We are going to have to sell. We were quite well known at that time, and once it was known that we were interested in an “exit strategy” – I think I learnt the term at that point – there was quite a lot of interest initially. Eventually we were bought by Ad Pepper Media, which is a European-based advertising company. It could have gone in any of these five directions, but they were the ones that came in.

So since 2005, we have been focusing entirely on the contextual advertising solution (of the five) and that’s what’s happened. The product is now out there. The product is called iSense and it’s a big contextual advertising solution. Ad Pepper are rolling it out all over Europe. The search engine assistant hasn’t gone ahead. But anyway, its going very well and all the signs are that it’s going to be OK – Ad Pepper will be happy – and eventually some of the other applications of the thing will go ahead.

So you approached Yahoo. Google as well? Was that around the same time [when Google started]?

We’ve approached all sorts of people over the years and in one or two contexts the response has been good. Google had already got its act together, as they thought – it still isn’t brilliant – but it’s better than it was. It’s just that we got the line: 'we've already invested so much in what we’ve got, and we don’t want to start another one'. And that’s still the general attitude.

All these guys think that the solution is so easy, that all you’ve got to do is think of an algorithm that will take a page, analyse the page and then you get the results. It isn’t like that. These pages are complicated things. Web pages are typically multisemantic. They don’t just talk about one theme, they talk about several themes. You can’t just read the top part of a page, like most systems do, and get the right answer all the time. It takes a lot of thought.

Like that Churchill example you said earlier.

Exactly like that. So what these firms do is employ teams of people, 80 or a 100 people in a room – I’ve seen it – where they are calling up web pages, working out what they are about, and classifying them manually. That’s if they understand it. But our system can do it automatically. I thought that it would be so obvious, that it would sell and sell and sell.

But for some reason there’s a kind of reluctance to encounter it. Not entirely, as I say. Some firms have done this and the interest is undoubtedly there. Although all this was developed 10 years ago, this iSense approach was put up for a product innovation award just last month and won. You know, as an innovation! So as far as most of the world is concerned, this is a new thing even thought it's 10 years old. We have patents on it since 1998, a pre-Google patent, on the whole thing. And you might say that there are companies out there breaking our patents? There probably are, there might be, but can we take them on? Can we fight a big company? Not a chance. A patent is a curious thing, I’m not sure how useful it is at the end of the day.

Anyway, the result is that this year a lot of the technology and the thinking that has gone on in the last 10 years is finally coming to the boil. And another, that will interest you particularly, is that it’s not just in English. Ad Pepper were extremely interested, because they’re in 13 European countries, in having the stuff translated; having all our list of words and all the senses, translated into the main European languages. So, I had to supervise that project a couple of years ago. We got a team of translators and they translated all the words and senses.

How many words and senses?

A couple of hundred thousand, something like that. They began translating into Finnish, Swedish, French, German and all the rest of it. A huge project. That’s being done now, and exactly what Ad Pepper are going to do with it, I’m not sure. It takes a while to roll out a new product.

That must be over a million words.

It is, you’re absolutely right. It’s enormous. And now iSense has been rolled out in Germany, France and I think Denmark is next. It’s not my role to worry about the selling side of things. But notice: a translation exercise is not easy. It isn’t just a matter of translating different words into different languages, you have to translate the cultural concepts as well. It’s all very well having a set of key words for say, refrigerator, in English. But you can’t translate those into Swedish because the same refrigerators aren’t sold in Sweden. There are different companies involved, so you have to find out what those firms are. It’s a cultural exercise as well as a linguistic exercise.

There’s another aspect to this, and that is that in the advertising world, people are very interested in appropriateness now. In other words, we want our ad on that page not just because it’s relevant to the page but because we feel it’s an appropriate ad for that page. Now, if the page is about, say a porn site, we don’t want our ad to be on that page – we don’t want to be associated with that page. Or, we do want to be associated with it. Whichever way it goes. So the other thing we had to develop is now a product called Sitescreen. Sitescreen actually looks at a page and decides whether it has any sensitive content: whether it’s about sex, or drugs or smoking - all the taboos. There are about 12 of them. Extreme religious views, all of this. Sitescreen actually works out if a site has sensitive material by analysing the words in the way I mentioned before.

So that had to be developed, as another part of the filtering technique. And suddenly, after all this time, these products are starting to be out there. I'm waiting to see what happens with some interest.

To go back to the beginning of the story, this is not my world, advertising, But the challenges for me were the intellectual challenges of how you solve the problems that these guys have come up with. Well, I know how you solve it now. I know how to do it, so it’s less interesting to me. If it makes millions, then of course I’m interested – I’m a human being – but as a linguist I’m not interested anymore. So these days, my R&D work for Ad Pepper is, on the whole, not so hands on. Every now and then something comes along that is interesting, but I’ve become more part time. So now I’m back in my more familiar world of writing books on language.

Continue reading Part Three of the David Crystal interview.