Case study: Notes on Underware Latin Plus
November 2014

Recently we introduced Latin Plus, which is at the heart of our production line. Everything we do is based around this standard. Not only all our fonts, but also some upcoming tools are based on this touchstone. Because you’ll encounter this term more often in the future, here is the background story of Latin Plus and its languages.

For Europeans it might be obvious that you can set texts in English, as well as Polish or Turkish with the same typeface. Europe accommodates dozens of different languages, so most Europeans are aware of the importance of a large diacritic support. But for North Americans it is as essential as for Europeans to use typefaces with a large language support. While many people think of the USA as a monolingual country, the opposite is true. It’s well-known that there is a large Spanish-speaking population, but in reality over 300 languages are spoken in the USA. You think Igbo and Swahili are only spoken far away, deep down in the heart of Africa? Think again. Over the past decade, the number of Igbo and Swahili speakers in the USA doubled. You want to be able to reach them as well, right? More than 20% of the population in the USA speaks another language at home. Another example (on a very small scale) is Amsterdam, capital of the Netherlands. With only 800.000 citizens, it accommodates the highest number of different nationalities in one city worldwide: 180. Monolingual cultures don’t exist anymore in the civilised world.

Swearing in Polish is much easier nowadays.

Broader language support
The need for a larger character set, with more accented diacritics, wasn’t that obvious when we started making typefaces in the nineties. Due to technical limitations, Eastern European characters always gave trouble when, for example, transferring a document from Mac to PC. Eastern European character sets had to be put in separate fonts, causing a lot of inconvenience. In our current digital environment, we don’t accept this kind of technical issues anymore. Luckily these problems are also easy to avoid, because today, fonts have the capacity for large character sets. If you include the right Unicode support, all the old technical problems are eliminated. Compared to when we started making fonts, we now have the option to include all Eastern European accented characters (as well as many others). In that sense, nowadays there no technical limitations anymore. So, let’s do it. Oh, wait. Do what?

To realize a broader language support, we needed to know exactly which languages our fonts already support, but also could support in the future. The introduction of Mac OS X 10.3 suddenly offered automatic previewing of the supported languages of any font on a computer. Out of the blue, long lists appeared of sometimes more than 30 supported languages, even for old PostScript fonts you may have already had on your computer for more than a decade. Without knowing exactly how these lists of supported languages were composed (well, that info is derived from here), these ‘supported languages’-lists were copied and used by font foundries and others. These lists were however incomplete, and also gave a distorted view of the naked truth. Alongside that, we have trouble using information without knowing exactly what it is based on, where it comes from, and which decisions were made while compiling it. For a type designer and a font foundry this is something too essential to be taken for granted.

Language research
Some years ago we researched the use of diacritics in Latin languages. At that time we needed to define a reasonable character set for the updated versions of all our fonts. Our aim was to support as many Latin languages as possible. This research is not done in a second, because orthographic documentation is spread over many different sources and places. Various orthographic sources can be found, but not a single one was as extensive as we wanted our research to be. And often the sources didn’t correspond with each other. The only option was to make our own research. For some extinct languages we were left to an orthography defined by one single professor, and that being the only source available. In some unfortunate cases, we couldn’t find any convincing orthographic documentation at all.

Eventually we defined a certain character set which allowed our fonts to support more than 200 Latin-based languages. We call that the Underware Latin Plus character set. This standard guarantees extensive Latin language support and excellent details for typesetting specific languages, for any font which meets this standard. Meanwhile all our fonts have been updated to contain all these additional accented characters. Which means all our fonts now support at least 219 languages. You only need 436 characters per font to do so. For those who enjoy more stats: 2.103.569.421 speakers can be reached with the Latin Plus character set. That’s 30% of the world's population.

The list of supported languages mentions how many (first-language) speakers each language has. The exact number of native speakers is impossible to define, estimates from different sources varied by up to 300%. As soon as various sources stated a different number of native speakers, we used the lowest number. But the number of speakers of a certain language is still inadequate. Even for English it’s unclear whether there are 340 or 380 million native speakers – or maybe even more. A difference of 40 million people, and we’re still not sure. That’s as accurate as it gets. On the other side of the spectrum are languages listed as having just 230, 50, 20, 12 or 5 speakers. Although this sounds as if we all know them by name, this number is also just an estimate. Our language list also mentions 3 native speakers for Wiradjuri, a language spoken in South-Eastern Australia. Well, to be honest: that number is incorrect as all three died in the early 80's. The Wiradjuri people – who call themselves Wirraaydhuurray – are still alive, but there are no native speakers left. It wasn't until after the last three native speakers died that an official orthography was defined. A standardised spelling system was approved in 1988. It’s kind of bizarre that only after all the native speakers have passed, dictionaries and textbooks became available, meaning those native speakers didn’t have a chance to write their own language. To honour those last three native speakers, we included 3 – instead of 0 – speakers of Wiradjuri in our list. Currently the Wiradjuri language is one of the five languages being revived by the Australian government. By teaching young Wiradjuri kids this language, the government hopes that they will develop a better understanding of their own Aboriginal culture, as well as keeping negative behaviour at bay. Although there aren’t any (new) native speakers yet, the number of people learning the language keeps growing. The attitude towards Wiradjuri people and their culture has already been impacted in a positive way. By reviving their language, the Wiradjuri people have developed a stronger sense of self and of their own identity. And the kids? They are no longer ashamed of their own language.

Auxiliary languages?
Nine of the listed languages are auxiliary languages. These are languages which have been constructed by human beings, for various reasons. Some wanted to achieve world peace, others wanted to shape the thought processes of its users, or have been created for fictional worlds. In the past millennium probably more than 1000 languages have been created by mankind for a specific purpose, and even now new ones are born almost weekly. However, only the well known, or historically relevant constructed languages (like Esperanto) have been included in the overview of support languages, otherwise the complete list of supported languages would have at least tripled in size.

If you look at the list of supported languages, you’ll see that all the auxiliary languages have – of course – zero native speakers. Except for Esperanto, which has 100 native speakers. Yes, that’s right. Native speakers of an auxiliary language. Think about that. Being a native Esperanto speaker is only possible if you were brought up with this language. This sometimes happened when the parents met at Esperanto gatherings, fell in love, made a baby, but couldn’t speak each other's language. Instead they used Esperanto to communicate with each other, and eventually also with their kids (at least one of both parents). These kids then acquired the language as a native language, and in turn would go on to become one of the few native Esperanto speakers on the planet. There have been just a handful of native Esperanto speakers, of which the famous Hungarian-American business magnate George Soros (1930) is one of the few survivors to date. Another, but no less notable, native Esperanto speaker is Kim Henriksen (1960). Although millions of people learn this utopian language out of personal interest and only thousands of people manage to ultimately have fluent conversations, Kim is “rapid-fire fluent”. He is truly to be considered a real native speaker of Esperanto. He doesn’t only speak like a native, he is a native. Born to a Danish father and a Polish mother, Kim was born in Copenhagen, Denmark. Although he speaks Danish fluently, he considers Esperanto to be his mother tongue. His parents, of course, met through Esperanto. As this language was the basis of their relationship, it was only natural for them to raise their son with Esperanto as the main language. Nowadays, Kim is a rock star in the auxlang scene and has been playing folk-rock music in Esperanto bands (like Amplifiki, Desperado and Hotel Desperado) for decades. It was completely logical for him and his Polish wife that they raise their son with Esperanto too. Their son, currently a teenage punk, is a second-generation native Esperanto speaker. Second. Generation. Native.

If you think the idea of a second-generation Esperanto speaker is crazy (it is crazy), you’ll be surprised to meet Nils Martin Klünder. His great-grandfather learned the language, taught it to his kids, who taught it their kids, who taught it to Nils. Third. Generation. Native. Let’s just hope that this German student ends up in a foreign love-affair on an Esperanto meeting, proceeds to make lots of babies, and secures an unprecedented fourth generation native auxiliary speaker – never seen before in the history of mankind. The world of languages is full of surprises.

Soft data
Languages are not static, and are not even very precisely defined. Although that’s an essential part of languages, it hinders language investigation. At the very most a large group of people are constantly trying to capture a language on paper and redefine the rules for that language every decade or so. If this is the case, then language researchers are lucky. Unfortunately this is only true for widespread languages like English, Spanish, German, etc. That’s a very small percentage of all languages. As soon as languages have less than one or two million speakers, it becomes increasingly difficult to find native speakers or, more important, to find an official orthography. Often it’s not even possible to figure out if a language is officially a language or a dialect. While national authorities give certain languages the legal status of ‘language’ (like Venetian), most locals can consider their language a dialect. We’ve tried our best in defining the exact status of a language, but that's no easy task. At least seven of our supported languages are currently under debate on whether they are a dialect or a language. Quite often the arguments are not linguistic, but rather socio-political. Linguists and official authorities still can’t make up their minds.

If you’ve ever seen the Swedish Chef in the Muppet Show, you must have heard him singing “Børk! Børk! Børk!”. Not sure if Jim Henson wrote this word on purpose like this, or if it was ignorance. For Americans this might sound very Swedish, but Swedes know better: the ø (oslash) isn’t used in Swedish at all. You see, even Jim Henson could have found this overview of essential diacritics somewhat useful.

This list of essential diacritics has been compiled by Underware with the greatest possible care and has been manually triple-checked. As the orthography of many languages is not always set in stone (eg. Silesian or Ladin) and because local variations become more prominent in the instance a language isn't very widespread (eg. Piedmontese), it’s impossible to create a perfect overview of essential diacritics for each language. However, we have worked on constantly improving this overview over the last couple of years. Although we might have made some mistakes, the overview should, overall, be somewhat reliable; at least, this is as good as we could make it. Over 260 Latin-based languages have been investigated, of which 43 Latin languages have been considered too exotic to be supported by our fonts. Because those languages require characters which don’t have a Unicode, for example, or because there aren’t any design standards for some required exotic characters.

It’s also good to know we used the better-safe-than-sorry approach while defining the essential diacritics per language. Wherever possible, we consulted four or more different sources, offline as well as online sources (in a few cases we couldn't find more than one orthographic source). In case of contradictions between those different sources, and when it wasn’t possible to investigate the most plausible orthography, we included all characters mentioned in those different orthographies for one language. So if you think a certain character should not actually be included in a certain language, chances are that we also consulted another source. Besides: it’s always better to have one (doubtful) character too many, than risking a shortfall in case a document requires a diacritic which is not included in the font.

Language cannot be caught in a fixed, static list which lasts for eternity. Future changes, improvements or additions to this overview are very likely. Please check our website for the latest version. Comments are more than welcome, and help us to keep this overview of essential diacritics up-to-date.

Translated sample texts
Reading about a language you have never heard of can be rather abstract, therefore we included a sample text. The sample text consists of four lines of text, presented in over 100 translations. All these different translations show the visual differences between languages. English looks completely different to Estonian. Greenlandic looks very different to Spanish. Reading them out loud and hearing their various sounds and intonations gives you a clue as to what each language sounds like. You might even be able to more easily imagine what their culture is like if you also imagine a village in which this language is spoken, imagine the accompanying temperature, what they eat and what their houses look like. Language is a symbol of local culture, and visualising this language helps to visualise that culture. By reading these translated texts you can teleport yourself from the jungle in Papua New Guinea – where they speak Tok Pisin – to spotting a rhino in Zululand, South Africa – where they speak Zulu. You can imagine having a conversation with Spock in Star Trek while reading the Klingon text. Or attending a concert by Vladimir and his band Noid in Petrozavodsk (Karelia, Northern Russia) – where they speak Vepsian – and soon after suddenly seeing yourself sunbathing at Ipanema beach in Rio de Janeiro, Brazil – where they speak Portuguese. Travel the world from the comfort of your chair, just by reading these translations.

Language is a fascinating aspect of culture because it stresses local characteristics of society. Take a look at Belgium for example, a very small country. Remarkably enough, that small country has three official languages (Dutch, French & German), with all the problematic cultural and political consequences that come with it. But it’s even more remarkable that we had to convince many of our Belgian friends and colleagues that a fourth language exists in their own, small country: Walloon. Not a single Belgian admitted to know about this language, most Belgians actually thought it was the same as French. Well, it isn’t. Walloon is a very little-known language outside Belgium, and even within its own country, because it doesn’t have an official status. It’s not used in education, and is hardly ever used in written word. After a serious search we finally found somebody who ‘knew of the existence’ of this language. It took us even longer to find somebody who actually speaks Walloon and could translate our sample text. So even Belgians can learn about their own little country by reading these translated sample texts.

Nobody seems to know exactly how many languages there are on earth, estimations vary from 5000 to 7000. Harsh prediction: the majority will probably become extinct within a century, another reason to treasure some of these translated samples. And although thousands of languages exist, it’s funny to see how difficult it is to arrange more than 100 different translations of a sample text. Even the world’s most universal document – the Universal Declaration of Human Rights – has been translated into ‘only’ 370 languages. What about the other 6600 languages? Although the United Nations received a Guinness World Record for “the most universal document”, the 518 translations of the Bible stand supreme. If you do the math, you’ll see that despite its effort during the last millennium, the Catholic Church left 6500 languages without a full translation of the Bible.

Because we’re not an organisation as large as the United Nations or the Catholic Church, we need to manage with what we’ve got. The first 50 translations are peanuts, after that it becomes slightly more difficult; after 80 it becomes really hard. Just 80! When there are thousands more languages. That’s, well, shocking. As a matter of fact 93% of the languages are not easily accessible, just because they are not connected to the internet. They don’t partake in the global world. They don’t even partake in the global world of the United Nations. That’s much more worrying than language extinction.

While translating this sample text we experienced four different levels of difficulty. The first is that translation requests were fulfilled immediately, with a perfectly translated text. No effort required. This happened with the first 40 languages, which have (roughly) at least millions of speakers each. Think of Spanish, German, Polish, Italian, etc. Done in a second. The second saw translation requests met with questions. “Is it okay to replace a ‘flycatcher’ with another local bird?”, because some animals just don’t exist in certain languages. This made the task of translating a simple text more difficult. That happened to the next 30 languages, which roughly have 100.000 to 1 million speakers each. Think of Silesian or Scottish Gaelic for example. After 70 translations it became harder and harder to acquire a translation, reaching the third level of difficulty. A translation request was mostly answered elaborately and thoroughly. The answer consisted of the original English line of text, followed by a line of text translated into a certain language, which again was followed by an English translation of the translated line of text. This was done to illustrate what the translation actually says. From English, to another language, back to English: the two English versions sometimes varied greatly.

Languages which have less than 10.000 native speakers (like Vepsian), extinct languages (for example Old Norse), or constructed languages (like Occidental) fall in this category. This category was often the most amusing to read and resulted more often than not in an email correspondence about different plausible orthographies for a certain language. Sometimes that correspondence was with professors of universities who worked with the last native speakers before a language became extinct. The more energy it takes to get a certain translation, the more satisfying it is when you succeed. The fourth category is the toughest one: there is still a large group of languages of which we didn’t yet manage to get in touch with native speakers. Some small Aboriginal language or many languages east from the Caspian Sea, or deep down in Africa. Although we’re not the kind to give up easily, we have to admit we didn’t manage to get in touch with all language groups around the globe until now. The good news is we’re still working on it, so expect more translations on our website in the future. And if you happen to speak a language of which we don’t yet have a translation: please get in touch with us!

If we’re being totally honest, there is also a small fifth category: people offering a translation in a language we'd never heard of until then. In some cases, Old Icelandic for example, the language turned out to be already supported by our fonts. As you can imagine, a joyful category!

The haiku-like text – of which the translation formed the basis for all language samples – first appeared when we were interviewed by a Japanese magazine. Therefore the Japanese spirit is therefore throughout this whole section. We liked the poem in English, but maybe like it even more now in other languages. Maybe it’s because we are not native English speakers, and therefore doomed to Pidgin English, that we can be uplifted by all these different sounds.

Don’t be a cuckoo if you’re a nightingale.
Don’t be a nightingale or a flycatcher, if you’re a dog.
But anyone can make sound.
We are Underware.

A new life for our language research
Recently this overview of essential diacritics per language has proved to be relevant once more. When using fonts on the web, the file size sometimes matters. When you bought type in the good old metal days, the weight in kilograms was always relevant because it defined the price. The weight was often determined in type specimens. Nowadays the weight in kilobytes can still be relevant in terms of reducing loading time of webpages. Therefore we implemented our diacritic research in our webfont subsetter. This subsetter allows our customers to customise their fonts in several ways, such as picking a preferred figure style, or putting small caps in the position of lower- as well as uppercase; one can also select specific languages if that’s all that’s needed. This could potentially save time when loading a website. Having language-specific subset options required an online database of diacritics per language. We were lucky in that we had already collected that information. To implement this research in our online subsetter, we had to make this data available online. For this purpose, the diacritic overview moved from our local desktop to our server. Instead of hiding that database far away in just our subsetter, we thought it would be nice to make that data easily accessible, for anybody who is interested. Happy browsing! Welcome to Underware Latin Plus.