More about UCSUR

First, What Is It?

This is an explanation of UCSUR for a non-technical audience. You can also choose to read this high-level technical explanation on on sona.pona.la, or skip to "UCSUR Sitelen Pona vs ASCII Sitelen Pona" if you are already familiar with how it works.

Sometimes when you're browsing on the internet, characters like these crop up:

           

It's a miracle that it doesn't happen more often.

Character Encoding Standards

It's a miracle that we're able to represent letters and words on computers at all.

At the end of the day, all data is binary-- a combination of ones and zeros.

I typed these words out on my device and now you are reading them on your device. In order for your device to show you the same letters that I intended to send, they have to agree on which combinations of ones and zeros are used to represent each letter.

Hence computer people around the world developed character encoding standards: agreements on how to present each character in data.

You can imagine that in the beginning, they were really trying to limit how much numbers they had to send each time. The devices were only so powerful.

Let's say that each character was represented by a group of 5 digits: each of them ones and zeros. You could do A as 10000, B as 01000, maybe C as 00100? With 5 digits, you had a max of 32 combinations to play with: meaning that you could represent 32 letters. The UK worked out their codes. Continental Europe worked out... different codes. The French really wanted É.

5-bit encoding, allowing a total of 32 combinations, was not enough. And so going up to 6-bit encoding allowed for 64 combinations, which also wasn't enough.

At 7 bits, with 128 total combinations, people writing in English finally felt comfortable. American Standard Code for Information Interchange, also known as ASCII, became quite an influential standard. It allows for uppercase and lowercase letters, a wide variety of punctuation symbols and characters like 'Tab.'

That was what was happening in the English-speaking world. Early products exported to the English-speaking world often did not support multilingual scripts.

Massive work was done on an international level to create a universal coded character set, capable of representing scripts of all languages.

Which brings us to Unicode. UTF-8, with 8 bits, is able to encode over a million characters. On the international level, most devices now use it.

But Unicode does not currently encode Sitelen Pona.

Sitelen Pona Fonts

How do you make a font to represent a character that doesn't exist according to most of the devices out there?

Here's the two workarounds that fontmakers have figured out:

First Option: Ligatures with ASCII characters

Fonts themselves are complicated.

Font-builders have access to a feature called ligatures.

Enabling ligatures give you the ability to combine letters. In English serif fonts, fontmaker draw out the letters 'f' and 'i' independently, but often when 'fi' occurs as a combination, they'll draw out a new combo-letter that squishes the two letters together, where the bar of the f is located over the i.

What Sitelen Pona font makers have done is assigned new combo-letter characters to every Sitelen Pona word.

Second Option: Make Your Own Standard

Unicode is able to encode over a million characters, but it currently only uses a fraction of that. There are many combinations which are simply unassigned, and there's actually a whole bunch of them that have been declared as being for private use. Anyone, including a Sitelen Pona enthusiast, can go ahead and just... pick a few to use for their private needs.

There is more than one Sitelen Pona font-making enthusiast. How do we ensure that they pick the same ones to use for Sitelen Pona?

Enter UCSUR

Toki Pona is not the only constructed language that wants its members to agree on which encodings to use. The UCSUR "ensures nobody steps on each other's toes" as conlang communities decide which of the combinations they'd like to use.

Here's the UCSUR entry outlining the claimed combinations/codepoints of Sitelen Pona.

UCSUR Sitelen Pona vs ASCII Sitelen Pona

UCSUR fonts do not need ligatures enabled, which is good. Ligatures are not intended for the extended use that they have in Sitelen Pona fonts. Using them can be incredibly buggy and there are known bugs in practically all browsers.

UCSUR doesn't even really need a font: someone can dump a bunch of UCSUR Sitelen Pona in Discord, and if you have a UCSUR font downloaded onto your Discord client, you will see their Sitelen Pona. If you don't have that UCSUR font, well...

           

(That's Klingon, by the way.)

ASCII allows more choice. There's no easy way of turning UCSUR Sitelen Pona into Latin alphabet Toki Pona very quickly, but there is a very easy way to turn ASCII Latin alphabet Toki Pona into ASCII Sitelen Pona. Simply turn off the font, and voila. Not knowing Sitelen Pona or not having a UCSUR font no longer becomes a barrier to participating in the conversation.

Perhaps, if Sitelen Pona ever enters Unicode, people will make fonts that revert Unicode Sitelen Pona into Latin alphabet-looking text using reverse-ligatures. Hah. That's probably not a thing.

Why is UCSUR not the Default for Beginner Texts?

Three Big Reasons:

It's not accessible.
It's not convenient to my workflow.
Without configuring UCSUR on their devices, people should be able to select sitelen pona text and paste it to see how it's been formatted.

Not Accessible?

In response to the first point, someone once told me, "Well, those people can just rely on the sitelen Lasina instead."

You're making a bold assumption that everyone who relies on a screen-reader or text-to-speech synthesizer is not interested in what Sitelen Pona has to offer.

Once someone makes a stable and lightweight Toki Pona TTS engine capable of reading UCSUR, I might reconsider slightly. But see the other two reasons.