José / Unicode UTF16 Surrogate pairs

José Roca · April 09, 2025, 01:03:55 AM

Hi Charles,

This can be of interest to you, I believe.

https://github.com/freebasic/fbc/issues/451

I have thought of an strategy, and I have working code to implement it, to deal with unicode surrogate pairs.

Charles Pegge · April 11, 2025, 10:15:32 AM

Thanks José, Very interesting.

O2 has wchr and unic functions, corresponding to char and asc. these are not aware of surrogate pairs yet. All the other core string functions, and the concatenator can handle wide strings automatically.

But to handle the higher Unicode, with surrogate pairs, I think it would make sense to work with uniform 32bit characters, providing overloads for all the core string functions (instr mid left right etc). This would facilitate standard text operations. Do we need to invent QSTRING ?

José Roca · April 11, 2025, 04:33:26 PM

IF by QSTRING you mean a string that works with UTF-32, this would be only useful to Linux users because Windows only works with UTF-16. And Linux users don't use it because it wastes memory. Linux users are the ones that have more problems with broken surrogates, but Windows users can aso break surrogates when using functions like LEFT, MID and RIGHT.

In my DWSTRING class, all the string operations are being centralized in a method called AppendBuffer, which is the one that moves the memory and resizes the buffer when needed. Therefore, I have implemented a method called ScanForSurrogates to check if there are surrogates. If there aren't, it proceeds as usual; otherwise, checks the characters of the string to see if the surrogates are valid or broken; if broken, it changes them with the &hFFFD symbol.

It is a paliative to avoid corruption of the string. To avoid the accidental production of surrogates, all the intrinsic string procedures of the compiler should do the checking. For example, if LEFT(<string>, 5) breaks a surrogate, they should detect it and return 4 or 6 characters, depending of the capacity of the destination buffer, instead of 5.

I'm thinking of optimize my code: the ScanForSurrogates will search for broken surrogates only and return the position in the string; the AppendBuffer method will fix it and call the ScanForSurrogates again; this will avoid to having to check all the string because it will slow processing if it is a big string and only contains a broken surrogate.

José Roca · April 11, 2025, 07:48:12 PM

I'm improving the scanning for broken surrogates by avoiding repeated casting and using masking instead of direct comparisons. I can't change the FreeBasic string functions to make them surrogate aware, but you can do it in your compiler. Avoiding the possibility of potentially creating them, avoids the need to fix them.

Charles Pegge · April 14, 2025, 01:13:19 PM

I think there are only 3 essential functions needed for handling surrogate pairs explicitly.

UCODE to get the CodePoint from a unicode character.

UCHR to create a unicode character from the CodePoint.

UMID to get a complete substring from a unicode string. This also covers ULEFT and URIGHT.

BTW
OxygenBasic accepts unicode UTF-16 source code, so you can use any Unicode characters within symbols. Since the compiler itself requires ANSI, unicode characters with non-zero upper bytes are simply recoded as a series of ansi characters before compiling.

Theo Gottwald · July 15, 2025, 09:00:59 AM

UTF-16, UTF-32, broken surrogates?
Sounds like a Wasps nest to me?
Did you include all that stuff in your compiler Charles?

Unicode, Surrogate Pairs & UTF‑32 in Go and Free Pascal

0 → 0x10 FFFF in practice

Term	What it really is	Why it matters to the compiler
Unicode scalar value	Any code point except 0xD800–0xDFFF	Compilers can say "one scalar value = one character", ignoring UTF‑16 artefacts
Surrogate pair	Two UTF‑16 code units that encode one scalar value ≥ 0x10000	Only relevant if the data is stored in UTF‑16
UTF‑32 / UCS‑4 char	One 32‑bit integer that equals the scalar value	No surrogate logic needed; every code point fits

Go (current toolchain 1.23‑dev)

Aspect	How Go handles it
Source encoding	Compiler expects the file to be valid UTF‑8
String / rune literals	[tt]\uXXXX[/tt] escapes may NOT name a surrogate; use [tt]\U0001F600[/tt] instead of a pair
Data representation	[tt]string[/tt] = immutable UTF‑8; [tt]rune[/tt] = 32‑bit scalar (built‑in UTF‑32 cell)
Library helpers	[tt]encoding/utf16[/tt] offers [tt]Encode[/tt]/[tt]Decode[/tt]/[tt]IsSurrogate[/tt]
JSON & friends	Parsers merge [tt]\uXXXX\uXXXX[/tt] into a single rune; invalid pairs become U+FFFD
Every‑day impact	You never see surrogates unless you purposely handle UTF‑16

Free Pascal (FPC 3.2.x / 3.3‑trunk)

Concept	Free Pascal type / facility
UTF‑16 storage	[tt]WideChar[/tt] (16‑bit) & [tt]UnicodeString[/tt]/[tt]WideString[/tt]; non‑BMP stored as two cells
Surrogate helpers	Unit [tt]Character[/tt] (Delphi‑compatible): [tt]IsSurrogate[/tt], [tt]ConvertToUtf32[/tt], [tt]ConvertFromUtf32[/tt]
UTF‑32 storage	[tt]UCS4Char[/tt] (32‑bit) & dynamic [tt]UCS4String[/tt]
Source code	Parser expects UTF‑8; you may embed a pair as [tt]#$D83D#$DE00[/tt]
Conversions	[tt]SysUtils[/tt] + [tt]LazUTF8[/tt] for UTF‑8 ⇄ UTF‑16 / UTF‑32
Every‑day impact	If you stick to [tt]UTF8String[/tt] or [tt]UCS4String[/tt], you rarely care about surrogates

Code (pascal) Select

program Smile32;
uses SysUtils, Character;

var
s16 : UnicodeString;
cp : UCS4Char;
begin
s16 := #$D83D#$DE00; // 😀 in UTF‑16
cp := ConvertToUtf32(s16, 1); // -> $1F600
Writeln('U+' + IntToHex(cp, 6));
end.

Side‑by‑side summary

	Go	FreePascal
Default string encoding	UTF‑8	Platform‑dependent (UTF‑8 on nix, UTF‑16 on Windows)
"One code‑point" primitive	[tt]rune[/tt]	[tt]UCS4Char[/tt]
Literal rejects half‑surrogate?	Yes (compile‑time error)	No (validated at run time if you ask)
Std‑lib helpers	[tt]encoding/utf16[/tt]	[tt]TCharacter.[/tt]
Need to think about pairs daily?	Rarely	Only when you keep data in UTF‑16

Practical guidelines

Go

Write non‑BMP literals with [tt]\UXXXXXXXX[/tt] or raw UTF‑8 bytes.
Convert to/from UTF‑16 only when calling external APIs that demand it.

Free Pascal

Prefer [tt]UTF8String[/tt] or [tt]UCS4String[/tt] for pure‑Pascal text.
Use [tt]ConvertToUtf32[/tt] / [tt]ConvertFromUtf32[/tt] when crossing a UTF‑16 boundary.
If you must slice a UTF‑16 string, validate pairs first with [tt]TCharacter.IsSurrogate[/tt] helpers.

Both compilers give full Unicode reach; they only differ in when surrogate‑pair logic surfaces—compile‑time rejection in Go, run‑time helpers in Free Pascal.

José / Unicode UTF16 Surrogate pairs

José Roca

Charles Pegge

José Roca

José Roca

Charles Pegge

Theo Gottwald