José / Unicode UTF16 Surrogate pairs

Started by José Roca, April 09, 2025, 01:03:55 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

José Roca

Hi Charles,

This can be of interest to you, I believe.

https://github.com/freebasic/fbc/issues/451

I have thought of an strategy, and I have working code to implement it, to deal with unicode surrogate pairs.

Charles Pegge

#1
Thanks José, Very interesting.

O2 has wchr and unic functions, corresponding to char and asc. these are not aware of surrogate pairs yet. All the other core string functions, and the concatenator can handle wide strings automatically.

But to handle the higher Unicode, with surrogate pairs, I think it would make sense to work with uniform 32bit characters, providing overloads for all the core string functions (instr mid left right etc). This would facilitate standard text operations. Do we need to invent QSTRING ? :)

José Roca

#2
IF by QSTRING you mean a string that works with UTF-32, this would be only useful to Linux users because Windows only works with UTF-16. And Linux users don't use it because it wastes memory. Linux users are the ones that have more problems with broken surrogates, but Windows users can aso break surrogates when using functions like LEFT, MID and RIGHT.

In my DWSTRING class, all the string operations are being centralized in a method called AppendBuffer, which is the one that moves the memory and resizes the buffer when needed. Therefore, I have implemented a method called ScanForSurrogates to check if there are surrogates. If there aren't, it proceeds as usual; otherwise, checks the characters of the string to see if the surrogates are valid or broken; if broken, it changes them with the &hFFFD symbol.

It is a paliative to avoid corruption of the string. To avoid the accidental production of surrogates, all the intrinsic string procedures of the compiler should do the checking. For example, if LEFT(<string>, 5) breaks a surrogate, they should detect it and return 4 or 6 characters, depending of the capacity of the destination buffer, instead of 5.

I'm thinking of optimize my code: the ScanForSurrogates will search for broken surrogates only and return the position in the string; the AppendBuffer method will fix it and call the ScanForSurrogates again; this will avoid to having to check all the string because it will slow processing if it is a big string and only contains a broken surrogate.

José Roca

#3
I'm improving the scanning for broken surrogates by avoiding repeated casting and using masking instead of direct comparisons. I can't change the FreeBasic string functions to make them surrogate aware, but you can do it in your compiler. Avoiding the possibility of potentially creating them, avoids the need to fix them.

Charles Pegge

I think there are only 3 essential functions needed for handling surrogate pairs explicitly.

UCODE to get the CodePoint from a unicode character.

UCHR to create a unicode character from the CodePoint.

UMID to get a complete substring from a unicode string. This also covers ULEFT and URIGHT.

BTW
OxygenBasic accepts unicode UTF-16 source code, so you can use any Unicode characters within symbols. Since the compiler itself requires ANSI, unicode characters with non-zero upper bytes are simply recoded as a series of ansi characters  before compiling.

Theo Gottwald

#5
UTF-16, UTF-32, broken surrogates?
Sounds like a Wasps nest to me?
Did you include all that stuff in your compiler Charles?

Unicode, Surrogate Pairs & UTF‑32 in Go and Free Pascal



0 → 0x10 FFFF in practice
TermWhat it really isWhy it matters to the compiler
Unicode scalar valueAny code point except 0xD800–0xDFFFCompilers can say "one scalar value = one character", ignoring UTF‑16 artefacts
Surrogate pairTwo UTF‑16 code units that encode one scalar value ≥ 0x10000Only relevant if the data is stored in UTF‑16
UTF‑32 / UCS‑4 charOne 32‑bit integer that equals the scalar valueNo surrogate logic needed; every code point fits



Go (current toolchain 1.23‑dev)
AspectHow Go handles it
Source encodingCompiler expects the file to be valid UTF‑8
String / rune literals[tt]\uXXXX[/tt] escapes may NOT name a surrogate; use [tt]\U0001F600[/tt] instead of a pair
Data representation[tt]string[/tt] = immutable UTF‑8; [tt]rune[/tt] = 32‑bit scalar (built‑in UTF‑32 cell)
Library helpers[tt]encoding/utf16[/tt] offers [tt]Encode[/tt]/[tt]Decode[/tt]/[tt]IsSurrogate[/tt]
JSON & friendsParsers merge [tt]\uXXXX\uXXXX[/tt] into a single rune; invalid pairs become U+FFFD
Every‑day impactYou never see surrogates unless you purposely handle UTF‑16



Free Pascal (FPC 3.2.x / 3.3‑trunk)
ConceptFree Pascal type / facility
UTF‑16 storage[tt]WideChar[/tt] (16‑bit) & [tt]UnicodeString[/tt]/[tt]WideString[/tt]; non‑BMP stored as two cells
Surrogate helpersUnit [tt]Character[/tt] (Delphi‑compatible): [tt]IsSurrogate[/tt], [tt]ConvertToUtf32[/tt], [tt]ConvertFromUtf32[/tt]
UTF‑32 storage[tt]UCS4Char[/tt] (32‑bit) & dynamic [tt]UCS4String[/tt]
Source codeParser expects UTF‑8; you may embed a pair as [tt]#$D83D#$DE00[/tt]
Conversions[tt]SysUtils[/tt] + [tt]LazUTF8[/tt] for UTF‑8 ⇄ UTF‑16 / UTF‑32
Every‑day impactIf you stick to [tt]UTF8String[/tt] or [tt]UCS4String[/tt], you rarely care about surrogates

Code (pascal) Select
program Smile32;
uses SysUtils, Character;

var
s16 : UnicodeString;
cp : UCS4Char;
begin
s16 := #$D83D#$DE00; // 😀 in UTF‑16
cp := ConvertToUtf32(s16, 1); // -> $1F600
Writeln('U+' + IntToHex(cp, 6));
end.



Side‑by‑side summary
GoFreePascal
Default string encodingUTF‑8Platform‑dependent (UTF‑8 on nix, UTF‑16 on Windows)
"One code‑point" primitive[tt]rune[/tt][tt]UCS4Char[/tt]
Literal rejects half‑surrogate?Yes (compile‑time error)No (validated at run time if you ask)
Std‑lib helpers[tt]encoding/utf16[/tt][tt]TCharacter.[/tt]
Need to think about pairs daily?RarelyOnly when you keep data in UTF‑16



Practical guidelines

Go
  • Write non‑BMP literals with [tt]\UXXXXXXXX[/tt] or raw UTF‑8 bytes.
  • Convert to/from UTF‑16 only when calling external APIs that demand it.

Free Pascal
  • Prefer [tt]UTF8String[/tt] or [tt]UCS4String[/tt] for pure‑Pascal text.
  • Use [tt]ConvertToUtf32[/tt] / [tt]ConvertFromUtf32[/tt] when crossing a UTF‑16 boundary.
  • If you must slice a UTF‑16 string, validate pairs first with [tt]TCharacter.IsSurrogate[/tt] helpers.



Both compilers give full Unicode reach; they only differ in when surrogate‑pair logic surfaces—compile‑time rejection in Go, run‑time helpers in Free Pascal.