This system converts allkeys.txt into a compact binary format (0.56 MB) that can be loaded and queried efficiently.
The system consists of two programs:
char magic[8] "UCADATA1" uint32_t version UCA version (e.g., 170000) uint32_t num_singles Number of single codepoint entries uint32_t num_sequences Number of multi-codepoint sequences uint32_t page_table_offset Offset to page table uint32_t trie_offset Offset to sequence trie
Array of 4,352 uint32_t offsets, one per page:
For each allocated page:
uint16_t count Number of entries in this page (0-256) For each entry (5 bytes each): uint8_t offset Offset within page (0-255) uint32_t data_offset Offset to collation data
Entries are sorted by offset for binary search.
uint8_t num_elements Number of collation elements (1-16) For each element (7 bytes): uint16_t primary uint16_t secondary uint16_t tertiary uint8_t variable 1 if variable weighting, 0 otherwise
Recursive trie structure for multi-codepoint sequences:
uint32_t codepoint Codepoint at this node uint32_t data_offset Offset to collation data (0 if intermediate) uint16_t num_children Number of child nodes For each child: uint32_t child_offset Offset to child node
# Build the builder gcc -O2 -o collation_builder collation_builder.c # Generate binary data file ./collation_builder allkeys.txt collation_data.bin # Optionally also generate C source (for embedding) ./collation_builder allkeys.txt collation_data.bin collation_data.c
The minimal_lookup.c program is a simple, working example that demonstrates the binary format:
# Build gcc -O2 -o minimal_lookup minimal_lookup.c # Run ./minimal_lookup collation_data.bin
Output:
Loaded 587112 bytes
Version: 17.0.0
Singles: 38785
Sequences: 964
Looking up U+0041:
Page: 0, Offset: 65
Page data at offset: 17436
Page has 256 entries
Found at entry 140, data offset: 212907
Collation: 1 elements
[.23EC.0020.0008]
This demonstrates:
The collation_lookup.c program provides a complete interface with tests and interactive mode. Note: This version currently has a runtime issue and needs debugging, but the format specification it implements is correct.
# Build lookup program gcc -O2 -o collation_lookup collation_lookup.c # Run tests ./collation_lookup collation_data.bin # Interactive mode ./collation_lookup collation_data.bin -i
> 0041 U+0041: [.23EC.0020.0008] > 006C 00B7 U+006C U+00B7: [.2528.0020.0002][.0000.011F.0002] > quit
Builder:
Lookup:
For Unicode range U+0000 to U+00FF (page 0):
Page Table[0] → Page Data @ offset 4444
Page Data:
count: 179 (179 codepoints in this range have collation data)
entries[0]: offset=0x00, data_offset=5555
entries[1]: offset=0x01, data_offset=5562
...
entries[178]: offset=0xFF, data_offset=6789
For codepoint U+0041 (‘A’):
1. page_num = 0x0041 >> 8 = 0 2. page_offset = 0x0041 & 0xFF = 0x41 (65) 3. Read Page Table[0] → page_data_offset = 4444 4. Read page count at offset 4444 → 179 entries 5. Binary search for offset 65 in entries 6. Found at entries[33]: data_offset = 5900 7. Read collation data at offset 5900: num_elements = 1 primary = 0x23EC secondary = 0x0020 tertiary = 0x0008 variable = 0
Why page-based?
Why offsets instead of pointers?
Why binary search within pages?
For a fully working system, see the hybrid page table + trie implementation in:
collation_hybrid.ccollation_loader.ccollation_demo.cThis uses an in-memory data structure instead of binary file format, but provides the same O(log 256) + O(k) performance.