ClassDistribution: store distributions as a sorted vector<Vfield> (−32% base memory) by antalvdb · Pull Request #14 · LanguageMachines/timbl

antalvdb · 2026-06-19T13:40:27Z

Summary

Proposes changing ClassDistribution's storage from std::map<size_t, Vfield*> to a sorted std::vector<Vfield>, to cut the instance-base memory footprint.

Today each target class in a distribution costs a red-black-tree node plus a separately heap-allocated Vfield, and there is (potentially) one distribution per instance-base node — so on a large base this is the dominant memory consumer. This stores the distribution as a flat vector of Vfield values kept sorted by value->Index() (the map key was redundant — it always equals value->Index()). Vfield becomes a plain value type stored inline. Lookups are a binary search with an O(1) fast path for the common sorted-append case.

Measured impact

Word-prediction base, ~4.9M training instances / ~13.5M nodes (l4r0 next-token data):

	before	after
peak RSS, loading a TRIBL2 base	2.07 GB	1.40 GB (−32%)
peak RSS, IGTree	1.26 GB	0.87 GB (−31%)
TRIBL2 test-phase instructions	—	+3.4%

The +3.4% is intrinsic to in-place sorted-vector inserts (a tail shift where the tree was O(log n)) and only matters for large distributions. Wall-clock wasn't reliably measurable (the test machine thermally throttles under repeated trainings), but contiguous storage should fault less than the pointer-chasing map. Net: a memory-for-CPU trade that's favourable for memory-bound use.

Correctness

TRIBL2 test output byte-identical, both plain and with +v db (distributions printed).
A freshly trained instance base is byte-identical to one saved by the current code (on-disk format unchanged).
IGTree accuracy unchanged to the digit.

Scope

ClassDistribution/Vfield (Targets.h/.cxx), the TargetDist iterations in Features.cxx, and 3 lines in MBLClass.cxx. sum_distributions and the I/O code go through the public API and are untouched.

Posting for your consideration — happy to adjust.

🤖 Generated with Claude Code

The per-class distribution was a std::map<size_t, Vfield*>: a red-black-tree node plus a separately heap-allocated Vfield for every target class, for (potentially) every node in the instance base. On a large base this is the dominant memory consumer. Store the distribution as a flat std::vector<Vfield>, kept sorted by value->Index() (the map key was redundant -- it always equals value->Index()). Vfield becomes a plain copyable value type stored inline. Lookups are a binary search, with an O(1) fast path for the common sorted-append case; the sorted invariant keeps (de)serialisation byte-for-byte identical. Measured on a word-prediction base (~4.9M instances, ~13.5M nodes; l4r0): - peak RSS loading a TRIBL2 base: 2.07 GB -> 1.40 GB (-32%) - peak RSS, IGTree: 1.26 GB -> 0.87 GB (-31%) - TRIBL2 test-phase instructions: +3.4% (in-place sorted inserts cost a tail shift where the tree was O(log n); intrinsic to the layout, bites only on large distributions) Verified byte-identical test output (plain and +v db), a freshly trained instance base byte-identical to the previous on-disk format, and unchanged IGTree accuracy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClassDistribution: store distributions as a sorted vector<Vfield> (−32% base memory)#14

ClassDistribution: store distributions as a sorted vector<Vfield> (−32% base memory)#14
antalvdb wants to merge 1 commit into
masterfrom
classdistribution-sorted-vector

antalvdb commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antalvdb commented Jun 19, 2026

Summary

Measured impact

Correctness

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant