Skip to content

ClassDistribution: store distributions as a sorted vector<Vfield> (−32% base memory)#14

Open
antalvdb wants to merge 1 commit into
masterfrom
classdistribution-sorted-vector
Open

ClassDistribution: store distributions as a sorted vector<Vfield> (−32% base memory)#14
antalvdb wants to merge 1 commit into
masterfrom
classdistribution-sorted-vector

Conversation

@antalvdb

Copy link
Copy Markdown
Member

Summary

Proposes changing ClassDistribution's storage from std::map<size_t, Vfield*> to a sorted std::vector<Vfield>, to cut the instance-base memory footprint.

Today each target class in a distribution costs a red-black-tree node plus a separately heap-allocated Vfield, and there is (potentially) one distribution per instance-base node — so on a large base this is the dominant memory consumer. This stores the distribution as a flat vector of Vfield values kept sorted by value->Index() (the map key was redundant — it always equals value->Index()). Vfield becomes a plain value type stored inline. Lookups are a binary search with an O(1) fast path for the common sorted-append case.

Measured impact

Word-prediction base, ~4.9M training instances / ~13.5M nodes (l4r0 next-token data):

before after
peak RSS, loading a TRIBL2 base 2.07 GB 1.40 GB (−32%)
peak RSS, IGTree 1.26 GB 0.87 GB (−31%)
TRIBL2 test-phase instructions +3.4%

The +3.4% is intrinsic to in-place sorted-vector inserts (a tail shift where the tree was O(log n)) and only matters for large distributions. Wall-clock wasn't reliably measurable (the test machine thermally throttles under repeated trainings), but contiguous storage should fault less than the pointer-chasing map. Net: a memory-for-CPU trade that's favourable for memory-bound use.

Correctness

  • TRIBL2 test output byte-identical, both plain and with +v db (distributions printed).
  • A freshly trained instance base is byte-identical to one saved by the current code (on-disk format unchanged).
  • IGTree accuracy unchanged to the digit.

Scope

ClassDistribution/Vfield (Targets.h/.cxx), the TargetDist iterations in Features.cxx, and 3 lines in MBLClass.cxx. sum_distributions and the I/O code go through the public API and are untouched.

Posting for your consideration — happy to adjust.

🤖 Generated with Claude Code

The per-class distribution was a std::map<size_t, Vfield*>: a red-black-tree
node plus a separately heap-allocated Vfield for every target class, for
(potentially) every node in the instance base. On a large base this is the
dominant memory consumer.

Store the distribution as a flat std::vector<Vfield>, kept sorted by
value->Index() (the map key was redundant -- it always equals value->Index()).
Vfield becomes a plain copyable value type stored inline. Lookups are a binary
search, with an O(1) fast path for the common sorted-append case; the sorted
invariant keeps (de)serialisation byte-for-byte identical.

Measured on a word-prediction base (~4.9M instances, ~13.5M nodes; l4r0):
  - peak RSS loading a TRIBL2 base: 2.07 GB -> 1.40 GB (-32%)
  - peak RSS, IGTree:               1.26 GB -> 0.87 GB (-31%)
  - TRIBL2 test-phase instructions: +3.4% (in-place sorted inserts cost a tail
    shift where the tree was O(log n); intrinsic to the layout, bites only on
    large distributions)

Verified byte-identical test output (plain and +v db), a freshly trained
instance base byte-identical to the previous on-disk format, and unchanged
IGTree accuracy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant