Skip to content

Add scoped event-time director models#249

Open
sanjaychari wants to merge 15 commits into
codes-org:masterfrom
sanjaychari:digital-twin-sbir-develop-component-ml
Open

Add scoped event-time director models#249
sanjaychari wants to merge 15 commits into
codes-org:masterfrom
sanjaychari:digital-twin-sbir-develop-component-ml

Conversation

@sanjaychari

Copy link
Copy Markdown
Collaborator

This PR makes the following changes.

  • Add scoped event-time model storage and inference so the
    dragonfly-dally event-time surrogate can use separate ML models for
    switch LPs while retaining the existing director request flow.

  • Fix ZeroMQ director request argument handling so command handlers parse
    the argument-count prefix exactly once. This prevents client IDs such as
    1 from being mistaken for a second argument count and dropped from
    training/inference requests.

  • Restore the ZeroMQ director build path by compiling director-client.C
    when USE_ZMQML is enabled and propagating the USE_ZMQML compile
    definition to downstream targets.

  • Clean up the director-client merge conflict around global ZMQ latency
    statistics and keep the cumulative MPI-reduced DIR_STATS output format.
    Expose a latency-recording hook so event-time inference requests from
    dragonfly-dally are included in the shared ZMQ request statistics.

  • Update the event-time workflow to use START_ITER and END_ITER template
    variables and save/load the scoped event-time model directory rather
    than a single model file.

Add scoped event-time model storage and inference so the
dragonfly-dally event-time surrogate can use separate ML models for
switch LPs while retaining the existing director request flow.

Fix ZeroMQ director request argument handling so command handlers parse
the argument-count prefix exactly once. This prevents client IDs such as
1 from being mistaken for a second argument count and dropped from
training/inference requests.

Restore the ZeroMQ director build path by compiling director-client.C
when USE_ZMQML is enabled and propagating the USE_ZMQML compile
definition to downstream targets.

Clean up the director-client merge conflict around global ZMQ latency
statistics and keep the cumulative MPI-reduced DIR_STATS output format.
Expose a latency-recording hook so event-time inference requests from
dragonfly-dally are included in the shared ZMQ request statistics.

Update the event-time workflow to use START_ITER and END_ITER template
variables and save/load the scoped event-time model directory rather
than a single model file.
This commit formats files with clang-format-20
,which is used by the CI, instead of just clang-format.
@sanjaychari sanjaychari force-pushed the digital-twin-sbir-develop-component-ml branch from 73d52b7 to e937968 Compare June 23, 2026 15:54
@caitlinross

Copy link
Copy Markdown
Member

@sanjaychari let me know when you've fixed conflicts and I'll review this branch. I made some improvements to how zeromq related stuff is built, so I think that's where the conflicts are.

@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/networks/model-net/dragonfly-dally.cxx 60.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Comment on lines +70 to +71
extern "C" void director_record_external_zmq_latency(double processing_sec, double total_sec)
__attribute__((weak));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the __attribute__((weak)) is unnecessary at this point because this should only be built when #if CODES_HAVE_ZEROMQ.

actually these 2 lines should just be completely removed because this is already declared in director-client.h. It should not be declared again in a cxx file.

#include "codes/model-net-lp.h"
#include "codes/surrogate/init.h"
#if CODES_HAVE_TORCH
#include "codes/surrogate/director-client.h"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is probably why you added the weak attribute below. This should either be guarded by #if CODES_HAVE_ZEROMQ or I think it could technically be unguarded.

Comment on lines +57 to +59
bash -euxo pipefail -c '
apt-get update
apt-get install -y python3-zmq python3-numpy python3-sklearn python3-pandas python3-pip gettext-base

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with #255 updating the ci image, this is no longer needed., same for the other places this is done. in the future we should update the docker image with dependencies like this instead of grabbing them in every job. just note that it has to be done in a separate PR because the job that creates the docker image does it on push to master only when the dockerfile changes.

Comment thread CMakeLists.txt
Comment on lines +84 to +85
if(NOT TARGET ROSS::ROSS)
message(WARNING "ROSS package did not define ROSS::ROSS; creating compatibility imported target.")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a ci job (can be build only, I don't think it's necessary to run the test suite) that builds CODES with an old ROSS. That way we can be sure this continues to work.

Also change the warning message to a deprecation message (see elsewhere in this file where deprecation is used), because I don't think we should keep this around forever.

}
}

if (director_record_external_zmq_latency) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't need to be checked. It's guarded in #if CODES_HAVE_ZEROMQ

director_record_zmq_latency_values(zmq_processing_time, local_latency_sec);
}

extern "C" void director_record_external_zmq_latency(double processing_sec, double total_sec) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the extern "C" is not needed here, only in the header, where you already have it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants