Continuously train and relearn

Question

Continuously train and relearn

Closed this issue 3 months ago · 7 comments

With #83 we introduced learning from human feedback.
This process currently requires running two commands

foyle logs process
foyle learn

We should automate it so this process continuously runs in the background.
This might require using watermarks or some other mechanism to track files and logs already processed so we don't have to go back and do them.

Answer 1 · 2024-05-29T19:14:04.000Z

Some high evel ideas.

We can use a queue based design
We can treat the log files as queues and each log entry as an event
We can process log entries by updating the corresponding trace
When we detect a logentry that requires updating a block we can create a block update event
We can use another file to represent the queue for updating the blocklogs.
We can use an FSNotifier to detect when logs are updated and minimize latency of reprocessing.

Answer 2 · 2024-05-31T20:43:05.000Z

A couple things to consider

We have multiple RunMe servers; each of which will have its own log file
So we can't simply process the files sequentially based on their lexicographical ordering
It looks like the first log message in the RunMe logs contains the address of the server

{"level":"info","ts":1716428176.536521,"caller":"cmd/server.go:98","function":"github.com/stateful/runme/v3/internal/cmd.serverCmd.func2","msg":"started listening","addr":"127.0.0.1:7890"}

The address is specified on the command line

/Users/jlewi/git_runme/runme server --address localhost:7868 --runner --ai-logs=true --tls /Users/jlewi/.vscode/extensions/stateful.runme-3.5.8-darwin-arm64/tls

So we could potentially use the address to determine if the process associated with a runner is still running.
That seems very brittle. I think if we start looking at process information we get into os specific issues.
An alternative approach would be to use file modification time to determine if a file has no entries that need to be processed.

Answer 3 · 2024-05-31T21:40:25.000Z

@jlewi have you considered streaming the logs as events Runme -> Foyle? Runme is already using "subscription-like" services that unidirectionally stream events, e.g. https://github.com/stateful/runme/blob/main/pkg/api/proto/runme/runner/v1/runner.proto#L379

I guess the important question is how important order and loss is. I can't imagine learning doing well with unreliable delivery.

Another, perhaps, simpler approach could be to use a subscription to notify Foyle about the availability of learning logs and where they reside instead of streaming the logs itself.

Just some ideas.

Answer 4 · 2024-05-31T22:16:58.000Z

I hadn't thought of direct communication. The original design (prior to RunMe) was based on using structured logging as a simple way of separating how an application gets instrumented to how data gets processed and stored. This was described in a bit more detail in this post.

I guess the important question is how important order and loss is. I can't imagine learning doing well with unreliable delivery.

That's right.

I think the semantics we want are

At least once delivery
Eventually consistent

I think we can model the problem as a streaming problem (ala Dataflow). That is we have multiple sources of events we want to combine them into a single stream and then we need to join all the events using cell id as the key.
Redoing work is fine because learning is idempotent. So other than wasting CPU and LLM credits its fine.

What we want to minimize is dropped events; i.e. suppose the user does the following

Generates a cell
Executes the cell
Corrects the cell
Executes the cell

If we miss the second execution then the AI would use the first execution and potentially learn the mistake.

Even that isn't so bad because ultimately we need to treat executions as noisy signal anyway and design systems that are robust to that.

The biggest issue is that from a debugging/troubleshooting perspective if our processed logs aren't reliable (e.g. we drop one or more executions) we make debugging much harder because we don't know whether a missing event means it didn't happen or is a result of the unreliability in our logs processing.

So for this reason, I think we want to persist log events to durable storage and then design a level based, eventually consistent processing pipeline around that.

Answer 5 · 2024-06-05T13:34:42.000Z

So for this reason, I think we want to persist log events to durable storage and then design a level based, eventually consistent processing pipeline around that.

Makes sense. In my last largish dataflow project, we wrote all events to a GCS bucket (avro/proto-encoded), as a separate subscriber. It was worth its weight in gold for debugging/development purposes to replay it through beam pipelines at any time.

That being said, perhaps Runme can publish events about user-related notebook/command activity. Foyle could subscribe to those and decide with increasingly more sophisticated logic (e.g. simple debounce+timeout at first) to schedule learning. This would continue to separate concerns with low coupling like one would if PubSub, SQS, Kafka, or any of these producer/consumer archs came into play server-side.

Answer 6 · 2024-06-05T14:29:20.000Z

@sourishkrout

we wrote all events to a GCS bucket (avro/proto-encoded), as a separate subscriber

Is the logging framework already doing that? Within the RunMe process the logger library has a publisher - subscriber type architecture where the publisher is the "main thread" writing log messages and then the subscriber is a background thread writing them to their destination.

The current implementation is to local file but we could also use that to stream to GCS/Kafka/etc...
Here's a diagram.

IUC Are you proposing that instead of using the logger framework we implement our own gRPC streaming protocol to stream from the RunMe process to a subscriber process which would then persist logs?

Answer 7 · 2024-06-12T01:02:55.000Z

@jlewi

we wrote all events to a GCS bucket (avro/proto-encoded), as a separate subscriber

Is the logging framework already doing that? Within the RunMe process the logger library has a publisher - subscriber type architecture where the publisher is the "main thread" writing log messages and then the subscriber is a background thread writing them to their destination.

No. Runme is not already doing that. However, I believe you're right that the logger (zap) is likely capable of streaming to a variety of destinations (subscribers). When I was talking about the GCS bucket, I was talking about a previous not directly related project.

The current implementation is to local file but we could also use that to stream to GCS/Kafka/etc...

IUC Are you proposing that instead of using the logger framework we implement our own gRPC streaming protocol to stream from the RunMe process to a subscriber process which would then persist logs?

No, quite the contrary. I am suggesting to keep the file-based logging for the AI/learning logs as implemented. The idea was to notify Foyle (pub/sub) out-of-band about usage-related events. Let's say as a gRPC streaming service that could be leveraged on the Foyle-side to conclude when retraining is necessary. E.g., Runme has a MonitorEnvStore which emits a redacted snapshot of the ENV (prototype) anytime a cell run completes for the extension to display the snapshot. Runme could publish events about edits, saved, cells generated etc... not sure if that'd be helpful to make learning continuous, though. Just an idea.