scalapb/ScalaPB

Efficiently convert bytes to strings

jCalamari opened this issue · 6 comments

In ScalaPB, what's the best way of converting message A to message B, knowing that message A can contain thousands of elements? Ideally, I would like to just pass the bytes from A to B without copying/looping.

message A {
  repeated bytes a_strings = 1;
}
message B {
  repeated string b_strings = 1;
}

Naive approach would look as follows, however there is just too much copying/looping:

val a: A = ...
val b: B = B(a.a_strings.map(ByteString.copyFromUtf8))

Context would be more helpful to answer this since you haven't stated where A comes from and how you want to use the Bs. A few thoughts:

  1. A and B have the same binary representation. So if you have A available in binary forms, just parse it using B.parseFrom and then you get efficiency by not instantiating As.
  2. Don't create messages of type B, convert the bytes to string at the time of access. You could add a base trait to A that has a method like def getString(index: Int): String = ByteString.copyFromUtf8(a_strings(i))

Hi @thesamet,

thanks for prompt response! A message comes from API response and there is no way to avoid creating A message. Sadly A and B don't have the same binary representation (I formed the example wrong), they have many more fields and their field IDs don't match. Since I am working with akka-grpc ecosystem, delaying creation of B is not an option. Is there a way to create instance of B just by providing bytes for b_strings field?

Do you own the proto for B? If so, you can leave the proto type as bytes, but similar to (2) above add a base trait that converts to strings only what you need. Then, the creation of this field is just assigning the same reference of Seq[ByteString].

Do you own the proto for B? If so, you can leave the proto type as bytes, but similar to (2) above add a base trait that converts to strings only what you need. Then, the creation of this field is just assigning the same reference of Seq[ByteString].

This sounds promising. What would happen to already compiled clients who would expect repeated string but got repeated bytes?

The binary representation is the same. Running clients will not be impacted when this change rolls out to a server.

The binary representation is the same. Running clients will not be impacted when this change rolls out to a server.

Worked like a charm, thank you very much for your prompt responses!