Skip to content

Bug 2010698 - add a basic concept interface for an LLM#45

Closed
MatthewTighe wants to merge 1 commit intomozilla-firefox:autolandfrom
MatthewTighe:s2s-concept
Closed

Bug 2010698 - add a basic concept interface for an LLM#45
MatthewTighe wants to merge 1 commit intomozilla-firefox:autolandfrom
MatthewTighe:s2s-concept

Conversation

@MatthewTighe
Copy link
Contributor

No description provided.

@github-actions
Copy link
Contributor

Warning

The base branch is currently set to main. Please Edit this PR and set the base to autoland.

@github-actions
Copy link
Contributor

View this pull request in Lando to land it once approved.

@segunfamisa
Copy link
Contributor

@MatthewTighe I think this PR should be targeting autoland

Comment on lines 34 to 70
FeatureStatus.DOWNLOADING -> {
Llm.Response.InProgress("already downloading")
}
FeatureStatus.DOWNLOADABLE -> {
val result = model.download().onEach {
logger("Download update: $it")
yield()
}.first { status ->
status == DownloadStatus.DownloadCompleted || status is DownloadStatus.DownloadFailed
}

if (result is DownloadStatus.DownloadFailed) {
Llm.Response.Failure("download failed")
} else {
model.getPromptResponse(prompt)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't exactly sure what to do here - in our use case we will probably intentionally block summarization from calling prompt twice but we can't guarantee it. Without knowing more about how model.download() functions, I didn't feel confident trying to share the flow between the DOWNLOADING and DOWNLOADABLE cases. For example, if it's called twice, does it avoid interrupting an existing download? Basically, I had been considering something like

Suggested change
FeatureStatus.DOWNLOADING -> {
Llm.Response.InProgress("already downloading")
}
FeatureStatus.DOWNLOADABLE -> {
val result = model.download().onEach {
logger("Download update: $it")
yield()
}.first { status ->
status == DownloadStatus.DownloadCompleted || status is DownloadStatus.DownloadFailed
}
if (result is DownloadStatus.DownloadFailed) {
Llm.Response.Failure("download failed")
} else {
model.getPromptResponse(prompt)
}
FeatureStatus.DOWNLOADING -> {
model.resumeDownload()
}
FeatureStatus.DOWNLOADABLE -> {
model.resumeDownload()
}
...
fun GenerativeModel.resumeDownload(): Llm.Response {
// if flow active already, cancel it and return Llm.Response.InProgress
// do the thing we're doing above
val result = model.download().onEach {
logger("Download update: $it")
yield()
}.first { status ->
status == DownloadStatus.DownloadCompleted || status is DownloadStatus.DownloadFailed
}
if (result is DownloadStatus.DownloadFailed) {
Llm.Response.Failure("download failed")
} else {
model.getPromptResponse(prompt)
}
}

but I don't think we have a good mechanism for canceling the in-progress Flow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that cancelling the job or the scope in which we start observing the flow, should cancel the request.

We can verify this with our PoC branch. I think we should expect to see that it will cancel.

@MatthewTighe
Copy link
Contributor Author

I went ahead and added an implementation as we discussed on Slack. None of what's here is necessarily expected to survive first contact with how we actually want to use it, but it felt like a good starting place.

@MatthewTighe MatthewTighe changed the base branch from main to autoland January 28, 2026 00:44
@MatthewTighe
Copy link
Contributor Author

Copy link
Contributor

@segunfamisa segunfamisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. We looked at this at today's review-n-brew and there were some questions, some I could answer, and some I couldn't.

In the end, it turned out to be a good idea to try and work out an implementation together with the interface we thought about.

I posted a mix of those comments with my own review comments. Let me know your thoughts.

/**
* A prompt request deliver to the LLM for inference.
*/
suspend fun prompt(prompt: Prompt): Response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q from review-n-brew:

We think that the prompt streams the response, and we should consider making this a Flow since we may get more than one response

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked in the docs, and saw that the Gemini Nano one has both a streaming and non-streaming API.

I think our LLM abstraction should accommodate both, but if we just want to simplify things in the beginning, then we need to explicitly state the behavior of the prompt() function - whether or not it returns only when the request is complete, or it has streaming capability in the doc comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we should consider using the streaming response, because for the non-streamed version, we will have to wait until the model is done with the inference, before the user sees anything, then we now start to fake a "streaming" UI.

That's going to feel verrrryyyy slow.

FeatureStatus.DOWNLOADING -> {
Llm.Response.InProgress("already downloading")
}
FeatureStatus.DOWNLOADABLE -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q from review-n-brew:

How is this going to be represented to the user? Especially since there will be a download state.

* inference.
*/
class GeminiNanoLlm(
private val buildModel: () -> GenerativeModel = { Generation.getClient() },
Copy link
Contributor

@segunfamisa segunfamisa Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this way, we stand the risk of calling buildModel more than once, and it seems like it could be expensive.

I think we could:
Create a lazy member that will call that, to ensure we only build the model once.

class GeminiNanoLlm(....) {
  
   private val model by lazy { buildModel() }

}

Comment on lines 38 to 43
val result = model.download().onEach {
logger("Download update: $it")
yield()
}.first { status ->
status == DownloadStatus.DownloadCompleted || status is DownloadStatus.DownloadFailed
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions that came out

  1. How long does this download take? Do we know the size of the download?
  2. Do we need to communicate it to the user that we are going to need to download the model?
  3. Do we need to do things like ensure we're on WiFi or an unmetered network before we do a download.
  4. At what point in the user journey do we get here?

If we are going to need to let the user know that we are going to download, then we might need an intermittent state where we show a permission to the user, or some way we communicate with them that we will need to download the models.

It seems we may need a sequence diagram to surface the user flow for us, but it seems that may be dependent on UX.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a question I was grappling with quite a bit yesterday, and was the motivation behind the Preparing availability status. Each of our implementations will have intermittent states, but they are different.

For Nano, this will be downloading the model.
For MLPA, this will be authentication.

These are probably both worth visualizing to the user, but require us abstracting over some kind of visualization or messaging if we want to share a common API. I think for now we can just expose string (resources, once we have copy) to distinguish between the different types of preparing

I agree that we want user consent for network downloading (or at least strong network detection), but I think we could probably handle it in a follow-up and should get UX/product involvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've engaged UX for this in our Slack channel

Comment on lines 34 to 70
FeatureStatus.DOWNLOADING -> {
Llm.Response.InProgress("already downloading")
}
FeatureStatus.DOWNLOADABLE -> {
val result = model.download().onEach {
logger("Download update: $it")
yield()
}.first { status ->
status == DownloadStatus.DownloadCompleted || status is DownloadStatus.DownloadFailed
}

if (result is DownloadStatus.DownloadFailed) {
Llm.Response.Failure("download failed")
} else {
model.getPromptResponse(prompt)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that cancelling the job or the scope in which we start observing the flow, should cancel the request.

We can verify this with our PoC branch. I think we should expect to see that it will cancel.

/**
* An abstract definition of a LLM that can receive prompts.
*/
interface Llm {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider adding the ability to warm up and close the LLM for cleaning up resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we handle this in a follow-up if and when it becomes necessary? Or do you see a need for it now?

/**
* A prompt request deliver to the LLM for inference.
*/
suspend fun prompt(prompt: Prompt): Response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the naming of this API could box us in.

For instance, If we decide to pass in various configurations into the LLM, then the parameter name prompt no longer holds true.

I think a Request is generic enough to hold a prompt and future extensions.

Copy link
Contributor Author

@MatthewTighe MatthewTighe Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to extend the API at that point, rather than try to plan for the future? Or perhaps prompt means something specific that I'm not understanding? I take it to mean any arbitrary string instruction input into a LLM

For example, I can imagine a future where:

interface Llm {
    suspend fun prompt(prompt: Prompt)
    
    suspend fun request(request: Request)

    data class Request(val startingImage: Image, config: Config, prompt: Prompt) 
}

I am open to the suggestion, I guess I'm just not fully understanding what usecases you foresee us not being able to accommodate down the line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think handling arbitrary string prompts is already pretty open, and we can build additional functionality (image processing, etc) on top. I can't immediately think of a way to limit the prompt to only do summarization while still keep the underlying types abstract

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I think it's fine to leave it as is. It's not a strongly held thought from my end. Just wanted to float the idea.

/**
* An abstract definition of a LLM that can receive prompts.
*/
interface Llm {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it, the more the name Llm suggest something much more abstract than I am seeing in the rest of the class accommodate for.

For example, an LLM has the ability to respond to images, generate content, proof read, etc. but the GeminiNanoLlm does only summarization. If we needed to do a proof-reading Gemini nano implementation, we will not be able to use that same class, despite it being a GeminiNanoLlm

For the specifics of our case, I think what we are really doing is using its ability to summarize, and I think a name Summarizer is more fitting.

And that way, the gemini-nano one can be named OnDeviceGeminiSummarizer.

So future developers who are using other capabilities of an LLM, will not think they can use this because it's named an LLM, and also, the usage is going to be specific enough.

My proposal is something like:

@JvmInline
value class SummarizationRequest(val content: String)

interface Summarizer {

  /** This does an async summarization request. It does *not* stream the response */
  suspend fun summarize(request: SummarizationRequest) : Result<SummarizationResponse>
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, an LLM has the ability to respond to images, generate content, proof read, etc. but the GeminiNanoLlm does only summarization.

Technically, GeminiNanoLlm as written will handle any text prompt so it could be used to generate content, proofread, etc. Yesterday I was wondering if it makes sense to build an even more opinionated API on top of it to handle summarization specifically. We will want to construct the actual somewhere. Thinking now that our planned LlmProvider might be the place to do that, and that we may need another abstraction layer between them. Something like:

// stick this in feature-summarizer
class Summarizer(val llm: Llm) {
    private fun buildPrompt(content: String) = Prompt("Summarize the following, using all these rules.......: $content)

    fun summarize(content: String) {
        llm.prompt(buildPrompt(content))
    }
}

object SummarizationLlmProvider {
    fun buildSummarizer(config: Config) {
        val llm = /* logic to determine nano or mlpa based on config */
        return summarizer(llm)
    }
}

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I think I misunderstood the Gemini API at the time I was reviewing. I thought we were using this one https://developers.google.com/ml-kit/genai/summarization/android that is explicitly a summarizer.

That's why I was struggling to fit in the "Llm" abstraction, cos I thought we were using the Summarizer model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That then makes me ask: Why are we using "prompt" over "summarizer"? do they use different models?

Or is it just a wrapper around the "prompt" model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Summarizer" package strictly controls the output - IIRC you can only get output in a 3 bullet point format and you can't specify additional instructions. I've seen in one of the iOS briefs things like "for recipes, summarize this way..." etc.

The "prompt" model allows for a more unrestricted conversational type of model.

}
}

override suspend fun checkAvailability(): Llm.AvailabilityStatus = when (buildModel().checkStatus()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this API going to be used for? My suspicion is that it is for us to know if this "LLM" has the capability we want? For example, if we are running on a device that does not have GeminiNano?

And are we going to do anything specifically if it's "Preparing"?

If we want to use it to know if we have this capability or not, then I wonder if what we need for checkAvailability() is not a simple yes or no?


My line of questioning is edging towards an idea of unifying both APIs, into 1 flow, such that our result here represents the "status" of the prompt request.

something like this:

Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was just overthinking this yesterday - combining the two APIs makes total sense. I was trying to allow for a second API consumer to inspect the model state and proactively choose not to engage with it - like if it were unavailable or already processing - but a flow already accomplishes that. Thanks for such a helpful diagram!

Copy link
Contributor

@segunfamisa segunfamisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this patch Matt.

I appreciate the effort and the conversations that have come out of it.

As we discussed on slack, I think that we have gotten the value we can get from review - and its still heading in the direction of where our design is, though specifics will evolve as the feature gets more defined.

We can proceed now, and we will iterate, instead of being blocked figuring everything out in one go.

Thank you!

@MatthewTighe MatthewTighe force-pushed the s2s-concept branch 3 times, most recently from cb72eaa to d06df93 Compare January 30, 2026 18:56
@lando-prod-mozilla
Copy link

Pull request closed by commit 423a0aa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants