Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make stream_token can break prefill, too #94

Closed
wants to merge 0 commits into from
Closed

Make stream_token can break prefill, too #94

wants to merge 0 commits into from

Conversation

zeerd
Copy link
Contributor

@zeerd zeerd commented Mar 12, 2024

When we give the gemma a large input(ex. a file), we need a way to stop it.

Copy link
Collaborator

@austinvhuang austinvhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the need to be able to exit makes sense.

Probably beyond the scope of this PR, but I wonder if in the long term this starts to get messy trying to track this in stream_token It's certainly doable, but there's probably a breaking point where stream_token is trying to accomplish too much control flow logic.

There's some related design choices around customizable hooks/callbacks in the context of trainer abstractions (eg in pytorch lightning), might look to see how other frameworks deal with this.

Specific tactical items to get this merged:

  • Can you use a separate variable to track the stream_token return value instead of overwriting token? (unless there's a rationale I'm not seeing)
  • Also, may need to merge/test with the latest state of the dev branch.

gemma.cc Outdated
@@ -529,7 +529,7 @@ void GenerateImpl(GemmaImpl<TConfig>& gemma, const InferenceArgs& args,

// Prefill stops before prompt.size() - 1 since the last prompt token is the
// first input token for generation.
while (pos_offset < prompt.size() - 1) {
while (pos_offset < prompt.size() - 1 && token != EOS_ID) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to override the token value vs. track the stream_token return state with its own bool variable? Keeping the two forms of state seems easier to reason about rather than having to track "Is token EOS_ID because that's what was generated or because stream_token was false..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I want to change as little code as possible. But it's not meaningful.
I will think more again.

@zeerd
Copy link
Contributor Author

zeerd commented Mar 22, 2024

In fact, I don't like returning in the middle of a function.
However, it is strange to enclose the second half of the entire function with conditional judgments.

Another side, I recommend splitting GenerateImpl into two functions.
But this might be a big changing.

@jan-wassenberg
Copy link
Member

Yes, GenerateImpl does not fit on a single screen and I agree it would make sense to split into two functions.
We just did that with Attention as well. Would you prefer lambdas or actual functions? A function probably only makes sense if there are relatively few parameters.

@zeerd
Copy link
Contributor Author

zeerd commented Mar 25, 2024

I have less knowlage about neurol network.
But I am thinking about an usecase :

There is a source file, which has 5 functions.I want the gemma help me to know what the 3rd function dose.

We know that, if the gemma read the whole file first, it could give the summarization of the 3rd function better.

So, maybe I could do:

  • 1st, let gemma read the whole file to summarize but not output(to save time time, as I do not care it).
  • 2nd, let gemma read the 3rd function again , summarize and output.

I want to know, If we splite the GenerateImpl to two part. Could them help me to implement this case?

If the answer is not (or this could not help me to save time), I think lambda is enough.

@jan-wassenberg
Copy link
Member

hm, I do not believe splitting the function up helps this case specifically. GenerateImpl consists of the prefill loop, bounds checks, then the decode loop. Splitting it up could improve readability and make it easier to understand "early return" in this pull request. We can see how it looks as a lambda :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants