-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make stream_token can break prefill, too #94
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the need to be able to exit makes sense.
Probably beyond the scope of this PR, but I wonder if in the long term this starts to get messy trying to track this in stream_token It's certainly doable, but there's probably a breaking point where stream_token is trying to accomplish too much control flow logic.
There's some related design choices around customizable hooks/callbacks in the context of trainer abstractions (eg in pytorch lightning), might look to see how other frameworks deal with this.
Specific tactical items to get this merged:
- Can you use a separate variable to track the stream_token return value instead of overwriting token? (unless there's a rationale I'm not seeing)
- Also, may need to merge/test with the latest state of the
dev
branch.
gemma.cc
Outdated
@@ -529,7 +529,7 @@ void GenerateImpl(GemmaImpl<TConfig>& gemma, const InferenceArgs& args, | |||
|
|||
// Prefill stops before prompt.size() - 1 since the last prompt token is the | |||
// first input token for generation. | |||
while (pos_offset < prompt.size() - 1) { | |||
while (pos_offset < prompt.size() - 1 && token != EOS_ID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to override the token value vs. track the stream_token return state with its own bool variable? Keeping the two forms of state seems easier to reason about rather than having to track "Is token EOS_ID because that's what was generated or because stream_token was false..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I want to change as little code as possible. But it's not meaningful.
I will think more again.
In fact, I don't like returning in the middle of a function. Another side, I recommend splitting GenerateImpl into two functions. |
Yes, GenerateImpl does not fit on a single screen and I agree it would make sense to split into two functions. |
I have less knowlage about neurol network. There is a source file, which has 5 functions.I want the gemma help me to know what the 3rd function dose. We know that, if the gemma read the whole file first, it could give the summarization of the 3rd function better. So, maybe I could do:
I want to know, If we splite the GenerateImpl to two part. Could them help me to implement this case? If the answer is not (or this could not help me to save time), I think lambda is enough. |
hm, I do not believe splitting the function up helps this case specifically. GenerateImpl consists of the prefill loop, bounds checks, then the decode loop. Splitting it up could improve readability and make it easier to understand "early return" in this pull request. We can see how it looks as a lambda :) |
When we give the gemma a large input(ex. a file), we need a way to stop it.