Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new datasets + query reformulators #24

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

AlexandreMisrahi2005
Copy link
Collaborator

@AlexandreMisrahi2005 AlexandreMisrahi2005 commented Sep 16, 2024

  • new datasets: bioasq-pubmed, coderagbench (humaneval + mbpp), apibench, syllabusqa)
  • add query generators/reformulators to reformulate query with generator @nadiinchi
  • update generators(wrapper to load gguf format + wrapper to load local model)
  • some assertions (check label format == list, and id format == string)
  • generate oracles scripts (bioasq + humaneval)

Please check especially the files utils.py and modules/metrics.py where there could be some potentially critical changes

…h, syllabusqa), update generators (wrap gguf format, local model), update query generators, assertions (label format, id format), gen-oracles scripts (bioasq + humaneval)
@AlexandreMisrahi2005 AlexandreMisrahi2005 marked this pull request as ready for review September 16, 2024 14:52
@nadiinchi
Copy link
Collaborator

My review

  • create a folder in config/datasets with all new datasets? e.g. multi-domain ?
  • same suggestion for config/prompts
  • put all new domain dataset processors into a separate file in processors, as it is done for KILT? (i'd suggest both datastores and queries)
  • code rag bench: add separate configs per doc dataset, cause they need to be run separately before using MergedDocDataset
  • remove config config/generator/llama-3-8b-instruct-FT-NQ-2epochs.yaml
  • config/query_generator/unfold_api_query.yaml: are all commented parts needed?
  • train code: is it possible to add optional support of wandb? for now its imported mandatory. or discuss this with other contributors
  • train code: could we make num_saving_steps a config argument and not hardcoded?

- created folder multidomain in config/datasets and config/prompts with all new datasets
- moved and tested dataset processors to separate multidomain processor file
- code rag bench: added and tested separate configs per doc dataset
- removed local model checkpoint config
- removed comments in config_query_generator/unfold_api_query.yaml
- added optional support for wandb (and tested) + added training documentation to use / disable wandb
- num_saving_steps now config argument

TODO: debug ignored error in vllm -> __del__ -> gc.collect()
@AlexandreMisrahi2005
Copy link
Collaborator Author

All comments from @nadiinchi are taken into account in the new commits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants