new datasets + query reformulators #24

AlexandreMisrahi2005 · 2024-09-16T10:29:20Z

new datasets: bioasq-pubmed, coderagbench (humaneval + mbpp), apibench, syllabusqa)
add query generators/reformulators to reformulate query with generator @nadiinchi
update generators(wrapper to load gguf format + wrapper to load local model)
some assertions (check label format == list, and id format == string)
generate oracles scripts (bioasq + humaneval)

Please check especially the files utils.py and modules/metrics.py where there could be some potentially critical changes

…h, syllabusqa), update generators (wrap gguf format, local model), update query generators, assertions (label format, id format), gen-oracles scripts (bioasq + humaneval)

… causing Unit Tests to fail

nadiinchi · 2024-09-18T09:25:53Z

My review

create a folder in config/datasets with all new datasets? e.g. multi-domain ?
same suggestion for config/prompts
put all new domain dataset processors into a separate file in processors, as it is done for KILT? (i'd suggest both datastores and queries)
code rag bench: add separate configs per doc dataset, cause they need to be run separately before using MergedDocDataset
remove config config/generator/llama-3-8b-instruct-FT-NQ-2epochs.yaml
config/query_generator/unfold_api_query.yaml: are all commented parts needed?
train code: is it possible to add optional support of wandb? for now its imported mandatory. or discuss this with other contributors
train code: could we make num_saving_steps a config argument and not hardcoded?

- created folder multidomain in config/datasets and config/prompts with all new datasets - moved and tested dataset processors to separate multidomain processor file - code rag bench: added and tested separate configs per doc dataset - removed local model checkpoint config - removed comments in config_query_generator/unfold_api_query.yaml - added optional support for wandb (and tested) + added training documentation to use / disable wandb - num_saving_steps now config argument TODO: debug ignored error in vllm -> __del__ -> gc.collect()

AlexandreMisrahi2005 · 2024-09-24T10:07:05Z

All comments from @nadiinchi are taken into account in the new commits

…nore for wandb

new datasets (bioasq-pubmed, coderagbench (humaneval + mbpp), apibenc…

c7739da

…h, syllabusqa), update generators (wrap gguf format, local model), update query generators, assertions (label format, id format), gen-oracles scripts (bioasq + humaneval)

AlexandreMisrahi2005 requested review from sclincha, nadiinchi and vnikouliNLE September 16, 2024 10:29

AlexandreMisrahi2005 marked this pull request as draft September 16, 2024 10:36

change label format assert to warning + edit config with wrong target…

cffa0e6

… causing Unit Tests to fail

AlexandreMisrahi2005 marked this pull request as ready for review September 16, 2024 14:52

change back to assertion

b0fe974

AlexandreMisrahi2005 added 2 commits September 24, 2024 12:08

comment vllm del method override to remove ignored error + edit gitig…

d92bd8f

…nore for wandb

correct syllabusQA splits

93bed67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new datasets + query reformulators #24

new datasets + query reformulators #24

AlexandreMisrahi2005 commented Sep 16, 2024 •

edited

Loading

nadiinchi commented Sep 18, 2024

AlexandreMisrahi2005 commented Sep 24, 2024

new datasets + query reformulators #24

Are you sure you want to change the base?

new datasets + query reformulators #24

Conversation

AlexandreMisrahi2005 commented Sep 16, 2024 • edited Loading

nadiinchi commented Sep 18, 2024

AlexandreMisrahi2005 commented Sep 24, 2024

AlexandreMisrahi2005 commented Sep 16, 2024 •

edited

Loading