Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about data #7

Open
pat-alt opened this issue Jan 15, 2024 · 0 comments
Open

Questions about data #7

pat-alt opened this issue Jan 15, 2024 · 0 comments

Comments

@pat-alt
Copy link

pat-alt commented Jan 15, 2024

Hi there,

This is a very interesting dataset, thank you for sharing. I have a few questions:

  1. In the filtered datasets, does the 'score' column correspond to the softmax output for the predicted label?
  2. For the training datasets, I assume the final digits in the filenames just indicate the seeds?
  3. Regarding seeds, were those used to for splitting the data into train and test set? In other words, the union of train and test still always contains the same sentences which were manually annotated?
  4. Am I right in assuming that the datasets on HF correspond to one of the 'lab-manual-split-combine-train-XXXX.xlsx' and 'lab-manual-split-combine-test-XXXX.xlsx' datasets? Which seed exactly?
  5. I am assuming I can use the URL as a unique document identifier? For press conferences, for example, I find 63 unique URLs which corresponds to '# Files' presented in the paper.
  6. Finally, when I locally concatenate all filtered speeches/mm/pc, I actually find a significant number of duplicate sentences, often in documents with varying time stamps. My hunch is that there are simply some sentences that actually reappear in multiple documents. For example, it's reasonable to assume that the below sentence gets recycled regularly. But still I wanted to ask for view on this.

"The Federal Open Market Committee seeks monetary and financial conditions that will foster price stability and promote sustainable growth in output."

Sorry for the long list of questions and apologies if I have missed something obvious in some cases.

Many thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant