Questions about data #7

pat-alt · 2024-01-15T08:45:05Z

Hi there,

This is a very interesting dataset, thank you for sharing. I have a few questions:

In the filtered datasets, does the 'score' column correspond to the softmax output for the predicted label?
For the training datasets, I assume the final digits in the filenames just indicate the seeds?
Regarding seeds, were those used to for splitting the data into train and test set? In other words, the union of train and test still always contains the same sentences which were manually annotated?
Am I right in assuming that the datasets on HF correspond to one of the 'lab-manual-split-combine-train-XXXX.xlsx' and 'lab-manual-split-combine-test-XXXX.xlsx' datasets? Which seed exactly?
I am assuming I can use the URL as a unique document identifier? For press conferences, for example, I find 63 unique URLs which corresponds to '# Files' presented in the paper.
Finally, when I locally concatenate all filtered speeches/mm/pc, I actually find a significant number of duplicate sentences, often in documents with varying time stamps. My hunch is that there are simply some sentences that actually reappear in multiple documents. For example, it's reasonable to assume that the below sentence gets recycled regularly. But still I wanted to ask for view on this.

"The Federal Open Market Committee seeks monetary and financial conditions that will foster price stability and promote sustainable growth in output."

Sorry for the long list of questions and apologies if I have missed something obvious in some cases.

Many thanks for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about data #7

Questions about data #7

pat-alt commented Jan 15, 2024 •

edited

Loading

Questions about data #7

Questions about data #7

Comments

pat-alt commented Jan 15, 2024 • edited Loading

pat-alt commented Jan 15, 2024 •

edited

Loading