-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change database model for job data #155
Comments
@smathot @nclaidiere I finally found some time and started working on this. I think the new proposed way of storing and retrieving data will be an improvement overall. There is only one caveat:
I need to relationally link the data to at least one of these database records, and I have chosen studies. This implies though that if a study is deleted the stored data will go with it. For participants, I have chosen to embed their data in the data that is stored for the job, so participants can be safely deleted but their data will remain. @nclaidiere I remember you sent me this a while ago but I cannot find it anymore, but can you send me an export or dump of your current database? I want to convert the old data tabel structure to the new one in a migration, but I want to see if this works well with the production data, to be really sure nothing will break. |
@dschreij Thanks for picking this up. Can you hold this thought for a few days though? @nclaidiere @kristian-lange and I are having a meeting next Wednesday to discuss the next steps for OpenMonkeyMind, and this one of the things that's on the agenda. I'll get back to you! |
Hello, I'm one of the IT in the Laboratory of @nclaidiere. |
Hi @Loic-Bonnier, you are right in your assessment. I have already started with decoupling the job data with the job_states as you can see in the commits that are attached to this issue. Feel free to inspect them and improve them. The docker image of omm-server only contains the production build of the package to keep the image size lean, so it is likely libraries only used for development are missing. |
@smathot I'm looking at the database schema and stumbled over certain tables that have only one column: dtypes, job_statuses_, access_permission, user_types, participation_statuses_. They all have the same columns: an id and a name. I fail to grasp the purpose of them - why not just incorporate them into the table where they have relation to? E.g. Instead of having two tables for the job status one could have just one table and this would reduce the complexity of our schema, makes it easier to handle and develop. Or do I miss some purpose of them? |
Allow me to chime in. This was a deliberate design choice (and a discussion
that pops up once in a while again). But I am from the camp that believes
that enums are evil 👀. See this article that explains nicely why
http://komlenic.com/244/8-reasons-why-mysqls-enum-data-type-is-evil/
Especially point 4 has often bitten me, as has point 2 when you need to
change something to the member list.
Met vriendelijke groet / with kind regards
Daniël Schreij
…On Fri, Jul 12, 2024, 14:54 Kristian Lange ***@***.***> wrote:
@smathot <https://github.com/smathot> I'm looking at the database schema
and stumbled over certain tables that have only one column: *dtypes*,
job_statuses_, *access_permission*, *user_types*,
participation_statuses_. They all have the same columns: an *id* and a
*name*. I fail to grasp the purpose of them - why not just incorporate
them into the table where they have relation to?
E.g. Instead of having two tables for the job status
image.png (view on web)
<https://github.com/user-attachments/assets/a6f38f99-eb52-472f-9456-980e9f2ed802>
one could have just one table
image.png (view on web)
<https://github.com/user-attachments/assets/aa94d2a8-381c-41a2-b0d2-fe79fd125eb0>
and this would reduce the complexity of our schema, makes it easier to
handle and develop. Or do I miss some purpose of them?
—
Reply to this email directly, view it on GitHub
<#155 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFIHFE455DYN4HVFYQAYMLZL7GZLAVCNFSM6AAAAABF3EBPOOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRVGUYTQMJTGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dschreij Hi Daniël, nice you chimed in :). I wasn't aware of this apparent war, people on one side supporting enum types and people on the other side supporting look-up tables. I belong to a third type of people, admittedly a minory, that argue one could just put your status/type data in a simple varchar or tinyint and keep it simple without an extra table or enum. Advantages
Disadvantages
For a small project (like OMM) I'm happy to live with the disadvantages and enjoy the advantages. E.g. I think this await ptcp.jobs()
.where('study_id', study.id)
.whereInPivot('status', ['pending', 'started'])
.withPivot(['status'])
.orderBy('pivot_status', 'desc')
.orderBy('position', 'asc')
.firstOrFail() is easier to read than this await ptcp.jobs()
.where('study_id', study.id)
.whereInPivot('status_id', [1, 2])
.withPivot(['status_id'])
.with('variables.dtype')
.orderBy('pivot_status_id', 'desc')
.orderBy('position', 'asc')
.firstOrFail() . |
The main disadvantage of using a reference table that I can see is that this may require an additional join and thus reduce performance. But this may not actually happen in practice, because we are generally not interested in the text description of these fields, and thus we are likely not actually joining the reference tables in our queries. (Or am I wrong about that? If we are regularly joining the reference tables, then that's a performance issue.) @kristian-lange You mention using a simple tinyint instead. This is actually no different from using a reference table, except that in that case the tinyint also serves as an index to the reference table. The varchar approach I don't like because it makes the table bigger and potentially less performant. In other words: @kristian-lange, restructure as you see fit while considering that:
Is that helpful? |
I'm still trying to get familiar with the database schema and where it could potentially be improved to solve some problems we actually face. For me this discussion about the look-up table / reference table is more like an exercise before tackling the real problems. I agree with @smathot that there is no need to change the schema regarding this.
I actually meant only an tinyint and no lookup table. The meaning of the different int values would have been coded in the application. It's similar to what we are doing currently. We seldom actually look up the meaning in the look-up table and we do the database queries with the id, e.g. |
Hi @dschreij, could you please help me to understand why you have chosen a certain way. You wrote:
And from your commit (#4accc39) I can see that you planed a table study_data that both links the tables studies and participants. For me the natural choice to attach the table that contains the job's result data would be to link it to the job_states table where the job's result data are currently stored (in field data of type json). Why did you choose to link it to studies and participants.
Can you please elaborate on this? Are their data for participants that have to be stored additionally? Are those data different from the job's result data? What do you mean with 'embed' and 'data that is stored for the job'? Do you mean the jobs table (But this table does not have a data field)? And I have two partially related questions:
|
Hi Kristian,
Why did you choose to link it to studies and participants.
The main reason to no longer attach job results to the job_states table is that there is a (new) requirement that the data should be retained if the job table is wiped. Currently, if one wants to upload a new job table file, the old jobs are deleted first, along with their data, before the new job table is composed from the uploaded file. To mitigate deleting all result data, I had to decouple it from jobs and tie it to the study instead.
Can you please elaborate on this? Are their data for participants that have to be stored additionally? Are those data different from the job's result data?
Yes, if you view a participant record you will see one can specify extra data in yaml that pertain to the participant, like gender, age, etc. These need to be stored with each trial/job result, so they are exported for each job data row when the data is saved as excel.
Why does the table job_states not have an id? Was it a deliberate choice? What were the reasons?
This is a pivot table, right? Those never really need ids as they have a composite primary key consisting of the participant ID and job ID if I remember correctly. Any combination of these are always unique. You will rarely use ids from a pivot table to retrieve data, as they are only used to be able to join 2 many-many related tables.
Why is the participant_id sometimes stored as a varchar (e.g. table session) and sometimes as an int (e.g. job_states or participations)
I can't clearly recall, but I think in the session table, the link to the participant is made using its identifier string, and not its ID. This is because the omm client needed to be able to manipulate the session data and using the identifier is easier that using the id for that case. I should have chosen a better name for this participant_id in the session table, so this is my bad
On Sun, Jul 14, 2024, 23:31 Kristian Lange ***@***.******@***.***>> wrote:
Hi @dschreij<https://github.com/dschreij>, could you please help me to understand why you have chosen a certain way. You wrote:
I need to relationally link the data to at least one of these database records, and I have chosen studies.
And from your commit (#4accc39) I can see that you planed a table study_data that both links the tables studies and participants.
image.png (view on web)<https://github.com/user-attachments/assets/87abd619-5506-4d7f-aa73-1bdb977fc152>
For me the natural choice to attach the table that contains the job's result data would be to link it to the job_states table where the job's result data are currently stored (in field data of type json). Why did you choose to link it to studies and participants.
For participants, I have chosen to embed their data in the data that is stored for the job, so participants can be safely deleted but their data will remain.
Can you please elaborate on this? Are their data for participants that have to be stored additionally? Are those data different from the job's result data? What do you mean with 'embed' and 'data that is stored for the job'? Do you mean the jobs table (But this table does not have a data field)?
And I have two partially related questions:
1. Why does the table job_states not have an id? Was it a deliberate choice? What were the reasons?
2. Why is the participant_id sometimes stored as a varchar (e.g. table session) and sometimes as an int (e.g. job_states or participations)?
—
Reply to this email directly, view it on GitHub<#155 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAFIHFB2WYPQXMQUBR6TRWTZMLU3XAVCNFSM6AAAAABF3EBPOOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRXGQ4DQNZTGE>.
You are receiving this because you were mentioned.
|
Just to add some formatting to Daniel's answer ;). Hi Kristian,
The main reason to no longer attach job results to the job_states table is that there is a (new) requirement that the data should be retained if the job table is wiped. Currently, if one wants to upload a new job table file, the old jobs are deleted first, along with their data, before the new job table is composed from the uploaded file. To mitigate deleting all result data, I had to decouple it from jobs and tie it to the study instead.
Yes, if you view a participant record you will see one can specify extra data in yaml that pertain to the participant, like gender, age, etc. These need to be stored with each trial/job result, so they are exported for each job data row when the data is saved as excel.
Those never really need ids as they have a composite primary key consisting of the participant ID and job ID if I remember correctly. Any combination of these are always unique. You will rarely use ids from a pivot table to retrieve data, as they are only used to be able to join 2 many-many related tables.
I can't clearly recall, but I think in the session table, the link to the participant is made using its identifier string, and not its ID. This is because the omm client needed to be able to manipulate the session data and using the identifier is easier that using the id for that case. I should have chosen a better name for this participant_id in the session table, so this is my bad |
Right now job data is stored such that one entry corresponds to one job and one participant. The actual data is stored as a JSON in a text field. This is problematic for two reasons:
To fix this, the database model should be refactored such that multiple runs of the same job are stored as separate entries, and that these entries remain accessible when linked experiments / participants are removed.
The text was updated successfully, but these errors were encountered: