Change database model for job data #155

smathot · 2024-04-07T08:20:45Z

Right now job data is stored such that one entry corresponds to one job and one participant. The actual data is stored as a JSON in a text field. This is problematic for two reasons:

If a job is repeated often, this results in very large JSON text blob that contains the combined results from all repetitions. This results in the database crashing, out-of-sort memory errors, and other symptoms that we've discussed elsewhere. (OMM can not be accessed anymore #152 , Unable to retrieve data from experiment #150, 'Out of sort memory' error when downloading data #92 )
If an experiment is removed or changed, or if a participant is removed, the data is removed as well. This is undesirable because the data should remain available even after such changes.

To fix this, the database model should be refactored such that multiple runs of the same job are stored as separate entries, and that these entries remain accessible when linked experiments / participants are removed.

dschreij · 2024-04-21T20:22:29Z

@smathot @nclaidiere I finally found some time and started working on this. I think the new proposed way of storing and retrieving data will be an improvement overall. There is only one caveat:

and that these entries remain accessible when linked experiments / participants are removed.

I need to relationally link the data to at least one of these database records, and I have chosen studies. This implies though that if a study is deleted the stored data will go with it. For participants, I have chosen to embed their data in the data that is stored for the job, so participants can be safely deleted but their data will remain.

@nclaidiere I remember you sent me this a while ago but I cannot find it anymore, but can you send me an export or dump of your current database? I want to convert the old data tabel structure to the new one in a migration, but I want to see if this works well with the production data, to be really sure nothing will break.

smathot · 2024-04-22T07:27:01Z

@dschreij Thanks for picking this up. Can you hold this thought for a few days though? @nclaidiere @kristian-lange and I are having a meeting next Wednesday to discuss the next steps for OpenMonkeyMind, and this one of the things that's on the agenda. I'll get back to you!

Loic-Bonnier · 2024-04-23T23:16:02Z

Hello, I'm one of the IT in the Laboratory of @nclaidiere.
I did some investigation, and relative to some experiment's design, sometimes there is an UPDATE's query in the table "job_states". I understand this table has all the job and sometimes there are one, two or prehaps three json data in the same row. From my point view this the problem we are looking for. I will try to correct that with the rewrite plugin of mysql but it's not in the Docker image use by omm-server.

dschreij · 2024-04-24T07:55:03Z

Hi @Loic-Bonnier, you are right in your assessment. I have already started with decoupling the job data with the job_states as you can see in the commits that are attached to this issue. Feel free to inspect them and improve them.

The docker image of omm-server only contains the production build of the package to keep the image size lean, so it is likely libraries only used for development are missing.

kristian-lange · 2024-07-12T12:54:22Z

@smathot I'm looking at the database schema and stumbled over certain tables that have only one column: dtypes, job_statuses_, access_permission, user_types, participation_statuses_. They all have the same columns: an id and a name. I fail to grasp the purpose of them - why not just incorporate them into the table where they have relation to?

E.g. Instead of having two tables for the job status

one could have just one table

and this would reduce the complexity of our schema, makes it easier to handle and develop. Or do I miss some purpose of them?

dschreij · 2024-07-12T18:30:04Z

Allow me to chime in. This was a deliberate design choice (and a discussion that pops up once in a while again). But I am from the camp that believes that enums are evil 👀. See this article that explains nicely why http://komlenic.com/244/8-reasons-why-mysqls-enum-data-type-is-evil/ Especially point 4 has often bitten me, as has point 2 when you need to change something to the member list. Met vriendelijke groet / with kind regards Daniël Schreij

…

On Fri, Jul 12, 2024, 14:54 Kristian Lange ***@***.***> wrote: @smathot <https://github.com/smathot> I'm looking at the database schema and stumbled over certain tables that have only one column: *dtypes*, job_statuses_, *access_permission*, *user_types*, participation_statuses_. They all have the same columns: an *id* and a *name*. I fail to grasp the purpose of them - why not just incorporate them into the table where they have relation to? E.g. Instead of having two tables for the job status image.png (view on web) <https://github.com/user-attachments/assets/a6f38f99-eb52-472f-9456-980e9f2ed802> one could have just one table image.png (view on web) <https://github.com/user-attachments/assets/aa94d2a8-381c-41a2-b0d2-fe79fd125eb0> and this would reduce the complexity of our schema, makes it easier to handle and develop. Or do I miss some purpose of them? — Reply to this email directly, view it on GitHub <#155 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFIHFE455DYN4HVFYQAYMLZL7GZLAVCNFSM6AAAAABF3EBPOOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRVGUYTQMJTGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

kristian-lange · 2024-07-12T20:47:43Z

@dschreij Hi Daniël, nice you chimed in :). I wasn't aware of this apparent war, people on one side supporting enum types and people on the other side supporting look-up tables. I belong to a third type of people, admittedly a minory, that argue one could just put your status/type data in a simple varchar or tinyint and keep it simple without an extra table or enum.

Advantages

simple schema: no enum type, no extra look-up table
simple queries: no table joins
easy to read (everyone understands "pending" - not everyone "1")

Disadvantages

varchar uses (a bit) more disk space
people need to be disciplined and actually only using the appropriate values
renaming of such status/type data means to go through every row in the table and update the value

For a small project (like OMM) I'm happy to live with the disadvantages and enjoy the advantages.

E.g.

I think this

await ptcp.jobs()
        .where('study_id', study.id)
        .whereInPivot('status', ['pending', 'started'])
        .withPivot(['status'])
        .orderBy('pivot_status', 'desc')
        .orderBy('position', 'asc')
        .firstOrFail()

is easier to read than this

await ptcp.jobs()
        .where('study_id', study.id)
        .whereInPivot('status_id', [1, 2])
        .withPivot(['status_id'])
        .with('variables.dtype')
        .orderBy('pivot_status_id', 'desc')
        .orderBy('position', 'asc')
        .firstOrFail()

.

smathot · 2024-07-13T07:15:46Z

The main disadvantage of using a reference table that I can see is that this may require an additional join and thus reduce performance. But this may not actually happen in practice, because we are generally not interested in the text description of these fields, and thus we are likely not actually joining the reference tables in our queries. (Or am I wrong about that? If we are regularly joining the reference tables, then that's a performance issue.)

@kristian-lange You mention using a simple tinyint instead. This is actually no different from using a reference table, except that in that case the tinyint also serves as an index to the reference table. The varchar approach I don't like because it makes the table bigger and potentially less performant.

In other words: @kristian-lange, restructure as you see fit while considering that:

There is no need to change the database structure unless there is a clear reason for it. The main reason would be to reduce the number of operations and joins to increase performance.
We don't want to use varchar columns for enum-like data.
There may not be a fundamental difference between your tinyint approach and @dschreij 's reference tables.

Is that helpful?

kristian-lange · 2024-07-13T19:58:42Z

I'm still trying to get familiar with the database schema and where it could potentially be improved to solve some problems we actually face. For me this discussion about the look-up table / reference table is more like an exercise before tackling the real problems. I agree with @smathot that there is no need to change the schema regarding this.

You mention using a simple tinyint instead. This is actually no different from using a reference table, except that in that case the tinyint also serves as an index to the reference table.

I actually meant only an tinyint and no lookup table. The meaning of the different int values would have been coded in the application. It's similar to what we are doing currently. We seldom actually look up the meaning in the look-up table and we do the database queries with the id, e.g. .whereInPivot('status_id', [1, 2]).

kristian-lange · 2024-07-14T21:31:16Z

Hi @dschreij, could you please help me to understand why you have chosen a certain way. You wrote:

I need to relationally link the data to at least one of these database records, and I have chosen studies.

And from your commit (#4accc39) I can see that you planed a table study_data that both links the tables studies and participants.

For me the natural choice to attach the table that contains the job's result data would be to link it to the job_states table where the job's result data are currently stored (in field data of type json). Why did you choose to link it to studies and participants.

For participants, I have chosen to embed their data in the data that is stored for the job, so participants can be safely deleted but their data will remain.

Can you please elaborate on this? Are their data for participants that have to be stored additionally? Are those data different from the job's result data? What do you mean with 'embed' and 'data that is stored for the job'? Do you mean the jobs table (But this table does not have a data field)?

And I have two partially related questions:

Why does the table job_states not have an id? Was it a deliberate choice? What were the reasons?
Why is the participant_id sometimes stored as a varchar (e.g. table session) and sometimes as an int (e.g. job_states or participations)?

dschreij · 2024-07-15T08:00:17Z

Hi Kristian, Why did you choose to link it to studies and participants. The main reason to no longer attach job results to the job_states table is that there is a (new) requirement that the data should be retained if the job table is wiped. Currently, if one wants to upload a new job table file, the old jobs are deleted first, along with their data, before the new job table is composed from the uploaded file. To mitigate deleting all result data, I had to decouple it from jobs and tie it to the study instead. Can you please elaborate on this? Are their data for participants that have to be stored additionally? Are those data different from the job's result data? Yes, if you view a participant record you will see one can specify extra data in yaml that pertain to the participant, like gender, age, etc. These need to be stored with each trial/job result, so they are exported for each job data row when the data is saved as excel. Why does the table job_states not have an id? Was it a deliberate choice? What were the reasons? This is a pivot table, right? Those never really need ids as they have a composite primary key consisting of the participant ID and job ID if I remember correctly. Any combination of these are always unique. You will rarely use ids from a pivot table to retrieve data, as they are only used to be able to join 2 many-many related tables. Why is the participant_id sometimes stored as a varchar (e.g. table session) and sometimes as an int (e.g. job_states or participations) I can't clearly recall, but I think in the session table, the link to the participant is made using its identifier string, and not its ID. This is because the omm client needed to be able to manipulate the session data and using the identifier is easier that using the id for that case. I should have chosen a better name for this participant_id in the session table, so this is my bad On Sun, Jul 14, 2024, 23:31 Kristian Lange ***@***.******@***.***>> wrote: Hi @dschreij<https://github.com/dschreij>, could you please help me to understand why you have chosen a certain way. You wrote: I need to relationally link the data to at least one of these database records, and I have chosen studies. And from your commit (#4accc39) I can see that you planed a table study_data that both links the tables studies and participants. image.png (view on web)<https://github.com/user-attachments/assets/87abd619-5506-4d7f-aa73-1bdb977fc152> For me the natural choice to attach the table that contains the job's result data would be to link it to the job_states table where the job's result data are currently stored (in field data of type json). Why did you choose to link it to studies and participants. For participants, I have chosen to embed their data in the data that is stored for the job, so participants can be safely deleted but their data will remain. Can you please elaborate on this? Are their data for participants that have to be stored additionally? Are those data different from the job's result data? What do you mean with 'embed' and 'data that is stored for the job'? Do you mean the jobs table (But this table does not have a data field)? And I have two partially related questions: 1. Why does the table job_states not have an id? Was it a deliberate choice? What were the reasons? 2. Why is the participant_id sometimes stored as a varchar (e.g. table session) and sometimes as an int (e.g. job_states or participations)? — Reply to this email directly, view it on GitHub<#155 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAFIHFB2WYPQXMQUBR6TRWTZMLU3XAVCNFSM6AAAAABF3EBPOOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRXGQ4DQNZTGE>. You are receiving this because you were mentioned.

kristian-lange · 2024-07-23T13:57:34Z

Just to add some formatting to Daniel's answer ;).

Hi Kristian,

Why did you choose to link it to studies and participants.

The main reason to no longer attach job results to the job_states table is that there is a (new) requirement that the data should be retained if the job table is wiped. Currently, if one wants to upload a new job table file, the old jobs are deleted first, along with their data, before the new job table is composed from the uploaded file. To mitigate deleting all result data, I had to decouple it from jobs and tie it to the study instead.

Can you please elaborate on this? Are their data for participants that have to be stored additionally? Are those data different from the job's result data?

Yes, if you view a participant record you will see one can specify extra data in yaml that pertain to the participant, like gender, age, etc. These need to be stored with each trial/job result, so they are exported for each job data row when the data is saved as excel.

Why does the table job_states not have an id? Was it a deliberate choice? What were the reasons? This is a pivot table, right?

Those never really need ids as they have a composite primary key consisting of the participant ID and job ID if I remember correctly. Any combination of these are always unique. You will rarely use ids from a pivot table to retrieve data, as they are only used to be able to join 2 many-many related tables.

Why is the participant_id sometimes stored as a varchar (e.g. table session) and sometimes as an int (e.g. job_states or participations)

I can't clearly recall, but I think in the session table, the link to the participant is made using its identifier string, and not its ID. This is because the omm client needed to be able to manipulate the session data and using the identifier is easier that using the id for that case. I should have chosen a better name for this participant_id in the session table, so this is my bad

This was referenced Apr 7, 2024

OMM can not be accessed anymore #152

Closed

'Out of sort memory' error when downloading data #92

Closed

Unable to retrieve data from experiment #150

Closed

dschreij added a commit that referenced this issue Apr 21, 2024

feature #155: Setup infra

4accc39

dschreij added a commit that referenced this issue Apr 21, 2024

#155: Store and retrieve data in new fashion

f9d747a

dschreij added a commit that referenced this issue Apr 21, 2024

#155: embed ptcp instead of making a relation of data

472fb25

smathot mentioned this issue Jul 9, 2024

Uploading job list is extremely slow #115

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change database model for job data #155

Change database model for job data #155

smathot commented Apr 7, 2024 •

edited

Loading

dschreij commented Apr 21, 2024 •

edited

Loading

smathot commented Apr 22, 2024

Loic-Bonnier commented Apr 23, 2024

dschreij commented Apr 24, 2024 •

edited

Loading

kristian-lange commented Jul 12, 2024

dschreij commented Jul 12, 2024 via email

kristian-lange commented Jul 12, 2024

smathot commented Jul 13, 2024

kristian-lange commented Jul 13, 2024

kristian-lange commented Jul 14, 2024

dschreij commented Jul 15, 2024 via email

kristian-lange commented Jul 23, 2024

Change database model for job data #155

Change database model for job data #155

Comments

smathot commented Apr 7, 2024 • edited Loading

dschreij commented Apr 21, 2024 • edited Loading

smathot commented Apr 22, 2024

Loic-Bonnier commented Apr 23, 2024

dschreij commented Apr 24, 2024 • edited Loading

kristian-lange commented Jul 12, 2024

dschreij commented Jul 12, 2024 via email

kristian-lange commented Jul 12, 2024

smathot commented Jul 13, 2024

kristian-lange commented Jul 13, 2024

kristian-lange commented Jul 14, 2024

dschreij commented Jul 15, 2024 via email

kristian-lange commented Jul 23, 2024

smathot commented Apr 7, 2024 •

edited

Loading

dschreij commented Apr 21, 2024 •

edited

Loading

dschreij commented Apr 24, 2024 •

edited

Loading