Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change database model for job data #155

Open
smathot opened this issue Apr 7, 2024 · 12 comments
Open

Change database model for job data #155

smathot opened this issue Apr 7, 2024 · 12 comments

Comments

@smathot
Copy link
Collaborator

smathot commented Apr 7, 2024

Right now job data is stored such that one entry corresponds to one job and one participant. The actual data is stored as a JSON in a text field. This is problematic for two reasons:

  1. If a job is repeated often, this results in very large JSON text blob that contains the combined results from all repetitions. This results in the database crashing, out-of-sort memory errors, and other symptoms that we've discussed elsewhere. (OMM can not be accessed anymore #152 , Unable to retrieve data from experiment #150, 'Out of sort memory' error when downloading data #92 )
  2. If an experiment is removed or changed, or if a participant is removed, the data is removed as well. This is undesirable because the data should remain available even after such changes.

To fix this, the database model should be refactored such that multiple runs of the same job are stored as separate entries, and that these entries remain accessible when linked experiments / participants are removed.

@dschreij
Copy link
Member

dschreij commented Apr 21, 2024

@smathot @nclaidiere I finally found some time and started working on this. I think the new proposed way of storing and retrieving data will be an improvement overall. There is only one caveat:

and that these entries remain accessible when linked experiments / participants are removed.

I need to relationally link the data to at least one of these database records, and I have chosen studies. This implies though that if a study is deleted the stored data will go with it. For participants, I have chosen to embed their data in the data that is stored for the job, so participants can be safely deleted but their data will remain.

@nclaidiere I remember you sent me this a while ago but I cannot find it anymore, but can you send me an export or dump of your current database? I want to convert the old data tabel structure to the new one in a migration, but I want to see if this works well with the production data, to be really sure nothing will break.

@smathot
Copy link
Collaborator Author

smathot commented Apr 22, 2024

@dschreij Thanks for picking this up. Can you hold this thought for a few days though? @nclaidiere @kristian-lange and I are having a meeting next Wednesday to discuss the next steps for OpenMonkeyMind, and this one of the things that's on the agenda. I'll get back to you!

@Loic-Bonnier
Copy link

Hello, I'm one of the IT in the Laboratory of @nclaidiere.
I did some investigation, and relative to some experiment's design, sometimes there is an UPDATE's query in the table "job_states". I understand this table has all the job and sometimes there are one, two or prehaps three json data in the same row. From my point view this the problem we are looking for. I will try to correct that with the rewrite plugin of mysql but it's not in the Docker image use by omm-server.

@dschreij
Copy link
Member

dschreij commented Apr 24, 2024

Hi @Loic-Bonnier, you are right in your assessment. I have already started with decoupling the job data with the job_states as you can see in the commits that are attached to this issue. Feel free to inspect them and improve them.

The docker image of omm-server only contains the production build of the package to keep the image size lean, so it is likely libraries only used for development are missing.

@kristian-lange
Copy link
Collaborator

@smathot I'm looking at the database schema and stumbled over certain tables that have only one column: dtypes, job_statuses_, access_permission, user_types, participation_statuses_. They all have the same columns: an id and a name. I fail to grasp the purpose of them - why not just incorporate them into the table where they have relation to?

E.g. Instead of having two tables for the job status

Image

one could have just one table

Image

and this would reduce the complexity of our schema, makes it easier to handle and develop. Or do I miss some purpose of them?

@dschreij
Copy link
Member

dschreij commented Jul 12, 2024 via email

@kristian-lange
Copy link
Collaborator

@dschreij Hi Daniël, nice you chimed in :). I wasn't aware of this apparent war, people on one side supporting enum types and people on the other side supporting look-up tables. I belong to a third type of people, admittedly a minory, that argue one could just put your status/type data in a simple varchar or tinyint and keep it simple without an extra table or enum.

Advantages

  • simple schema: no enum type, no extra look-up table
  • simple queries: no table joins
  • easy to read (everyone understands "pending" - not everyone "1")

Disadvantages

  • varchar uses (a bit) more disk space
  • people need to be disciplined and actually only using the appropriate values
  • renaming of such status/type data means to go through every row in the table and update the value

For a small project (like OMM) I'm happy to live with the disadvantages and enjoy the advantages.

E.g.

I think this

await ptcp.jobs()
        .where('study_id', study.id)
        .whereInPivot('status', ['pending', 'started'])
        .withPivot(['status'])
        .orderBy('pivot_status', 'desc')
        .orderBy('position', 'asc')
        .firstOrFail()

is easier to read than this

await ptcp.jobs()
        .where('study_id', study.id)
        .whereInPivot('status_id', [1, 2])
        .withPivot(['status_id'])
        .with('variables.dtype')
        .orderBy('pivot_status_id', 'desc')
        .orderBy('position', 'asc')
        .firstOrFail()

.

@smathot
Copy link
Collaborator Author

smathot commented Jul 13, 2024

The main disadvantage of using a reference table that I can see is that this may require an additional join and thus reduce performance. But this may not actually happen in practice, because we are generally not interested in the text description of these fields, and thus we are likely not actually joining the reference tables in our queries. (Or am I wrong about that? If we are regularly joining the reference tables, then that's a performance issue.)

@kristian-lange You mention using a simple tinyint instead. This is actually no different from using a reference table, except that in that case the tinyint also serves as an index to the reference table. The varchar approach I don't like because it makes the table bigger and potentially less performant.

In other words: @kristian-lange, restructure as you see fit while considering that:

  • There is no need to change the database structure unless there is a clear reason for it. The main reason would be to reduce the number of operations and joins to increase performance.
  • We don't want to use varchar columns for enum-like data.
  • There may not be a fundamental difference between your tinyint approach and @dschreij 's reference tables.

Is that helpful?

@kristian-lange
Copy link
Collaborator

I'm still trying to get familiar with the database schema and where it could potentially be improved to solve some problems we actually face. For me this discussion about the look-up table / reference table is more like an exercise before tackling the real problems. I agree with @smathot that there is no need to change the schema regarding this.

You mention using a simple tinyint instead. This is actually no different from using a reference table, except that in that case the tinyint also serves as an index to the reference table.

I actually meant only an tinyint and no lookup table. The meaning of the different int values would have been coded in the application. It's similar to what we are doing currently. We seldom actually look up the meaning in the look-up table and we do the database queries with the id, e.g. .whereInPivot('status_id', [1, 2]).

@kristian-lange
Copy link
Collaborator

Hi @dschreij, could you please help me to understand why you have chosen a certain way. You wrote:

I need to relationally link the data to at least one of these database records, and I have chosen studies.

And from your commit (#4accc39) I can see that you planed a table study_data that both links the tables studies and participants.

Image

For me the natural choice to attach the table that contains the job's result data would be to link it to the job_states table where the job's result data are currently stored (in field data of type json). Why did you choose to link it to studies and participants.

For participants, I have chosen to embed their data in the data that is stored for the job, so participants can be safely deleted but their data will remain.

Can you please elaborate on this? Are their data for participants that have to be stored additionally? Are those data different from the job's result data? What do you mean with 'embed' and 'data that is stored for the job'? Do you mean the jobs table (But this table does not have a data field)?

And I have two partially related questions:

  1. Why does the table job_states not have an id? Was it a deliberate choice? What were the reasons?

  2. Why is the participant_id sometimes stored as a varchar (e.g. table session) and sometimes as an int (e.g. job_states or participations)?

@dschreij
Copy link
Member

dschreij commented Jul 15, 2024 via email

@kristian-lange
Copy link
Collaborator

Just to add some formatting to Daniel's answer ;).

Hi Kristian,

Why did you choose to link it to studies and participants.

The main reason to no longer attach job results to the job_states table is that there is a (new) requirement that the data should be retained if the job table is wiped. Currently, if one wants to upload a new job table file, the old jobs are deleted first, along with their data, before the new job table is composed from the uploaded file. To mitigate deleting all result data, I had to decouple it from jobs and tie it to the study instead.

Can you please elaborate on this? Are their data for participants that have to be stored additionally? Are those data different from the job's result data?

Yes, if you view a participant record you will see one can specify extra data in yaml that pertain to the participant, like gender, age, etc. These need to be stored with each trial/job result, so they are exported for each job data row when the data is saved as excel.

Why does the table job_states not have an id? Was it a deliberate choice? What were the reasons? This is a pivot table, right?

Those never really need ids as they have a composite primary key consisting of the participant ID and job ID if I remember correctly. Any combination of these are always unique. You will rarely use ids from a pivot table to retrieve data, as they are only used to be able to join 2 many-many related tables.

Why is the participant_id sometimes stored as a varchar (e.g. table session) and sometimes as an int (e.g. job_states or participations)

I can't clearly recall, but I think in the session table, the link to the participant is made using its identifier string, and not its ID. This is because the omm client needed to be able to manipulate the session data and using the identifier is easier that using the id for that case. I should have chosen a better name for this participant_id in the session table, so this is my bad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants