Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TPC-H] Query 11 and 13 memory issue #1382

Open
hendrikmakait opened this issue Feb 13, 2024 · 5 comments
Open

[TPC-H] Query 11 and 13 memory issue #1382

hendrikmakait opened this issue Feb 13, 2024 · 5 comments
Labels

Comments

@hendrikmakait
Copy link
Member

hendrikmakait commented Feb 13, 2024

This looks like another instance of the problem in #1376. We end up with a groupby-aggregate that leaves us with ~30M groups in a single partition.

Edit (Patrick): Query 13 has exactly the same issue

@phofl
Copy link
Contributor

phofl commented Feb 13, 2024

This is a bit harder to detect though, this is actually something where the cardinality is needed to make a more informed decision

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Feb 13, 2024

This one turns out to be a bit trickier. There's an instance of a groupby with many unique values, as well as a join with a single value of the (partitioned) nations dataset (#1380). However, applying the trivial fixes (split_out=True and broadcast=True) does not solve the memory issue completely. I suspect that imbalanced partition sizes are also to blame (#1367 (comment)).

@phofl phofl changed the title [TPC-H] Query 11 memory issue [TPC-H] Query 11 and 13 memory issue Feb 13, 2024
@hendrikmakait
Copy link
Member Author

As it turns out, broadcast=True does not work because of dask/dask-expr#871.

@phofl
Copy link
Contributor

phofl commented Feb 13, 2024

The broadcast flag wasn't properly preserved when pushing filters down, this is probably why that looked weird for @hendrikmakait

Pr to fix is here: dask/dask-expr#871

Have to rerun after that one is in

@hendrikmakait
Copy link
Member Author

@phofl: This looks much better now, thanks! https://cloud.coiled.io/clusters/383307/information?tab=Metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants