Optimize `min_count` when `expected_groups` is not provided. #236

dcherian · 2023-04-26T02:24:50Z

Skip reindexing when expected_groups is not provided. In this case, we detect all available groups anyway.\

This is less impactful than it seems because Xarray always sets expected_groups = (pd.RangeIndex(...),).

A solution is to set min_count=0 upstream for UniqueGrouper

For pure numpy arrays, min_count=1 (xarray default) is the same as min_count=None, with the right fill_value. This avoids one useless pass over the data, and one useless copy. We need to always accumulate count with dask, to make sure we get the right values at the end.

* main: (64 commits) import `normalize_axis_index` from `numpy.lib` on `numpy>=2` (#364) Optimize `min_count` when `expected_groups` is not provided. (#236) Use threadpool for finding labels in chunk (#327) Manually fuse reindexing intermediates with blockwise reduction for cohorts. (#300) Bump codecov/codecov-action from 4.1.1 to 4.3.1 (#362) Add cubed notebook for hourly climatology example using "map-reduce" method (#356) Optimize bitmask finding for chunk size 1 and single chunk cases (#360) Edits to climatology doc (#361) Fix benchmarks (#358) Trim CI (#355) [pre-commit.ci] pre-commit autoupdate (#350) Initial minimal working Cubed example for "map-reduce" (#352) Bump codecov/codecov-action from 4.1.0 to 4.1.1 (#349) `method` heuristics: Avoid dot product as much as possible (#347) Fix nanlen with strings (#344) Fix direct quantile reduction (#343) Fix upstream-dev CI, silence warnings (#341) Bump codecov/codecov-action from 4.0.0 to 4.1.0 (#338) Fix direct reductions of Xarray objects (#339) Test with py3.12 (#336) ...

* main: Bump codecov/codecov-action from 4.3.1 to 4.4.1 (#366) Cubed blockwise (#357) Remove errant print statement import `normalize_axis_index` from `numpy.lib` on `numpy>=2` (#364) Optimize `min_count` when `expected_groups` is not provided. (#236) Use threadpool for finding labels in chunk (#327) Manually fuse reindexing intermediates with blockwise reduction for cohorts. (#300) Bump codecov/codecov-action from 4.1.1 to 4.3.1 (#362) Add cubed notebook for hourly climatology example using "map-reduce" method (#356) Optimize bitmask finding for chunk size 1 and single chunk cases (#360) Edits to climatology doc (#361) Fix benchmarks (#358)

dcherian force-pushed the optimize-more branch 3 times, most recently from 66010f7 to fe8d3c9 Compare April 28, 2023 15:38

dcherian marked this pull request as draft May 1, 2023 21:39

dcherian added 2 commits May 2, 2024 07:41

Better?

5212a5b

dcherian force-pushed the optimize-more branch from fe8d3c9 to 5212a5b Compare May 2, 2024 13:56

dcherian marked this pull request as ready for review May 2, 2024 14:33

dcherian changed the title ~~Optimizations~~ Optimize min_count when expected_groups is not provided. May 2, 2024

dcherian merged commit 0083ab2 into main May 2, 2024
15 checks passed

dcherian deleted the optimize-more branch May 2, 2024 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `min_count` when `expected_groups` is not provided. #236

Optimize `min_count` when `expected_groups` is not provided. #236

dcherian commented Apr 26, 2023 •

edited

Loading

Optimize min_count when expected_groups is not provided. #236

Optimize min_count when expected_groups is not provided. #236

Conversation

dcherian commented Apr 26, 2023 • edited Loading

Optimize `min_count` when `expected_groups` is not provided. #236

Optimize `min_count` when `expected_groups` is not provided. #236

dcherian commented Apr 26, 2023 •

edited

Loading