Add slice tests! #111

mistermichaelll · 2024-08-07T17:53:27Z

Summary

This PR adds relevant tests for the @slice() macro based on the tests which are in the dplyr code, found here. I skipped quite a few tests that involve functionality that I don't think is implemented yet. I also didn't go crazy and add all the possible tests to keep the PR at a reasonable size for review.

I went ahead and put these under the tests/ folder rather than implementing as doctests. My thinking on this is that there are two types of things that I can foresee wanting to test:

basic examples that show how someone could use the functions (the doctests).
tests focused on on edge-cases or cases that test the robustness and correctness of the implementation.
- sometimes this can even be silly stuff that doesn't necessarily make sense to include in the docs, like "this code throws an error if someone tries to use a string as a slice index...right?"

This keeps the documentation examples concise and helpful while also thoroughly testing the underlying functionality in various scenarios. I think most if not all of these tests fall under the second case.

Overview of the Implemented Tests

The test summary currently looks like this:

Test Summary:                                           | Pass  Error  Total  Time
TidierData                                              |   20      4     24  8.1s
  pivots                                                |   12            12  3.0s
    pivot_wider                                         |    3             3  0.9s
    pivot_longer                                        |    9             9  2.1s
  @slice()                                              |    8      4     12  3.7s
    empty slice drops all rows                          |    2             2  0.6s
    slicing DataFrame yields DataFrame                  |    1             1  0.1s
    slice keeps positive indices, ignoring out of range |    1      1      2  2.3s
    slice keeps negative indices, ignoring out of range |    1      1      2  0.4s
    slice errors if positive and negative indices mixed |    1             1  0.1s
    slice errors if index is not numeric                |    2             2  0.1s
    slice keeps zero length groups                      |           1      1  0.1s
    slice retains labels for zero length groups         |           1      1  0.0s

I'll drop some descriptions of the failing tests below for discussion😄

Positive Indices Test

Here's the R code and output from the test I'm trying to replicate:

gf <- group_by(tibble(g = c(1, 2, 2, 3, 3, 3), id = 1:6), g)
out <- slice(gf, 2)
out

#> # A tibble: 2 × 2
#> # Groups:   g [2]
#>       g    id
#>   <dbl> <int>
#> 1     2     3
#> 2     3     5

Here's what happens in Julia:

gf = @group_by(DataFrame(g = [1, 2, 2, 3, 3, 3], id = 1:6), g)
out = @slice(gf, 2)

ERROR: BoundsError: attempt to access 1-element Vector{Int64} at index [[2]]
Stacktrace:
 [1] _combine(gd::GroupedDataFrame{…}, cs_norm::Vector{…}, optional_transform::Vector{…}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:755
 [2] _combine_prepare_norm(gd::GroupedDataFrame{…}, cs_vec::Vector{…}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:87
 [3] _combine_prepare(gd::GroupedDataFrame{…}, ::Base.RefValue{…}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:52
 [4] _combine_prepare
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:26 [inlined]
 [5] combine
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:857 [inlined]
 [6] #combine#775
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:846 [inlined]
 [7] macro expansion
   @ ~/Documents/github/TidierData.jl/src/slice.jl:19 [inlined]
 [8] top-level scope
   @ REPL[22]:1

caused by: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:352 [inlined]
 [2] _combine(gd::GroupedDataFrame{…}, cs_norm::Vector{…}, optional_transform::Vector{…}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:752
 [3] _combine_prepare_norm(gd::GroupedDataFrame{…}, cs_vec::Vector{…}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:87
 [4] _combine_prepare(gd::GroupedDataFrame{…}, ::Base.RefValue{…}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:52
 [5] _combine_prepare
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:26 [inlined]
 [6] combine
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:857 [inlined]
 [7] #combine#775
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:846 [inlined]
 [8] macro expansion
   @ ~/Documents/github/TidierData.jl/src/slice.jl:19 [inlined]
 [9] top-level scope
   @ REPL[22]:1

    nested task error: BoundsError: attempt to access 1-element Vector{Int64} at index [[2]]
    Stacktrace:
     [1] throw_boundserror(A::Vector{Int64}, I::Tuple{Vector{Int64}})
       @ Base ./abstractarray.jl:737
     [2] checkbounds
       @ ./abstractarray.jl:702 [inlined]
     [3] _getindex
       @ ./multidimensional.jl:888 [inlined]
     [4] getindex
       @ ./abstractarray.jl:1291 [inlined]
     [5] getindex
       @ ~/.julia/packages/DataFrames/58MUJ/src/subdataframe/subdataframe.jl:181 [inlined]
     [6] (::var"#51#53")(sdf::SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}})
       @ Main ~/Documents/github/TidierData.jl/src/slice.jl:20
     [7] _combine_process_callable(wcs_i::Base.RefValue{…}, optional_i::Bool, parentdf::DataFrame, gd::GroupedDataFrame{…}, seen_cols::Dict{…}, trans_res::Vector{…}, idx_agg::Base.RefValue{…}, threads::Bool)
       @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:381
     [8] macro expansion
       @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:729 [inlined]
     [9] (::DataFrames.var"#759#768"{…})()
       @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/other/utils.jl:227
Some type information was truncated. Use `show(err)` to see complete types.

Negative Indices Test

Same for the negative indices, here's the R code:

gf <- group_by(tibble(g = c(1, 2, 2, 3, 3, 3), id = 1:6), g)
out <- slice(gf, -(1:2))
out

#> # A tibble: 1 × 2
#> # Groups:   g [1]
#>       g    id
#>   <dbl> <int>
#> 1     3     6

Here's the Julia code and resulting error (it's very similar to the one above)

gf = @group_by(DataFrame(g = [1, 2, 2, 3, 3, 3], id = 1:6), g)
out = @slice(gf, -(1:2))

ERROR: BoundsError: attempt to access 1-element Vector{Int64} at index [Not([1, 2])]
Stacktrace:
 [1] _combine(gd::GroupedDataFrame{…}, cs_norm::Vector{…}, optional_transform::Vector{…}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:755
 [2] _combine_prepare_norm(gd::GroupedDataFrame{…}, cs_vec::Vector{…}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:87
 [3] _combine_prepare(gd::GroupedDataFrame{…}, ::Base.RefValue{…}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:52
 [4] _combine_prepare
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:26 [inlined]
 [5] combine
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:857 [inlined]
 [6] #combine#775
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:846 [inlined]
 [7] top-level scope
   @ ~/Documents/github/TidierData.jl/src/slice.jl:23

caused by: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:352 [inlined]
 [2] _combine(gd::GroupedDataFrame{…}, cs_norm::Vector{…}, optional_transform::Vector{…}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:752
 [3] _combine_prepare_norm(gd::GroupedDataFrame{…}, cs_vec::Vector{…}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:87
 [4] _combine_prepare(gd::GroupedDataFrame{…}, ::Base.RefValue{…}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:52
 [5] _combine_prepare
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:26 [inlined]
 [6] combine
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:857 [inlined]
 [7] #combine#775
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:846 [inlined]
 [8] top-level scope
   @ ~/Documents/github/TidierData.jl/src/slice.jl:23

    nested task error: BoundsError: attempt to access 1-element Vector{Int64} at index [Not([1, 2])]
    Stacktrace:
     [1] throw_boundserror(A::Vector{Int64}, I::Tuple{InvertedIndices.InvertedIndexIterator{Int64, Vector{…}, Base.OneTo{…}}})
       @ Base ./abstractarray.jl:737
     [2] checkbounds
       @ ./abstractarray.jl:702 [inlined]
     [3] _getindex
       @ ./multidimensional.jl:888 [inlined]
     [4] getindex
       @ ./abstractarray.jl:1291 [inlined]
     [5] getindex
       @ ~/.julia/packages/DataFrames/58MUJ/src/subdataframe/subdataframe.jl:181 [inlined]
     [6] (::var"#56#58")(sdf::SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}})
       @ Main ~/Documents/github/TidierData.jl/src/slice.jl:24
     [7] _combine_process_callable(wcs_i::Base.RefValue{…}, optional_i::Bool, parentdf::DataFrame, gd::GroupedDataFrame{…}, seen_cols::Dict{…}, trans_res::Vector{…}, idx_agg::Base.RefValue{…}, threads::Bool)
       @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:381
     [8] macro expansion
       @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:729 [inlined]
     [9] (::DataFrames.var"#759#768"{…})()
       @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/other/utils.jl:227
Some type information was truncated. Use `show(err)` to see complete types.

I suspect that these are failing because GroupedDataFrames don't behave quite the same as a grouped tibble does in R as far as things like indexing go, so I'm curious if this is a place where we'd expect the Tidier equivalent of dplyr code to behave exactly the same.

Zero-Length Groups

This is an interesting set of tests. Basically the gist is that in dplyr, if one of the grouping variables is a factor with defined levels, slice() should still retain this group when working with a grouped data frame. This is mostly for consistency/expected results when working with grouped data in R.

There isn't a group_size function in Julia, so I tried to get the size of each group for comparison with combine and nrow. The zero-length group doesn't seem to be accounted for, so this test fails as the size vector is of length 2 rather than one of length 3 that looks like [1, 1, 0] as it does in the R test.

That being said - this may not be a slice() issue, as the GroupedDataFrame itself seems to only have 2 groups and doesn't appear to account for the zero-length group even though it seems to recognize that there is one.

julia> df
GroupedDataFrame with 2 groups based on keys: e, f, g
First Group (2 rows): e = 1, f = CategoricalValue{Int64, UInt32} 1 (1/3), g = 1
 Row │ e      f     g      x     
     │ Int64  Cat…  Int64  Int64 
─────┼───────────────────────────
   1 │     1  1         1      1
   2 │     1  1         1      2
⋮
Last Group (2 rows): e = 1, f = CategoricalValue{Int64, UInt32} 2 (2/3), g = 2
 Row │ e      f     g      x     
     │ Int64  Cat…  Int64  Int64 
─────┼───────────────────────────
   1 │     1  2         2      1
   2 │     1  2         2      4

So this may be a case where we really don't expect the same behavior because DataFrames doesn't behave the same as R--and if so we could probably just drop these two testsets.

kdpsingh · 2024-09-03T17:01:27Z

Thank you so much for putting these together! Going to look through them soon and decide which ones are readily fixable and which ones would require rewriting underlying DataFrames.jl code.

mistermichaelll added 2 commits August 7, 2024 13:06

add initial slice tests

3a165f6

clearer to use function name

34f9817

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add slice tests! #111

Add slice tests! #111

mistermichaelll commented Aug 7, 2024 •

edited

Loading

kdpsingh commented Sep 3, 2024

Add slice tests! #111

Are you sure you want to change the base?

Add slice tests! #111

Conversation

mistermichaelll commented Aug 7, 2024 • edited Loading

Summary

Overview of the Implemented Tests

Positive Indices Test

Negative Indices Test

Zero-Length Groups

kdpsingh commented Sep 3, 2024

mistermichaelll commented Aug 7, 2024 •

edited

Loading