Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add slice tests! #111

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mistermichaelll
Copy link

@mistermichaelll mistermichaelll commented Aug 7, 2024

Summary

This PR adds relevant tests for the @slice() macro based on the tests which are in the dplyr code, found here. I skipped quite a few tests that involve functionality that I don't think is implemented yet. I also didn't go crazy and add all the possible tests to keep the PR at a reasonable size for review.

I went ahead and put these under the tests/ folder rather than implementing as doctests. My thinking on this is that there are two types of things that I can foresee wanting to test:

  • basic examples that show how someone could use the functions (the doctests).
  • tests focused on on edge-cases or cases that test the robustness and correctness of the implementation.
    • sometimes this can even be silly stuff that doesn't necessarily make sense to include in the docs, like "this code throws an error if someone tries to use a string as a slice index...right?"

This keeps the documentation examples concise and helpful while also thoroughly testing the underlying functionality in various scenarios. I think most if not all of these tests fall under the second case.

Overview of the Implemented Tests

The test summary currently looks like this:

Test Summary:                                           | Pass  Error  Total  Time
TidierData                                              |   20      4     24  8.1s
  pivots                                                |   12            12  3.0s
    pivot_wider                                         |    3             3  0.9s
    pivot_longer                                        |    9             9  2.1s
  @slice()                                              |    8      4     12  3.7s
    empty slice drops all rows                          |    2             2  0.6s
    slicing DataFrame yields DataFrame                  |    1             1  0.1s
    slice keeps positive indices, ignoring out of range |    1      1      2  2.3s
    slice keeps negative indices, ignoring out of range |    1      1      2  0.4s
    slice errors if positive and negative indices mixed |    1             1  0.1s
    slice errors if index is not numeric                |    2             2  0.1s
    slice keeps zero length groups                      |           1      1  0.1s
    slice retains labels for zero length groups         |           1      1  0.0s

I'll drop some descriptions of the failing tests below for discussion😄

Positive Indices Test

Here's the R code and output from the test I'm trying to replicate:

gf <- group_by(tibble(g = c(1, 2, 2, 3, 3, 3), id = 1:6), g)
out <- slice(gf, 2)
out

#> # A tibble: 2 × 2
#> # Groups:   g [2]
#>       g    id
#>   <dbl> <int>
#> 1     2     3
#> 2     3     5

Here's what happens in Julia:

gf = @group_by(DataFrame(g = [1, 2, 2, 3, 3, 3], id = 1:6), g)
out = @slice(gf, 2)

ERROR: BoundsError: attempt to access 1-element Vector{Int64} at index [[2]]
Stacktrace:
 [1] _combine(gd::GroupedDataFrame{…}, cs_norm::Vector{…}, optional_transform::Vector{…}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:755
 [2] _combine_prepare_norm(gd::GroupedDataFrame{…}, cs_vec::Vector{…}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:87
 [3] _combine_prepare(gd::GroupedDataFrame{…}, ::Base.RefValue{…}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:52
 [4] _combine_prepare
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:26 [inlined]
 [5] combine
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:857 [inlined]
 [6] #combine#775
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:846 [inlined]
 [7] macro expansion
   @ ~/Documents/github/TidierData.jl/src/slice.jl:19 [inlined]
 [8] top-level scope
   @ REPL[22]:1

caused by: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:352 [inlined]
 [2] _combine(gd::GroupedDataFrame{…}, cs_norm::Vector{…}, optional_transform::Vector{…}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:752
 [3] _combine_prepare_norm(gd::GroupedDataFrame{…}, cs_vec::Vector{…}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:87
 [4] _combine_prepare(gd::GroupedDataFrame{…}, ::Base.RefValue{…}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:52
 [5] _combine_prepare
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:26 [inlined]
 [6] combine
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:857 [inlined]
 [7] #combine#775
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:846 [inlined]
 [8] macro expansion
   @ ~/Documents/github/TidierData.jl/src/slice.jl:19 [inlined]
 [9] top-level scope
   @ REPL[22]:1

    nested task error: BoundsError: attempt to access 1-element Vector{Int64} at index [[2]]
    Stacktrace:
     [1] throw_boundserror(A::Vector{Int64}, I::Tuple{Vector{Int64}})
       @ Base ./abstractarray.jl:737
     [2] checkbounds
       @ ./abstractarray.jl:702 [inlined]
     [3] _getindex
       @ ./multidimensional.jl:888 [inlined]
     [4] getindex
       @ ./abstractarray.jl:1291 [inlined]
     [5] getindex
       @ ~/.julia/packages/DataFrames/58MUJ/src/subdataframe/subdataframe.jl:181 [inlined]
     [6] (::var"#51#53")(sdf::SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}})
       @ Main ~/Documents/github/TidierData.jl/src/slice.jl:20
     [7] _combine_process_callable(wcs_i::Base.RefValue{…}, optional_i::Bool, parentdf::DataFrame, gd::GroupedDataFrame{…}, seen_cols::Dict{…}, trans_res::Vector{…}, idx_agg::Base.RefValue{…}, threads::Bool)
       @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:381
     [8] macro expansion
       @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:729 [inlined]
     [9] (::DataFrames.var"#759#768"{})()
       @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/other/utils.jl:227
Some type information was truncated. Use `show(err)` to see complete types.

Negative Indices Test

Same for the negative indices, here's the R code:

gf <- group_by(tibble(g = c(1, 2, 2, 3, 3, 3), id = 1:6), g)
out <- slice(gf, -(1:2))
out

#> # A tibble: 1 × 2
#> # Groups:   g [1]
#>       g    id
#>   <dbl> <int>
#> 1     3     6

Here's the Julia code and resulting error (it's very similar to the one above)

gf = @group_by(DataFrame(g = [1, 2, 2, 3, 3, 3], id = 1:6), g)
out = @slice(gf, -(1:2))

ERROR: BoundsError: attempt to access 1-element Vector{Int64} at index [Not([1, 2])]
Stacktrace:
 [1] _combine(gd::GroupedDataFrame{…}, cs_norm::Vector{…}, optional_transform::Vector{…}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:755
 [2] _combine_prepare_norm(gd::GroupedDataFrame{…}, cs_vec::Vector{…}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:87
 [3] _combine_prepare(gd::GroupedDataFrame{…}, ::Base.RefValue{…}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:52
 [4] _combine_prepare
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:26 [inlined]
 [5] combine
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:857 [inlined]
 [6] #combine#775
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:846 [inlined]
 [7] top-level scope
   @ ~/Documents/github/TidierData.jl/src/slice.jl:23

caused by: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:352 [inlined]
 [2] _combine(gd::GroupedDataFrame{…}, cs_norm::Vector{…}, optional_transform::Vector{…}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:752
 [3] _combine_prepare_norm(gd::GroupedDataFrame{…}, cs_vec::Vector{…}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:87
 [4] _combine_prepare(gd::GroupedDataFrame{…}, ::Base.RefValue{…}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:52
 [5] _combine_prepare
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:26 [inlined]
 [6] combine
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:857 [inlined]
 [7] #combine#775
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:846 [inlined]
 [8] top-level scope
   @ ~/Documents/github/TidierData.jl/src/slice.jl:23

    nested task error: BoundsError: attempt to access 1-element Vector{Int64} at index [Not([1, 2])]
    Stacktrace:
     [1] throw_boundserror(A::Vector{Int64}, I::Tuple{InvertedIndices.InvertedIndexIterator{Int64, Vector{…}, Base.OneTo{…}}})
       @ Base ./abstractarray.jl:737
     [2] checkbounds
       @ ./abstractarray.jl:702 [inlined]
     [3] _getindex
       @ ./multidimensional.jl:888 [inlined]
     [4] getindex
       @ ./abstractarray.jl:1291 [inlined]
     [5] getindex
       @ ~/.julia/packages/DataFrames/58MUJ/src/subdataframe/subdataframe.jl:181 [inlined]
     [6] (::var"#56#58")(sdf::SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}})
       @ Main ~/Documents/github/TidierData.jl/src/slice.jl:24
     [7] _combine_process_callable(wcs_i::Base.RefValue{…}, optional_i::Bool, parentdf::DataFrame, gd::GroupedDataFrame{…}, seen_cols::Dict{…}, trans_res::Vector{…}, idx_agg::Base.RefValue{…}, threads::Bool)
       @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:381
     [8] macro expansion
       @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:729 [inlined]
     [9] (::DataFrames.var"#759#768"{})()
       @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/other/utils.jl:227
Some type information was truncated. Use `show(err)` to see complete types.

I suspect that these are failing because GroupedDataFrames don't behave quite the same as a grouped tibble does in R as far as things like indexing go, so I'm curious if this is a place where we'd expect the Tidier equivalent of dplyr code to behave exactly the same.

Zero-Length Groups

This is an interesting set of tests. Basically the gist is that in dplyr, if one of the grouping variables is a factor with defined levels, slice() should still retain this group when working with a grouped data frame. This is mostly for consistency/expected results when working with grouped data in R.

There isn't a group_size function in Julia, so I tried to get the size of each group for comparison with combine and nrow. The zero-length group doesn't seem to be accounted for, so this test fails as the size vector is of length 2 rather than one of length 3 that looks like [1, 1, 0] as it does in the R test.

That being said - this may not be a slice() issue, as the GroupedDataFrame itself seems to only have 2 groups and doesn't appear to account for the zero-length group even though it seems to recognize that there is one.

julia> df
GroupedDataFrame with 2 groups based on keys: e, f, g
First Group (2 rows): e = 1, f = CategoricalValue{Int64, UInt32} 1 (1/3), g = 1
 Row │ e      f     g      x     
     │ Int64  Cat…  Int64  Int64 
─────┼───────────────────────────
   1 │     1  1         1      1
   2 │     1  1         1      2
⋮
Last Group (2 rows): e = 1, f = CategoricalValue{Int64, UInt32} 2 (2/3), g = 2
 Row │ e      f     g      x     
     │ Int64  Cat…  Int64  Int64 
─────┼───────────────────────────
   1 │     1  2         2      1
   2 │     1  2         2      4

So this may be a case where we really don't expect the same behavior because DataFrames doesn't behave the same as R--and if so we could probably just drop these two testsets.

@kdpsingh
Copy link
Member

kdpsingh commented Sep 3, 2024

Thank you so much for putting these together! Going to look through them soon and decide which ones are readily fixable and which ones would require rewriting underlying DataFrames.jl code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants