Skip to content

Commit

Permalink
Merge pull request #61 from TidierOrg/udf-and-agg-docs
Browse files Browse the repository at this point in the history
UDF docs, intro `agg()`, `t(), fix windows
  • Loading branch information
drizk1 authored Sep 23, 2024
2 parents 4099414 + ae097ee commit 7712485
Show file tree
Hide file tree
Showing 19 changed files with 467 additions and 83 deletions.
14 changes: 14 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
# TidierDB.jl updates

## v0.3.4 - 2024 2024-09-23
TidierDB works with nearly any exisiting SQL function, now there are docs about it.
- Docs on using any exisiting SQL function in TidierDB
- Docs on user defined functions (UDFs) in TidierDB
- Adds `agg()` to use any aggregate built into a database to be used in `@mutate`. support for `agg()` in across. (`@summarize` continues to all aggregate SQL functions without `agg()`)
- Adds `t(query)` as a more efficient alternative to reference tables.
```
table = db_table(db, "name")
@chain t(table) ...
```
- Bugfix: fixes MsSQL joins
- Bugfix: window functions
- Bugfix: json paths supported for `json` DuckDB functions

## v0.3.3 - 2024-08-29
- Bugfix: `@mutate` allows type conversion as part of larger mutate expressions

Expand Down
4 changes: 2 additions & 2 deletions docs/examples/UserGuide/getting_started.jl
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,6 @@
# end
# ```
# ---
# Tip: Setting `t(table) = from_query(table)` will save some keystrokes.
# This means after saving the results of `db_table` you can start all chains/refer to the data with `t(table)`
# Tip: `t()` is an alias for `from_query`
# This means after saving the results of `db_table` use `t(table)` refer to the table or prior query
# ---
144 changes: 144 additions & 0 deletions docs/examples/UserGuide/udfs_ex.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# TidierDB is unique in its statement parsing flexiblility. This means that using any built in SQL function or user defined functions (or UDFS) or is readily avaialable.
# To use any function built into a database in `@mutate` or in `@summarize`, simply correctly write the correctly, but replace `'` with `"`. This also applies to any UDF. The example below will illustrate UDFs in the context of DuckDB.


# ```
# # Set up the connection
# using TidierDB #rexports DuckDB
# db = DuckDB.DB()
# con = DuckDB.connect(db) # this will be important for UDFs
# mtcars_path = "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"
# mtcars = db_tbable(con, mtcars_path);
# ```
# ## aggregate function in `@summarize`
# Lets use the DuckDB `kurtosis` aggregate function
# ```
# @chain t(mtcars) begin
# @group_by cyl
# @summarize(kurt = kurtosis(mpg))
# @collect
# end
# 3×2 DataFrame
# Row │ cyl kurt
# │ Int64? Float64?
# ─────┼───────────────────
# 1 │ 4 -1.43411
# 2 │ 6 -1.82944
# 3 │ 8 0.330061
# ```

# ## aggregate functions in `@mutate`
# To aggregate sql functions that are builtin to any database, but exist outside of the TidierDB parser, simply wrap the function call in `agg()`
# ```
# @chain t(mtcars) begin
# @group_by(cyl)
# @mutate(kurt = agg(kurtosis(mpg)))
# @select cyl mpg kurt
# @collect
# end

# 32×3 DataFrame
# Row │ cyl mpg kurt
# │ Int64? Float64? Float64?
# ─────┼─────────────────────────────
# 1 │ 8 18.7 0.330061
# 2 │ 8 14.3 0.330061
# 3 │ 8 16.4 0.330061
# 4 │ 8 17.3 0.330061
# 5 │ 8 15.2 0.330061
# 6 │ 8 10.4 0.330061
# 7 │ 8 10.4 0.330061
# ⋮ │ ⋮ ⋮ ⋮
# 27 │ 6 21.0 -1.82944
# 28 │ 6 21.4 -1.82944
# 29 │ 6 18.1 -1.82944
# 30 │ 6 19.2 -1.82944
# 31 │ 6 17.8 -1.82944
# 32 │ 6 19.7 -1.82944
# 19 rows omitted
# end

# ```

# ## DuckDB function chaining
# In DuckDB, functions can be chained together with `.`. TidierDB lets you leverage this.
# ```
# @chain t(mtcars) begin
# @mutate(model2 = model.upper().string_split(" ").list_aggr("string_agg",".").concat("."))
# @select model model2
# @collect
# end
# 32×2 DataFrame
# Row │ model model2
# │ String? String?
# ─────┼───────────────────────────────────────
# 1 │ Mazda RX4 MAZDA.RX4.
# 2 │ Mazda RX4 Wag MAZDA.RX4.WAG.
# 3 │ Datsun 710 DATSUN.710.
# 4 │ Hornet 4 Drive HORNET.4.DRIVE.
# 5 │ Hornet Sportabout HORNET.SPORTABOUT.
# 6 │ Valiant VALIANT.
# 7 │ Duster 360 DUSTER.360.
# ⋮ │ ⋮ ⋮
# 27 │ Porsche 914-2 PORSCHE.914-2.
# 28 │ Lotus Europa LOTUS.EUROPA.
# 29 │ Ford Pantera L FORD.PANTERA.L.
# 30 │ Ferrari Dino FERRARI.DINO.
# 31 │ Maserati Bora MASERATI.BORA.
# 32 │ Volvo 142E VOLVO.142E.
# 19 rows omitted
# ```

# ## `rowid` and pseudocolumns
# When a table is not being read directly from a file, `rowid` is avaialable for use. In general, TidierDB should support all pseudocolumns.
# ```
# copy_to(db, mtcars_path, "mtcars"); # copying table in for demostration purposes
# @chain db_table(con, :mtcars) begin
# @filter(rowid == 4)
# @select(model:hp)
# @collect
# end
# 1×5 DataFrame
# Row │ model mpg cyl disp hp
# │ String? Float64? Int64? Float64? Int64?
# ─────┼───────────────────────────────────────────────────────
# 1 │ Hornet Sportabout 18.7 8 360.0 175
# ```

# ## UDF SQLite Example
# ```
# using SQLite
# sql = connect(sqlite());
# df = DataFrame(id = [string('A' + i ÷ 26, 'A' + i % 26) for i in 0:9],
# groups = [i % 2 == 0 ? "aa" : "bb" for i in 1:10],
# value = repeat(1:5, 2),
# percent = 0.1:0.1:1.0);
#
# copy_to(db, sql, "df_mem");
# SQLite.@register sql function diff_of_squares(x, y)
# x^2 - y^2
# end;
#
# @chain db_table(sql, "df_mem") begin
# @select(value, percent)
# @mutate(plus3 = diff_of_squares(value, percent))
# @collect
# end
# 10×3 DataFrame
# Row │ value percent plus3
# │ Int64 Float64 Float64
# ─────┼─────────────────────────
# 1 │ 1 0.1 0.99
# 2 │ 2 0.2 3.96
# 3 │ 3 0.3 8.91
# 4 │ 4 0.4 15.84
# 5 │ 5 0.5 24.75
# 6 │ 1 0.6 0.64
# 7 │ 2 0.7 3.51
# 8 │ 3 0.8 8.36
# 9 │ 4 0.9 15.19
# 10 │ 5 1.0 24.0
# ```

# ## How to create UDF in DuckDB
# Example coming soon..
1 change: 1 addition & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -126,4 +126,5 @@ nav:
- "Writing Functions/Macros with TidierDB Chains" : "examples/generated/UserGuide/functions_pass_to_DB.md"
- "Working With Larger than RAM Datasets" : "examples/generated/UserGuide/outofmemex.md"
- "TidierDB.jl vs Ibis" : "examples/generated/UserGuide/ibis_comp.md"
- "Flexible Syntax and UDFs" : "examples/generated/UserGuide/udfs_ex.md"
- "Reference" : "reference.md"
2 changes: 1 addition & 1 deletion src/TBD_macros.jl
Original file line number Diff line number Diff line change
Expand Up @@ -775,6 +775,6 @@ end
"""
$docstring_show_tables
"""
function show_tables(con::DuckDB.DB)
function show_tables(con::Union{DuckDB.DB, DuckDB.Connection})
return DataFrame(DBInterface.execute(con, "SHOW TABLES"))
end
7 changes: 3 additions & 4 deletions src/TidierDB.jl
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ using GZip
@distinct, @left_join, @right_join, @inner_join, @count, @window_order, @window_frame, @show_query, @collect, @slice_max,
@slice_min, @slice_sample, @rename, copy_to, duckdb_open, duckdb_connect, @semi_join, @full_join,
@anti_join, connect, from_query, @interpolate, add_interp_parameter!, update_con, @head,
clickhouse, duckdb, sqlite, mysql, mssql, postgres, athena, snowflake, gbq, oracle, databricks, SQLQuery, show_tables
clickhouse, duckdb, sqlite, mysql, mssql, postgres, athena, snowflake, gbq, oracle, databricks, SQLQuery, show_tables, t

abstract type SQLBackend end

Expand Down Expand Up @@ -154,10 +154,9 @@ function finalize_query(sqlquery::SQLQuery)
complete_query = replace(complete_query, "&&" => " AND ", "||" => " OR ",
"FROM )" => ")" , "SELECT SELECT " => "SELECT ", "SELECT SELECT " => "SELECT ", "DISTINCT SELECT " => "DISTINCT ",
"SELECT SELECT SELECT " => "SELECT ", "PARTITION BY GROUP BY" => "PARTITION BY", "GROUP BY GROUP BY" => "GROUP BY", "HAVING HAVING" => "HAVING",
r"var\"(.*?)\"" => s"\1")
r"var\"(.*?)\"" => s"\1", r"\"\\\$" => "\"\$")

complete_query = replace(complete_query, ", AS " => " AS ")

if current_sql_mode[] == postgres() || current_sql_mode[] == duckdb() || current_sql_mode[] == mysql() || current_sql_mode[] == mssql() || current_sql_mode[] == clickhouse() || current_sql_mode[] == athena() || current_sql_mode[] == gbq() || current_sql_mode[] == oracle() || current_sql_mode[] == snowflake() || current_sql_mode[] == databricks()
complete_query = replace(complete_query, "\"" => "'", "==" => "=")
end
Expand All @@ -172,7 +171,7 @@ end


# DuckDB
function get_table_metadata(conn::DuckDB.DB, table_name::String)
function get_table_metadata(conn::Union{DuckDB.DB, DuckDB.Connection}, table_name::String)
set_sql_mode(duckdb());
query =
"""
Expand Down
60 changes: 50 additions & 10 deletions src/db_parsing.jl
Original file line number Diff line number Diff line change
Expand Up @@ -244,14 +244,13 @@ end

function parse_across(expr, metadata)
columns_expr, funcs_expr = expr.args[2], expr.args[3]

# Existing column selection logic remains unchanged
if isa(columns_expr, String)
# Split the string on commas and trim any whitespace around the names
columns_exprs = map(Symbol, split(strip(columns_expr), ", "))
columns_exprs = map(Symbol, split(strip(columns_expr), ","))
elseif isa(columns_expr, Expr) && columns_expr.head == :tuple
# If columns_expr is a tuple expression, extract its arguments
columns_exprs = columns_expr.args
else
# Handle single columns or other expressions by wrapping in an array
columns_exprs = [columns_expr]
end

Expand All @@ -261,13 +260,12 @@ function parse_across(expr, metadata)

for func in funcs
for col_name in resolved_columns
func_name = isa(func, Symbol) ? func : func.args[1]
result_name = Symbol(string(func_name), "_", col_name)

# Ensure column names are treated as symbols (identifiers)
col_symbol = Meta.parse(col_name) # Convert string back to symbol

new_expr = :($result_name = $func_name($col_symbol))
func_filled = insert_col_into_func(func, col_symbol)
# Specify "agg" to be skipped in the result name
func_name_str = generate_func_name(func, ["agg"])
result_name = Symbol(func_name_str, "_", col_name)
new_expr = Expr(:(=), result_name, func_filled)
push!(result_exprs, new_expr)
end
end
Expand All @@ -276,6 +274,48 @@ function parse_across(expr, metadata)
return combined_expr
end

function insert_col_into_func(func_expr, col_symbol)
if isa(func_expr, Symbol)
# Simple function name; create a call with the column symbol
return Expr(:call, func_expr, col_symbol)
elseif isa(func_expr, Expr) && func_expr.head == :call
# Function call; recursively insert the column symbol into arguments
func_name = func_expr.args[1]
args = func_expr.args[2:end]
new_args = [insert_col_into_func(arg, col_symbol) for arg in args]
return Expr(:call, func_name, new_args...)
else
# Other expressions; return as-is
return func_expr
end
end
function generate_func_name(func_expr, skip_funcs=String[])
if isa(func_expr, Symbol)
return string(func_expr)
elseif isa(func_expr, Expr) && func_expr.head == :call
func_name_expr = func_expr.args[1]
if isa(func_name_expr, Symbol)
func_name = string(func_name_expr)
else
func_name = generate_func_name(func_name_expr, skip_funcs)
end
# Process nested function names
nested_names = [generate_func_name(arg, skip_funcs) for arg in func_expr.args[2:end]]
# Exclude function names in skip_funcs
if func_name in skip_funcs
# Skip adding this function name
return join(nested_names, "_")
else
# Remove empty strings from nested_names
nested_names = filter(n -> n != "", nested_names)
return join([func_name; nested_names], "_")
end
else
return ""
end
end


function parse_interpolation2(expr)
MacroTools.postwalk(expr) do x
if @capture(x, !!variable_Symbol)
Expand Down
63 changes: 63 additions & 0 deletions src/docstrings.jl
Original file line number Diff line number Diff line change
Expand Up @@ -1185,4 +1185,67 @@ julia> show_tables(db) # there are no tables in when first loading so df below i
│ String
─────┴────────
```
"""


const docstring_from_query =
"""
from_query(query)
This is an alias for `t()`. Refer to SQL query without changing the underlying struct. This is an alternate and convenient way to refer to an exisiting DB table
# Arguments
- `query`: The SQL query to reference
# Examples
```julia
julia> df = DataFrame(id = [string('A' + i ÷ 26, 'A' + i % 26) for i in 0:9],
groups = [i % 2 == 0 ? "aa" : "bb" for i in 1:10],
value = repeat(1:5, 2),
percent = 0.1:0.1:1.0);
julia> db = connect(duckdb());
julia> copy_to(db, df, "df_mem");
julia> df_mem = db_table(db, "df_mem");
julia> @chain t(df_mem) @collect
10×4 DataFrame
Row │ id groups value percent
│ String String Int64 Float64
─────┼────────────────────────────────
1 │ AA bb 1 0.1
2 │ AB aa 2 0.2
3 │ AC bb 3 0.3
4 │ AD aa 4 0.4
5 │ AE bb 5 0.5
6 │ AF aa 1 0.6
7 │ AG bb 2 0.7
8 │ AH aa 3 0.8
9 │ AI bb 4 0.9
10 │ AJ aa 5 1.0
julia> query_part = @chain t(df_mem) @select groups:percent;
julia> @chain t(query_part) @filter(value == 4) @collect
2×3 DataFrame
Row │ groups value percent
│ String? Int64? Float64?
─────┼───────────────────────────
1 │ aa 4 0.4
2 │ bb 4 0.9
julia> from_query(df_mem)
SQLQuery("", "df_mem", "", "", "", "", "", "", false, false, 4×4 DataFrame
Row │ name type current_selxn table_name
│ String? String? Int64 String
─────┼─────────────────────────────────────────────
1 │ id VARCHAR 1 df_mem
2 │ groups VARCHAR 1 df_mem
3 │ value BIGINT 1 df_mem
4 │ percent DOUBLE 1 df_mem, false, DuckDB.DB(":memory:"), TidierDB.CTE[], 0, nothing, "", "")
```
"""
Loading

0 comments on commit 7712485

Please sign in to comment.