Skip to content

Commit

Permalink
Clean up README, index, and Project.toml. Bump version to 0.1.0.
Browse files Browse the repository at this point in the history
  • Loading branch information
kdpsingh committed Apr 7, 2024
1 parent 79ea80a commit d4868a9
Show file tree
Hide file tree
Showing 3 changed files with 290 additions and 136 deletions.
16 changes: 14 additions & 2 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "TidierDB"
uuid = "86993f9b-bbba-4084-97c5-ee15961ad48b"
authors = ["Daniel Rizk <[email protected]> and contributors"]
version = "0.1.0-DEV"
version = "0.1.0"

[deps]
Arrow = "69666777-d1a9-59fb-9406-91d4454c9d45"
Expand All @@ -19,9 +19,21 @@ SQLite = "0aa819cd-b072-5ff4-a722-6bc24af294d9"

[compat]
julia = "1.9"
Arrow = "2.7"
Chain = "0.6"
ClickHouse = "0.2"
DataFrames = "1.5"
Documenter = "1.3"
DuckDB = "0.10"
LibPQ = "1.17"
MacroTools = "0.5"
MySQL = "1.4"
ODBC = "1.1"
Reexport = "0.2, 1"
SQLite = "1.6"

[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Test"]
test = ["Test"]
205 changes: 138 additions & 67 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
## What is TidierDB.jl
## What is TidierDB.jl?

[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/TidierOrg/TidierDB.jl/blob/main/LICENSE)
[![Docs: Latest](https://img.shields.io/badge/Docs-Latest-blue.svg)](https://tidierorg.github.io/TidierDB.jl/latest)

TiderDB.jl is a 100% Julia implementation of the dbplyr R package (and similar to python's ibis package).
TiderDB.jl is a 100% Julia implementation of the dbplyr R package, and similar to Python's ibis package.

The main goal of TidierDB.jl is to bring the ease of use and simple syntax of Tidier.jl to mutliple SQL backends,
making data analysis smoother by abstracting away subtle syntax differences between backends.
The main goal of TidierDB.jl is to bring the syntax of Tidier.jl to multiple SQL backends, making it possible to analyze data directly on databases without needing to copy the entire database into memory.

## Currently supported backends include:

- DuckDB (the default) `set_sql_mode(:duckdb)`
- ClickHouse `set_sql_mode(:clickhouse)`
- SQLite `set_sql_mode(:lite)`
- MySQL `set_sql_mode(:mysql)`
- MSSQL `set_sql_mode(:mssql)`
- Postgres `set_sql_mode(:postgres)`

Change the backend by using `set_sql_mode()`
The style of SQL that is generated can be modified using `set_sql_mode()`.

## Installation

Expand All @@ -27,13 +27,12 @@ For the stable version:
```

TidierDB.jl currently supports the following top-level macros:

- `@arrange`
- `@group_by`
- `@filter`
- `@select`
- `@mutate` supports `across`
- `@summarize` / `@summarise` supports `across`
- `@mutate`, which supports `across()`
- `@summarize` and `@summarise`, which supports `across()`
- `@distinct`
- `@left_join`, `@right_join`, `@inner_join` (slight syntax differences from TidierData.jl)
- `@count`
Expand All @@ -42,7 +41,7 @@ TidierDB.jl currently supports the following top-level macros:
- `@show_query`
- `@collect`

Supported helper functions for most backends include
Supported helper functions for most backends include:
- `across()`
- `desc()`
- `if_else()` and `case_when()`
Expand All @@ -52,102 +51,174 @@ Supported helper functions for most backends include
- `is_missing()`
- `missing_if()` and `replace_missing()`

From TidierStrings.jl
From TidierStrings.jl:
- `str_detect`, `str_replace`, `str_replace_all`, `str_remove_all`, `str_remove`

From TidierDates.jl
From TidierDates.jl:
- `year`, `month`, `day`, `hour`, `min`, `second`, `floor_date`, `difftime`

Supported aggregate functions (as supported by the backend) with more to come
- `mean`, `minimium`, `maximum`, `std`, `sum`, `cumsum`, `cor`, `cov`, `var`

- `copy_to` (for DuckDB, MySQL, SQLite)

DuckDB specifically enables copy_to to directly reading in .parquet, .json, .csv, .arrow, and https file paths.
```
DuckDB specifically enables copy_to to directly reading in `.parquet`, `.json`, `.csv`, and `.arrow` file, including https file paths.

```julia
path = "file_path.parquet"
copy_to(conn, file_path, "table_name")
```

Bang bang `!!` Interpolation for columns and values is supported.
## What is the recommended way to use TidierDB?

There are a few subtle but important differences from Tidier.jl outlined [here](https://github.com/drizk1/TidierDB.jl/blob/main/docs/examples/UserGuide/key_differences.jl).
Typically, you will want to use TidierDB alongside TidierData because there are certain functionality (such as pivoting) which are only supported in TidierData and can only be performed on data frames.

Missing a function or backend?
Our recommended path for using TidierDB is to import the package so that there are no namespace conflicts with TidierData. Once TidierDB is integrated with Tidier, then Tidier will automatically load the packages in this fashion.

You can actually use any (non-agg) sql function in mutate with the correct sql syntax and it will will still run.
But open an issue, and we would be happy to address it.
First, let's develop and execute a query using TidierDB. Notice that all top-level macros and functions originating from TidierDB start with a `DB` prefix. Any functions defined within macros do *not* need to be prefixed within `DB` because they are actually pseudofunctions that are in actuality converted into SQL code.

Finally, some examples
```
using TidierDB
mem = duckdb_open(":memory:");
db = duckdb_connect(mem);
Even though the code reads similarly to TidierData, note that no computational work actually occurs until you run `DB.@collect()`, which runs the SQL query and instantiates the result as a DataFrame.

```julia
using TidierData
import TidierDB as DB

mem = DB.duckdb_open(":memory:");
db = DB.duckdb_connect(mem);
path = "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"
copy_to(db, path, "mtcars2");
@chain db_table(db, :mtcars2) begin
@filter(model != starts_with("M"))
@group_by(cyl)
@summarize(mpg = mean(mpg))
@mutate(sqaured = mpg^2,
rounded = round(mpg),
efficiency = case_when(
mpg >= cyl^2 , 12,
mpg < 15.2 , 14,
44))
@filter(efficiency>12)
@arrange(rounded)
@show_query
#@collect
DB.copy_to(db, path, "mtcars");

@chain DB.db_table(db, :mtcars) begin
DB.@filter(!starts_with(model, "M"))
DB.@group_by(cyl)
DB.@summarize(mpg = mean(mpg))
DB.@mutate(mpg_squared = mpg^2,
mpg_rounded = round(mpg),
mpg_efficiency = case_when(
mpg >= cyl^2 , "efficient",
mpg < 15.2 , "inefficient",
"moderate"))
DB.@filter(mpg_efficiency in ("moderate", "efficient"))
DB.@arrange(desc(mpg_rounded))
DB.@collect
end
```

```
2×5 DataFrame
Row │ cyl mpg mpg_squared mpg_rounded mpg_efficiency
│ Int64? Float64? Float64? Float64? String?
─────┼────────────────────────────────────────────────────────────
1 │ 4 27.3444 747.719 27.0 efficient
2 │ 6 19.7333 389.404 20.0 moderate
```

## What if we wanted to pivot the result?

We cannot do this using TidierDB. However, we can call `@pivot_longer()` from TidierData *after* the result of the query has been instantiated as a DataFrame, like this:

```julia
@chain DB.db_table(db, :mtcars) begin
DB.@filter(!starts_with(model, "M"))
DB.@group_by(cyl)
DB.@summarize(mpg = mean(mpg))
DB.@mutate(mpg_squared = mpg^2,
mpg_rounded = round(mpg),
mpg_efficiency = case_when(
mpg >= cyl^2 , "efficient",
mpg < 15.2 , "inefficient",
"moderate"))
DB.@filter(mpg_efficiency in ("moderate", "efficient"))
DB.@arrange(desc(mpg_rounded))
DB.@collect
@pivot_longer(everything(), names_to = "variable", values_to = "value")
end
```

```
10×2 DataFrame
Row │ variable value
│ String Any
─────┼───────────────────────────
1 │ cyl 4
2 │ cyl 6
3 │ mpg 27.3444
4 │ mpg 19.7333
5 │ mpg_squared 747.719
6 │ mpg_squared 389.404
7 │ mpg_rounded 27.0
8 │ mpg_rounded 20.0
9 │ mpg_efficiency efficient
10 │ mpg_efficiency moderate
```

## What SQL query does TidierDB generate for a given piece of Julia code?

We can replace `DB.collect()` with `DB.@show_query` to reveal the underlying SQL query being generated by TidierDB. To handle complex queries, TidierDB makes heavy use of Common Table Expressions (CTE), which are a useful tool to organize long queries.

```julia
@chain DB.db_table(db, :mtcars) begin
DB.@filter(!starts_with(model, "M"))
DB.@group_by(cyl)
DB.@summarize(mpg = mean(mpg))
DB.@mutate(mpg_squared = mpg^2,
mpg_rounded = round(mpg),
mpg_efficiency = case_when(
mpg >= cyl^2 , "efficient",
mpg < 15.2 , "inefficient",
"moderate"))
DB.@filter(mpg_efficiency in ("moderate", "efficient"))
DB.@arrange(desc(mpg_rounded))
DB.@show_query
end
```

```
WITH cte_1 AS (
SELECT *
FROM mtcars2
WHERE NOT (model LIKE 'M%')),
FROM mtcars
WHERE NOT (starts_with(model, 'M'))),
cte_2 AS (
SELECT cyl, AVG(mpg) AS mpg
FROM cte_1
GROUP BY cyl),
cte_3 AS (
SELECT cyl, mpg, POWER(mpg, 2) AS sqaured, ROUND(mpg) AS rounded, CASE WHEN mpg >= POWER(cyl, 2) THEN 12 WHEN mpg < 15.2 THEN 14 ELSE 44 END AS efficiency
SELECT cyl, mpg, POWER(mpg, 2) AS mpg_squared, ROUND(mpg) AS mpg_rounded, CASE WHEN mpg >= POWER(cyl, 2) THEN 'efficient' WHEN mpg < 15.2 THEN 'inefficient' ELSE 'moderate' END AS mpg_efficiency
FROM cte_2 ),
cte_4 AS (
SELECT *
FROM cte_3
WHERE efficiency > 12)
WHERE mpg_efficiency in ('moderate', 'efficient'))
SELECT *
FROM cte_4
ORDER BY rounded ASC
```
Now instead of ending the chain with `@show_query`, we use `@collect` to pull the df into the local environment
```
2×5 DataFrame
Row │ cyl mpg sqaured rounded efficiency
│ Int64 Float64 Float64 Float64 Int64
─────┼──────────────────────────────────────────────
1 │ 8 14.75 217.562 15.0 14
2 │ 6 19.7333 389.404 20.0 44
ORDER BY mpg_rounded DESC
```
`across` in `summarize`
```
@chain db_table(db, :mtcars2) begin
@group_by(cyl)
@summarize(across((starts_with("a"), ends_with("s")), (mean, sum)))
#@show_query
@collect

## TidierDB is already quite fully-featured, supporting advanced TidierData functions like `across()` for multi-column selection.

```julia
@chain DB.db_table(db, :mtcars) begin
DB.@group_by(cyl)
DB.@summarize(across((starts_with("a"), ends_with("s")), (mean, sum)))
DB.@collect
end
```

```
3×5 DataFrame
Row │ cyl mean_am mean_vs sum_am sum_vs
│ Int64 Float64 Float64 Int64 Int64
─────┼───────────────────────────────────────────
1 │ 4 0.727273 0.909091 8 10
2 │ 6 0.428571 0.571429 3 4
3 │ 8 0.142857 0.0 2 0
Row │ cyl mean_am mean_vs sum_am sum_vs
│ Int64? Float64? Float64? Int128? Int128?
─────┼──────────────────────────────────────────────
1 │ 4 0.727273 0.909091 8 10
2 │ 6 0.428571 0.571429 3 4
3 │ 8 0.142857 0.0 2 0
```

Bang bang `!!` interpolation for columns and values is also supported.

There are a few subtle but important differences from Tidier.jl outlined [here](https://tidierorg.github.io/TidierDB.jl/latest/examples/generated/UserGuide/key_differences/).

## Missing a function or backend?

You can use any existing SQL function within `@mutate` with the correct SQL syntax and it should just work.

This links to [examples](https://github.com/drizk1/TidierDB.jl/blob/main/src/olympics_examples_fromweb.jl) which achieve the same result as the SQL queries.
But if you run into problems please open an issue, and we will be happy to take a look!
Loading

0 comments on commit d4868a9

Please sign in to comment.