Clean up README, index, and Project.toml. Bump version to 0.1.0.

TidierOrg · Apr 7, 2024 · d4868a9 · d4868a9
1 parent 79ea80a
commit d4868a9
Show file tree

Hide file tree

Showing 3 changed files with 290 additions and 136 deletions.
diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "TidierDB"
 uuid = "86993f9b-bbba-4084-97c5-ee15961ad48b"
 authors = ["Daniel Rizk <[email protected]> and contributors"]
-version = "0.1.0-DEV"
+version = "0.1.0"
 
 [deps]
 Arrow = "69666777-d1a9-59fb-9406-91d4454c9d45"
@@ -19,9 +19,21 @@ SQLite = "0aa819cd-b072-5ff4-a722-6bc24af294d9"
 
 [compat]
 julia = "1.9"
+Arrow = "2.7"
+Chain = "0.6"
+ClickHouse = "0.2"
+DataFrames = "1.5"
+Documenter = "1.3"
+DuckDB = "0.10"
+LibPQ = "1.17"
+MacroTools = "0.5"
+MySQL = "1.4"
+ODBC = "1.1"
+Reexport = "0.2, 1"
+SQLite = "1.6"
 
 [extras]
 Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
 
 [targets]
-test = ["Test"]
+test = ["Test"]
diff --git a/README.md b/README.md
@@ -1,22 +1,22 @@
-## What is TidierDB.jl
+## What is TidierDB.jl?
 
 [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/TidierOrg/TidierDB.jl/blob/main/LICENSE)
 [![Docs: Latest](https://img.shields.io/badge/Docs-Latest-blue.svg)](https://tidierorg.github.io/TidierDB.jl/latest)
 
-TiderDB.jl is a 100% Julia implementation of the dbplyr R package (and similar to python's ibis package).
+TiderDB.jl is a 100% Julia implementation of the dbplyr R package, and similar to Python's ibis package.
 
-The main goal of TidierDB.jl is to bring the ease of use and simple syntax of Tidier.jl to mutliple SQL backends,
-making data analysis smoother by abstracting away subtle syntax differences between backends.
+The main goal of TidierDB.jl is to bring the syntax of Tidier.jl to multiple SQL backends, making it possible to analyze data directly on databases without needing to copy the entire database into memory.
 
 ## Currently supported backends include:
+
 - DuckDB (the default) `set_sql_mode(:duckdb)`
 - ClickHouse `set_sql_mode(:clickhouse)`
 - SQLite `set_sql_mode(:lite)`
 - MySQL `set_sql_mode(:mysql)`
 - MSSQL `set_sql_mode(:mssql)`
 - Postgres `set_sql_mode(:postgres)`
 
-Change the backend by using `set_sql_mode()`
+The style of SQL that is generated can be modified using `set_sql_mode()`.
 
 ## Installation
 
@@ -27,13 +27,12 @@ For the stable version:
 ```
 
 TidierDB.jl currently supports the following top-level macros:
-
 - `@arrange`
 - `@group_by` 
 - `@filter`
 - `@select`
-- `@mutate` supports `across` 
-- `@summarize` / `@summarise` supports `across` 
+- `@mutate`, which supports `across()` 
+- `@summarize` and `@summarise`, which supports `across()` 
 - `@distinct`
 - `@left_join`, `@right_join`, `@inner_join` (slight syntax differences from TidierData.jl)
 - `@count`
@@ -42,7 +41,7 @@ TidierDB.jl currently supports the following top-level macros:
 - `@show_query`
 - `@collect`
 
-Supported helper functions for most backends include
+Supported helper functions for most backends include:
 - `across()`
 - `desc()`
 - `if_else()` and `case_when()`
@@ -52,102 +51,174 @@ Supported helper functions for most backends include
 - `is_missing()`
 - `missing_if()` and `replace_missing()`
 
-From TidierStrings.jl
+From TidierStrings.jl:
 - `str_detect`, `str_replace`, `str_replace_all`, `str_remove_all`, `str_remove`
 
-From TidierDates.jl
+From TidierDates.jl:
 -  `year`, `month`, `day`, `hour`, `min`, `second`, `floor_date`, `difftime`
 
 Supported aggregate functions (as supported by the backend) with more to come
 - `mean`, `minimium`, `maximum`, `std`, `sum`, `cumsum`, `cor`, `cov`, `var`
-
 - `copy_to` (for DuckDB, MySQL, SQLite)
 
-DuckDB specifically enables copy_to to directly reading in .parquet, .json, .csv, .arrow, and https file paths.
-```
+DuckDB specifically enables copy_to to directly reading in `.parquet`, `.json`, `.csv`, and `.arrow` file, including https file paths.
+
+```julia
 path = "file_path.parquet"
 copy_to(conn, file_path, "table_name")
 ```
 
-Bang bang `!!` Interpolation for columns and values is supported.
+## What is the recommended way to use TidierDB?
 
-There are a few subtle but important differences from Tidier.jl outlined [here](https://github.com/drizk1/TidierDB.jl/blob/main/docs/examples/UserGuide/key_differences.jl).
+Typically, you will want to use TidierDB alongside TidierData because there are certain functionality (such as pivoting) which are only supported in TidierData and can only be performed on data frames.
 
-Missing a function or backend?
+Our recommended path for using TidierDB is to import the package so that there are no namespace conflicts with TidierData. Once TidierDB is integrated with Tidier, then Tidier will automatically load the packages in this fashion.
 
-You can actually use any (non-agg) sql function in mutate with the correct sql syntax and it will will still run.
-But open an issue, and we would be happy to address it.
+First, let's develop and execute a query using TidierDB. Notice that all top-level macros and functions originating from TidierDB start with a `DB` prefix. Any functions defined within macros do *not* need to be prefixed within `DB` because they are actually pseudofunctions that are in actuality converted into SQL code.
 
-Finally, some examples
-```
-using TidierDB
-mem = duckdb_open(":memory:");
-db = duckdb_connect(mem);
+Even though the code reads similarly to TidierData, note that no computational work actually occurs until you run `DB.@collect()`, which runs the SQL query and instantiates the result as a DataFrame.
+
+```julia
+using TidierData
+import TidierDB as DB
+
+mem = DB.duckdb_open(":memory:");
+db = DB.duckdb_connect(mem);
 path = "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"
-copy_to(db, path, "mtcars2");
-@chain db_table(db, :mtcars2) begin
-    @filter(model != starts_with("M"))
-    @group_by(cyl)
-    @summarize(mpg = mean(mpg))
-    @mutate(sqaured = mpg^2, 
-               rounded = round(mpg), 
-               efficiency = case_when(
-                             mpg >= cyl^2 , 12,
-                             mpg < 15.2 , 14,
-                              44))            
-    @filter(efficiency>12)                       
-    @arrange(rounded)
-    @show_query
-    #@collect
+DB.copy_to(db, path, "mtcars");
+
+@chain DB.db_table(db, :mtcars) begin
+    DB.@filter(!starts_with(model, "M"))
+    DB.@group_by(cyl)
+    DB.@summarize(mpg = mean(mpg))
+    DB.@mutate(mpg_squared = mpg^2, 
+               mpg_rounded = round(mpg), 
+               mpg_efficiency = case_when(
+                                 mpg >= cyl^2 , "efficient",
+                                 mpg < 15.2 , "inefficient",
+                                 "moderate"))            
+    DB.@filter(mpg_efficiency in ("moderate", "efficient"))
+    DB.@arrange(desc(mpg_rounded))
+    DB.@collect
+end
+```
+
+```
+2×5 DataFrame
+ Row │ cyl     mpg       mpg_squared  mpg_rounded  mpg_efficiency 
+     │ Int64?  Float64?  Float64?     Float64?     String?        
+─────┼────────────────────────────────────────────────────────────
+   1 │      4   27.3444      747.719         27.0  efficient
+   2 │      6   19.7333      389.404         20.0  moderate
+```
+
+## What if we wanted to pivot the result?
+
+We cannot do this using TidierDB. However, we can call `@pivot_longer()` from TidierData *after* the result of the query has been instantiated as a DataFrame, like this: 
+
+```julia
+@chain DB.db_table(db, :mtcars) begin
+    DB.@filter(!starts_with(model, "M"))
+    DB.@group_by(cyl)
+    DB.@summarize(mpg = mean(mpg))
+    DB.@mutate(mpg_squared = mpg^2, 
+               mpg_rounded = round(mpg), 
+               mpg_efficiency = case_when(
+                                 mpg >= cyl^2 , "efficient",
+                                 mpg < 15.2 , "inefficient",
+                                 "moderate"))            
+    DB.@filter(mpg_efficiency in ("moderate", "efficient"))
+    DB.@arrange(desc(mpg_rounded))
+    DB.@collect
+    @pivot_longer(everything(), names_to = "variable", values_to = "value")
+end
+```
+
+```
+10×2 DataFrame
+ Row │ variable        value     
+     │ String          Any       
+─────┼───────────────────────────
+   1 │ cyl             4
+   2 │ cyl             6
+   3 │ mpg             27.3444
+   4 │ mpg             19.7333
+   5 │ mpg_squared     747.719
+   6 │ mpg_squared     389.404
+   7 │ mpg_rounded     27.0
+   8 │ mpg_rounded     20.0
+   9 │ mpg_efficiency  efficient
+  10 │ mpg_efficiency  moderate
+```
+
+## What SQL query does TidierDB generate for a given piece of Julia code?
+
+We can replace `DB.collect()` with `DB.@show_query` to reveal the underlying SQL query being generated by TidierDB. To handle complex queries, TidierDB makes heavy use of Common Table Expressions (CTE), which are a useful tool to organize long queries.
+
+```julia
+@chain DB.db_table(db, :mtcars) begin
+    DB.@filter(!starts_with(model, "M"))
+    DB.@group_by(cyl)
+    DB.@summarize(mpg = mean(mpg))
+    DB.@mutate(mpg_squared = mpg^2, 
+               mpg_rounded = round(mpg), 
+               mpg_efficiency = case_when(
+                                 mpg >= cyl^2 , "efficient",
+                                 mpg < 15.2 , "inefficient",
+                                 "moderate"))            
+    DB.@filter(mpg_efficiency in ("moderate", "efficient"))
+    DB.@arrange(desc(mpg_rounded))
+    DB.@show_query
 end
 ```
+
 ```
 WITH cte_1 AS (
 SELECT *
-        FROM mtcars2
-        WHERE NOT (model LIKE 'M%')),
+        FROM mtcars
+        WHERE NOT (starts_with(model, 'M'))),
 cte_2 AS (
 SELECT cyl, AVG(mpg) AS mpg
         FROM cte_1
         GROUP BY cyl),
 cte_3 AS (
-SELECT  cyl, mpg, POWER(mpg, 2) AS sqaured, ROUND(mpg) AS rounded, CASE WHEN mpg >= POWER(cyl, 2) THEN 12 WHEN mpg < 15.2 THEN 14 ELSE 44 END AS efficiency
+SELECT  cyl, mpg, POWER(mpg, 2) AS mpg_squared, ROUND(mpg) AS mpg_rounded, CASE WHEN mpg >= POWER(cyl, 2) THEN 'efficient' WHEN mpg < 15.2 THEN 'inefficient' ELSE 'moderate' END AS mpg_efficiency
         FROM cte_2 ),
 cte_4 AS (
 SELECT *
         FROM cte_3
-        WHERE efficiency > 12)  
+        WHERE mpg_efficiency in ('moderate', 'efficient'))  
 SELECT *
         FROM cte_4  
-        ORDER BY rounded ASC
-```
-Now instead of ending the chain with `@show_query`, we use `@collect` to pull the df into the local environment
-```
-2×5 DataFrame
- Row │ cyl    mpg      sqaured  rounded  efficiency 
-     │ Int64  Float64  Float64  Float64  Int64      
-─────┼──────────────────────────────────────────────
-   1 │     8  14.75    217.562     15.0          14
-   2 │     6  19.7333  389.404     20.0          44
+        ORDER BY mpg_rounded DESC
 ```
-`across` in `summarize`
-```
-@chain db_table(db, :mtcars2) begin
-    @group_by(cyl)
-    @summarize(across((starts_with("a"), ends_with("s")), (mean, sum)))
-    #@show_query
-    @collect
+
+## TidierDB is already quite fully-featured, supporting advanced TidierData functions like `across()` for multi-column selection.
+
+```julia
+@chain DB.db_table(db, :mtcars) begin
+    DB.@group_by(cyl)
+    DB.@summarize(across((starts_with("a"), ends_with("s")), (mean, sum)))
+    DB.@collect
 end
 ```
+
 ```
 3×5 DataFrame
- Row │ cyl    mean_am   mean_vs   sum_am  sum_vs 
-     │ Int64  Float64   Float64   Int64   Int64  
-─────┼───────────────────────────────────────────
-   1 │     4  0.727273  0.909091       8      10
-   2 │     6  0.428571  0.571429       3       4
-   3 │     8  0.142857  0.0            2       0
+ Row │ cyl     mean_am   mean_vs   sum_am   sum_vs  
+     │ Int64?  Float64?  Float64?  Int128?  Int128? 
+─────┼──────────────────────────────────────────────
+   1 │      4  0.727273  0.909091        8       10
+   2 │      6  0.428571  0.571429        3        4
+   3 │      8  0.142857  0.0             2        0
 ```
 
+Bang bang `!!` interpolation for columns and values is also supported.
+
+There are a few subtle but important differences from Tidier.jl outlined [here](https://tidierorg.github.io/TidierDB.jl/latest/examples/generated/UserGuide/key_differences/).
+
+## Missing a function or backend?
+
+You can use any existing SQL function within `@mutate` with the correct SQL syntax and it should just work.
 
-This links to [examples](https://github.com/drizk1/TidierDB.jl/blob/main/src/olympics_examples_fromweb.jl) which achieve the same result as the SQL queries.
+But if you run into problems please open an issue, and we will be happy to take a look!