--- title: "Introduction to DBmaps" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to DBmaps} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(DBmaps) library(data.table) ``` ## Introduction This vignette provides an introduction to the `DBmaps` package and its approach to simplifying the analysis of data from relational databases within R. A common and often repetitive challenge in data analysis involves aggregating detailed "fact" tables (like transactions or event logs) before they can be meaningfully joined with descriptive "dimension" tables (like customers or products). `DBmaps` is designed to streamline and automate this entire workflow. The core principle of `DBmaps` is a metadata-driven approach. Instead of writing complex and manual `data.table` join and aggregation code, the user first describes the analytical potential of their tables in a structured format called metadata. The package then uses this metadata to automatically discover join paths, create executable plans, and produce the final merged dataset. This makes the entire process more efficient, scalable, and less prone to error. ## A Complete Workflow Example This vignette walks through a complete, end-to-end example based on a typical e-commerce dataset. We will perform the following steps: 1. Load the four raw data tables. 2. Define the metadata for each table using `table_info()`. 3. Functions to initialize and populate a registry that stores all your metadata definitions `create_metadata_registry() & add_table()`. 4. Automatically discover all possible join paths with `map_join_paths()`. 5. Define a desired final dataset and generate an execution plan with `create_join_plan()`. 6. Visualize the plan with `plot_join_plan()`. 7. Execute the plan to generate the final data.table with `execute_join_plan()`. ### 1. The Example Data The example uses four CSV files that represent common tables in an e-commerce database. You can download them directly from the project's GitHub repository: * [**customers.csv**](https://github.com/akshat09867/Retail-data/blob/main/customers.csv) * [**products.csv**](https://github.com/akshat09867/Retail-data/blob/main/products.csv) * [**transactions.csv**](https://github.com/akshat09867/Retail-data/blob/main/transactions.csv) * [**views.csv**](https://github.com/akshat09867/Retail-data/blob/main/views.csv) First, let's load these tables into our R session. ```{r load_data, message=FALSE} transactions[, time := as.POSIXct(time, format = "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")] views[, time := as.POSIXct(time, format = "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")] cat("--- All 4 Raw Data Tables Loaded Successfully ---\n") cat("---Customers Data---\n") print(head(customers, 5)) cat("---products Data---\n") print(head(products, 5)) cat("---Transactions Data---\n") print(head(transactions, 5)) cat("---Views Data---\n") print(head(views, 5)) ``` ### 2. Defining Metadata with `table_info()` The `table_info()` function is the primary tool for creating metadata. It captures descriptive information about a table and its analytical potential. The key argument is `key_outcome_specs`, a list that defines how a table can be aggregated. A crucial rule in `DBmaps` is that every aggregation method must include one or more `GroupingVariables`, as this is what makes the resulting aggregated data mergeable. ```{r define_metadata} customers_info <- table_info( table_name = "customers", source_identifier = "customers.csv", identifier_columns = "customer_id", key_outcome_specs = list(list(OutcomeName = "CustomerCount", ValueExpression = 1, AggregationMethods = list( list(AggregatedName = "CountByRegion", AggregationFunction = "sum", GroupingVariables = "region") ))) ) products_info <- table_info( table_name = "products", source_identifier = "products.csv", identifier_columns = "product_id", key_outcome_specs = list(list(OutcomeName = "ProductCount", ValueExpression = 1, AggregationMethods = list( list(AggregatedName = "ProductsPerCategory", AggregationFunction = "sum", GroupingVariables = "category") ))) ) transactions_info <- table_info( table_name = "transactions", source_identifier = "transactions.csv", identifier_columns = c("customer_id", "product_id", "time"), key_outcome_specs = list( list(OutcomeName = "Revenue", ValueExpression = quote(price * quantity), AggregationMethods = list( list(AggregatedName = "RevenueByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id"), list(AggregatedName = "RevenueByProduct", AggregationFunction = "sum", GroupingVariables = "product_id"), list(AggregatedName = "DailyRevenueByCustomerProduct", AggregationFunction = "sum", GroupingVariables = c("customer_id", "product_id", "time")) )), list(OutcomeName = "UnitsSold", ValueExpression = quote(quantity), AggregationMethods = list( list(AggregatedName = "TotalUnitsByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id") )) ) ) views_info <- table_info( table_name = "views", source_identifier = "views.csv", identifier_columns = c("customer_id", "product_id", "time"), key_outcome_specs = list(list(OutcomeName = "ViewCount", ValueExpression = 1, AggregationMethods = list( list(AggregatedName = "ViewsByProduct", AggregationFunction = "count", GroupingVariables = "product_id"), list(AggregatedName = "ViewsByCustomer", AggregationFunction = "count", GroupingVariables = "customer_id") ))) ) cat("---Metadata for transactions---\n") print(transactions_info) ``` ### Understanding the Metadata Columns The output of table_info() has several key columns that drive the entire DBmaps system. Here is a brief explanation of what each one does: **table_name:** The conceptual name of the table (e.g., "transactions"). **identifier_columns:** This list-column contains the name(s) of the column(s) that form the primary key of the raw table. This is what other tables can join TO. **outcome_name:** A high-level description of the metric being calculated (e.g., "Revenue"). **value_expression:** The R code (as a string) that defines how to calculate the raw value for the outcome from the source table's columns (e.g., "price * quantity"). **aggregated_name:** The final name for the new, aggregated column in the summarized table (e.g., "RevenueByCustomer"). This is the name you will use in your selections when creating a plan. **aggregation_function:** The function to apply to the value_expression during grouping (e.g., "sum"). **grouping_variable:** This is the most critical column for joins. It is a list-column containing the name(s) of the column(s) to group by. The values in this column define the "join key" that can be used to join this aggregated data FROM. A crucial rule in DBmaps is that every aggregation method must include one or more GroupingVariables, as this is what makes the resulting aggregated data mergeable with other tables. ### 3. create_metadata_registry() and add_table() To manage metadata for multiple tables, DBmaps provides a simple registry system. create_metadata_registry() initializes a registry object, and add_table() adds the metadata for each table to it. This avoids the need to manually combine data.table objects. ```{r creating-registry, echo = TRUE} meta <- create_metadata_registry() meta <- add_table(meta, customers_info) meta <- add_table(meta, products_info) meta <- add_table(meta, views_info) meta <- add_table(meta, transactions_info) print(meta) ``` ### 4. Discovering Join Paths with `map_join_paths()` In any data analysis project involving multiple tables, a primary task is to identify how these tables can be joined together. This typically involves matching foreign keys in one table to primary keys in another. The `map_join_paths()` function automates this discovery process, making it faster and less error-prone. This function operates in two powerful modes: 1. **Metadata-Driven Discovery**: It can find join paths based solely on a metadata definition, matching tables where a defined `grouping_variable` identically matches another table's `identifier_columns` (primary key). This is extremely fast and useful for well-documented schemas. 2. **Data-Driven Discovery**: Optionally, by providing the actual data, the function can scan column values to find "inferred" joins where key names *do not* match. This is invaluable for exploring new or messy datasets. ## Combining Both Methods The true power of the function is when it combines both methods. This provides a complete and reliable map of all possible connections in your data ecosystem. ```{r map_paths, message=TRUE} # Create a named list of the actual data tables for the functions to use all_tables <- list( customers = customers, products = products, transactions = transactions, views = views ) # Generate the join map paths <- map_join_paths(meta, all_tables) print(paths) ``` The resulting "Join Map" shows every valid, directional, many-to-one join that can be performed after aggregation. ### 5. Creating an Execution Plan with `create_join_plan()` Now we define our analytical goal. Let's say we want to create a product-level summary table that includes the product's category, its total revenue, and the total number of times it was viewed. The `create_join_plan()` function translates this high-level request into a concrete, step-by-step recipe of `data.table` code. ```{r create_plan} # Define our desired output selections <- list( products = c("product_id", "category"), # Base columns from the products table transactions = "RevenueByProduct", # The aggregated revenue by product views = "ViewsByProduct" # The aggregated view count by product ) # Generate the plan plan <- create_join_plan( base_table = "products", selections = selections, metadata_dt = meta, join_map = paths ) print(plan) ``` The output of the planner is a data.table that acts as an executable recipe. Each row represents a step, but the final column, code, is the most powerful feature. It contains the precise, executable R code for each step of the data manipulation process. Let's break down how the plan achieves our goal: **1. Steps 1 & 2 (AGGREGATE):** The planner analyzed our request for RevenueByProduct and ViewsByProduct. It correctly determined from the metadata that these require aggregating the transactions and views tables, respectively. It then generated the exact data.table code to create two new, summarized tables: agg_transactions and agg_views. **2. Steps 3 & 4 (MERGE):** The planner then constructed a sequence of join operations. It first merges the aggregated transactions (agg_transactions) onto our products base table. Then, it merges the aggregated views (agg_views) onto the result of the first merge. This sequential chaining correctly combines all the necessary data. **3. Step 5 (SELECT):** Finally, the planner generates the code to select only the columns we originally requested from the final, fully merged table. This demonstrates the core value of the planner: it automates the tedious and error-prone task of determining the correct sequence of aggregations and joins, providing a concrete set of steps that will join the data in our preferred manner. **Handling Invalid Requests** A key feature of the planner is its ability to validate user requests. What happens if we ask for an aggregation that cannot logically be joined to our base table? Let's ask for `RevenueByProduct` (grouped by `product_id`) to be joined to the `customers` table (keyed by `customer_id`). This is not a valid join. ```{r setup_invalid} # Add product metadata for this example products_meta <- table_info("products", "p.csv", "product_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="category"))))) transactions_meta_v2 <- table_info("transactions", "t.csv", "trans_id", list( list(OutcomeName="Revenue", ValueExpression=quote(price*qty), AggregationMethods=list( # This aggregation is by product_id, not customer_id list(AggregatedName="RevenueByProduct", AggregationFunction="sum", GroupingVariables="product_id") )) )) invalid_metadata <- create_metadata_registry() invalid_metadata <- add_table(invalid_metadata, products_meta) invalid_metadata <- add_table(invalid_metadata, transactions_meta_v2) # The invalid request invalid_selections <- list( customers = "customer_id", transactions = "RevenueByProduct" ) ``` Instead of producing a faulty plan or a cryptic error, `create_join_plan` stops with a clear, informative message. ```{r run_invalid, error=TRUE} create_join_plan( base_table = "customers", selections = invalid_selections, metadata_dt = invalid_metadata ) ``` The reason this is invalid is that the join key of the selected aggregation does not match the join key of the base table. 1. The **base_table** is `customers`, whose primary join key is `customer_id`. 2. The selection asks for the **RevenueByProduct** aggregation from the `transactions` table. 3. According to our metadata, the **RevenueByProduct** aggregation is grouped by (and therefore keyed on) `product_id`. 4. The planner function, `create_join_plan()`, correctly sees that there is no direct path to join a table keyed by `product_id` to a table keyed on `customer_id`. This strict validation ensures that only logical and correct data manipulation plans are generated, preventing common data analysis errors. ### 6. Visualizing the Plan To make the plan even clearer, we can visualize it as a flowchart using `plot_join_plan()`. ```{r plot_plan, fig.align='center', results='asis', fig.width=10} # This requires the DiagrammeR package if (requireNamespace("DiagrammeR", quietly = TRUE)) { # Generate the plot object visualize <- plot_join_plan(plan) visualize } else { cat("Install the 'DiagrammeR' package to visualize the join plan.") } ``` The graph shows the flow of data from the source tables (blue boxes) through intermediate aggregations (gray ellipses) to the final merged result and selection (yellow diamond). ### 7. Executing the Plan The final step is to execute the plan using `execute_join_plan()`. This function takes the plan and the list of source data tables and runs the generated code, returning the final dataset. ```{r execute_plan, message=TRUE} # The executor runs the plan in a clean environment final_dt <- execute_join_plan(plan, all_tables) # Show the first few rows of the final, merged data.table print(head(final_dt)) ``` ## Conclusion The `DBmaps` workflow successfully automated the entire process. By investing a small amount of time to define metadata, we were able to automatically discover relationships, generate a complex join plan, visualize it, and execute it with just a few high-level function calls. This demonstrates a powerful, scalable, and reproducible approach to data preparation and analysis in R.