---
title: "Data Guide Creation and Level-0 Data Compilation"
author: "US EPA's Center for Computational Toxicology and Exposure ccte@epa.gov"
output:
  rmarkdown::html_vignette
bibliography: '`r system.file("REFERENCES.bib", package="invitroTKstats")`'
vignette: >
  %\VignetteIndexEntry{data_guide_and_LO}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = 'center'
)
```

# Overview

The `invitroTKstats` R package is designed to work as a data processing pipeline 
for *in vitro* assays measuring various chemical-specific toxicokinetic (TK) 
parameters used in high-throughput TK (HTTK) modeling. The `invitroTKstats` 
pipeline is meant to take mass spectrometry (MS) data and estimate the 
chemical-specific TK parameters from the relevant *in vitro* assays. These 
include but are not limited to:

- Fraction Unbound in Plasma $f_{up}$, either via Rapid Equilibrium Dialysis (RED) or Ultracentrifugation (UC) assays
- Intrinsic Hepatic Clearance $Cl_{int}$
- Membrane Permeability $P_{app}$ via the Caco-2 assay

The MS data are experimental measurements and the possibility for variation in 
reporting exists between laboratories and technicians.
Preserving the data as reported by the original laboratory allows 
reproducibility, transparency, and auditing of any numbers calculated from 
these data by the `invitroTKstats` R package.
We refer to the MS data files generated by the laboratory as "level-0" files.

*An important distinction between `invitroTkstats` and other approaches for 
determining chemical-specific in vitro TK parameters is that here we 
systematically analyze large data sets containing measurements for many 
(potentially hundreds) of chemicals.*

Here, we discuss the initial steps of the data processing pipeline. 
First, we create the data guide (also known as a data catalog). 
We then use the data guide to compile the level-0 (raw) data into a single R
`data.frame`. *Note, these steps are meant to be consistent across the various 
assays for TK parameters that may be pipelined by this package, that is, they 
are not assay specific. **However, the pipeline is not set up to simultaneously 
process data from different assays (for example $f_{up}$ from RED and $f_{up}$ 
from UC).***

In this vignette, we are going to use data from the intrinsic hepatic clearance 
assays ($Cl_{int}$) collected as part of the @kreutz2023category PFAS 
manuscript for demonstration purposes.

## Suggested packages for use with this vignette

```{r setup,message=FALSE,warning=FALSE}
# Primary Package #
library(invitroTKstats)
# Data Formatting Packages #
library(dplyr)
library(magrittr)
library(stringr)
library(readxl)
# Table Formatting Package #
library(flextable)
```

# Raw Data File Exploration

The first thing we need to do for the pipelining process is to construct a data guide itemizing which data files and sheets therein contain relevant MS data, as well as other relevant meta-data information. Before creating a data guide with `invitroTKstats` functions, let us first consider the raw data files and their structure in the relevant data directory, called "inst/extdata/Kreutz-Clint".

*NOTE: The terms "data catalog" and "data guide" are used interchangeably throughout the vignette.*

For users to replicate this vignette, it is necessary to clone the [`invitroTKstats` package repository](https://github.com/USEPA/invitroTKstats) which contains the "inst/extdata/Kreutz-Clint" sub-directory. 

The "Kreutz-Clint" sub-directory contains a subset of raw data files with gas chromatography-tandem mass spectrometry (GC-MS/MS) analyses of a subset of PFAS alcohols, amides, and acrylates (@kreutz2023category). The entire set of raw data files can be found within the "working/KreutzPFAS" sub-directory tracked in the [`invitrotkstats_dataproc` repository](https://github.com/USEPA/invitrotkstats_dataproc)  (i.e. not within the `invitroTKstats` package).  

```{r raw_data_dir}
# the path to the applicable sub-directory after the `invitroTKstats` package repository is cloned and the R project is opened
raw_data_dir <- system.file("extdata/Kreutz-Clint", package = "invitroTKstats")
```

More than one experiment (that is, data collection run) is typically necessary to obtain all the relevant data for an analysis. Thus, there is likely to be more than one Excel file containing raw data that needs to be processed.

For this exercise, it is known that the Excel (.xlsx) files containing intrinsic hepatic clearance data are denoted with prefixes of "Hep" and "G<number>". Thus, we can programmatically search in the sub-directory for all the relevant Excel files we need to pull and compile raw data from.

```{r raw_data_files}
# identify the hepatic clearance raw datasets
hep_clint_xlsx <- list.files(raw_data_dir,
                             pattern = paste(
                               paste(c("^Hep","^G\\d"),"[.]xlsx$",sep = ".+"),
                               collapse = "|"))
# show the data file names
hep_clint_xlsx
# add the raw data file directory path
hep_clint_xlsx <- paste(raw_data_dir,hep_clint_xlsx, sep = "/")
```

In cases where one is unfamiliar with the raw data Excel files, it may be useful to do some programmatic evaluation of the file structure. For example, listing the sheets within the Excel file to determine where data exists and should be pulled from. It should be noted this will likely, or may not always, replace the need to visually inspect the file. To demonstrate this type of investigation consider the first hepatic clearance dataset identified above. 

```{r rdfile1_sheets}
# for the first dataset, see the sheets contained in the raw data file
readxl::excel_sheets(hep_clint_xlsx[1])
```

Here, it is already known that the "Data063021" data sheet contains MS data for parameter estimation. However, other data sheets (for example, "Cover Sheet") may also provide valuable information about the original experiment, chemicals evaluated, meta data, etc. Clear communication between the lab staff and data analyst about the structure of the data files and general expectations is highly encouraged to enable accurate, efficient, and transparent data pipelining.

*It should be noted, other exploratory steps may be necessary, but we leave this up to the reader to determine the necessary steps for raw data exploration given their data and use case.*

# Data Catalog

Now that the Excel files and sheets containing the raw data are identified and we have a general idea of how to programatically explore the data, we need to create the data guide. The data guide creation step, as previously mentioned, can be thought of as logging all the relevant raw data files for a given toxicokinetic parameter we wish to pipeline (that is, data inventory). This is a crucial part of the pipeline in that it allows us to identify where all of the raw datafiles are coming from and where we are pulling raw (level-0) data from within those datasets. Thus, setting up a standard and transparent level-0 data compilation process. This also provides key meta-data information that can be used for tracking/evaluating data provenance.

Though the data guide (which is ultimately a `data.frame` format) may be created manually through an Excel file, the `invitroTKstats` package includes a function called `create_catalog` that automatically generates and documents the data guide creation. Furthermore, this function has built in checks to ensure all the necessary information is provided and in the correct format expected by the pipeline. This allows for maximum reproducibility, transparency, and efficiency in compiling the data guide. Thus, it is highly encouraged and considered best practice to utilize the `create_catalog` function for this step.

Typically, one will want to include all the datasets "delivered" to the data analyst. As mentioned before, the data catalog logs where the raw MS data is for the respective files and allows us to efficiently pull level-0 data (raw data) from those disparate files and compile them into a comprehensive level-0 dataset. In this sub-section, however, we are going to demonstrate how to put together the data catalog for a single Excel file.

Here, we are going to use the first Excel file (and only the "Data063021" sheet) to demonstrate putting together the data catalog programmatically using the `create_catalog` function within `invitroTKstats`. It should be noted, the information provided in `create_catalog` is obtained by visual inspection of the file of interest and identifying the corresponding input values for each of the arguments.

```{r data_catalog}
DC_kreutz.pfas <- create_catalog(
  # filename (no file path)
  file = c(rep("Hep_745_949_959_082421_final.xlsx",3)), 
  # sheet name (or sheet number)
  sheet = c(rep("Data063021",3)), 
  # number of rows to skip in L0 Excel file - start for compound/analyte samples
  skip.rows = c(44,74,92), 
  # number of rows to read in from L0 Excel file for compound/analyte samples
  num.rows = c(30,18,18), 
  # date the data was generated
  #   (MMDDYY: 2-digit month, 2-digit day, 2-digit year)
  date = "063021", 
  # chemical id
  compound = c("745","949","959"),
  # internal standard compound (corresponding to chemical id)
  istd = c("MFBET","MFOET","MFHET"), 
  # column name for sample names
  sample = "Name", 
  # column name for sample types
  type = "Type",
  # column name(s) for analyte MS peak areas
  peak = c("Area...13","Area...27","Area...20"), 
  # column name(s) for internal standard MS peak areas
  istd.peak = c("Resp....16","Resp....30","Resp....23"), 
  # column name(s) for experimental concentration
  conc = c("Final Conc....11","Final Conc....25","Final Conc....18"),
  # column name(s) with analysis parameters
  analysis.param = c("RT...12", "RT...26", "RT...19"),
  # column name - row locations
  col.names.loc = 2
  # note = "RT...12"
)
```

```{r echo=FALSE}
# show the data catalog
DC_kreutz.pfas %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list(
        
      )
    )
  ) %>% 
  set_caption(caption = "Table 1: Data Catalog for the Kruetz et al. (2023) PFAS Cl~int~ Experiment.", 
              align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "all") %>% 
  theme_vanilla()
```

In cases where a data analyst is working with a set of raw data Excel files with a semi-standardized form (often lab-specific) it may be advantageous to have a wrapper function for streamlining the data guide creation process, such that the function automatically extracts the necessary meta-data to provide for populating the data catalog. This exercise is left up to the reader.

# Chemical ID Map

Once the data catalog is compiled, the only remaining experimental meta-data to collect prior to compiling the comprehensive level-0 data files is the chemical identification mapping table (that is chemical ID map). The chemical ID map provides the pipeline with the compound name used by the lab, EPA DSSTox identifier (see [Distributed Structure-Searchable Toxicity (DSSTox) Database](https://www.epa.gov/comptox-tools/distributed-structure-searchable-toxicity-dsstox-database)), and the common compound name. This typically will be provided by the lab within the Excel files with raw data and/or in a separate "meta-data" file, and this may be lab dependent given the potential for differing lab protocols.

For our purposes, the "Cover Sheet" contains a table with the chemical ID mapping along with some physico-chemical properties, the corresponding reference, and the corresponding internal standard. 

```{r assay_cover_sheet_metadata}
# obtain the chemical identification mapping information from the MS-data
#   cover sheet (that is raw data summary information)
assay_cover_sheet <-
  readxl::read_xlsx(
    paste(raw_data_dir,"Hep_745_949_959_082421_final.xlsx",sep = "/"),
    sheet = "Cover Sheet",skip = 35,n_max = 4) %>% 
  as.data.frame()
```

```{r echo=FALSE}
# show the assay cover sheet
assay_cover_sheet %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list(
        
      )
    )
  ) %>% 
  set_caption(caption = "Table 2: Chemical Mapping Information from the Raw Excel File.", 
              align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "all") %>% 
  theme_vanilla()
```

The `invitroTKstats` package has a function that can take a `data.frame`, such as `assay_cover_sheet`, and generate the required chemical mapping table necessary for the `merge_level0` function. This function is called `create_chem_table`. Though the chemical mapping table can be manually constructed it is highly encouraged analysts using `invitroTKstats` leverage the utility of this function.

```{r chem_id_table}
# create a chemical table necessary for the L0 compilation function, using the
#   assay cover sheet chemical identification mapping information
chem.ids <- create_chem_table(
  # input table (data.frame class) with information
  input.table = assay_cover_sheet,
  # column name with DSSTox chemical ID's
  dtxsid.col = "Analyte",
  # column name with formal compound names
  compound.col = "Name",
  # column name with lab chemical ID's
  lab.compound.col = "Sample ID"
)
```

```{r echo=FALSE}
# show the chemical ID mapping data
chem.ids %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list(
        
      )
    )
  ) %>% 
  set_caption(caption = "Table 3: Chemical ID Mapping Information for the `merge_level0` Function.", 
              align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "all") %>% 
  theme_vanilla()
```

# Level-0 (L0) Data Compilation

Once the data guide and chemical ID map `data.frame`s are compiled we can use the `merge_level0` function to pull in all of the 'raw' level-0 data and compile it into a single `data.frame`.

*It should be noted here we are <u> not </u> exporting the data catalog or level-0 data file, as denoted by the arguments `catalog.out` and `output.res` (respectively) both being set to `FALSE`. Furthermore, since we are not exporting either of these `data.frame`s the `FILENAME` argument is not required and thereby not provided.*

Any of the column names from the raw data not matching the anticipated column name convention are specified to their corresponding argument. For more details on using the `merge_level0` function we direct the user to the help file, which can be accessed via `help("merge_level0")` or `?merge_level0` in the R console.

```{r level0_comp,message=FALSE}
# compile the l0 files
kreutz.pfas_L0 <- merge_level0(
  level0.catalog = DC_kreutz.pfas, # data catalog
  INPUT.DIR = raw_data_dir, # the path to your raw data files
  num.rows.col = "Number.Data.Rows", 
  istd.col = "ISTD.Name",
  type.colname.col = "Type.ColName",
  chem.ids = chem.ids, # chemical ID mapping data
  chem.lab.id.col = "Lab.Compound.Name",
  chem.name.col = "Compound.Name",
  catalog.out = FALSE, # do not export the data catalog during this function
  output.res = FALSE # do not export the compiled L0 during this function
)
# show the dimension of the the Kruetz PFAS data
dim(kreutz.pfas_L0)
```

```{r echo=FALSE}
# show the resulting merge level 0 output data
head(kreutz.pfas_L0) %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list(
        
      )
    )
  ) %>% 
  set_caption(caption = "Table 4: Level-0 data resulting from the `merge_level0` Function.", 
              align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "all") %>% 
  theme_vanilla()
```

The level-0 data file is now ready to undergo the assay-specific processing steps, which can be found in the assay-specific vignettes:

- "Clint_vignette"
- "FupRED_vignette"
- "FupUC_vignette"
- "Caco2_vignette"

# Best Practices for Data Processing

- The data catalog can be manually constructed as done in this markdown. This is the minimum expectation. However, the best practice would be to create a wrapper function for your particular MS data to leverage the `create_catalog` function, but to auto-generate the input for this function programmatically (for example a lab-specific processing function).
- Users may manually construct data catalogs in Excel files, this is not recommended but may occur in cases of working with legacy data (that is data processed prior to 2025). In this scenario, it is recommended to use the `check_catalog` to ensure the data catalog meets the minimum meta-data reporting requirements to use `merge_level0`.
- In an ideal scenario, the lab will deliver the relevant data files for processing with the following guidelines:
  - All data sets to be processed together are given to the analyst at one time and stored in a common directory location.
  - A standardized filenaming convention is used to convey the assay type (that is toxicokinetic parameter), date of generation, and other relevant distinguishing information.
  - Excel files have a standardized structure enabling more programmatic and/or efficient processing by the analyst.
  - The raw data sheet(s) are direct computerized read-outs from the mass-spectrometer.
  - Notification/relaying of any necessary orientation or additional notes the data analyst may need to be aware of.
- In an ideal scenario, the analyst organized the data files and data processing code as such:
  - A single directory storing all the relevant raw data files, processing scripts, and results together.
  - All the raw data in their own sub-directory.
  - All data scripts in their own sub-directory.
  - Create a sub-directory to export the results output to.
- Raw data files are ideally constructed such that at least one sheet contains the raw read-out directly from the mass-spectrometer and this is the data that should be used in the analysis.
- Data analysts should generally take the approach of little "data touching", including but not limited to:
  - Evaluation of Excel files should be done programmatically (as far as is possible). ***Note**: In cases where this is not feasible, no manual changes to the Excel files should be made.*
  - Manipulation/editing of raw Excel files should be done programmatically and documented in a code script (for example `.R` script) and accompanied by the code script execution output (for example `.Rout` file).
  - Raw Excel files manipulated/edited via code should be saved as a separate files wiht a "tag" indicating the file has been edited from the originally delivered data, and any other relevant "tag" information (for example date and time of "new data" export).
  
# References

\insertRef{kreutz2023category}{invitroTKstats}