Title: | Tidy Verbs for Fast Data Manipulation |
Version: | 1.8.2 |
Depends: | R (≥ 4.1.0) |
Description: | A toolkit of tidy data manipulation verbs with 'data.table' as the backend. Combining the merits of syntax elegance from 'dplyr' and computing performance from 'data.table', 'tidyfst' intends to provide users with state-of-the-art data manipulation tools with least pain. This package is an extension of 'data.table'. While enjoying a tidy syntax, it also wraps combinations of efficient functions to facilitate frequently-used data operations. |
URL: | https://github.com/hope-data-science/tidyfst, https://hope-data-science.github.io/tidyfst/ |
BugReports: | https://github.com/hope-data-science/tidyfst/issues |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Imports: | data.table (≥ 1.15.0), fst (≥ 0.9.0), stringr (≥ 1.5.0) |
Suggests: | knitr, rmarkdown, nycflights13, pryr, tidyr, ggplot2, dplyr, bench, testthat, tidytext |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2025-05-07 04:55:13 UTC; Admin |
Author: | Tian-Yuan Huang |
Maintainer: | Tian-Yuan Huang <huang.tian-yuan@qq.com> |
Repository: | CRAN |
Date/Publication: | 2025-05-07 05:20:02 UTC |
Arrange entries in data.frame
Description
Order the rows of a data frame rows by the values of selected columns.
Usage
arrange_dt(.data, ...)
Arguments
.data |
data.frame |
... |
Arrange by what group? Minus symbol means arrange by descending order. |
Value
data.table
See Also
Examples
iris %>% arrange_dt(Sepal.Length)
# minus for decreasing order
iris %>% arrange_dt(-Sepal.Length)
# arrange by multiple variables
iris %>% arrange_dt(Sepal.Length,Petal.Length)
Save a data.frame as a fst table
Description
This function first export the data.frame to a temporal file, and then parse it back as a fst table (class name is "fst_table").
Usage
as_fst(.data)
Arguments
.data |
A data.frame |
Value
An object of class fst_table
Examples
## Not run:
iris %>%
as_fst() -> iris_fst
iris_fst
## End(Not run)
Bind multiple data frames by row
Description
Bind any number of data frames by row, making a longer result. Similar to 'dplyr::bind_rows', however, columns with same names but different data types would be coerced to a single proper data type.
Usage
bind_rows_dt(...)
Arguments
... |
Data frames to combine. Each argument can either be a data frame, a list that could be a data frame, or a list of data frames. Columns are matched by name, and any missing columns will be filled with 'NA'. |
Value
data.table
See Also
Examples
bind_rows_dt(iris[1:3,],iris[6:8,])
# data frames with same name but different type
# numeric data would be coerced to character data in this case
df1 <- data.frame(x = 1:2, y = letters[1:2])
df2 <- data.frame(x = 4:5, y = 1:2)
bind_rows_dt(df1, df2)
Compute TF–IDF Using data.table with Optional Counting and Grouping
Description
This function computes term frequency–inverse document frequency (tf–idf) on a dataset with one row per term occurrence (or pre-counted). It preserves original column names and returns new columns: - 'n': raw count (computed or user-supplied) - 'tf': term frequency per document - 'idf': inverse document frequency per group (or corpus) - 'tf_idf': tf × idf If 'group_col' is 'NULL', all documents are treated as a single group.
Usage
bind_tf_idf_dt(.data, group_col = NULL, doc_col, term_col, n_col = NULL)
Arguments
.data |
A data.frame or data.table of text data. |
group_col |
Character name of grouping column, or 'NULL' for no grouping. |
doc_col |
Character name of document identifier column. |
term_col |
Character name of term/word column. |
n_col |
(Optional) Character name of pre-counted term-frequency column. If 'NULL' (default), counts are computed via '.N'. |
Value
A data.table containing: - Original grouping, document, and term columns - 'n', 'tf', 'idf', and 'tf_idf'
See Also
Examples
# With groups
df <- data.frame(
category = rep(c("A","B"), each = 6),
doc_id = rep(c("d1","d2","d3"), times = 4),
word = c("apple","banana","apple","banana","cherry","apple",
"dog","cat","dog","mouse","cat","dog"),
stringsAsFactors = FALSE
)
result <- bind_tf_idf_dt(df, "category", "doc_id", "word")
result
# Without groups
df %>%
filter_dt(category == "A") %>%
bind_tf_idf_dt(doc_col = "doc_id",term_col = "word")
# With counts provided
df %>%
filter_dt(category == "A") %>%
count_dt() %>%
bind_tf_idf_dt(doc_col = "doc_id",term_col = "word",n_col = "n")
df %>%
count_dt() %>%
bind_tf_idf_dt(group_col = "category",
doc_col = "doc_id",
term_col = "word",n_col = "n")
Get the column name of the max/min number each row
Description
For a data.frame with numeric values, add a new column specifying the column name of the first max/min value each row.
Usage
col_max(.data, ..., .name = "max_col")
col_min(.data, ..., .name = "min_col")
Arguments
.data |
A data.frame with numeric column(s) |
... |
Variables for screening, could receive what 'select_dt' receives. When starts with '-'(minus symbol) or '!', return the negative columns. |
.name |
The column name of the new added column |
Value
A data.table
References
https://stackoverflow.com/questions/17735859/for-each-row-return-the-column-name-of-the-largest-value
Examples
set.seed(199057)
DT <- data.table(matrix(sample(10, 100, TRUE), ncol=10))
DT
col_max(DT)
col_max(DT,V1:V3)
col_max(DT,.name = "max_col_name")
col_min(DT)
col_min(DT,2:4)
col_max(iris)
Complete a data frame with missing combinations of data
Description
Turns implicit missing values into explicit missing values. All the combinations of column values (should be unique) will be constructed. Other columns will be filled with NAs or constant value.
Usage
complete_dt(.data, ..., fill = NA)
Arguments
.data |
data.frame |
... |
Specification of columns to expand.The selection of columns is
supported by the flexible |
fill |
Atomic value to fill into the missing cell, default uses |
Details
When the provided columns with addtion data are of different length, all the unique combinations would be returned. This operation should be used only on unique entries, and it will always returned the unique entries.
If you supply fill parameter, these values will also replace existing explicit missing values in the data set.
Value
data.table
See Also
Examples
df <- data.table(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
df %>% complete_dt(item_id,item_name)
df %>% complete_dt(item_id,item_name,fill = 0)
df %>% complete_dt("item")
df %>% complete_dt(item_id=1:3)
df %>% complete_dt(item_id=1:3,group=1:2)
df %>% complete_dt(item_id=1:3,group=1:3,item_name=c("a","b","c"))
Count observations by group
Description
Count the unique values of one or more variables.
Usage
count_dt(.data, ..., sort = TRUE, .name = "n")
add_count_dt(.data, ..., .name = "n")
Arguments
.data |
data.table/data.frame data.frame will be automatically converted to data.table. |
... |
Variables to group by, could receive what 'select_dt' receives. |
sort |
logical. If TRUE result will be sorted in desending order by resulting variable. |
.name |
character. Name of resulting variable. Default uses "n". |
Value
data.table
See Also
Examples
iris %>% count_dt(Species)
iris %>% count_dt(Species,.name = "count")
iris %>% add_count_dt(Species)
iris %>% add_count_dt(Species,.name = "N")
mtcars %>% count_dt(cyl,vs)
mtcars %>% count_dt("cyl|vs")
mtcars %>% count_dt(cyl,vs,.name = "N",sort = FALSE)
mtcars %>% add_count_dt(cyl,vs)
mtcars %>% add_count_dt("cyl|vs")
Cumulative mean
Description
Returns a vector whose elements are the cumulative mean of the elements of the argument.
Usage
cummean(x)
Arguments
x |
a numeric or complex object, or an object that can be coerced to one of these. |
Examples
cummean(1:10)
Select distinct/unique rows in data.frame
Description
Select only unique/distinct rows from a data frame.
Usage
distinct_dt(.data, ..., .keep_all = FALSE, fromLast = FALSE)
Arguments
.data |
data.frame |
... |
Optional variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved. If omitted, will use all variables. |
.keep_all |
If |
fromLast |
Logical indicating if duplication should be
considered from the reverse side. Defaults to |
Value
data.table
See Also
Examples
iris %>% distinct_dt()
iris %>% distinct_dt(Species)
iris %>% distinct_dt(Species,.keep_all = TRUE)
mtcars %>% distinct_dt(cyl,vs)
mtcars %>% distinct_dt(cyl,vs,.keep_all = TRUE)
mtcars %>% distinct_dt(cyl,vs,.keep_all = TRUE,fromLast = TRUE)
Dump, replace and fill missing values in data.frame
Description
A set of tools to deal with missing values in data.frames. It can dump, replace, fill (with next or previous observation) or delete entries according to their missing values.
Usage
drop_na_dt(.data, ...)
replace_na_dt(.data, ..., to)
delete_na_cols(.data, prop = NULL, n = NULL)
delete_na_rows(.data, prop = NULL, n = NULL)
fill_na_dt(.data, ..., direction = "down")
shift_fill(x, direction = "down")
Arguments
.data |
data.frame |
... |
Colunms to be replaced or filled. If not specified, use all columns. |
to |
What value should NA replace by? |
prop |
If proportion of NAs is larger than or equal to "prop", would be deleted. |
n |
If number of NAs is larger than or equal to "n", would be deleted. |
direction |
Direction in which to fill missing values. Currently either "down" (the default) or "up". |
x |
A vector with missing values to be filled. |
Details
drop_na_dt
drops the entries with NAs in specific columns.
fill_na_dt
fill NAs with observations ahead ("down") or below ("up"),
which is also known as last observation carried forward (LOCF) and
next observation carried backward(NOCB).
delete_na_cols
could drop the columns with NA proportion larger
than or equal to "prop" or NA number larger than or equal to "n",
delete_na_rows
works alike but deals with rows.
shift_fill
could fill a vector with missing values.
Value
data.table
References
https://stackoverflow.com/questions/23597140/how-to-find-the-percentage-of-nas-in-a-data-frame
https://stackoverflow.com/questions/2643939/remove-columns-from-dataframe-where-all-values-are-na
https://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table
See Also
Examples
df <- data.table(x = c(1, 2, NA), y = c("a", NA, "b"))
df %>% drop_na_dt()
df %>% drop_na_dt(x)
df %>% drop_na_dt(y)
df %>% drop_na_dt(x,y)
df %>% replace_na_dt(to = 0)
df %>% replace_na_dt(x,to = 0)
df %>% replace_na_dt(y,to = 0)
df %>% replace_na_dt(x,y,to = 0)
df %>% fill_na_dt(x)
df %>% fill_na_dt() # not specified, fill all columns
df %>% fill_na_dt(y,direction = "up")
x = data.frame(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5),z = rep(NA,4))
x
x %>% delete_na_cols()
x %>% delete_na_cols(prop = 0.75)
x %>% delete_na_cols(prop = 0.5)
x %>% delete_na_cols(prop = 0.24)
x %>% delete_na_cols(n = 2)
x %>% delete_na_rows(prop = 0.6)
x %>% delete_na_rows(n = 2)
# shift_fill
y = c("a",NA,"b",NA,"c")
shift_fill(y) # equals to
shift_fill(y,"down")
shift_fill(y,"up")
Fast creation of dummy variables
Description
Quickly create dummy (binary) columns from character and factor type columns in the inputted data (and numeric columns if specified.) This function is useful for statistical analysis when you want binary columns rather than character columns.
Usage
dummy_dt(.data, ..., longname = TRUE)
Arguments
.data |
data.frame |
... |
Columns you want to create dummy variables from. Very flexible, find in the examples. |
longname |
logical. Should the output column labeled with the
original column name? Default uses |
Details
If no columns provided, will return the original data frame. When NA exist in the input column, they would also be considered. If the input character column contains both NA and string "NA", they would be merged.
This function is inspired by fastDummies package, but provides
simple and precise usage, whereas fastDummies::dummy_cols
provides more
features for statistical usage.
Value
data.table
References
https://stackoverflow.com/questions/18881073/creating-dummy-variables-in-r-data-table
See Also
Examples
iris %>% dummy_dt(Species)
iris %>% dummy_dt(Species,longname = FALSE)
mtcars %>% head() %>% dummy_dt(vs,am)
mtcars %>% head() %>% dummy_dt("cyl|gear")
# when there are NAs in the column
df <- data.table(x = c("a", "b", NA, NA),y = 1:4)
df %>%
dummy_dt(x)
# when NA and "NA" both exist, they would be merged
df <- data.table(x = c("a", "b", NA, "NA"),y = 1:4)
df %>%
dummy_dt(x)
Read and write fst files
Description
Wrapper for read_fst
and write_fst
from fst, but use a different default. For data import, always return a data.table.
For data export, always compress the data to the smallest size.
Usage
export_fst(x, path, compress = 100, uniform_encoding = TRUE)
import_fst(
path,
columns = NULL,
from = 1,
to = NULL,
as.data.table = TRUE,
old_format = FALSE
)
Arguments
x |
a data frame to write to disk |
path |
path to fst file |
compress |
value in the range 0 to 100, indicating the amount of compression to use. Lower values mean larger file sizes. The default compression is set to 50. |
uniform_encoding |
If 'TRUE', all character vectors will be assumed to have elements with equal encoding. The encoding (latin1, UTF8 or native) of the first non-NA element will used as encoding for the whole column. This will be a correct assumption for most use cases. If 'uniform.encoding' is set to 'FALSE', no such assumption will be made and all elements will be converted to the same encoding. The latter is a relatively expensive operation and will reduce write performance for character columns. |
columns |
Column names to read. The default is to read all columns. |
from |
Read data starting from this row number. |
to |
Read data up until this row number. The default is to read to the last row of the stored dataset. |
as.data.table |
If TRUE, the result will be returned as a |
old_format |
must be FALSE, the old fst file format is deprecated and can only be read and converted with fst package versions 0.8.0 to 0.8.10. |
Value
'import_fst' returns a data.table with the selected columns and rows. 'export_fst' writes 'x' to a 'fst' file and invisibly returns 'x' (so you can use this function in a pipeline).
See Also
Examples
## Not run:
export_fst(iris,"iris_fst_test.fst")
iris_dt = import_fst("iris_fst_test.fst")
iris_dt
unlink("iris_fst_test.fst")
## End(Not run)
Filter entries in data.frame
Description
Choose rows where conditions are true.
Usage
filter_dt(.data, ...)
Arguments
.data |
data.frame |
... |
List of variables or name-value pairs of summary/modifications functions. |
Value
data.table
See Also
Examples
iris %>% filter_dt(Sepal.Length > 7)
iris %>% filter_dt(Sepal.Length == max(Sepal.Length))
# comma is not supported in tidyfst after v0.9.8
# which means you can't use:
## Not run:
iris %>% filter_dt(Sepal.Length > 7, Sepal.Width > 3)
## End(Not run)
# use following code instead
iris %>% filter_dt(Sepal.Length > 7 & Sepal.Width > 3)
Parse,inspect and extract data.table from fst file
Description
A tookit of APIs for reading fst file as data.table, could select by column, row and conditional filtering.
Usage
parse_fst(path)
slice_fst(ft, row_no)
select_fst(ft, ...)
filter_fst(ft, ...)
summary_fst(ft)
Arguments
path |
path to fst file |
ft |
An object of class fst_table, returned by |
row_no |
An integer vector (Positive) |
... |
The filter conditions |
Details
summary_fst
could provide some basic information about
the fst table.
Value
parse_fst
returns a fst_table class.
select_fst
and filter_fst
returns a data.table.
See Also
Examples
## Not run:
fst::write_fst(iris,"iris_test.fst")
# parse the file but not reading it
parse_fst("iris_test.fst") -> ft
ft
class(ft)
lapply(ft,class)
names(ft)
dim(ft)
summary_fst(ft)
# get the data by query
ft %>% slice_fst(1:3)
ft %>% slice_fst(c(1,3))
ft %>% select_fst(Sepal.Length)
ft %>% select_fst(Sepal.Length,Sepal.Width)
ft %>% select_fst("Sepal.Length")
ft %>% select_fst(1:3)
ft %>% select_fst(1,3)
ft %>% select_fst("Se")
ft %>% select_fst("nothing")
ft %>% select_fst("Se|Sp")
ft %>% select_fst(cols = names(iris)[2:3])
ft %>% filter_fst(Sepal.Width > 3)
ft %>% filter_fst(Sepal.Length > 6 , Species == "virginica")
ft %>% filter_fst(Sepal.Length > 6 & Species == "virginica" & Sepal.Width < 3)
unlink("iris_test.fst")
## End(Not run)
Group by variable(s) and implement operations
Description
Carry out data manipulation within specified groups. Different from group_dt
,
the implementation is split into two operations, namely grouping and implementation.
Using setkey
and setkeyv
in data.table
to carry out group_by
-like functionalities in dplyr. This is
not only convenient but also efficient in computation.
Usage
group_by_dt(.data, ..., cols = NULL)
group_exe_dt(.data, ...)
Arguments
.data |
A data frame |
... |
Variables to group by for |
cols |
A character vector of column names to group by. |
Details
group_by_dt
and group_exe_dt
are a pair of functions
to be used in combination. It utilizes the feature of key setting in data.table,
which provides high performance for group operations, especially when you have
to operate by specific groups frequently.
Value
A data.table with keys
Examples
# aggregation after grouping using group_exe_dt
as.data.table(iris) -> a
a %>%
group_by_dt(Species) %>%
group_exe_dt(head(1))
a %>%
group_by_dt(Species) %>%
group_exe_dt(
head(3) %>%
summarise_dt(sum = sum(Sepal.Length))
)
mtcars %>%
group_by_dt("cyl|am") %>%
group_exe_dt(
summarise_dt(mpg_sum = sum(mpg))
)
# equals to
mtcars %>%
group_by_dt(cols = c("cyl","am")) %>%
group_exe_dt(
summarise_dt(mpg_sum = sum(mpg))
)
Data manipulation within groups
Description
Carry out data manipulation within specified groups.
Usage
group_dt(.data, by = NULL, ...)
rowwise_dt(.data, ...)
Arguments
.data |
A data.frame |
by |
Variables to group by,unquoted name of grouping variable of list of unquoted names of grouping variables. |
... |
Any data manipulation arguments that could be implemented on a data.frame. |
Details
If you want to use summarise_dt
and mutate_dt
in
group_dt
, it is better to use the "by" parameter in those functions,
that would be much faster because you don't have to use .SD
(which takes
extra time to copy).
Value
data.table
References
https://stackoverflow.com/questions/36802385/use-by-each-row-for-data-table
Examples
iris %>% group_dt(by = Species,slice_dt(1:2))
iris %>% group_dt(Species,filter_dt(Sepal.Length == max(Sepal.Length)))
iris %>% group_dt(Species,summarise_dt(new = max(Sepal.Length)))
# you can pipe in the `group_dt`
iris %>% group_dt(Species,
mutate_dt(max= max(Sepal.Length)) %>%
summarise_dt(sum=sum(Sepal.Length)))
# for users familiar with data.table, you can work on .SD directly
# following codes get the first and last row from each group
iris %>%
group_dt(
by = Species,
rbind(.SD[1],.SD[.N])
)
#' # for summarise_dt, you can use "by" to calculate within the group
mtcars %>%
summarise_dt(
disp = mean(disp),
hp = mean(hp),
by = cyl
)
# but you could also, of course, use group_dt
mtcars %>%
group_dt(by =.(vs,am),
summarise_dt(avg = mean(mpg)))
# and list of variables could also be used
mtcars %>%
group_dt(by =list(vs,am),
summarise_dt(avg = mean(mpg)))
# examples for `rowwise_dt`
df <- data.table(x = 1:2, y = 3:4, z = 4:5)
df %>% mutate_dt(m = mean(c(x, y, z)))
df %>% rowwise_dt(
mutate_dt(m = mean(c(x, y, z)))
)
Read a fst file by chunks
Description
For 'import_fst_chunked', if a large fst file which could not be imported into the memory all at once, this function could read the fst file by chunks and preprocessed the chunk to ensure the results yielded by the chunks are small enough to be summarised in the end. For 'get_fst_chunk_size', this function can measure the memory used by a specified row number.
Usage
import_fst_chunked(
path,
chunk_size = 10000L,
chunk_f = identity,
combine_f = rbindlist
)
get_fst_chunk_size(path, nrows)
Arguments
path |
Path to fst file |
chunk_size |
Integer. The number of rows to include in each chunk |
chunk_f |
A function implemented on every chunk. |
combine_f |
A function to aggregate all the elements from the list of results from chunks. |
nrows |
Number of rows to test. |
Value
For 'import_fst_chunked', default to the whole data.frame in data.table. Could be adjusted to any type. For 'get_fst_chunk_size', return the file size.
See Also
Examples
## Not run:
# Generate some random data frame with 10 million rows and various column types
nr_of_rows <- 1e7
df <- data.frame(
Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
)
# Write the file to disk
fst_file <- tempfile(fileext = ".fst")
write_fst(df, fst_file)
# Get the size of 10000 rows
get_fst_chunk_size(fst_file,1e4)
# File all rows that Integer == 7 by chunks
import_fst_chunked(fst_file,chunk_f = \(x) x[Integer==7])
## End(Not run)
Impute missing values with mean, median or mode
Description
Impute the columns of data.frame with its mean, median or mode.
Usage
impute_dt(.data, ..., .func = "mode")
Arguments
.data |
A data.frame |
... |
Columns to select |
.func |
Character, "mode" (default), "mean" or "median". Could also define it by oneself. |
Value
A data.table
Examples
Pclass <- c(3, 1, 3, 1, 3, 2, 2, 3, NA, NA)
Sex <- c('male', 'male', 'female', 'female', 'female',
'female', NA, 'male', 'female', NA)
Age <- c(22, 38, 26, 35, NA,
45, 25, 39, 28, 40)
SibSp <- c(0, 1, 3, 1, 2, 3, 2, 2, NA, 0)
Fare <- c(7.25, 71.3, 7.92, NA, 8.05, 8.46, 51.9, 60, 32, 15)
Embarked <- c('S', NA, 'S', 'Q', 'Q', 'S', 'C', 'S', 'C', 'S')
data <- data.frame('Pclass' = Pclass,
'Sex' = Sex, 'Age' = Age, 'SibSp' = SibSp,
'Fare' = Fare, 'Embarked' = Embarked)
data
data %>% impute_dt() # defalut uses "mode" as `.func`
data %>% impute_dt(is.numeric,.func = "mean")
data %>% impute_dt(is.numeric,.func = "median")
my_fun = function(x){
x[is.na(x)] = (max(x,na.rm = TRUE) - min(x,na.rm = TRUE))/2
x
}
data %>% impute_dt(is.numeric,.func = my_fun)
Short cut to data.table
Description
To use facilities provided by data.table, but do not have to load data.table package.
Usage
in_dt(.data, ...)
as_dt(.data)
Arguments
.data |
A data.frame |
... |
Recieve |
Details
The as_dt
could turn any data frame to data.table class. If the data is
not a data frame, return error.
The in_dt
function creates a virtual environment in data.table, it could be
piped well because it still follows the principals of tidyfst, which are: (1) Never
use in place replacement and (2) Always recieves a data frame (data.frame/tibble/data.table)
and returns a data.table. Therefore, the in place functions like :=
will still
return the results.
See Also
Examples
iris %>% as_dt()
iris %>% in_dt(order(-Sepal.Length),.SD[.N],by=Species)
Set operations for data frames
Description
Wrappers of set operations in data.table. Only difference is it could be applied to non-data.table data frames by recognizing and coercing them to data.table automatically.
Usage
intersect_dt(x, y, all = FALSE)
union_dt(x, y, all = FALSE)
setdiff_dt(x, y, all = FALSE)
setequal_dt(x, y, all = TRUE)
Arguments
x |
A data.frame |
y |
A data.frame |
all |
Logical. When |
Value
A data.table
See Also
Examples
x = iris[c(2,3,3,4),]
x2 = iris[2:4,]
y = iris[c(3:5),]
intersect_dt(x, y) # intersect
intersect_dt(x, y, all=TRUE) # intersect all
setdiff_dt(x, y) # except
setdiff_dt(x, y, all=TRUE) # except all
union_dt(x, y) # union
union_dt(x, y, all=TRUE) # union all
setequal_dt(x, x2, all=FALSE) # setequal
setequal_dt(x, x2) # setequal all
Join tables
Description
The mutating joins add columns from 'y' to 'x', matching rows based on the keys:
* 'inner_join_dt()': includes all rows in 'x' and 'y'. * 'left_join_dt()': includes all rows in 'x'. * 'right_join_dt()': includes all rows in 'y'. * 'full_join_dt()': includes all rows in 'x' or 'y'.
Filtering joins filter rows from 'x' based on the presence or absence of matches in 'y':
* 'semi_join_dt()' return all rows from 'x' with a match in 'y'. * 'anti_join_dt()' return all rows from 'x' without a match in 'y'.
Usage
inner_join_dt(x, y, by = NULL, on = NULL, suffix = c(".x", ".y"))
left_join_dt(x, y, by = NULL, on = NULL, suffix = c(".x", ".y"))
right_join_dt(x, y, by = NULL, on = NULL, suffix = c(".x", ".y"))
full_join_dt(x, y, by = NULL, on = NULL, suffix = c(".x", ".y"))
anti_join_dt(x, y, by = NULL, on = NULL)
semi_join_dt(x, y, by = NULL, on = NULL)
Arguments
x |
A data.table |
y |
A data.table |
by |
(Optional) A character vector of variables to join by. If 'NULL', the default, '*_join_dt()' will perform a natural join, using all variables in common across 'x' and 'y'. A message lists the variables so that you can check they're correct; suppress the message by supplying 'by' explicitly. To join by different variables on 'x' and 'y', use a named vector. For example, 'by = c("a" = "b")' will match 'x$a' to 'y$b'. To join by multiple variables, use a vector with length > 1. For example, 'by = c("a", "b")' will match 'x$a' to 'y$a' and 'x$b' to 'y$b'. Use a named vector to match different variables in 'x' and 'y'. For example, 'by = c("a" = "b", "c" = "d")' will match 'x$a' to 'y$b' and 'x$c' to 'y$d'. |
on |
(Optional)
Indicate which columns in x should be joined with which columns in y.
Examples included:
1. |
suffix |
If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2. |
Value
A data.table
Examples
workers = fread("
name company
Nick Acme
John Ajax
Daniela Ajax
")
positions = fread("
name position
John designer
Daniela engineer
Cathie manager
")
workers %>% inner_join_dt(positions)
workers %>% left_join_dt(positions)
workers %>% right_join_dt(positions)
workers %>% full_join_dt(positions)
# filtering joins
workers %>% anti_join_dt(positions)
workers %>% semi_join_dt(positions)
# To suppress the message, supply 'by' argument
workers %>% left_join_dt(positions, by = "name")
# Use a named 'by' if the join variables have different names
positions2 = setNames(positions, c("worker", "position")) # rename first column in 'positions'
workers %>% inner_join_dt(positions2, by = c("name" = "worker"))
# the syntax of 'on' could be a bit different
workers %>% inner_join_dt(positions2,on = "name==worker")
Fast lead/lag for vectors
Description
Find the "next" or "previous" values in a vector.
It has wrapped data.table's shift
function.
Usage
lead_dt(x, n = 1L, fill = NA)
lag_dt(x, n = 1L, fill = NA)
Arguments
x |
A vector |
n |
a positive integer of length 1, giving the number of positions to lead or lag by. Default uses 1 |
fill |
Value to use for padding when the window goes beyond the input length.
Default uses |
Value
A vector
See Also
Examples
lead_dt(1:5)
lag_dt(1:5)
lead_dt(1:5,2)
lead_dt(1:5,n = 2,fill = 0)
Pivot data from wide to long
Description
Turning a wide table to its longer form. It takes multiple columns and collapses into key-value pairs.
Usage
longer_dt(.data, ..., name = "name", value = "value", na.rm = FALSE)
Arguments
.data |
A data.frame |
... |
Pattern for unchanged group or unquoted names. Pattern can accept
regular expression to match column names. It can recieve what |
name |
Name for the measured variable names column. The default name is 'name'. |
value |
Name for the molten data values column(s). The default name is 'value'. |
na.rm |
If |
Value
A data.table
See Also
Examples
## Example 1:
stocks = data.frame(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
stocks
stocks %>%
longer_dt(time)
stocks %>%
longer_dt("ti")
# Example 2:
library(tidyr)
billboard %>%
longer_dt(
-"wk",
name = "week",
value = "rank",
na.rm = TRUE
)
# or use:
billboard %>%
longer_dt(
artist,track,date.entered,
name = "week",
value = "rank",
na.rm = TRUE
)
# or use:
billboard %>%
longer_dt(
1:3,
name = "week",
value = "rank",
na.rm = TRUE
)
Conversion between tidy table and named matrix
Description
Convenient fucntions to implement conversion between tidy table and named matrix.
Usage
mat_df(m)
df_mat(df, row, col, value)
Arguments
m |
A matrix |
df |
A data.frame with at least 3 columns, one for row name, one for column name, and one for values. The names for column and row should be unique. |
row |
Unquoted expression of column name for row |
col |
Unquoted expression of column name for column |
value |
Unquoted expression of column name for values |
Value
For mat_df
, a data.frame.
For df_mat
, a named matrix.
Examples
mm = matrix(c(1:8,NA),ncol = 3,dimnames = list(letters[1:3],LETTERS[1:3]))
mm
tdf = mat_df(mm)
tdf
mat = df_mat(tdf,row,col,value)
setequal(mm,mat)
tdf %>%
setNames(c("A","B","C")) %>%
df_mat(A,B,C)
Mutate columns in data.frame
Description
Adds or updates columns in data.frame.
Usage
mutate_dt(.data, ..., by)
transmute_dt(.data, ..., by)
Arguments
.data |
data.frame |
... |
List of variables or name-value pairs of summary/modifications functions. |
by |
(Optional) Mutate by what group? |
Value
data.table
See Also
Examples
iris %>% mutate_dt(one = 1,Sepal.Length = Sepal.Length + 1)
iris %>% transmute_dt(one = 1,Sepal.Length = Sepal.Length + 1)
# add group number with symbol `.GRP`
iris %>% mutate_dt(id = 1:.N,grp = .GRP,by = Species)
Conditional update of columns in data.table
Description
Update or add columns when the given condition is met.
mutate_when
integrates mutate
and case_when
in dplyr and make a new tidy verb for data.table. mutate_vars
is
a super function to do updates in specific columns according to conditions.
Usage
mutate_when(.data, when, ..., by)
mutate_vars(.data, .cols = NULL, .func, ..., by)
Arguments
.data |
data.frame |
when |
An object which can be coerced to logical mode |
... |
Name-value pairs of expressions for |
by |
(Optional) Mutate by what group? |
.cols |
Any types that can be accepted by |
.func |
Function to be run within each column, should return a value or vectors with same length. |
Value
data.table
See Also
Examples
iris[3:8,]
iris[3:8,] %>%
mutate_when(Petal.Width == .2,
one = 1,Sepal.Length=2)
iris %>% mutate_vars("Pe",scale)
iris %>% mutate_vars(is.numeric,scale)
iris %>% mutate_vars(-is.factor,scale)
iris %>% mutate_vars(1:2,scale)
iris %>% mutate_vars(.func = as.character)
Nest and unnest
Description
Create or melt list columns in data.frame.
Analogous function for nest
and unnest
in tidyr.
unnest_dt
will automatically remove other list-columns except for the
target list-columns (which would be unnested later). Also, squeeze_dt
is
designed to merge multiple columns into list column.
Usage
nest_dt(.data, ..., mcols = NULL, .name = "ndt")
unnest_dt(.data, ...)
squeeze_dt(.data, ..., .name = "ndt")
chop_dt(.data, ...)
unchop_dt(.data, ...)
Arguments
.data |
data.table, nested or unnested |
... |
The variables for nest group(for |
mcols |
Name-variable pairs in the list, form like |
.name |
Character. The nested column name. Defaults to "ndt".
|
Details
In the nest_dt
, the data would be nested to a column named 'ndt',
which is short for nested data.table.
The squeeze_dt
would not remove the originial columns.
The unchop_dt
is the reverse operation of chop_dt
.
These functions are experiencing the experimental stage, especially
the unnest_dt
. If they don't work on some circumtances, try tidyr
package.
Value
data.table, nested or unnested
References
https://www.r-bloggers.com/much-faster-unnesting-with-data-table/
https://stackoverflow.com/questions/25430986/create-nested-data-tables-by-collapsing-rows-into-new-data-tables
See Also
Examples
# examples for nest_dt
# nest by which columns?
mtcars %>% nest_dt(cyl)
mtcars %>% nest_dt("cyl")
mtcars %>% nest_dt(cyl,vs)
mtcars %>% nest_dt(vs:am)
mtcars %>% nest_dt("cyl|vs")
mtcars %>% nest_dt(c("cyl","vs"))
# change the nested column name
mtcars %>% nest_dt(cyl,.name = "data")
# nest two columns directly
iris %>% nest_dt(mcols = list(petal="^Pe",sepal="^Se"))
# nest more flexibly
iris %>% nest_dt(mcols = list(ndt1 = 1:3,
ndt2 = "Pe",
ndt3 = Sepal.Length:Sepal.Width))
# examples for unnest_dt
# unnest which column?
mtcars %>% nest_dt("cyl|vs") %>%
unnest_dt(ndt)
mtcars %>% nest_dt("cyl|vs") %>%
unnest_dt("ndt")
df <- data.table(
a = list(c("a", "b"), "c"),
b = list(c(TRUE,TRUE),FALSE),
c = list(3,c(1,2)),
d = c(11, 22)
)
df
df %>% unnest_dt(a)
df %>% unnest_dt(2)
df %>% unnest_dt("c")
df %>% unnest_dt(cols = names(df)[3])
# You can unnest multiple columns simultaneously
df %>% unnest_dt(1:3)
df %>% unnest_dt(a,b,c)
df %>% unnest_dt("a|b|c")
# examples for squeeze_dt
# nest which columns?
iris %>% squeeze_dt(1:2)
iris %>% squeeze_dt("Se")
iris %>% squeeze_dt(Sepal.Length:Petal.Width)
iris %>% squeeze_dt(1:2,.name = "data")
# examples for chop_dt
df <- data.table(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)
df %>% chop_dt(y,z)
df %>% chop_dt(y,z) %>% unchop_dt(y,z)
Extract the nth value from a vector
Description
nth
get the value from a vector with its position,
while maxth
and minth
get the nth highest or lowest value
from the vector.
Usage
nth(v, n = 1)
maxth(v, n = 1)
minth(v, n = 1)
Arguments
v |
A vector |
n |
For |
Value
A single value.
References
https://stackoverflow.com/questions/2453326/fastest-way-to-find-second-third-highest-lowest-value-in-vector-or-column/66367996#66367996
Examples
x = 1:10
nth(x, 1)
nth(x, 5)
nth(x, -2)
y = c(10,3,4,5,2,1,6,9,7,8)
maxth(y,3)
minth(y,3)
Nice printing of report the Space Allocated for an Object
Description
Provides an estimate of the memory that is being used to store an R object. A wrapper of 'object.size', but use a nicer printing unit.
Usage
object_size(object)
Arguments
object |
an R object. |
Value
An object of class "object_size"
Examples
iris %>% object_size()
Count pairs of items within a group
Description
Count the number of times each pair
of items appear together within a group.
For example, this could count the number of times two words appear within documents.
This function has referred to pairwise_count
in widyr package,
but with very different defaults on several parameters.
Usage
pairwise_count_dt(
.data,
.group,
.value,
upper = FALSE,
diag = FALSE,
sort = TRUE
)
Arguments
.data |
A data.frame. |
.group |
Column name of counting group. |
.value |
Item to count pairs, will end up in |
upper |
When |
diag |
Whether to include diagonal (V1==V2) in output. Default uses |
sort |
Whether to sort rows by counts. Default uses |
Value
A data.table with 3 columns (named as "V1","V2" and "n"), containing combinations in "V1" and "V2", and counts in "n".
See Also
Examples
dat <- data.table(group = rep(1:5, each = 2),
letter = c("a", "b",
"a", "c",
"a", "c",
"b", "e",
"b", "f"))
pairwise_count_dt(dat,group,letter)
pairwise_count_dt(dat,group,letter,sort = FALSE)
pairwise_count_dt(dat,group,letter,upper = TRUE)
pairwise_count_dt(dat,group,letter,diag = TRUE)
pairwise_count_dt(dat,group,letter,diag = TRUE,upper = TRUE)
# The column name could be specified using character.
pairwise_count_dt(dat,"group","letter")
Add percentage to counts in data.frame
Description
Add percentage for counts in the data.frame, both numeric and character with '
Usage
percent(x, digits = 1)
add_prop(.data, count_name = last(names(.data)), digits = 1)
Arguments
x |
A number (numeric). |
digits |
How many digits to keep in the percentage. Default uses 1. |
.data |
A data frame. |
count_name |
Column name of counts (Character). Default uses the last column of data.frame. |
References
https://stackoverflow.com/questions/7145826/how-to-format-a-number-as-percentage-in-r
Examples
percent(0.9057)
percent(0.9057,3)
iris %>%
count_dt(Species) %>%
add_prop()
iris %>%
count_dt(Species) %>%
add_prop(count_name = "n",digits = 2)
Load or unload R package(s)
Description
This function is a wrapper for require
and
detach
. pkg_load
checks to see if a
package is installed, if not it attempts to install the package
from CRAN. pkg_unload
can detach one or more loaded packages.
Usage
pkg_load(..., pkg_names = NULL)
pkg_unload(..., pkg_names = NULL)
Arguments
... |
Name(s) of package(s). |
pkg_names |
(Optional)Character vector containing packages to load
or unload. Default uses |
See Also
require
,
detach
,
p_load
,
p_unload
Examples
## Not run:
pkg_load(data.table)
pkg_unload(data.table)
pkg_load(stringr,fst)
pkg_unload(stringr,fst)
pkg_load(pkg_names = c("data.table","fst"))
p_unload(pkg_names = c("data.table","fst"))
pkg_load(data.table,stringr,fst)
pkg_unload("all") # shortcut to unload all loaded packages
## End(Not run)
Set global printing method for data.table
Description
This function allow user to define how data.table is printed.
Usage
print_options(
topn = 5,
nrows = 100,
class = TRUE,
row.names = TRUE,
col.names = "auto",
print.keys = TRUE,
trunc.cols = FALSE
)
Arguments
topn |
The number of rows to be printed from the beginning and
end of tables with more than |
nrows |
The number of rows which will be printed before truncation is enforced. |
class |
If |
row.names |
If |
col.names |
One of three flavours for controlling the display of column names in output. |
print.keys |
If |
trunc.cols |
If |
Details
Notice that tidyfst has a slightly different printing default for data.table, which is it always prints the keys and variable class (not like data.table).
Value
None. This function is used for its side effect of changing options.
See Also
Examples
iris %>% as.data.table()
print_options(topn = 3,trunc.cols = TRUE)
iris %>% as.data.table()
# set all settings to default in tidyfst
print_options()
iris %>% as.data.table()
Pull out a single variable
Description
Extract vector from data.frame, works likt '[['. Analogous function for pull
in dplyr
Usage
pull_dt(.data, col)
Arguments
.data |
data.frame |
col |
A name of column or index (should be positive). |
Value
vector
See Also
Examples
mtcars %>% pull_dt(2)
mtcars %>% pull_dt(cyl)
mtcars %>% pull_dt("cyl")
Recode number or strings
Description
Recode discrete variables, including numerice and character variable.
Usage
rec_num(x, rec, keep = TRUE)
rec_char(x, rec, keep = TRUE)
Arguments
x |
A numeric or character vector. |
rec |
String with recode pairs of old and new values. Find the usage in examples. |
keep |
Logical. Decide whether to keep the original values if not recoded.
Defaults to |
Value
A vector.
See Also
Examples
x = 1:10
x
rec_num(x, rec = "1=10; 4=2")
rec_num(x, rec = "1:3=1; 4:6=2")
rec_num(x, rec = "1:3=1; 4:6=2",keep = FALSE)
y = letters[1:5]
y
rec_char(y,rec = "a=A;b=B")
rec_char(y,rec = "a,b=A;c,d=B")
rec_char(y,rec = "a,b=A;c,d=B",keep = FALSE)
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- data.table
%chin%
,%like%
,%notin%
,as.data.table
,between
,CJ
,copy
,data.table
,fcase
,fcoalesce
,fintersect
,fread
,frollapply
,fsetdiff
,fsetequal
,funion
,fwrite
,key
,like
,rbindlist
,rleid
,rleidv
,set
,setDF
,setDT
,setnames
,tables
,uniqueN
- stringr
Change column order
Description
Change the position of columns, using the same syntax as 'select_dt()'. Check similar function as 'relocate' in dplyr.
Usage
relocate_dt(.data, ..., how = "first", where = NULL)
Arguments
.data |
A data.frame |
... |
Columns to move |
how |
The mode of movement, including "first","last","after","before". Default uses "first". |
where |
Destination of columns selected by |
Value
A data.table with rearranged columns.
See Also
Examples
df <- data.table(a = 1, b = 1, c = 1, d = "a", e = "a", f = "a")
df
df %>% relocate_dt(f)
df %>% relocate_dt(a,how = "last")
df %>% relocate_dt(is.character)
df %>% relocate_dt(is.numeric, how = "last")
df %>% relocate_dt("[aeiou]")
df %>% relocate_dt(a, how = "after",where = f)
df %>% relocate_dt(f, how = "before",where = a)
df %>% relocate_dt(f, how = "before",where = c)
df %>% relocate_dt(f, how = "after",where = c)
df2 <- data.table(a = 1, b = "a", c = 1, d = "a")
df2 %>% relocate_dt(is.numeric,
how = "after",
where = is.character)
df2 %>% relocate_dt(is.numeric,
how="before",
where = is.character)
Rename column in data.frame
Description
Rename one or more columns in the data.frame.
Usage
rename_dt(.data, ...)
rename_with_dt(.data, .fn, ...)
Arguments
.data |
data.frame |
... |
statements of rename, e.g. 'sl = Sepal.Length' means the column named as "Sepal.Length" would be renamed to "sl" |
.fn |
A function used to transform the selected columns. Should return a character vector the same length as the input. |
Value
data.table
See Also
Examples
iris %>%
rename_dt(sl = Sepal.Length,sw = Sepal.Width) %>%
head()
iris %>% rename_with_dt(toupper)
iris %>% rename_with_dt(toupper,"^Pe")
Fast value replacement in data frame
Description
While replace_na_dt
could replace all NAs to another
value, replace_dt
could replace any value(s) to another specific
value.
Usage
replace_dt(.data, ..., from = is.nan, to = NA)
Arguments
.data |
A data.frame |
... |
Colunms to be replaced. If not specified, use all columns. |
from |
A value, a vector of values or a function returns a logical value.
Defaults to |
to |
A value. Defaults to |
Value
A data.table.
See Also
Examples
iris %>% mutate_vars(is.factor,as.character) -> new_iris
new_iris %>%
replace_dt(Species, from = "setosa",to = "SS")
new_iris %>%
replace_dt(Species,from = c("setosa","virginica"),to = "sv")
new_iris %>%
replace_dt(Petal.Width, from = .2,to = 2)
new_iris %>%
replace_dt(from = .2,to = NA)
new_iris %>%
replace_dt(is.numeric, from = function(x) x > 3, to = 9999 )
Tools for working with row names
Description
The enhanced data.frame, including tibble and data.table, do not
support row names. To link to some base r facilities, there should be functions
to save information in row names. These functions are analogous to
rownames_to_column
and column_to_rownames
in tibble.
Usage
rn_col(.data, var = "rowname")
col_rn(.data, var = "rowname")
Arguments
.data |
A data.frame. |
var |
Name of column to use for rownames. |
Value
rn_col
returns a data.table,
col_rn
returns a data frame.
Examples
mtcars %>% rn_col()
mtcars %>% rn_col("rn")
mtcars %>% rn_col() -> new_mtcars
new_mtcars %>% col_rn() -> old_mtcars
old_mtcars
setequal(mtcars,old_mtcars)
Round a number and make it show zeros
Description
Rounds values in its first argument to the specified number of decimal places, returning character, ensuring first decimal digits are showed even when they are zeros.
Usage
round0(x, digits = 0)
Arguments
x |
A numeric vector. |
digits |
Integer indicating the number of decimal places. Defaults to 0. |
Value
A character vector.
References
https://stackoverflow.com/questions/42105336/how-to-round-a-number-and-make-it-show-zeros
See Also
Examples
a = 14.0034
round0(a,2)
b = 10
round0(b,1)
Sample rows randomly from a table
Description
Select a number or proportion of rows randomly from the data frame
sample_dt
is a merged version of sample_n_dt
and
sample_frac_dt
, this could be convenient.
Usage
sample_dt(.data, n = NULL, prop = NULL, replace = FALSE, by = NULL)
sample_n_dt(.data, size, replace = FALSE, by = NULL)
sample_frac_dt(.data, size, replace = FALSE, by = NULL)
Arguments
.data |
A data.frame |
n |
Number of rows to select |
prop |
Fraction of rows to select |
replace |
Sample with or without replacement? Default uses |
by |
(Optional) Character. Specify if you want to sample by group. |
size |
For |
Value
data.table
See Also
Examples
sample_n_dt(mtcars, 10)
sample_n_dt(mtcars, 50, replace = TRUE)
sample_frac_dt(mtcars, 0.1)
sample_frac_dt(mtcars, 1.5, replace = TRUE)
sample_dt(mtcars,n=10)
sample_dt(mtcars,prop = 0.1)
# sample by group(s)
iris %>% sample_n_dt(2,by = "Species")
iris %>% sample_frac_dt(.1,by = "Species")
mtcars %>% sample_n_dt(1,by = c("cyl","vs"))
Select column from data.frame
Description
Select specific column(s) via various ways. One can select columns by their column names, indexes or regular expression recognizing the column name(s).
Usage
select_dt(.data, ..., cols = NULL, negate = FALSE)
select_mix(.data, ..., rm.dup = TRUE)
Arguments
.data |
data.frame |
... |
List of variables or name-value pairs of summary/modifications functions. It can also recieve conditional function to select columns. When starts with '-'(minus symbol) or '!', return the negative columns. |
cols |
(Optional)A numeric or character vector. |
negate |
Applicable when regular expression and "cols" is used.
If |
rm.dup |
Should duplicated columns be removed? Defaults to |
Value
data.table
See Also
Examples
iris %>% select_dt(Species)
iris %>% select_dt(Sepal.Length,Sepal.Width)
iris %>% select_dt(Sepal.Length:Petal.Length)
iris %>% select_dt(-Sepal.Length)
iris %>% select_dt(-Sepal.Length,-Petal.Length)
iris %>% select_dt(-(Sepal.Length:Petal.Length))
iris %>% select_dt(c("Sepal.Length","Sepal.Width"))
iris %>% select_dt(-c("Sepal.Length","Sepal.Width"))
iris %>% select_dt(1)
iris %>% select_dt(-1)
iris %>% select_dt(1:3)
iris %>% select_dt(-(1:3))
iris %>% select_dt(1,3)
iris %>% select_dt("Pe")
iris %>% select_dt(-"Se")
iris %>% select_dt(!"Se")
iris %>% select_dt("Pe",negate = TRUE)
iris %>% select_dt("Pe|Sp")
iris %>% select_dt(cols = 2:3)
iris %>% select_dt(cols = 2:3,negate = TRUE)
iris %>% select_dt(cols = c("Sepal.Length","Sepal.Width"))
iris %>% select_dt(cols = names(iris)[2:3])
iris %>% select_dt(is.factor)
iris %>% select_dt(-is.factor)
iris %>% select_dt(!is.factor)
# select_mix could provide flexible mix selection
select_mix(iris, Species,"Sepal.Length")
select_mix(iris,1:2,is.factor)
select_mix(iris,Sepal.Length,is.numeric)
# set rm.dup to FALSE could save the duplicated column names
select_mix(iris,Sepal.Length,is.numeric,rm.dup = FALSE)
Separate a character column into two columns using a regular expression separator
Description
Given either regular expression,
separate_dt()
turns a single character column into two columns.
Usage
separate_dt(
.data,
separated_colname,
into,
sep = "[^[:alnum:]]+",
remove = TRUE
)
Arguments
.data |
A data frame. |
separated_colname |
Column to be separated, can be a character or alias. |
into |
Character vector of length 2. |
sep |
Separator between columns. |
remove |
If |
See Also
Examples
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df %>% separate_dt(x, c("A", "B"))
# equals to
df %>% separate_dt("x", c("A", "B"))
# If you just want the second variable:
df %>% separate_dt(x,into = c(NA,"B"))
Subset rows using their positions
Description
'slice_dt()' lets you index rows by their (integer) locations. It allows you to select, remove, and duplicate rows. It is accompanied by a number of helpers for common use cases:
* 'slice_head_dt()' and 'slice_tail_dt()' select the first or last rows. * 'slice_sample_dt()' randomly selects rows. * 'slice_min_dt()' and 'slice_max_dt()' select rows with highest or lowest values of a variable.
Usage
slice_dt(.data, ..., by = NULL)
slice_head_dt(.data, n, by = NULL)
slice_tail_dt(.data, n, by = NULL)
slice_max_dt(.data, order_by, n, by = NULL, with_ties = TRUE)
slice_min_dt(.data, order_by, n, by = NULL, with_ties = TRUE)
slice_sample_dt(.data, n, replace = FALSE, by = NULL)
Arguments
.data |
A data.table |
... |
Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. |
by |
Slice by which group(s)? |
n |
When larger than or equal to 1, the number of rows. When between 0 and 1, the proportion of rows to select. |
order_by |
Variable or function of variables to order by. |
with_ties |
Should ties be kept together? The default, 'TRUE', may return more rows than you request. Use 'FALSE' to ignore ties, and return the first 'n' rows. |
replace |
Should sampling be performed with ('TRUE') or without ('FALSE', the default) replacement. |
Value
A data.table
See Also
Examples
a = iris
slice_dt(a,1,2)
slice_dt(a,2:3)
slice_dt(a,141:.N)
slice_dt(a,1,.N)
slice_head_dt(a,5)
slice_head_dt(a,0.1)
slice_tail_dt(a,5)
slice_tail_dt(a,0.1)
slice_max_dt(a,Sepal.Length,10)
slice_max_dt(a,Sepal.Length,10,with_ties = FALSE)
slice_min_dt(a,Sepal.Length,10)
slice_min_dt(a,Sepal.Length,10,with_ties = FALSE)
slice_sample_dt(a,10)
slice_sample_dt(a,0.1)
# use by to slice by group
## following codes get the same results
slice_dt(a,1:3,by = "Species")
slice_dt(a,1:3,by = Species)
slice_dt(a,1:3,by = .(Species))
slice_head_dt(a,2,by = Species)
slice_tail_dt(a,2,by = Species)
slice_max_dt(a,Sepal.Length,3,by = Species)
slice_max_dt(a,Sepal.Length,3,by = Species,with_ties = FALSE)
slice_min_dt(a,Sepal.Length,3,by = Species)
slice_min_dt(a,Sepal.Length,3,by = Species,with_ties = FALSE)
# in `slice_sample_dt`, "by" could only take character class
slice_sample_dt(a,.1,by = "Species")
slice_sample_dt(a,3,by = "Species")
slice_sample_dt(a,51,replace = TRUE,by = "Species")
Case insensitive table joining like SQL
Description
Work like the '*_join_dt' series functions, joining tables with common or customized keys in various ways. The only difference is the joining is case insensitive like SQL.
Usage
sql_join_dt(x, y, by = NULL, type = "inner", suffix = c(".x", ".y"))
Arguments
x |
A data.table |
y |
A data.table |
by |
(Optional) A character vector of variables to join by. If 'NULL', the default, '*_join_dt()' will perform a natural join, using all variables in common across 'x' and 'y'. A message lists the variables so that you can check they're correct; suppress the message by supplying 'by' explicitly. To join by different variables on 'x' and 'y', use a named vector. For example, 'by = c("a" = "b")' will match 'x$a' to 'y$b'. To join by multiple variables, use a vector with length > 1. For example, 'by = c("a", "b")' will match 'x$a' to 'y$a' and 'x$b' to 'y$b'. Use a named vector to match different variables in 'x' and 'y'. For example, 'by = c("a" = "b", "c" = "d")' will match 'x$a' to 'y$b' and 'x$c' to 'y$d'. Notice that in 'sql_join', the joining variables would turn to upper case in the output table. |
type |
Which type of join would you like to use? Default uses "inner", other options include "left", "right", "full", "anti", "semi". |
suffix |
If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2. |
Value
A data.table
See Also
Examples
dt1 = data.table(x = c("A","b"),y = 1:2)
dt2 = data.table(x = c("a","B"),z = 4:5)
sql_join_dt(dt1,dt2)
Summarise columns to single values
Description
Summarise group of values into one value for each group. If there is only one group, then only one value would be returned. The summarise function should always return a single value.
Usage
summarise_dt(.data, ..., by = NULL)
summarize_dt(.data, ..., by = NULL)
summarise_when(.data, when, ..., by = NULL)
summarize_when(.data, when, ..., by = NULL)
summarise_vars(.data, .cols = NULL, .func, ..., by)
summarize_vars(.data, .cols = NULL, .func, ..., by)
Arguments
.data |
data.frame |
... |
List of variables or name-value pairs of summary/modifications
functions for |
by |
unquoted name of grouping variable of list of unquoted names of grouping variables. For details see data.table |
when |
An object which can be coerced to logical mode |
.cols |
Columns to be summarised. |
.func |
Function to be run within each column, should return a value or vectors with same length. |
Details
summarise_vars
could complete summarise on specific columns.
Value
data.table
See Also
Examples
iris %>% summarise_dt(avg = mean(Sepal.Length))
iris %>% summarise_dt(avg = mean(Sepal.Length),by = Species)
mtcars %>% summarise_dt(avg = mean(hp),by = .(cyl,vs))
# the data.table way
mtcars %>% summarise_dt(cyl_n = .N, by = .(cyl, vs)) # `.` is short for list
iris %>% summarise_vars(is.numeric,min)
iris %>% summarise_vars(-is.factor,min)
iris %>% summarise_vars(1:4,min)
iris %>% summarise_vars(is.numeric,min,by ="Species")
mtcars %>% summarise_vars(is.numeric,mean,by = c("vs", "am"))
# use multiple functions on multiple columns
iris %>%
summarise_vars(is.numeric,.func = list(mean,sd,median))
iris %>%
summarise_vars(is.numeric,.func = list(mean,sd,median),by = Species)
Convenient print of time taken
Description
Convenient printing of time elapsed. A wrapper of
data.table::timetaken
, but showing the results more directly.
Usage
sys_time_print(expr)
pst(expr)
Arguments
expr |
Valid R expression to be timed. |
Value
A character vector of the form HH:MM:SS,
or SS.MMMsec if under 60 seconds (invisibly for show_time
). See examples.
See Also
Examples
sys_time_print(Sys.sleep(1))
a = iris
sys_time_print({
res = iris %>%
mutate_dt(one = 1)
})
res
Efficient transpose of data.frame
Description
An efficient way to transpose data frames(data.frame/data.table/tibble).
Usage
t_dt(.data)
Arguments
.data |
A data.frame/data.table/tibble |
Details
This function would return the original data.frame structure, keeping all the row names and column names. If the row names are not available or, "V1,V2..." will be provided.
Value
A transposed data.frame
Examples
t_dt(iris)
t_dt(mtcars)
"Uncount" a data frame
Description
Duplicating rows according to a weighting variable. This is the opposite operation of 'count_dt'. Analogous to 'tidyr::uncount'.
Usage
uncount_dt(.data, wt, .remove = TRUE)
Arguments
.data |
A data.frame |
wt |
A vector of weights. |
.remove |
Should the column for |
See Also
Examples
df <- data.table(x = c("a", "b"), n = c(1, 2))
uncount_dt(df, n)
uncount_dt(df,n,FALSE)
Unite multiple columns into one by pasting strings together
Description
Convenience function to paste together multiple columns into one.
Usage
unite_dt(
.data,
united_colname,
...,
sep = "_",
remove = FALSE,
na2char = FALSE
)
Arguments
.data |
A data frame. |
united_colname |
The name of the new column, string only. |
... |
A selection of columns. If want to select all columns, pass "" to the parameter. See example. |
sep |
Separator to use between values. |
remove |
If |
na2char |
If |
See Also
Examples
df <- expand.grid(x = c("a", NA), y = c("b", NA))
df
# Treat missing value as NA, default
df %>% unite_dt("z", x:y, remove = FALSE)
# Treat missing value as character "NA"
df %>% unite_dt("z", x:y, na2char = TRUE, remove = FALSE)
df %>%
unite_dt("xy", x:y)
# Select all columns
iris %>% unite_dt("merged_name",".")
Use UTF-8 for character encoding in a data frame
Description
fread
from data.table could not recognize the encoding
and return the correct form, this could be unconvenient for text mining tasks. The
utf8-encoding
could use "UTF-8" as the encoding to override the current
encoding of characters in a data frame.
Usage
utf8_encoding(.data)
Arguments
.data |
A data.frame. |
Value
A data.table with characters in UTF-8 encoding
Pivot data from long to wide
Description
Transform a data frame from long format to wide by increasing the number of columns and decreasing the number of rows.
Usage
wider_dt(.data, ..., name, value = NULL, fun = NULL, fill = NA)
Arguments
.data |
A data.frame |
... |
Optional. The unchanged group in the transformation.
Could use integer vector, could receive what |
name |
Chracter.One column name of class to spread |
value |
Chracter.One column name of value to spread.
If |
fun |
Should the data be aggregated before casting?
Defaults to |
fill |
Value with which to fill missing cells. Default uses |
Details
The parameter of 'name' and 'value' should always be provided and should be explicit called (with the parameter names attached).
Value
data.table
See Also
Examples
stocks = data.frame(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
) %>%
longer_dt(time) -> longer_stocks
longer_stocks
longer_stocks %>%
wider_dt("time",
name = "name",
value = "value")
longer_stocks %>%
mutate_dt(one = 1) %>%
wider_dt("time",
name = "name",
value = "one")
## using "fun" parameter for aggregation
DT <- data.table(v1 = rep(1:2, each = 6),
v2 = rep(rep(1:3, 2), each = 2),
v3 = rep(1:2, 6),
v4 = rnorm(6))
## for each combination of (v1, v2), add up all values of v4
DT %>%
wider_dt(v1,v2,
value = "v4",
name = ".",
fun = sum)