module Crysda
Overview
CrysDA` is a {Crys}tal shard for {D}ata {A}nalysis. Provides you modern functional-style API for data manipulation to filter, transform, aggregate and reshape tabular data.
Core of the library is CrysDA::DataFrame
an immutable data structure interface.
Features
- [X] Filter, transform, aggregate and reshape tabular data
- [X] Modern, user-friendly and easy-to-learn data-science API
- [X] Reads from plain and compressed tsv, csv, json, or any delimited format with or without header from local or remote.
- [X] Supports grouped operations
- [X] Supports reading data from DB
- [X] Tables can contain atomic columns (Number, Float, Bool, String) as well as object columns
- [X] Reshape tables from wide to long and back
- [X] Table joins (left, right, semi, inner, outer)
- [X] Cross tabulation
- [X] Descriptive statistics (mean, min, max, median, ...)
- [X] Functional API inspired by dplyr, pandas
- [X] many more...
Defined in:
crysda.crcrysda/builder.cr
crysda/columns.cr
crysda/context.cr
crysda/dataframe.cr
crysda/groupdf.cr
crysda/joins.cr
crysda/reshape.cr
crysda/select.cr
crysda/simpledf.cr
crysda/utils.cr
Constant Summary
-
MISSING_VALUE =
"NA"
-
PRINT_MAX_DIGITS =
3
-
PRINT_MAX_ROWS =
10
-
PRINT_MAX_WIDTH =
100
-
PRINT_ROW_NUMBERS =
true
-
VERSION =
"0.1.1"
Class Method Summary
-
.bind_cols(left : DataFrame, right : DataFrame, rename_duplicates = true) : DataFrame
Binds dataframes by column.
-
.bind_rows(*dfs : DataFrame) : DataFrame
Adds new rows.
-
.column_types(df : DataFrame) : Array(ColSpec)
return column types as an array of
ColSpec
struct -
.dataframe_of(rows : Iterable(Hash(String, Any)))
Creates a new data-frame from Array of
{} of String => Any
-
.dataframe_of(rows : Iterable(DataFrameRow))
Creates a new data-frame from array of
DataFrameRow
-
.dataframe_of(cols : Iterable(DataCol))
Creates a data-frame from Array of
DataCol
-
.dataframe_of(*rows : Hash(String, Any))
Creates a new data-frame from
{} of String => Any
-
.dataframe_of(*header : String)
Creates a new dataframe in place.
-
.dataframe_of(*rows : DataFrameRow)
Creates a new data-frame from records encoded as key-value maps Column types will be inferred from the value types
-
.dataframe_of(*cols : DataCol)
Create a new data-frame from a list of
DataCol
instances -
.empty_df
Creates an empty dataframe with 0 observation
-
.from(resultset : DB::ResultSet)
build a data-frame from a
DB::ResultSet
-
.from_json(json : String)
builds a data-frame from a JSON string
-
.read_csv(file : String | IO, separator : Char = ',', quote_char : Char = '"', skip_blank_lines : Bool = true, skip : Int32 = 0, comment : Char? = '#', header : Int32? = 0, na_value : String = MISSING_VALUE, true_values = ["T", "TRUE"], false_values = ["F", "FALSE"])
reads a comma separated value file/io into a dataframe.
-
.read_json(file : String | IO)
reads a json file or URL
-
.selector(&block : ColumnSelector)
helper method to return the block as
Proc
.
Class Method Detail
Binds dataframes by column. Rows are matched by position, so all data frames must have the same number of rows.
Adds new rows. Missing entries are set to null. The output of bind_rows will contain a column if that column appears in any of the inputs. When row-binding, columns are matched by name, and any missing columns will be filled with NA. Grouping will be discarded when binding rows
return column types as an array of ColSpec
struct
Creates a new data-frame from Array of {} of String => Any
Creates a new data-frame from array of DataFrameRow
Creates a data-frame from Array of DataCol
Creates a new data-frame from {} of String => Any
Creates a new dataframe in place.
header - pass headers as variadic parameter
call values
after this call to pass the values
df = dataframe_of("quarter", "sales", "location").values(1, 300.01, "london", 2, 290, "chicago")
Creates a new data-frame from records encoded as key-value maps Column types will be inferred from the value types
Create a new data-frame from a list of DataCol
instances
reads a comma separated value file/io into a dataframe.
file
could be local file path or a URL. It will read compressed(gz, gzip) files.
separator
defaults to ,
and can be changed to other separator (e.g \t
for tab separated files)
skip_blank_lines
defaults to true, will skip all blank lines
skip
defaults to 0, will skip this much lines from start of file.
comment
character default #
will ignore all lines starting with this character
header
line defaults to 0 (first row), if set to nil
then column names are auto generated starting with Col1
.
if skip_blank_lines and comment are enabled, header will start reading after removing blank and comment lines
na_value
defaults to NA
Strings which should be treated as Nil
. values matching this param will be treated as nil
true_values
defaults to ["T","TRUE"]
values to consider as boolean true
false_values
defaults to ["F","FALSE"]
values to consider as boolean false
helper method to return the block as Proc
. Used when doing select with multiple criteria.
Kind of workaround as Crystal doesn't allow variadic blocks and Proc
definition requires
complete signature like Crysda::ColumnSelector.new{|e| ....}
so instead of
df.select(
Crysda::ColumnSelector.new { |s| ... },
Crysda::ColumnSelector.new { |s| ... }
)
One can simply use this helper
df.select(
Crysda.selector{|e| ....},
Crysda.selector{|e| ....},
)