module Crysda

Overview

CrysDA` is a {Crys}tal shard for {D}ata {A}nalysis. Provides you modern functional-style API for data manipulation to filter, transform, aggregate and reshape tabular data. Core of the library is CrysDA::DataFrame an immutable data structure interface.

Features

[X] Filter, transform, aggregate and reshape tabular data
[X] Modern, user-friendly and easy-to-learn data-science API
[X] Reads from plain and compressed tsv, csv, json, or any delimited format with or without header from local or remote.
[X] Supports grouped operations
[X] Supports reading data from DB
[X] Tables can contain atomic columns (Number, Float, Bool, String) as well as object columns
[X] Reshape tables from wide to long and back
[X] Table joins (left, right, semi, inner, outer)
[X] Cross tabulation
[X] Descriptive statistics (mean, min, max, median, ...)
[X] Functional API inspired by dplyr, pandas
[X] many more...

Defined in:

crysda.cr
crysda/builder.cr
crysda/columns.cr
crysda/context.cr
crysda/dataframe.cr
crysda/groupdf.cr
crysda/joins.cr
crysda/reshape.cr
crysda/select.cr
crysda/simpledf.cr
crysda/utils.cr

Constant Summary

MISSING_VALUE = "NA"
PRINT_MAX_DIGITS = 3
PRINT_MAX_ROWS = 10
PRINT_MAX_WIDTH = 100
PRINT_ROW_NUMBERS = true
VERSION = "0.1.1"

Class Method Summary

.bind_cols(left : DataFrame, right : DataFrame, rename_duplicates = true) : DataFrame
Binds dataframes by column.
.bind_rows(*dfs : DataFrame) : DataFrame
Adds new rows.
.column_types(df : DataFrame) : Array(ColSpec)
return column types as an array of ColSpec struct
.dataframe_of(rows : Iterable(Hash(String, Any)))
Creates a new data-frame from Array of {} of String => Any
.dataframe_of(rows : Iterable(DataFrameRow))
Creates a new data-frame from array of DataFrameRow
.dataframe_of(cols : Iterable(DataCol))
Creates a data-frame from Array of DataCol
.dataframe_of(*rows : Hash(String, Any))
Creates a new data-frame from {} of String => Any
.dataframe_of(*header : String)
Creates a new dataframe in place.
.dataframe_of(*rows : DataFrameRow)
Creates a new data-frame from records encoded as key-value maps Column types will be inferred from the value types
.dataframe_of(*cols : DataCol)
Create a new data-frame from a list of DataCol instances
.empty_df
Creates an empty dataframe with 0 observation
.from(resultset : DB::ResultSet)
build a data-frame from a DB::ResultSet
.from_json(json : String)
builds a data-frame from a JSON string
.read_csv(file : String | IO, separator : Char = ',', quote_char : Char = '"', skip_blank_lines : Bool = true, skip : Int32 = 0, comment : Char? = '#', header : Int32? = 0, na_value : String = MISSING_VALUE, true_values = ["T", "TRUE"], false_values = ["F", "FALSE"])
reads a comma separated value file/io into a dataframe.
.read_json(file : String | IO)
reads a json file or URL
.selector(&block : ColumnSelector)
helper method to return the block as Proc.

Class Method Detail

def self.bind_cols(left : DataFrame, right : DataFrame, rename_duplicates = true) : DataFrame #

Binds dataframes by column. Rows are matched by position, so all data frames must have the same number of rows.

[View source]

def self.bind_rows(*dfs : DataFrame) : DataFrame #

Adds new rows. Missing entries are set to null. The output of bind_rows will contain a column if that column appears in any of the inputs. When row-binding, columns are matched by name, and any missing columns will be filled with NA. Grouping will be discarded when binding rows

[View source]

def self.column_types(df : DataFrame) : Array(ColSpec) #

return column types as an array of ColSpec struct

[View source]

def self.dataframe_of(rows : Iterable(Hash(String, Any))) #

Creates a new data-frame from Array of {} of String => Any

[View source]

def self.dataframe_of(rows : Iterable(DataFrameRow)) #

Creates a new data-frame from array of DataFrameRow

[View source]

def self.dataframe_of(cols : Iterable(DataCol)) #

Creates a data-frame from Array of DataCol

[View source]

def self.dataframe_of(*rows : Hash(String, Any)) #

Creates a new data-frame from {} of String => Any

[View source]

def self.dataframe_of(*header : String) #

Creates a new dataframe in place. header - pass headers as variadic parameter call values after this call to pass the values

df = dataframe_of("quarter", "sales", "location").values(1, 300.01, "london", 2, 290, "chicago")

[View source]

def self.dataframe_of(*rows : DataFrameRow) #

Creates a new data-frame from records encoded as key-value maps Column types will be inferred from the value types

[View source]

def self.dataframe_of(*cols : DataCol) #

Create a new data-frame from a list of DataCol instances

[View source]

def self.empty_df #

Creates an empty dataframe with 0 observation

[View source]

def self.from(resultset : DB::ResultSet) #

build a data-frame from a DB::ResultSet

[View source]

def self.from_json(json : String) #

builds a data-frame from a JSON string

[View source]

def self.read_csv(file : String | IO, separator : Char = ',', quote_char : Char = '"', skip_blank_lines : Bool = true, skip : Int32 = 0, comment : Char? = '#', header : Int32? = 0, na_value : String = MISSING_VALUE, true_values = ["T", "TRUE"], false_values = ["F", "FALSE"]) #

reads a comma separated value file/io into a dataframe. file could be local file path or a URL. It will read compressed(gz, gzip) files. separator defaults to , and can be changed to other separator (e.g \t for tab separated files) skip_blank_lines defaults to true, will skip all blank lines skip defaults to 0, will skip this much lines from start of file. comment character default # will ignore all lines starting with this character header line defaults to 0 (first row), if set to nil then column names are auto generated starting with Col1. if skip_blank_lines and comment are enabled, header will start reading after removing blank and comment lines na_value defaults to NA Strings which should be treated as Nil. values matching this param will be treated as nil true_values defaults to ["T","TRUE"] values to consider as boolean true false_values defaults to ["F","FALSE"] values to consider as boolean false

[View source]

def self.read_json(file : String | IO) #

reads a json file or URL

[View source]

def self.selector(&block : ColumnSelector) #

helper method to return the block as Proc. Used when doing select with multiple criteria. Kind of workaround as Crystal doesn't allow variadic blocks and Proc definition requires complete signature like Crysda::ColumnSelector.new{|e| ....} so instead of

df.select(
  Crysda::ColumnSelector.new { |s| ... },
  Crysda::ColumnSelector.new { |s| ... }
)

One can simply use this helper

df.select(
 Crysda.selector{|e| ....},
 Crysda.selector{|e| ....},
)

[View source]