CrysDA
CrysDA is a Crystal shard for Data Analysis. Provides a modern functional-style API for data manipulation to filter, transform, aggregate and reshape tabular data.
Inspired by R's dplyr and Python's pandas, written in pure Crystal with no external dependencies.
Installation
Add to your shard.yml:
dependencies:
crysda:
github: naqvis/CrysDA
Then run shards install.
Quick Example
require "crysda"
flights = Crysda.read_csv("./data/nycflights.tsv.gz", separator: '\t')
flights
.group_by("year", "month", "day")
.summarize(
"mean_arr_delay".with { |s| s["arr_delay"].mean(remove_na: true) },
"mean_dep_delay".with { |s| s["dep_delay"].mean(true) })
.filter { |f| (f["mean_arr_delay"] > 30).or(f["mean_dep_delay"] > 30) }
.print
A DataFrame: 49 x 5
year month day mean_arr_delay mean_dep_delay
1 2013 1 16 34.247 24.613
2 2013 1 31 32.603 28.658
3 2013 10 7 39.017 39.147
...
Features
Data Operations
- Filter, select, mutate - Subset and transform data with expressive syntax
- Group & summarize - Aggregations with
mean,sum,min,max,median,sd - Joins - Left, right, inner, outer joins with single or multiple keys
- Reshape -
gather/spreadfor wide↔long transformations,separate/unitefor column splitting
I/O Support
- CSV/TSV - Plain or compressed (gzip), local or remote URLs
- JSON - Array of objects format
- Database - Direct SQL query results to DataFrame
Column Types
- Numeric:
Int32Col,Int64Col,Float64Col - Text:
StringColwith automatic interning for categorical data - Boolean:
BoolCol - DateTime:
DateTimeColwith component extraction and formatting - Nested:
DFColfor hierarchical data
Missing Data
df.dropna # Remove rows with any nulls
df.dropna("col1", "col2") # Check only specific columns
df.fillna(0) # Fill all nulls with value
df.fillna({"age" => 0, "name" => "Unknown"}) # Column-specific fills
String Operations
# No casting needed - use str_ prefix methods
df["name"].str_contains("john") # Check substring
df["name"].str_contains(/\d+/) # Regex match
df["name"].str_starts_with("Dr.") # Prefix check
df["name"].str_ends_with("Jr.") # Suffix check
df["name"].str_extract(/(\d+)/) # Extract regex groups
df["name"].str_replace("old", "new") # Substitute patterns
df["name"].str_upcase # Convert to uppercase
df["name"].str_downcase # Convert to lowercase
df["name"].str_strip # Trim whitespace
df["name"].str_len # String lengths
# Use in filters
df.filter { |e| e["email"].str_contains("@gmail.com") }
# Or use the block-based matching for simple cases
df.filter { |e| e["name"].matching { |s| s.starts_with?("A") } }
Window Functions
col = df["price"].as(Float64Col)
col.rolling_mean(3) # 3-period moving average
col.rolling_sum(5) # 5-period moving sum
col.rolling_min(3) # Moving minimum
col.rolling_max(3) # Moving maximum
col.rolling_std(3) # Moving standard deviation
col.ewm_mean(10) # Exponential weighted moving average
col.diff # Difference from previous value
Convenience APIs
Column operations:
df["age"].unique # Distinct non-null values
df["age"].nunique # Count of unique values
df["age"].in?([25, 30, 35]) # Membership check
df["age"].between(20, 40) # Range check (inclusive)
df["score"].clip(0, 100) # Clamp to range
df["value"].ffill # Forward fill nulls
df["value"].bfill # Backward fill nulls
df["value"].first # First non-null value
df["value"].last # Last non-null value
df["price"].apply { |v| v.as(Float64) * 2 } # Transform values
df["col1"].coalesce(df["col2"]) # First non-null from two columns
# Binning/Discretization
df["age"].cut([0, 18, 35, 65, 100], labels: ["child", "young", "adult", "senior"])
df["score"].qcut(4) # Quartile-based binning
df["score"].qcut(4, labels: ["Q1", "Q2", "Q3", "Q4"])
DataFrame operations:
df.sample(10) # Random n rows
df.sample(0.1) # Random fraction
df.shuffle # Randomize order
df.value_counts("category") # Count occurrences
df.describe # Summary statistics
df.coalesce("a", "b", "c") # First non-null across columns
df.duplicated("id") # Find duplicates
df.drop_duplicates("id") # Remove duplicates
df.nlargest(10, "sales") # Top 10 rows by sales
df.nsmallest(5, "price") # Bottom 5 rows by price
Crysda.concat([df1, df2]) # Vertical stack
Crysda.concat([df1, df2], axis: 1) # Horizontal stack
# Row-wise operations
df.apply_rows("total") { |row| row["price"].as_f * row["qty"].as_i }
# Pivot tables
df.pivot_table("region", "product", "sales", "sum")
# JSON output
df.to_json # Convert to JSON string
df.to_json(pretty: true) # Pretty-printed JSON
df.write_json("output.json") # Write to file
Performance
CrysDA uses optimized internal storage for efficient memory usage and fast operations:
- Compact storage - Numeric columns use contiguous
Slice(T)with separate null bitmap (1 bit per element) instead of nullable union types - SIMD acceleration - Bitmap operations use SSE (x86_64) or NEON (AArch64) instructions for large datasets
- Zero-allocation aggregations -
sum,mean,min,maxoperate directly on raw data without intermediate arrays - String interning - Categorical string columns automatically deduplicate repeated values
- Lazy iteration - Column iterators avoid materializing full arrays when possible
Usage
Reading Data
# From CSV (auto-detects types)
df = Crysda.read_csv("data.csv")
df = Crysda.read_csv("http://example.com/data.csv") # from URL
df = Crysda.read_csv("data.tsv.gz", separator: '\t') # compressed TSV
# From code
df = Crysda.dataframe_of("name", "age", "score").values(
"Alice", 30, 95.5,
"Bob", 25, 87.0,
"Carol", 35, 92.3
)
Inspecting Data
df.print # Pretty-print table
df.schema # Show column types
df.num_row # Row count
df.num_col # Column count
df.names # Column names
Selecting & Filtering
df.select("name", "score") # Select columns
df.reject("age") # Exclude columns
df.filter { |e| e["age"] > 25 } # Filter rows
df.filter { |e| e["name"].matching { |s| s.starts_with?("A") } }
Transforming
df.add_column("score_pct") { |e| e["score"] / 100.0 }
df.add_column("label") { "student" }
df.sort_by("age")
df.sort_desc_by("score")
Aggregating
df.summarize(
"avg_score".with { |e| e["score"].mean },
"max_age".with { |e| e["age"].max }
)
df.group_by("department").summarize(
"count".with { |e| e.num_row },
"avg_score".with { |e| e["score"].mean(remove_na: true) }
)
df.count("department") # Cross-tabulation
Joining
left.left_join(right, "id")
left.inner_join(right, by: ["id", "year"])
left.outer_join(right, by: "key", suffices: {"_l", "_r"})
Reshaping
# Wide to long
df.gather("quarter", "value", selector { |c| c["q1".."q4"] })
# Long to wide
df.spread("quarter", "value")
# Split column
df.separate("date", into: ["year", "month", "day"], sep: "-")
# Combine columns
df.unite("full_name", ["first", "last"], sep: " ")
Plotting
For data visualization, see crysda-plot.
Documentation
Development
crystal spec
Contributing
- Fork it (https://github.com/naqvis/Crysda/fork)
- Create your feature branch (
git checkout -b my-new-feature) - Commit your changes (
git commit -am 'Add some feature') - Push to the branch (
git push origin my-new-feature) - Create a new Pull Request
Contributors
- Ali Naqvi - creator and maintainer