v1.0.0

Data Analyst

oyi77 oyi77 ← All skills

Data visualization, report generation, SQL queries, and spreadsheet automation. Transform your AI agent into a data-savvy analyst that turns raw data into actionable insights.

Downloads
3.0k
Stars
6
Versions
1
Updated
2026-02-23

Install

npx clawhub@latest install data-analyst

Documentation

Data Analyst Skill 📊

Turn your AI agent into a data analysis powerhouse.

Query databases, analyze spreadsheets, create visualizations, and generate insights that drive decisions.

---

What This Skill Does

SQL Queries — Write and execute queries against databases

Spreadsheet Analysis — Process CSV, Excel, Google Sheets data

Data Visualization — Create charts, graphs, and dashboards

Report Generation — Automated reports with insights

Data Cleaning — Handle missing data, outliers, formatting

Statistical Analysis — Descriptive stats, trends, correlations

---

Quick Start

1. Configure your data sources in TOOLS.md:

Data Sources

  • -Primary DB: [Connection string or description]
  • -Spreadsheets: [Google Sheets URL / local path]
  • -Data warehouse: [BigQuery/Snowflake/etc.]

2. Set up your workspace:

./scripts/data-init.sh

3. Start analyzing!

---

SQL Query Patterns

Common Query Templates

Basic Data Exploration
-- Row count

SELECT COUNT(*) FROM table_name;

-- Sample data

SELECT * FROM table_name LIMIT 10;

-- Column statistics

SELECT

column_name,

COUNT(*) as count,

COUNT(DISTINCT column_name) as unique_values,

MIN(column_name) as min_val,

MAX(column_name) as max_val

FROM table_name

GROUP BY column_name;

Time-Based Analysis
-- Daily aggregation

SELECT

DATE(created_at) as date,

COUNT(*) as daily_count,

SUM(amount) as daily_total

FROM transactions

GROUP BY DATE(created_at)

ORDER BY date DESC;

-- Month-over-month comparison

SELECT

DATE_TRUNC('month', created_at) as month,

COUNT(*) as count,

LAG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', created_at)) as prev_month,

(COUNT(*) - LAG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', created_at))) /

NULLIF(LAG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', created_at)), 0) * 100 as growth_pct

FROM transactions

GROUP BY DATE_TRUNC('month', created_at)

ORDER BY month;

Cohort Analysis
-- User cohort by signup month

SELECT

DATE_TRUNC('month', u.created_at) as cohort_month,

DATE_TRUNC('month', o.created_at) as activity_month,

COUNT(DISTINCT u.id) as users

FROM users u

LEFT JOIN orders o ON u.id = o.user_id

GROUP BY cohort_month, activity_month

ORDER BY cohort_month, activity_month;

Funnel Analysis
-- Conversion funnel

WITH funnel AS (

SELECT

COUNT(DISTINCT CASE WHEN event = 'page_view' THEN user_id END) as views,

COUNT(DISTINCT CASE WHEN event = 'signup' THEN user_id END) as signups,

COUNT(DISTINCT CASE WHEN event = 'purchase' THEN user_id END) as purchases

FROM events

WHERE date >= CURRENT_DATE - INTERVAL '30 days'

)

SELECT

views,

signups,

ROUND(signups * 100.0 / NULLIF(views, 0), 2) as signup_rate,

purchases,

ROUND(purchases * 100.0 / NULLIF(signups, 0), 2) as purchase_rate

FROM funnel;

---

Data Cleaning

Common Data Quality Issues

| Issue | Detection | Solution |

|-------|-----------|----------|

| Missing values | IS NULL or empty string | Impute, drop, or flag |

| Duplicates | GROUP BY with HAVING COUNT(*) > 1 | Deduplicate with rules |

| Outliers | Z-score > 3 or IQR method | Investigate, cap, or exclude |

| Inconsistent formats | Sample and pattern match | Standardize with transforms |

| Invalid values | Range checks, referential integrity | Validate and correct |

Data Cleaning SQL Patterns

-- Find duplicates

SELECT email, COUNT(*)

FROM users

GROUP BY email

HAVING COUNT(*) > 1;

-- Find nulls

SELECT

COUNT(*) as total,

SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) as null_emails,

SUM(CASE WHEN name IS NULL THEN 1 ELSE 0 END) as null_names

FROM users;

-- Standardize text

UPDATE products

SET category = LOWER(TRIM(category));

-- Remove outliers (IQR method)

WITH stats AS (

SELECT

PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) as q1,

PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as q3

FROM data

)

SELECT * FROM data, stats

WHERE value BETWEEN q1 - 1.5*(q3-q1) AND q3 + 1.5*(q3-q1);

Data Cleaning Checklist

Data Quality Audit: [Dataset]

Row-Level Checks

  • -[ ] Total row count: [X]
  • -[ ] Duplicate rows: [X]
  • -[ ] Rows with any null: [X]

Column-Level Checks

| Column | Type | Nulls | Unique | Min | Max | Issues |

|--------|------|-------|--------|-----|-----|--------|

| [col] | [type] | [n] | [n] | [v] | [v] | [notes] |

Data Lineage

  • -Source: [Where data came from]
  • -Last updated: [Date]
  • -Known issues: [List]

Cleaning Actions Taken

1. [Action and reason]

2. [Action and reason]

---

Spreadsheet Analysis

CSV/Excel Processing with Python

import pandas as pd

Load data

df = pd.read_csv('data.csv') # or pd.read_excel('data.xlsx')

Basic exploration

print(df.shape) # (rows, columns)

print(df.info()) # Column types and nulls

print(df.describe()) # Numeric statistics

Data cleaning

df = df.drop_duplicates()

df['date'] = pd.to_datetime(df['date'])

df['amount'] = df['amount'].fillna(0)

Analysis

summary = df.groupby('category').agg({

'amount': ['sum', 'mean', 'count'],

'quantity': 'sum'

}).round(2)

Export

summary.to_csv('analysis_output.csv')

Common Pandas Operations

Filtering

filtered = df[df['status'] == 'active']

filtered = df[df['amount'] > 1000]

filtered = df[df['date'].between('2024-01-01', '2024-12-31')]

Aggregation

by_category = df.groupby('category')['amount'].sum()

pivot = df.pivot_table(values='amount', index='month', columns='category', aggfunc='sum')

Window functions

df['running_total'] = df['amount'].cumsum()

df['pct_change'] = df['amount'].pct_change()

df['rolling_avg'] = df['amount'].rolling(window=7).mean()

Merging

merged = pd.merge(df1, df2, on='id', how='left')

---

Data Visualization

Chart Selection Guide

| Data Type | Best Chart | Use When |

|-----------|------------|----------|

| Trend over time | Line chart | Showing patterns/changes over time |

| Category comparison | Bar chart | Comparing discrete categories |

| Part of whole | Pie/Donut | Showing proportions (≤5 categories) |

| Distribution | Histogram | Understanding data spread |

| Correlation | Scatter plot | Relationship between two variables |

| Many categories | Horizontal bar | Ranking or comparing many items |

| Geographic | Map | Location-based data |

Python Visualization with Matplotlib/Seaborn

import matplotlib.pyplot as plt

import seaborn as sns

Set style

plt.style.use('seaborn-v0_8-whitegrid')

sns.set_palette("husl")

Line chart (trends)

plt.figure(figsize=(10, 6))

plt.plot(df['date'], df['value'], marker='o')

plt.title('Trend Over Time')

plt.xlabel('Date')

plt.ylabel('Value')

plt.xticks(rotation=45)

plt.tight_layout()

plt.savefig('trend.png', dpi=150)

Bar chart (comparisons)

plt.figure(figsize=(10, 6))

sns.barplot(data=df, x='category', y='amount')

plt.title('Amount by Category')

plt.xticks(rotation=45)

plt.tight_layout()

plt.savefig('comparison.png', dpi=150)

Heatmap (correlations)

plt.figure(figsize=(10, 8))

sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)

plt.title('Correlation Matrix')

plt.tight_layout()

plt.savefig('correlation.png', dpi=150)

ASCII Charts (Quick Terminal Visualization)

When you can't generate images, use ASCII:

Revenue by Month (in $K)

========================

Jan: ████████████████ 160

Feb: ██████████████████ 180

Mar: ████████████████████████ 240

Apr: ██████████████████████ 220

May: ██████████████████████████ 260

Jun: ████████████████████████████ 280

---

Report Generation

Standard Report Template

[Report Name]

Period: [Date range] Generated: [Date] Author: [Agent/Human]

Executive Summary

[2-3 sentences with key findings]

Key Metrics

| Metric | Current | Previous | Change |

|--------|---------|----------|--------|

| [Metric] | [Value] | [Value] | [+/-X%] |

Detailed Analysis

[Section 1]

[Analysis with supporting data]

[Section 2]

[Analysis with supporting data]

Visualizations

[Insert charts]

Insights

1. [Insight]: [Supporting evidence]

2. [Insight]: [Supporting evidence]

Recommendations

1. [Actionable recommendation]

2. [Actionable recommendation]

Methodology

  • -Data source: [Source]
  • -Date range: [Range]
  • -Filters applied: [Filters]
  • -Known limitations: [Limitations]

Appendix

[Supporting data tables]

Automated Report Script

#!/bin/bash

generate-report.sh

Pull latest data

python scripts/extract_data.py --output data/latest.csv

Run analysis

python scripts/analyze.py --input data/latest.csv --output reports/

Generate report

python scripts/format_report.py --template weekly --output reports/weekly-$(date +%Y-%m-%d).md

echo "Report generated: reports/weekly-$(date +%Y-%m-%d).md"

---

Statistical Analysis

Descriptive Statistics

| Statistic | What It Tells You | Use Case |

|-----------|-------------------|----------|

| Mean | Average value | Central tendency |

| Median | Middle value | Robust to outliers |

| Mode | Most common | Categorical data |

| Std Dev | Spread around mean | Variability |

| Min/Max | Range | Data boundaries |

| Percentiles | Distribution shape | Benchmarking |

Quick Stats with Python

Full descriptive statistics

stats = df['amount'].describe()

print(stats)

Additional stats

print(f"Median: {df['amount'].median()}")

print(f"Mode: {df['amount'].mode()[0]}")

print(f"Skewness: {df['amount'].skew()}")

print(f"Kurtosis: {df['amount'].kurtosis()}")

Correlation

correlation = df['sales'].corr(df['marketing_spend'])

print(f"Correlation: {correlation:.3f}")

Statistical Tests Quick Reference

| Test | Use Case | Python |

|------|----------|--------|

| T-test | Compare two means | scipy.stats.ttest_ind(a, b) |

| Chi-square | Categorical independence | scipy.stats.chi2_contingency(table) |

| ANOVA | Compare 3+ means | scipy.stats.f_oneway(a, b, c) |

| Pearson | Linear correlation | scipy.stats.pearsonr(x, y) |

---

Analysis Workflow

Standard Analysis Process

1. Define the Question

- What are we trying to answer?

- What decisions will this inform?

2. Understand the Data

- What data is available?

- What's the structure and quality?

3. Clean and Prepare

- Handle missing values

- Fix data types

- Remove duplicates

4. Explore

- Descriptive statistics

- Initial visualizations

- Identify patterns

5. Analyze

- Deep dive into findings

- Statistical tests if needed

- Validate hypotheses

6. Communicate

- Clear visualizations

- Actionable insights

- Recommendations

Analysis Request Template

Analysis Request

Question

[What are we trying to answer?]

Context

[Why does this matter? What decision will it inform?]

Data Available

  • -[Dataset 1]: [Description]
  • -[Dataset 2]: [Description]

Expected Output

  • -[Deliverable 1]
  • -[Deliverable 2]

Timeline

[When is this needed?]

Notes

[Any constraints or considerations]

---

Scripts

data-init.sh

Initialize your data analysis workspace.

query.sh

Quick SQL query execution.

Run query from file

./scripts/query.sh --file queries/daily-report.sql

Run inline query

./scripts/query.sh "SELECT COUNT(*) FROM users"

Save output to file

./scripts/query.sh --file queries/export.sql --output data/export.csv

analyze.py

Python analysis toolkit.

Basic analysis

python scripts/analyze.py --input data/sales.csv

With specific analysis type

python scripts/analyze.py --input data/sales.csv --type cohort

Generate report

python scripts/analyze.py --input data/sales.csv --report weekly

---

Integration Tips

With Other Skills

| Skill | Integration |

|-------|-------------|

| Marketing | Analyze campaign performance, content metrics |

| Sales | Pipeline analytics, conversion analysis |

| Business Dev | Market research data, competitor analysis |

Common Data Sources

  • -Databases: PostgreSQL, MySQL, SQLite
  • -Warehouses: BigQuery, Snowflake, Redshift
  • -Spreadsheets: Google Sheets, Excel, CSV
  • -APIs: REST endpoints, GraphQL
  • -Files: JSON, Parquet, XML

---

Best Practices

1. Start with the question — Know what you're trying to answer

2. Validate your data — Garbage in = garbage out

3. Document everything — Queries, assumptions, decisions

4. Visualize appropriately — Right chart for right data

5. Show your work — Methodology matters

6. Lead with insights — Not just data dumps

7. Make it actionable — "So what?" → "Now what?"

8. Version your queries — Track changes over time

---

Common Mistakes

Confirmation bias — Looking for data to support a conclusion

Correlation ≠ causation — Be careful with claims

Cherry-picking — Using only favorable data

Ignoring outliers — Investigate before removing

Over-complicating — Simple analysis often wins

No context — Numbers without comparison are meaningless

---

License

License: MIT — use freely, modify, distribute.

---

*"The goal is to turn data into information, and information into insight." — Carly Fiorina*

Launch an agent with Data Analyst on Termo.