Module 09 // Insight Engine

Data
Analysis

> From raw rows to decisions. Load, clean, profile, aggregate, model — the analyst's loop. Pandas for in-memory power, SQL for big tables, NumPy for the math underneath.

The Analyst's Loop

Real analysis is iterative. You rarely move in a straight line — you load, peek, find a surprise, clean, re-aggregate, and loop again until the story holds up.

  1. Acquire — pull from a CSV, database, or API.
  2. Profile — shape, dtypes, nulls, ranges, cardinality.
  3. Clean — fix types, handle missing values, drop duplicates.
  4. Transform — group, pivot, join, window.
  5. Visualize — sanity-check distributions and trends.
  6. Communicate — narrow to the few numbers a decision actually needs.

Loading & Profiling Data

Start every dataset with a quick fingerprint. info(), describe(), and isna().sum() answer "what am I even looking at?".

profile.py
import pandas as pd

df = pd.read_csv(class="str">"orders.csv", parse_dates=[class="str">"order_date"])

print(df.shape)              class=class="str">"com"># (rows, cols)
print(df.dtypes)             class=class="str">"com"># column types
print(df.head())             class=class="str">"com"># first 5 rows
print(df.describe())         class=class="str">"com"># numeric summary
print(df.isna().sum())       class=class="str">"com"># null count per column
print(df[class="str">"region"].value_counts())  class=class="str">"com"># cardinality
▶ Python Playground · Pyodide (in-browser)
First load downloads ~10MB of Python runtime · cached after that.

Cleaning: Nulls, Types, Duplicates

Most analysis bugs are data bugs. Make missingness explicit, cast types intentionally, and check for duplicate keys before you trust any aggregate.

class=class="str">"com"># Missing values
df = df.dropna(subset=[class="str">"amount"])             class=class="str">"com"># drop rows missing critical field
df[class="str">"channel"] = df[class="str">"channel"].fillna(class="str">"unknown")

class=class="str">"com"># Type coercion
df[class="str">"amount"] = pd.to_numeric(df[class="str">"amount"], errors=class="str">"coerce")
df[class="str">"order_date"] = pd.to_datetime(df[class="str">"order_date"])

class=class="str">"com"># Duplicates
print(df.duplicated(subset=[class="str">"order_date",class="str">"customer_id"]).sum())
df = df.drop_duplicates(subset=[class="str">"order_date",class="str">"customer_id"], keep=class="str">"last")

class=class="str">"com"># Outliers via IQR
q1, q3 = df[class="str">"amount"].quantile([0.25, 0.75])
iqr = q3 - q1
mask = (df[class="str">"amount"] >= q1 - 1.5*iqr) & (df[class="str">"amount"] <= q3 + 1.5*iqr)
clean = df[mask]

Aggregation & Group-By

groupby is the workhorse of analysis. Pair it with agg to compute multiple metrics in one pass.

▶ Python Playground · Pyodide (in-browser)
First load downloads ~10MB of Python runtime · cached after that.

SQL Aggregations

When data is too big for memory, push the work to the database. Same mental model — group, aggregate, filter — different syntax. Run the queries below against a sample orders table.

▶ SQL Playground · SQLite (in-browser)
▶ SQL Playground · SQLite (in-browser)

Joins: Combining Tables

The same join semantics apply in SQL and pandas. Pick the smaller side as the lookup, validate the join keys, and always check row counts before and after.

class=class="str">"com"># pandas merge mirrors SQL JOIN
joined = orders.merge(
    customers,
    how=class="str">"left",        class=class="str">"com"># inner | left | right | outer
    on=class="str">"customer_id",
    validate=class="str">"m:1",    class=class="str">"com"># explode if assumption is wrong
    indicator=True,    class=class="str">"com"># adds _merge column for QA
)

class=class="str">"com"># Sanity: any orphaned orders?
print(joined[class="str">"_merge"].value_counts())

Time Series & Resampling

Time-indexed data unlocks resampling, rolling windows, and shift-based comparisons (week over week, month over month).

▶ Python Playground · Pyodide (in-browser)
First load downloads ~10MB of Python runtime · cached after that.

NumPy: The Math Underneath

Pandas columns are NumPy arrays. Vectorize whenever you can — Python for loops over a million rows are 100× slower than the equivalent NumPy expression.

▶ Python Playground · Pyodide (in-browser)
First load downloads ~10MB of Python runtime · cached after that.

Pivots & Cross-Tabs

Reshape long data into a matrix view for quick pattern-spotting.

▶ Python Playground · Pyodide (in-browser)
First load downloads ~10MB of Python runtime · cached after that.

Correlation & Quick Stats

Before reaching for a model, look at correlations and distributions. Most insights live in the simple stuff.

class=class="str">"com"># Pearson correlation matrix between numeric columns
df[[class="str">"amount",class="str">"items",class="str">"discount",class="str">"ship_days"]].corr().round(2)

class=class="str">"com"># Categorical relationship
pd.crosstab(df[class="str">"region"], df[class="str">"churned"], normalize=class="str">"index").round(3)

class=class="str">"com"># Quick segment comparison
df.groupby(class="str">"segment")[class="str">"ltv"].agg([class="str">"count",class="str">"mean",class="str">"median",class="str">"std"])

Visualization Fundamentals

Match the chart to the question. Distributions → histogram or box plot. Trend over time → line. Comparing categories → bar. Two numeric variables → scatter.

charts.py
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(10, 8))

class=class="str">"com"># Distribution
axes[0,0].hist(df[class="str">"amount"], bins=30, edgecolor=class="str">"white")
axes[0,0].set_title(class="str">"Order amount distribution")

class=class="str">"com"># Trend
daily = df.groupby(class="str">"order_date")[class="str">"amount"].sum()
axes[0,1].plot(daily.index, daily.values)
axes[0,1].set_title(class="str">"Daily revenue")

class=class="str">"com"># Categorical comparison
df.groupby(class="str">"region")[class="str">"amount"].sum().plot.bar(ax=axes[1,0])
axes[1,0].set_title(class="str">"Revenue by region")

class=class="str">"com"># Relationship
axes[1,1].scatter(df[class="str">"items"], df[class="str">"amount"], alpha=0.5)
axes[1,1].set_title(class="str">"Items vs amount")

plt.tight_layout()
plt.savefig(class="str">"dashboard.png", dpi=120)

BI Tools: Power BI & Tableau

Notebooks are great for exploration; BI tools are how the rest of the org consumes your work. Power BI and Tableau both turn modeled data into interactive dashboards — same goal, different philosophies.

Microsoft

Power BI

Tight Microsoft-stack integration. Strong for enterprises already on Azure, Excel, and Office 365. DAX is the formula language; Power Query (M) handles ETL.

  • Power Query → connect & shape
  • Data Model → relationships & measures
  • DAX → calculated columns & KPIs
  • Report View → visuals & slicers
  • Publish to Power BI Service
Salesforce

Tableau

Best-in-class visual analytics. Drag-and-drop is faster for exploration; calculated fields use Tableau's own expression language. Strong storytelling features.

  • Data Source → connect & join
  • Worksheets → build single visuals
  • Calculated Fields → LOD expressions
  • Dashboards → assemble worksheets
  • Publish to Tableau Server / Cloud

The BI Workflow (Same in Both Tools)

  1. Connect — to a database, warehouse, CSV, or API.
  2. Model — define relationships between fact and dimension tables (star schema).
  3. Measure — write reusable calculations (DAX in Power BI, Calculated Fields in Tableau).
  4. Visualize — pick the chart that matches the question.
  5. Interact — add filters, slicers, parameters, and drill-throughs.
  6. Publish — share via Power BI Service or Tableau Server / Cloud.

DAX vs Tableau Calculations

Both languages express the same ideas — totals, ratios, time intelligence — with different syntax. Below: revenue, YoY growth, and a running total in each.

measures.dax
-- Power BI / DAX
Total Revenue = SUM(Orders[Amount])

Revenue PY =
CALCULATE(
  [Total Revenue],
  SAMEPERIODLASTYEAR('Date'[Date])
)

YoY % =
DIVIDE([Total Revenue] - [Revenue PY], [Revenue PY])

Running Total =
CALCULATE(
  [Total Revenue],
  FILTER(
    ALLSELECTED('Date'[Date]),
    'Date'[Date] <= MAX('Date'[Date])
  )
)
calcs.tableau
// Tableau Calculated Fields
// Total Revenue
SUM([Amount])

// Revenue PY  (LOOKUP across date partition)
LOOKUP(SUM([Amount]), -1)

// YoY %
(SUM([Amount]) - LOOKUP(SUM([Amount]), -1))
/ LOOKUP(SUM([Amount]), -1)

// Running Total
RUNNING_SUM(SUM([Amount]))

// LOD: revenue per customer, regardless of view filters
{ FIXED [Customer ID] : SUM([Amount]) }

Choosing Between Them

  • Pick Power BI if your org runs on Microsoft 365 / Azure, your team knows Excel formulas, and licensing cost matters.
  • Pick Tableau if visual exploration speed is the priority, you need advanced chart types out of the box, or you're already on Salesforce.
  • Both connect to the same warehouses (Snowflake, BigQuery, Postgres) — your SQL and modeling skills transfer either way.

Dashboard Design Principles

  • One question per dashboard — "How is revenue trending?" not "everything about sales".
  • Top-left = most important — readers scan in a Z-pattern.
  • KPI cards first, detail below — headline number, then the breakdown.
  • Limit colors — use color to encode meaning, not decoration.
  • Consistent date grain — don't mix daily and monthly on the same view without labeling.

Reproducibility Checklist

An analysis that can't be rerun is just a screenshot. Build for repeatability from day one.

  • Pin versions: requirements.txt or pyproject.toml.
  • Seed every random source: np.random.default_rng(42).
  • Parameterize dates and inputs — no hard-coded "today".
  • Save raw + processed data separately; never overwrite the source.
  • Notebook → script → scheduled job, in that order of maturity.