Data
Analysis
> From raw rows to decisions. Load, clean, profile, aggregate, model — the analyst's loop. Pandas for in-memory power, SQL for big tables, NumPy for the math underneath.
The Analyst's Loop
Real analysis is iterative. You rarely move in a straight line — you load, peek, find a surprise, clean, re-aggregate, and loop again until the story holds up.
- Acquire — pull from a CSV, database, or API.
- Profile — shape, dtypes, nulls, ranges, cardinality.
- Clean — fix types, handle missing values, drop duplicates.
- Transform — group, pivot, join, window.
- Visualize — sanity-check distributions and trends.
- Communicate — narrow to the few numbers a decision actually needs.
Loading & Profiling Data
Start every dataset with a quick fingerprint. info(), describe(), and isna().sum() answer "what am I even looking at?".
import pandas as pd
df = pd.read_csv(class="str">"orders.csv", parse_dates=[class="str">"order_date"])
print(df.shape) class=class="str">"com"># (rows, cols)
print(df.dtypes) class=class="str">"com"># column types
print(df.head()) class=class="str">"com"># first 5 rows
print(df.describe()) class=class="str">"com"># numeric summary
print(df.isna().sum()) class=class="str">"com"># null count per column
print(df[class="str">"region"].value_counts()) class=class="str">"com"># cardinalityCleaning: Nulls, Types, Duplicates
Most analysis bugs are data bugs. Make missingness explicit, cast types intentionally, and check for duplicate keys before you trust any aggregate.
class=class="str">"com"># Missing values
df = df.dropna(subset=[class="str">"amount"]) class=class="str">"com"># drop rows missing critical field
df[class="str">"channel"] = df[class="str">"channel"].fillna(class="str">"unknown")
class=class="str">"com"># Type coercion
df[class="str">"amount"] = pd.to_numeric(df[class="str">"amount"], errors=class="str">"coerce")
df[class="str">"order_date"] = pd.to_datetime(df[class="str">"order_date"])
class=class="str">"com"># Duplicates
print(df.duplicated(subset=[class="str">"order_date",class="str">"customer_id"]).sum())
df = df.drop_duplicates(subset=[class="str">"order_date",class="str">"customer_id"], keep=class="str">"last")
class=class="str">"com"># Outliers via IQR
q1, q3 = df[class="str">"amount"].quantile([0.25, 0.75])
iqr = q3 - q1
mask = (df[class="str">"amount"] >= q1 - 1.5*iqr) & (df[class="str">"amount"] <= q3 + 1.5*iqr)
clean = df[mask]Aggregation & Group-By
groupby is the workhorse of analysis. Pair it with agg to compute multiple metrics in one pass.
SQL Aggregations
When data is too big for memory, push the work to the database. Same mental model — group, aggregate, filter — different syntax. Run the queries below against a sample orders table.
Joins: Combining Tables
The same join semantics apply in SQL and pandas. Pick the smaller side as the lookup, validate the join keys, and always check row counts before and after.
class=class="str">"com"># pandas merge mirrors SQL JOIN
joined = orders.merge(
customers,
how=class="str">"left", class=class="str">"com"># inner | left | right | outer
on=class="str">"customer_id",
validate=class="str">"m:1", class=class="str">"com"># explode if assumption is wrong
indicator=True, class=class="str">"com"># adds _merge column for QA
)
class=class="str">"com"># Sanity: any orphaned orders?
print(joined[class="str">"_merge"].value_counts())Time Series & Resampling
Time-indexed data unlocks resampling, rolling windows, and shift-based comparisons (week over week, month over month).
NumPy: The Math Underneath
Pandas columns are NumPy arrays. Vectorize whenever you can — Python for loops over a million rows are 100× slower than the equivalent NumPy expression.
Pivots & Cross-Tabs
Reshape long data into a matrix view for quick pattern-spotting.
Correlation & Quick Stats
Before reaching for a model, look at correlations and distributions. Most insights live in the simple stuff.
class=class="str">"com"># Pearson correlation matrix between numeric columns
df[[class="str">"amount",class="str">"items",class="str">"discount",class="str">"ship_days"]].corr().round(2)
class=class="str">"com"># Categorical relationship
pd.crosstab(df[class="str">"region"], df[class="str">"churned"], normalize=class="str">"index").round(3)
class=class="str">"com"># Quick segment comparison
df.groupby(class="str">"segment")[class="str">"ltv"].agg([class="str">"count",class="str">"mean",class="str">"median",class="str">"std"])Visualization Fundamentals
Match the chart to the question. Distributions → histogram or box plot. Trend over time → line. Comparing categories → bar. Two numeric variables → scatter.
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
class=class="str">"com"># Distribution
axes[0,0].hist(df[class="str">"amount"], bins=30, edgecolor=class="str">"white")
axes[0,0].set_title(class="str">"Order amount distribution")
class=class="str">"com"># Trend
daily = df.groupby(class="str">"order_date")[class="str">"amount"].sum()
axes[0,1].plot(daily.index, daily.values)
axes[0,1].set_title(class="str">"Daily revenue")
class=class="str">"com"># Categorical comparison
df.groupby(class="str">"region")[class="str">"amount"].sum().plot.bar(ax=axes[1,0])
axes[1,0].set_title(class="str">"Revenue by region")
class=class="str">"com"># Relationship
axes[1,1].scatter(df[class="str">"items"], df[class="str">"amount"], alpha=0.5)
axes[1,1].set_title(class="str">"Items vs amount")
plt.tight_layout()
plt.savefig(class="str">"dashboard.png", dpi=120)BI Tools: Power BI & Tableau
Notebooks are great for exploration; BI tools are how the rest of the org consumes your work. Power BI and Tableau both turn modeled data into interactive dashboards — same goal, different philosophies.
Power BI
Tight Microsoft-stack integration. Strong for enterprises already on Azure, Excel, and Office 365. DAX is the formula language; Power Query (M) handles ETL.
- Power Query → connect & shape
- Data Model → relationships & measures
- DAX → calculated columns & KPIs
- Report View → visuals & slicers
- Publish to Power BI Service
Tableau
Best-in-class visual analytics. Drag-and-drop is faster for exploration; calculated fields use Tableau's own expression language. Strong storytelling features.
- Data Source → connect & join
- Worksheets → build single visuals
- Calculated Fields → LOD expressions
- Dashboards → assemble worksheets
- Publish to Tableau Server / Cloud
The BI Workflow (Same in Both Tools)
- Connect — to a database, warehouse, CSV, or API.
- Model — define relationships between fact and dimension tables (star schema).
- Measure — write reusable calculations (DAX in Power BI, Calculated Fields in Tableau).
- Visualize — pick the chart that matches the question.
- Interact — add filters, slicers, parameters, and drill-throughs.
- Publish — share via Power BI Service or Tableau Server / Cloud.
DAX vs Tableau Calculations
Both languages express the same ideas — totals, ratios, time intelligence — with different syntax. Below: revenue, YoY growth, and a running total in each.
-- Power BI / DAX
Total Revenue = SUM(Orders[Amount])
Revenue PY =
CALCULATE(
[Total Revenue],
SAMEPERIODLASTYEAR('Date'[Date])
)
YoY % =
DIVIDE([Total Revenue] - [Revenue PY], [Revenue PY])
Running Total =
CALCULATE(
[Total Revenue],
FILTER(
ALLSELECTED('Date'[Date]),
'Date'[Date] <= MAX('Date'[Date])
)
)// Tableau Calculated Fields
// Total Revenue
SUM([Amount])
// Revenue PY (LOOKUP across date partition)
LOOKUP(SUM([Amount]), -1)
// YoY %
(SUM([Amount]) - LOOKUP(SUM([Amount]), -1))
/ LOOKUP(SUM([Amount]), -1)
// Running Total
RUNNING_SUM(SUM([Amount]))
// LOD: revenue per customer, regardless of view filters
{ FIXED [Customer ID] : SUM([Amount]) }Choosing Between Them
- Pick Power BI if your org runs on Microsoft 365 / Azure, your team knows Excel formulas, and licensing cost matters.
- Pick Tableau if visual exploration speed is the priority, you need advanced chart types out of the box, or you're already on Salesforce.
- Both connect to the same warehouses (Snowflake, BigQuery, Postgres) — your SQL and modeling skills transfer either way.
Dashboard Design Principles
- One question per dashboard — "How is revenue trending?" not "everything about sales".
- Top-left = most important — readers scan in a Z-pattern.
- KPI cards first, detail below — headline number, then the breakdown.
- Limit colors — use color to encode meaning, not decoration.
- Consistent date grain — don't mix daily and monthly on the same view without labeling.
Reproducibility Checklist
An analysis that can't be rerun is just a screenshot. Build for repeatability from day one.
- Pin versions:
requirements.txtorpyproject.toml. - Seed every random source:
np.random.default_rng(42). - Parameterize dates and inputs — no hard-coded "today".
- Save raw + processed data separately; never overwrite the source.
- Notebook → script → scheduled job, in that order of maturity.