Acknowledged. I will respond as a data analysis specialist with precise, context-aware methods across preprocessing, statistical analysis, visualization, and interpretation. To proceed efficiently and maintain accuracy, please share your task or dataset and the following details:
Required inputs
- Objective and key questions: What decisions will this analysis inform? Any hypotheses to test?
- Unit of analysis and granularity: Row meaning (e.g., transaction, user-day), time coverage, sampling frame.
- Target variable(s) and key features: Definitions, data dictionary, expected relationships.
- Data sources and joins: Primary keys, foreign keys, expected constraints, known quality issues.
- Experiment/causal design (if applicable): Randomization, assignment, exposure, inclusion/exclusion criteria.
- Metrics/KPIs: Business metrics to optimize or monitor; success thresholds.
- Context constraints: Regulatory, operational, domain-specific rules.
- Preferred environment: Python/R/SQL; ability to share sample data (schema + 10–50 rows).
Standard analysis workflow
-
Problem framing
- Define hypotheses, metrics, unit of analysis, cohorts/segments, and time windows.
- Identify confounders and potential sources of bias.
-
Data audit and preprocessing
- Validate schema, types, uniqueness, referential integrity, and time coverage.
- Handle missingness (MCAR/MAR/MNAR) using appropriate strategies: listwise deletion (if MCAR and small), simple imputation (median/mode), multiple imputation (MICE), or model-based imputations; document rationale.
- Detect and treat outliers using robust statistics (IQR/median absolute deviation) or domain rules; consider winsorization and robust models.
- Address duplicates, inconsistent labels, and timezone issues; standardize units and encodings.
- Prevent target leakage by ensuring features are available at prediction time.
-
Exploratory data analysis (EDA)
- Univariate: distributions, tails, missingness maps.
- Bivariate/multivariate: correlations (Pearson/Spearman), mutual information, pivoted group summaries, pairwise plots.
- Temporal: trend/seasonality, stationarity diagnostics (ADF/KPSS), lag features, cohort retention curves.
-
Statistical inference
- Variable-type and design-appropriate tests:
- Two-group numeric: t-test (if normal, equal variances) or Welch’s t-test; Mann–Whitney U if non-normal.
- Multi-group numeric: ANOVA or Kruskal–Wallis; post-hoc Tukey/Holm.
- Categorical associations: chi-square or Fisher’s exact (small counts).
- Correlation: Pearson (linear, normal), Spearman (monotonic), partial correlations controlling for confounders.
- Regression: linear/logistic/Poisson/negative binomial; regularization (L1/L2); robust SEs; check assumptions (linearity, homoscedasticity, multicollinearity, residual diagnostics).
- Multiple testing control via FDR (Benjamini–Hochberg) or Bonferroni where appropriate.
- Report effect sizes and confidence intervals, not just p-values.
-
Modeling (if predictive or forecasting)
- Baselines: simple mean/propensity/logistic baseline; naive or seasonal naive for time series.
- Feature engineering: domain-driven features, interactions, lags/rolling stats; avoid leakage.
- Models:
- Classification: logistic regression, tree-based (RF/GBM), calibrated probabilities (Platt/Isotonic).
- Regression: linear/regularized, tree-based (GBM/XGBoost/LightGBM).
- Time series: ARIMA/SARIMA, ETS; for complex seasonality consider TBATS; cross-validated ML with time-aware splits.
- Panel/causal: fixed/random effects, difference-in-differences, synthetic controls; check parallel trends.
- Evaluation:
- Classification: ROC-AUC, PR-AUC (for imbalance), log-loss, precision/recall/F1, calibration (reliability curves), decision curves.
- Regression: MAE, RMSE, MAPE/sMAPE (mind zero-valued targets).
- Forecasting: sMAPE, MASE, rolling-origin validation.
- Interpretability: SHAP/feature importance, partial dependence; stability across folds.
-
Visualization
- Choose charts aligned with data type and message: distributions (histograms/ECDF/box/violin), relationships (scatter with CI/binned heatmaps), time series (decomposition, anomaly overlays), categorical comparisons (bar/point-range with CIs).
- Always include axis units, uncertainty bands where relevant, and clear labeling of cohorts/time windows.
-
Results interpretation and recommendations
- Summarize key findings with quantified effects and uncertainty.
- Identify limitations, assumptions, sensitivity analyses, and robustness checks.
- Translate findings into actionable steps and expected impact on KPIs.
Quality and bias considerations
- Confounding: Include known covariates, stratify or adjust via regression/propensity methods.
- Selection bias: Verify sampling frame; compare sample vs population.
- Measurement error: Assess reliability of key variables; conduct sensitivity analyses.
- Data drift: For ongoing models, monitor feature distributions and performance over time.
- Reproducibility: Version data, code, parameters; provide seeds and environment details.
Deliverables (tailored to your task)
- Cleaned dataset summary and data quality report.
- EDA visuals and statistical test results with effect sizes and CIs.
- Model artifacts (if applicable): training pipeline, metrics, interpretability outputs.
- Executive summary: findings, limitations, and recommended actions.
- Reproducible notebook/script with instructions.
To begin, please provide:
- The exact question(s) you want answered and the decision context.
- Schema or sample rows, and a brief data dictionary.
- Time range, granularity, and any inclusion/exclusion rules.
- Target variable definition (if modeling) and key KPIs.
- Any known constraints, experimental setup, or suspected issues.
- Preferred tools (Python/R/SQL) and any formatting requirements for outputs.
Once I have these, I will propose a concise analysis plan and proceed with rigorous, verifiable results.