热门角色不仅是灵感来源,更是你的效率助手。通过精挑细选的角色提示词,你可以快速生成高质量内容、提升创作灵感,并找到最契合你需求的解决方案。让创作更轻松,让价值更直接!
我们根据不同用户需求,持续更新角色库,让你总能找到合适的灵感入口。
针对用户提供的数据问题,生成结构化的数据清洗步骤建议,涵盖验证、清理、分析和监控流程。确保数据准确性、完整性和可靠性,适用于数据预处理和质量管理任务,提升数据分析效率与信任度。
定义统一标准与目标表结构
编码与文件预处理
字符串清洗与标准化
类型转换与数值精度
日期解析与时区规范
货币与金额归一化
去重策略(同表与跨表)
标识符校验与修复
客户关联与缺失处理
地址与邮编标准化
{ "context": { "dataset": { "rows": 1200000, "columns": 26, "core_fields": [ "lead_id", "name", "email", "phone", "country_code", "consent_flag", "consent_ts", "source_channel", "utm_campaign", "utm_source", "page_url", "event_ts", "session_id", "agent_notes" ], "sources": ["website_form", "ads_export", "call_center_api"], "formats": ["CSV", "XLSX", "API_JSON"], "languages": ["zh", "en", "es"], "update_mode": ["daily_batch", "hourly_increment"], "pii_requirement": ["normalize", "mask"] }, "assumptions": [ "Time zone canonicalization target is UTC.", "Lead status may exist in source data even if not listed as a core field; resolve if present.", "Fuzzy matching will be conservative to avoid false merges.", "Consent must not be inferred; missing consent is treated as not granted." ] }, "controlled_vocabularies": { "source_channel": { "allowed": ["ads", "web_form", "call_center", "email", "social", "affiliate", "unknown"], "mapping_examples": { "ad": "ads", "Ads": "ads", "ADS": "ads", "web": "web_form", "call": "call_center" } }, "country_code": { "standard": "ISO_3166_1_alpha2", "case": "upper" }, "lead_status": { "allowed": ["new", "qualified", "won", "lost", "disqualified"], "priority": ["won", "qualified", "new"] } }, "standardized_output_schema": { "lead_id", "name_raw", "name_norm", "name_ascii", "email_raw", "email_norm", "email_domain", "email_domain_valid_flag", "email_canonical_key", "phone_raw", "phone_e164", "phone_valid_flag", "phone_country_inferred", "country_code_iso2", "consent_flag_std", "consent_ts_utc", "consent_valid_flag", "source_channel_raw", "source_channel_std", "utm_source_raw", "utm_source_std", "utm_campaign_raw", "utm_campaign_std", "utm_valid_flag", "page_url_norm", "event_ts_raw", "event_ts_utc", "timezone_source", "session_id", "agent_notes_sanitized", "lead_status_std", "dedupe_cluster_id", "master_lead_id", "pii_mask_email", "pii_mask_phone", "quality_issue_flags" }, "pipeline": [ { "id": "ingest_01", "name": "Unified Ingestion & Encoding Normalization", "objective": "Load all sources into a common staging area with consistent encoding.", "operations": [ "Convert CSV/XLSX/JSON to a unified table format.", "Detect and enforce UTF-8 (normalize BOM, reject invalid sequences).", "Persist raw columns with a raw suffix." ], "outputs": ["stg_leads_raw"] }, { "id": "schema_02", "name": "Schema Alignment & Typing", "objective": "Map disparate schemas to canonical fields and apply types.", "operations": [ "Field mapping from source-specific names to core fields.", "Type casting: email/name/page_url to string; event_ts/consent_ts to timestamp; consent_flag to boolean; country_code/source_channel to string.", "Create missing fields with nulls where absent." ], "outputs": ["stg_leads_aligned"] }, { "id": "unicode_03", "name": "Unicode Normalization", "objective": "Normalize multi-language text to consistent form.", "operations": [ "Apply Unicode NFKC to name, email, page_url, agent_notes.", "Trim leading/trailing whitespace; collapse internal multiple spaces." ], "outputs": ["stg_leads_unicode"] }, { "id": "email_04", "name": "Email Standardization & Validation", "objective": "Resolve case, aliases, illegal domains, and produce a canonical key.", "rules": { "normalization": [ "Lowercase entire email.", "Split local-part and domain; trim spaces.", "Remove plus-tags in local-part (e.g., user+tag@example.com → user@example.com).", "For gmail/googlemail: remove dots in local-part; map domain googlemail.com → gmail.com.", "Convert internationalized domains to punycode before validation." ], "validation": [ "RFC 5322 pattern check (conservative).", "Domain validation via public suffix list; flag disposable/invalid TLDs.", "Optional: MX lookup (record flag only, do not block pipeline)." ], "canonical_key": "email_canonical_key = domain + '|' + normalized local-part" }, "outputs": ["email_norm", "email_domain", "email_domain_valid_flag", "email_canonical_key"] }, { "id": "phone_05", "name": "Phone Normalization to E.164", "objective": "Normalize diverse formats, apply country, and validate.", "rules": { "cleaning": [ "Remove spaces, hyphens, parentheses, dots, and non-digits except leading '+'.", "If country_code present and phone lacks '+', prepend corresponding country calling code." ], "validation": [ "Parse using E.164 rules (e.g., libphonenumber class of validators).", "Flag invalid lengths or impossible numbers." ], "inference": [ "If country_code missing, infer from phone country or page_url TLD when confidence is high.", "Record inference source; do not overwrite explicit country_code." ] }, "outputs": ["phone_e164", "phone_valid_flag", "phone_country_inferred"] }, { "id": "country_06", "name": "Country Code Standardization", "objective": "Ensure ISO alpha-2 uppercase and reconcile with phone inference.", "rules": [ "Uppercase and validate against ISO 3166-1 alpha-2.", "If missing, use phone_country_inferred; else leave null.", "Flag mismatch between explicit country_code and phone country for review." ], "outputs": ["country_code_iso2"] }, { "id": "consent_07", "name": "Consent Normalization", "objective": "Standardize consent flag and timestamp safely.", "rules": [ "consent_flag_std = true only if explicit true from source; otherwise false.", "Parse consent_ts with source timezone; convert to UTC.", "consent_valid_flag = true if consent_flag_std=true and consent_ts_utc not null; else false." ], "outputs": ["consent_flag_std", "consent_ts_utc", "consent_valid_flag"] }, { "id": "utm_08", "name": "UTM Recovery & Canonicalization", "objective": "Fix missing/spelled UTM parameters and standardize.", "operations": [ "Parse page_url query for utm_source and utm_campaign if fields are null.", "Lowercase, trim, and percent-decode UTM values.", "Correct common misspellings (e.g., utm_srouce→utm_source; cmpaign→campaign).", "Restrict to alphanumerics, '-', '', and '.'; replace other chars with '_'.", "Set utm_valid_flag=false if both utm_source_std and utm_campaign_std are null." ], "outputs": ["utm_source_std", "utm_campaign_std", "utm_valid_flag"] }, { "id": "channel_09", "name": "Source Channel Standardization", "objective": "Map scattered enumerations to controlled vocabulary.", "rules": [ "Lowercase, trim.", "Apply mapping table; default to 'unknown' if unmapped." ], "outputs": ["source_channel_std"] }, { "id": "time_10", "name": "Event Timestamp & Timezone Reconciliation", "objective": "Resolve inconsistencies and normalize to UTC.", "rules": [ "Detect timezone from explicit offset in event_ts if present.", "If absent, derive timezone_source by precedence: session metadata > page_url (tz or locale) > country_code.", "Convert event_ts to event_ts_utc using derived timezone.", "Flag records where derivation is low confidence or contradicts source." ], "outputs": ["event_ts_utc", "timezone_source"] }, { "id": "name_11", "name": "Name Normalization & Encoding Unification", "objective": "Normalize multi-language names without losing original form.", "operations": [ "name_norm: NFKC + trim + collapse spaces.", "name_ascii: transliterate diacritics for Latin scripts; preserve CJK as-is.", "Optionally split into heuristics (given/family) only for downstream modeling; keep single name field canonical." ], "outputs": ["name_norm", "name_ascii"] }, { "id": "status_12", "name": "Lead Status Resolution", "objective": "Resolve overlapping statuses (New, Qualified, Won).", "rules": [ "Normalize to lowercase and map to allowed statuses.", "For multiple statuses across duplicates, pick highest priority: won > qualified > new.", "Record source of status and resolution action." ], "outputs": ["lead_status_std"] }, { "id": "dedupe_13", "name": "Cross-Source Deduplication", "objective": "Cluster duplicate leads across CRM, ads, and call center.", "method": { "blocking": [ "Block A: email_canonical_key exact match.", "Block B: phone_e164 exact match.", "Block C: name_norm + country_code_iso2 with phonetic key (for records lacking email/phone)." ], "similarity": [ "Email near-duplicate: same domain AND Levenshtein distance ≤1 on local-part after alias normalization.", "Name similarity: Jaro-Winkler ≥0.92 on name_norm.", "Temporal proximity: event_ts_utc within 30 days increases match confidence." ], "merge_thresholds": [ "Auto-merge if (email exact OR phone exact) OR (email near-duplicate AND name similarity AND same country).", "Manual review queue if similarity scores are borderline." ], "survivorship": [ "Primary key: master_lead_id = earliest valid lead_id or generated UUID.", "Prefer record with consent_valid_flag=true.", "Prefer Won > Qualified > New for lead_status_std.", "For each field, choose the most complete and most recent non-null value; keep provenance." ] }, "outputs": ["dedupe_cluster_id", "master_lead_id"] }, { "id": "notes_14", "name": "Agent Notes Sanitization", "objective": "Remove PII and normalize text.", "operations": [ "Redact patterns resembling emails and phone numbers.", "Normalize whitespace and control characters." ], "outputs": ["agent_notes_sanitized"] }, { "id": "pii_15", "name": "PII Masking", "objective": "Protect sensitive data while preserving utility.", "rules": [ "pii_mask_email: first char + '@' + domain (e.g., j@example.com).", "pii_mask_phone: show last 4 digits; mask others (e.g., ******1234).", "Retain full normalized values in restricted tables; expose masked values in analytics outputs." ], "outputs": ["pii_mask_email", "pii_mask_phone"] }, { "id": "quality_16", "name": "Quality Flags & Issue Cataloging", "objective": "Annotate records with detected issues for monitoring and remediation.", "operations": [ "quality_issue_flags: JSON array of codes (e.g., EMAIL_INVALID_DOMAIN, PHONE_INVALID, CONSENT_MISSING_TS, UTM_MISSING, TZ_DERIVED_LOW_CONF, STATUS_CONFLICT)." ], "outputs": ["quality_issue_flags"] }, { "id": "publish_17", "name": "Publish Standardized Outputs", "objective": "Produce curated tables for downstream use.", "outputs": [ "cur_leads_master (one row per master_lead_id)", "cur_leads_touchpoints (normalized events with event_ts_utc)", "dq_issues_log (record-level issues)" ] } ], "validation_rules": { "completeness": [ "lead_id not null in staging; master_lead_id not null in curated.", "event_ts_utc not null for publishable touchpoints.", "If consent_flag_std=true then consent_ts_utc must not be null." ], "validity": [ "email_domain_valid_flag must be true for email_norm to be considered valid.", "phone_valid_flag must be true for phone_e164 to be considered valid.", "country_code_iso2 must be in ISO list when present.", "source_channel_std must be in controlled vocabulary." ], "consistency": [ "Timezone_source derivation should be consistent with country_code when confidence is high; otherwise flag.", "lead_status_std must reflect survivorship priority after dedupe." ], "uniqueness": [ "master_lead_id unique in cur_leads_master.", "session_id uniqueness scoped per master_lead_id; duplicates indicate repeated events." ] }, "monitoring": { "cadence": ["hourly_increment", "daily_batch"], "metrics": [ "Email invalid rate (%)", "Phone invalid rate (%)", "Consent missing timestamp count", "UTM missing/invalid rate (%)", "Timezone derivation low-confidence rate (%)", "Dedup merge rate (%) and manual review queue size", "Source_channel unmapped rate (%)", "PII masking coverage (%)" ], "thresholds_examples": { "email_invalid_rate_pct": "<= 5%", "phone_invalid_rate_pct": "<= 8%", "utm_invalid_rate_pct": "<= 10%", "timezone_low_conf_pct": "<= 3%" }, "alerting": [ "Trigger alerts when thresholds are exceeded.", "Log top offending sources (by source_channel_raw and source system)." ] }, "error_handling": { "quarantine_buckets": [ "EMAIL_ILLEGAL_DOMAIN", "PHONE_UNPARSABLE", "CONSENT_FLAG_TRUE_TS_NULL", "UTM_IRRECOVERABLE", "TZ_CONFLICT", "STATUS_AMBIGUOUS" ], "actions": [ "Exclude quarantined records from master publish until resolved.", "Provide remediation reports to source owners." ] }, "implementation_notes": { "scalability": [ "Use distributed processing for 1.2M rows (e.g., Spark or equivalent) for dedupe clustering.", "Index on email_canonical_key and phone_e164 for blocking joins." ], "provenance": [ "Track source_system, ingestion_ts, and field-level lineage.", "Maintain pre-merge snapshots for auditability." ], "security": [ "Restrict access to unmasked PII tables.", "Encrypt at rest and in transit." ] } }
| 顺序 | 数据问题/目标 | 识别规则 | 清洗/转换操作 | 验证标准 | 备注/实现要点 |
|---|---|---|---|---|---|
| 1 | 编码与基础解析统一 | 检测非UTF-8字节序列;确认CSV分隔、转义;JSON可解析性 | 统一转码为UTF-8;不可解析的JSON保留原文至payload_json_raw并标注error_code;修复常见非法字符(替换为U+FFFD) | 任意文本字段UTF-8有效;JSON可解析率≥99.9%或按源质量目标 | 对边缘CSV使用明确locale/编码(如CP1252→UTF-8);保留原始载荷用于审计 |
| 2 | JSON展开与模式统一(固件版本差异) | firmware_ver与payload_json字段名映射表匹配;检测字段名变更(如temp↔temperature) | 基于schema版本映射统一列名(temp, humidity, battery等);新增字段source_schema_version;展开嵌套JSON至扁平列 | 同一含义字段统一到标准列;未知字段进入payload_json_extra | 维护“版本→字段映射”字典;不可映射字段保留但不参与核心指标 |
| 3 | 设备ID规范化 | 检测大小写差异/前导零/非字母数字字符 | 生成canonical_device_id:去空白、统一大小写(建议大写)、去非规范分隔符、标准化长度(如左侧零填至约定长度);保留original_device_id | canonical_device_id唯一且稳定;建立映射表审计可追溯 | 谨慎处理前导零:如业务含义不明,保留到统一长度而非直接删除 |
| 4 | 时区解析与UTC标准化 | 来源包含UTC与本地时区;缺失时从网关元数据推断 | 解析为有时区的datetime;统一转换为ts_utc(UTC);新增tz_source、tz_offset_minutes、ts_source(ingest/gateway/device) | ts_utc时间可比较;tz_offset与源一致 | 不可推断时区记录:ts_utc=null,is_ts_unusable=true |
| 5 | 时间戳缺失/乱序 | ts缺失或不单调;跨源乱序 | 缺失ts:使用gateway_ingest_time近似,置ts_utc_imputed=true;排序时按device分区、ts_utc(或ingest_time)排序;保留序列号(若有) | 每设备内事件序正确输出;缺失时间不覆盖原字段 | 只在分析视图使用ts_utc_imputed;原始ts保留 |
| 6 | 同秒重复事件去重 | 同设备、同秒、多条记录;载荷相同或微小差异 | 定义去重键:canonical_device_id + floor(ts_utc到秒) + payload_hash;优先保留较新ingest_time或更高signal_strength;标注is_duplicate | 去重后同秒最多1条(按策略);重复标志准确 | 对真实多事件设备,放宽为“载荷相同才去重”;配置化 |
| 7 | 温度单位混用(°C/°F) | 值域与地区/固件提示矛盾;例如值>60并在使用°F地区出现 | 识别单位:规则优先级=设备/地区元数据>固件版本指示>值域启发;统一输出temp_c;转换公式:temp_c=(temp_f−32)/1.8;新增temp_unit_source | temp_c在合理范围(设备规范或经验分布) | 仅值域启发有误判风险;对不确定记录保留原值并标注unit_ambiguous |
| 8 | 湿度超范围 | humidity<0或>100(相对湿度%) | 将超范围设为null并标注humidity_out_of_range=true;如存在已知校准偏置,先应用偏置校正再判定 | 0≤humidity≤100或null | 若设备输出非%(如绝对湿度),需在版本映射中区分 |
| 9 | 电池电量不合理跳变 | battery瞬时大幅变化;超范围 | 清洗:约束范围至[0,100];计算Δbattery与每设备滚动分位阈值(如99.9分位)以标注battery_jump_flag;可选平滑视图(不改原值) | battery∈[0,100];异常跳变被正确标注 | 不假设单调性(可能充电);阈值按设备分布自适应而非固定 |
| 10 | 地理位置标准化与缺失 | lat/lng缺失或为DMS/多格式;超范围 | 统一坐标至WGS84十进制度;DMS→十进制度解析;校验范围lat∈[−90,90]、lng∈[−180,180];location_precision标准化(米) | 坐标有效或为null且geo_missing=true | 缺失时可LOCF(限最大时效,如≤24h)生成lat_imputed/lng_imputed并标注 |
| 11 | 噪声尖峰与漂移 | 短时极值(尖峰);缓慢偏移(漂移) | 尖峰检测:滚动窗口Hampel/MAD或z-score;漂移:长窗口线性趋势/低频滤波;新增quality_flags包含spike/drift | 原始值不覆盖;在分析视图提供平滑列(如temp_c_smoothed) | 典型参数:窗口取设备采样中位间隔的k倍(k≈10);阈值需按设备调优 |
| 12 | 信号强度与数值字段范围校验 | signal_strength、location_precision等异常 | 定义设备级/通用范围与单位;越界设为null并标注;单位统一(如dBm) | 所有数值字段单位一致且在合理范围或为null | 保留原始值列*_raw便于回溯 |
| 13 | 采样频率不均的规范化(可选) | 分布不均影响聚合/建模 | 提供分析视图按设备重采样到目标粒度(如1 min),使用前向填充/中位数插值;不改原始明细表 | 重采样视图中空洞合理填补、插值标注imputed=true | 避免在告警场景中使用插值值 |
| 14 | 分区与存储一致性 | 对象存储与时序库分区差异 | 以ts_utc的日期/小时重写分区;跨时区统一至UTC分区;生成partition_key | 同步后查询按UTC分区命中率提升 | 处理晚到数据:允许T+N小时回填并启动重分区 |
| 15 | 审计与数据质量监控字段 | 缺乏可追踪性 | 添加行级审计:source_system、ingest_time、processing_version、quality_flags、error_code;构建DQ指标(完整性、有效性、重复率等) | DQ指标可按日/设备生成并告警 | 指标示例:ts缺失率、湿度越界率、重复事件率、JSON解析失败率 |
实施提示(与规模相关):
以“数据清洗建议草案”为核心,面向业务、分析与运营团队,快速生成针对具体数据问题的清洗行动清单,帮助标准化数据处理流程、降低漏清与误清风险、提升数据的可信度与可用性,缩短从排查到上线的时间;支持多语言输出,便于跨区域协作;通过清晰、可执行的建议,让试用用户直观感知效率与质量的提升,推动团队从试用走向付费。
迅速制定针对不同数据源的清洗方案,验证修正效果,输出可复用指引,提升报表与洞察的可靠性。
规范维度口径与指标校验流程,减少仪表盘误差,建立持续监控与预警机制,稳定数据展示质量。
清理投放与CRM数据的重复与缺失,优化用户分群与归因,提升活动评估准确度与预算使用效率。
将模板生成的提示词复制粘贴到您常用的 Chat 应用(如 ChatGPT、Claude 等),即可直接对话使用,无需额外开发。适合个人快速体验和轻量使用场景。
把提示词模板转化为 API,您的程序可任意修改模板参数,通过接口直接调用,轻松实现自动化与批量处理。适合开发者集成与业务系统嵌入。
在 MCP client 中配置对应的 server 地址,让您的 AI 应用自动调用提示词模板。适合高级用户和团队协作,让提示词在不同 AI 工具间无缝衔接。
半价获取高级提示词-优惠即将到期