热门角色不仅是灵感来源,更是你的效率助手。通过精挑细选的角色提示词,你可以快速生成高质量内容、提升创作灵感,并找到最契合你需求的解决方案。让创作更轻松,让价值更直接!
我们根据不同用户需求,持续更新角色库,让你总能找到合适的灵感入口。
本提示词可帮助用户从数据描述、方法选择到聚类实现与结果解读,提供完整专业流程与可视化指导,支持高效聚类分析,确保数据科学项目的准确性与可解释性。
以下方案从概念、实现、方法选择、结果解读与可视化、性能优化到呈现最佳实践,系统性帮助你用 K-Means 将电商用户按相似行为聚成5类。文末给出可直接运行的代码框架(Python/scikit-learn),覆盖你给定的预处理策略、PCA、分层抽样调参与5簇产出。
一、聚类是什么、为什么重要
二、实现流程(与你的数据保持一致)
三、不同聚类方法及适用场景(K=5为主但需知备选)
四、识别的聚类列表(示例业务画像,最终以实际数据输出为准) 基于你提供的特征集合,K=5常见可解释分群原型如下(示例命名与方向;以标准化均值相对全局均值的高/低来描述):
五、每个聚类的关键特征(如何从数据中系统产出)
六、使用的聚类方法简要说明
七、聚类可视化(指导与可复用代码)
八、优化性能与应对挑战
九、输出与呈现最佳实践
十、参考代码(可直接调整特征名后运行) 说明:
Python示例(sklearn/umap):
代码片段:
环境与工具 import numpy as np import pandas as pd from sklearn.base import BaseEstimator, TransformerMixin from sklearn.compose import ColumnTransformer from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer from sklearn.decomposition import PCA from sklearn.cluster import KMeans, MiniBatchKMeans from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split import umap.umap_ as umap import matplotlib.pyplot as plt import seaborn as sns
自定义P99剪裁器 class ClipP99(BaseEstimator, TransformerMixin): def init(self): self.p99_ = None def fit(self, X, y=None): self.p99_ = np.nanpercentile(X, 99, axis=0) return self def transform(self, X): return np.minimum(X, self.p99_)
核心特征定义(示例) num_features = ['R_30','F_180','M_180','AOV_7d','refund_rate','session_pages_90d','days_since_first_purchase', 'cat_share_1','cat_share_2','cat_share_3','cat_share_4','cat_share_5'] cat_features = ['region'] # 将在One-Hot中限制Top10
针对M_180的log1p def log1p_col(x): return np.log1p(x)
log_transformer = ColumnTransformer( transformers=[('logM', FunctionTransformer(log1p_col), [num_features.index('M_180')])], remainder='passthrough' )
top10_regions = get_top_k_series(df['region'], 10) df['region_top'] = df['region'].where(df['region'].isin(top10_regions), 'OTHER') cat_features = ['region_top']
categorical_pipeline = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore', drop=None)) ])
preprocessor = ColumnTransformer( transformers=[ ('num', numeric_pipeline, num_features), ('cat', categorical_pipeline, cat_features) ], remainder='drop' )
全流程:预处理 -> PCA -> KMeans pipe = Pipeline(steps=[ ('prep', preprocessor), ('pca', PCA(n_components=0.95, svd_solver='full', random_state=42)), ('kmeans', KMeans(n_clusters=5, init='k-means++', n_init='auto', max_iter=400, tol=1e-4, random_state=42)) ])
分层抽样(按R/F/M分位) df['_R_q'] = pd.qcut(df['R_30'], q=5, duplicates='drop') df['_F_q'] = pd.qcut(df['F_180'], q=5, duplicates='drop') df['_M_q'] = pd.qcut(df['M_180'], q=5, duplicates='drop') df['_strata'] = df['R_q'].astype(str) + '' + df['F_q'].astype(str) + '' + df['_M_q'].astype(str)
sample_idx = df.groupby('_strata', group_keys=False).apply(lambda x: x.sample(frac=min(1.0, 10000/len(df)), random_state=42)).index df_s = df.loc[sample_idx].copy()
metrics_df = evaluate_k_range(df_s) print(metrics_df.sort_values('silhouette', ascending=False))
最终K=5训练(全量) X_full = df[num_features + cat_features] pipe.fit(X_full) labels = pipe.named_steps['kmeans'].labels_ df['cluster'] = labels
簇画像:标准化均值差异(z-score)
z_inputs = df[num_features].copy() z_inputs = (z_inputs - z_inputs.mean()) / z_inputs.std(ddof=0) profile = z_inputs.join(df[['cluster']]).groupby('cluster').mean().T print(profile) # 行为画像矩阵(行=特征,列=簇)
reducer = umap.UMAP(n_neighbors=30, min_dist=0.1, random_state=42) Xt_umap = reducer.fit_transform(Xt_pca)
plt.figure(figsize=(8,6)) sns.scatterplot(x=Xt_umap[:,0], y=Xt_umap[:,1], hue=df['cluster'], palette='tab10', s=6, linewidth=0) plt.title('UMAP of PCA features colored by KMeans clusters') plt.show()
plot_radar(profile)
输出格式(示例)
最后提醒
以下方案以“DBSCAN 为主、密度校准+降噪”为核心,面向您给出的 IoT 智能楼宇多站点分钟级数据,给出从概念、实现到解读与呈现的完整路线图,并给出可复用代码片段与可视化建议。
一、聚类是什么&为何重要
二、实施流程(端到端)
数据对齐与清洗(按站点分批处理以并行)
特征工程(您已选18维)
标准化与加权
降维与可视化基座
DBSCAN参数选择与可扩展计算
质量评估与调参闭环
三、方法对比与本项目选择
四、输出(示例形态)
识别的聚类列表(6个“原型簇”;DBSCAN中的噪声记为-1,必要时并入异常类)
每个聚类的关键特征(建议用中位数±IQR呈现)
使用的聚类方法简要说明
聚类可视化(建议与代码片段)
五、性能与挑战的实用技巧
六、验证与解读(指标与业务对齐)
七、呈现最佳实践
八、代码示例(Python/Scikit-learn + 可选RAPIDS) 说明:示例展示 CPU 管道与参数选择;全量拟合建议用GPU或分块。
import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import DBSCAN
# 自定义特征加权(乘以sqrt(weight)以作用到欧氏距离)
class FeatureWeighter(BaseEstimator, TransformerMixin):
def __init__(self, columns, weight=1.5, all_columns=None):
self.columns = columns
self.weight = weight
self.all_columns = all_columns
self.col_idx_ = None
def fit(self, X, y=None):
cols = self.all_columns if self.all_columns is not None else X.columns
self.col_idx_ = [cols.index(c) for c in self.columns]
return self
def transform(self, X):
X = X.copy()
w = np.sqrt(self.weight)
if isinstance(X, pd.DataFrame):
X.iloc[:, self.col_idx_] = X.iloc[:, self.col_idx_] * w
else:
X[:, self.col_idx_] = X[:, self.col_idx_] * w
return X
# 假设df已是窗口级18维特征,包含['site_id']用于评估,不参与距离
feature_cols = [c for c in df.columns if c not in ['site_id']]
power_cols = ['power_mean','power_std','power_rate','switch_freq'] # 示例映射到您的功率相关列名
scaler = RobustScaler()
weighter = FeatureWeighter(columns=power_cols, weight=1.5, all_columns=feature_cols)
pca = PCA(n_components=8, random_state=42)
# 拟合Scaler可按站点单独fit后合并transform,以下为简化全局写法
X_feat = df[feature_cols].values
X_scaled = scaler.fit_transform(X_feat)
X_weighted = weighter.fit_transform(pd.DataFrame(X_scaled, columns=feature_cols))
X_pca = pca.fit_transform(X_weighted)
# 抽样用于估eps
idx = np.random.choice(len(X_pca), size=min(400_000, len(X_pca)), replace=False)
X_sample = X_pca[idx]
min_samples = 16 # 网格候选之一
nbrs = NearestNeighbors(n_neighbors=min_samples, algorithm='auto', n_jobs=-1).fit(X_sample)
dists, _ = nbrs.kneighbors(X_sample)
k_dists = np.sort(dists[:, -1])
# 可视化:绘制k_dists曲线并用拐点法估计eps
# plt.plot(k_dists); plt.ylim(0, np.percentile(k_dists, 99)); plt.show()
eps = float(np.percentile(k_dists, 95)) # 可与拐点检测结合/调参
db = DBSCAN(eps=eps, min_samples=min_samples, n_jobs=-1).fit(X_pca)
labels = db.labels_ # -1为噪声
df['cluster_db'] = labels
# 简要质量指标
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
noise_ratio = np.mean(labels == -1)
print({'n_clusters': n_clusters, 'noise_ratio': noise_ratio})
from sklearn.cluster import KMeans
core_mask = db.core_sample_indices_ # 核心点索引
core_idx = np.zeros(len(labels), dtype=bool)
core_idx[core_mask] = True
kmeans = KMeans(n_clusters=6, random_state=42, n_init='auto').fit(X_pca[core_idx])
# 所有点吸附最近质心(或仅核心点用于下游)
assign_all = kmeans.predict(X_pca)
df['cluster_6'] = assign_all
import umap
import matplotlib.pyplot as plt
reducer = umap.UMAP(n_neighbors=50, min_dist=0.1, random_state=42)
emb = reducer.fit_transform(X_pca)
plt.figure(figsize=(7,6))
plt.scatter(emb[:,0], emb[:,1], c=df['cluster_db'], s=1, cmap='tab20', alpha=0.4)
plt.title('DBSCAN clusters (UMAP 2D)')
plt.show()
九、如何把“6类”与DBSCAN统一
十、小结
以下方案面向中文消费类商品用户评论(约5,200条),以“句向量+层次聚类(平均联接+余弦距离)”实现8簇主题发现与解释。包含概念说明、实施流程、方法选择、可视化与结果呈现的最佳实践,并提供完整代码骨架,确保可复现与业务可解释性。
一、聚类是什么,为什么重要
二、实施流程(端到端)
三、不同聚类方法与应用场景对比(简)
四、本项目的聚类方法(简要说明)
五、输出格式与示例
六、可执行代码骨架(Python/Sklearn/Sentence-Transformers) 依赖
伪代码/代码片段
import re, unicodedata, numpy as np, pandas as pd
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.neighbors import NearestNeighbors
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cdist
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
import jieba
# 0. 加载数据:df需包含['text','star','has_image','sentiment','text_len','question_ratio']
# text_len和question_ratio可事先计算;若无,后续补充
df = pd.read_csv('comments.csv')
# 1. 文本清洗
def clean_text(s: str) -> str:
if not isinstance(s, str): return ''
s = re.sub(r'<[^>]+>', ' ', s) # 去HTML
s = unicodedata.normalize('NFKC', s) # 全/半角统一
s = re.sub(r'http\S+|www\.\S+', ' ', s) # 去URL
s = re.sub(r'[@#]\S+', ' ', s) # 去@/话题
s = re.sub(r'\s+', ' ', s).strip()
return s
df['clean'] = df['text'].astype(str).map(clean_text)
# 2. 句向量(中文MiniLM)
model = SentenceTransformer('shibing624/text2vec-base-chinese') # 或 paraphrase-multilingual-MiniLM-L12-v2
emb = model.encode(df['clean'].tolist(), batch_size=64, normalize_embeddings=True) # L2归一化
# 3. 近重复去重(相似度>0.95)
# 用近邻半径搜索:cosine距离<0.05
nn = NearestNeighbors(metric='cosine', radius=0.05, n_jobs=-1)
nn.fit(emb)
radii = nn.radius_neighbors(emb, return_distance=False)
to_drop = set()
seen = set()
for i, neigh in enumerate(radii):
if i in to_drop: continue
group = [j for j in neigh if j != i]
for j in group:
if j not in seen:
to_drop.add(j)
seen.add(i)
mask = ~df.index.isin(to_drop)
df = df[mask].reset_index(drop=True)
emb = emb[mask.values]
# 4. PCA降到50维并再次L2归一化
pca = PCA(n_components=50, random_state=42)
X50 = pca.fit_transform(emb)
X50 = normalize(X50) # 便于余弦距离稳定
# 5. 层次聚类(平均联接+余弦)
Z = linkage(X50, method='average', metric='cosine')
labels = fcluster(Z, t=8, criterion='maxclust')
df['cluster'] = labels
# 6. 簇解释:关键词(TF-IDF聚合)与代表评论(medoid)
def jieba_tokenizer(s):
return [w for w in jieba.lcut(s) if w.strip()]
tfidf = TfidfVectorizer(tokenizer=jieba_tokenizer, ngram_range=(1,2),
min_df=5, max_df=0.8, sublinear_tf=True)
X_tfidf = tfidf.fit_transform(df['clean'])
vocab = np.array(tfidf.get_feature_names_out())
cluster_info = []
X50_centroids = np.vstack([X50[labels==(k+1)].mean(axis=0) for k in range(8)])
for k in range(1, 9):
idx = np.where(labels == k)[0]
sub = df.iloc[idx]
# 关键词:对簇内TF-IDF求和排序
tfidf_sum = X_tfidf[idx].sum(axis=0).A1
top_idx = tfidf_sum.argsort()[::-1][:15]
keywords = vocab[top_idx].tolist()
# 代表评论:medoid(簇中心最近)
centroid = X50_centroids[k-1].reshape(1,-1)
dists = cdist(X50[idx], centroid, metric='cosine').ravel()
medoid_i = idx[dists.argmin()]
rep = df.loc[medoid_i, 'text'] # 使用原文
# 元数据画像
stats = {
'size': len(idx),
'star_mean': sub['star'].mean() if 'star' in sub else np.nan,
'img_ratio': sub['has_image'].mean() if 'has_image' in sub else np.nan,
'sentiment_mean': sub['sentiment'].mean() if 'sentiment' in sub else np.nan,
'len_mean': sub['text_len'].mean() if 'text_len' in sub else np.nan,
'question_ratio_mean': sub['question_ratio'].mean() if 'question_ratio' in sub else np.nan
}
cluster_info.append({'cluster': k, 'keywords': keywords[:10], 'representative': rep, 'stats': stats})
# 7. 可视化:t-SNE二维投影(余弦)
tsne = TSNE(n_components=2, perplexity=30, init='pca', learning_rate='auto',
metric='cosine', random_state=42)
X2 = tsne.fit_transform(X50)
df['tsne_x'], df['tsne_y'] = X2[:,0], X2[:,1]
# 8. 评估指标
sil = silhouette_score(X50, labels, metric='cosine')
dbi = davies_bouldin_score(X50, labels)
chi = calinski_harabasz_score(X50, labels)
print('Silhouette(cosine)=', round(sil,4), ' DBI=', round(dbi,4), ' CH=', int(chi))
# 9. 导出结果
# 每簇:关键词、代表评论、统计
for c in cluster_info:
print(f"Cluster {c['cluster']} | size={c['stats']['size']}")
print('Keywords:', ', '.join(c['keywords']))
print('Representative:', c['representative'][:120], '...')
print('Stats:', c['stats'])
print('-'*60)
# 可选:绘图(matplotlib)
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(9,7))
sns.scatterplot(x='tsne_x', y='tsne_y', hue='cluster', data=df, palette='tab10', s=12, linewidth=0)
plt.title('t-SNE of Comments (Average-Linkage Cosine Clusters)')
plt.legend(title='Cluster', bbox_to_anchor=(1.02,1), loc='upper left')
plt.tight_layout(); plt.show()
# 树状图(大数据量可只抽样)
# dendrogram(Z, p=50, truncate_mode='lastp', no_labels=True); plt.show()
七、解读与可视化指导
八、优化聚类性能与常见挑战
九、结果呈现的最佳实践(面向业务)
十、总结
需要我基于您的真实数据运行并返回“每簇关键词与代表评论”的具体结果吗?可在您确认数据字段后,我可直接按上述代码生成最终输出表与可视化图。
为用户提供高效、专业的聚类解决方案,帮助他们选择合适的算法,掌握实现过程并准确解读聚类结果,以满足数据分析需求并提升工作效率。
通过深入浅出的讲解与指导,快速掌握聚类算法的核心知识并开始动手实践,提升学习效果。
快速构建聚类分析流程,将零散数据高效分组,为现有项目挖掘新洞察提供强力支持。
通过聚类输出结果获取清晰的数据分类洞察,优化市场细分、客户分层等商业场景决策。
将模板生成的提示词复制粘贴到您常用的 Chat 应用(如 ChatGPT、Claude 等),即可直接对话使用,无需额外开发。适合个人快速体验和轻量使用场景。
把提示词模板转化为 API,您的程序可任意修改模板参数,通过接口直接调用,轻松实现自动化与批量处理。适合开发者集成与业务系统嵌入。
在 MCP client 中配置对应的 server 地址,让您的 AI 应用自动调用提示词模板。适合高级用户和团队协作,让提示词在不同 AI 工具间无缝衔接。
半价获取高级提示词-优惠即将到期