不止热门角色,我们为你扩展了更多细分角色分类,覆盖职场提升、商业增长、内容创作、学习规划等多元场景。精准匹配不同目标,让每一次生成都更有方向、更高命中率。
立即探索更多角色分类,找到属于你的增长加速器。
以下方案从概念、实现、方法选择、结果解读与可视化、性能优化到呈现最佳实践,系统性帮助你用 K-Means 将电商用户按相似行为聚成5类。文末给出可直接运行的代码框架(Python/scikit-learn),覆盖你给定的预处理策略、PCA、分层抽样调参与5簇产出。
一、聚类是什么、为什么重要
二、实现流程(与你的数据保持一致)
三、不同聚类方法及适用场景(K=5为主但需知备选)
四、识别的聚类列表(示例业务画像,最终以实际数据输出为准) 基于你提供的特征集合,K=5常见可解释分群原型如下(示例命名与方向;以标准化均值相对全局均值的高/低来描述):
五、每个聚类的关键特征(如何从数据中系统产出)
六、使用的聚类方法简要说明
七、聚类可视化(指导与可复用代码)
八、优化性能与应对挑战
九、输出与呈现最佳实践
十、参考代码(可直接调整特征名后运行) 说明:
Python示例(sklearn/umap):
代码片段:
环境与工具 import numpy as np import pandas as pd from sklearn.base import BaseEstimator, TransformerMixin from sklearn.compose import ColumnTransformer from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer from sklearn.decomposition import PCA from sklearn.cluster import KMeans, MiniBatchKMeans from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split import umap.umap_ as umap import matplotlib.pyplot as plt import seaborn as sns
自定义P99剪裁器 class ClipP99(BaseEstimator, TransformerMixin): def init(self): self.p99_ = None def fit(self, X, y=None): self.p99_ = np.nanpercentile(X, 99, axis=0) return self def transform(self, X): return np.minimum(X, self.p99_)
核心特征定义(示例) num_features = ['R_30','F_180','M_180','AOV_7d','refund_rate','session_pages_90d','days_since_first_purchase', 'cat_share_1','cat_share_2','cat_share_3','cat_share_4','cat_share_5'] cat_features = ['region'] # 将在One-Hot中限制Top10
针对M_180的log1p def log1p_col(x): return np.log1p(x)
log_transformer = ColumnTransformer( transformers=[('logM', FunctionTransformer(log1p_col), [num_features.index('M_180')])], remainder='passthrough' )
top10_regions = get_top_k_series(df['region'], 10) df['region_top'] = df['region'].where(df['region'].isin(top10_regions), 'OTHER') cat_features = ['region_top']
categorical_pipeline = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore', drop=None)) ])
preprocessor = ColumnTransformer( transformers=[ ('num', numeric_pipeline, num_features), ('cat', categorical_pipeline, cat_features) ], remainder='drop' )
全流程:预处理 -> PCA -> KMeans pipe = Pipeline(steps=[ ('prep', preprocessor), ('pca', PCA(n_components=0.95, svd_solver='full', random_state=42)), ('kmeans', KMeans(n_clusters=5, init='k-means++', n_init='auto', max_iter=400, tol=1e-4, random_state=42)) ])
分层抽样(按R/F/M分位) df['_R_q'] = pd.qcut(df['R_30'], q=5, duplicates='drop') df['_F_q'] = pd.qcut(df['F_180'], q=5, duplicates='drop') df['_M_q'] = pd.qcut(df['M_180'], q=5, duplicates='drop') df['_strata'] = df['R_q'].astype(str) + '' + df['F_q'].astype(str) + '' + df['_M_q'].astype(str)
sample_idx = df.groupby('_strata', group_keys=False).apply(lambda x: x.sample(frac=min(1.0, 10000/len(df)), random_state=42)).index df_s = df.loc[sample_idx].copy()
metrics_df = evaluate_k_range(df_s) print(metrics_df.sort_values('silhouette', ascending=False))
最终K=5训练(全量) X_full = df[num_features + cat_features] pipe.fit(X_full) labels = pipe.named_steps['kmeans'].labels_ df['cluster'] = labels
簇画像:标准化均值差异(z-score)
z_inputs = df[num_features].copy() z_inputs = (z_inputs - z_inputs.mean()) / z_inputs.std(ddof=0) profile = z_inputs.join(df[['cluster']]).groupby('cluster').mean().T print(profile) # 行为画像矩阵(行=特征,列=簇)
reducer = umap.UMAP(n_neighbors=30, min_dist=0.1, random_state=42) Xt_umap = reducer.fit_transform(Xt_pca)
plt.figure(figsize=(8,6)) sns.scatterplot(x=Xt_umap[:,0], y=Xt_umap[:,1], hue=df['cluster'], palette='tab10', s=6, linewidth=0) plt.title('UMAP of PCA features colored by KMeans clusters') plt.show()
plot_radar(profile)
输出格式(示例)
最后提醒
以下方案以“DBSCAN 为主、密度校准+降噪”为核心,面向您给出的 IoT 智能楼宇多站点分钟级数据,给出从概念、实现到解读与呈现的完整路线图,并给出可复用代码片段与可视化建议。
一、聚类是什么&为何重要
二、实施流程(端到端)
数据对齐与清洗(按站点分批处理以并行)
特征工程(您已选18维)
标准化与加权
降维与可视化基座
DBSCAN参数选择与可扩展计算
质量评估与调参闭环
三、方法对比与本项目选择
四、输出(示例形态)
识别的聚类列表(6个“原型簇”;DBSCAN中的噪声记为-1,必要时并入异常类)
每个聚类的关键特征(建议用中位数±IQR呈现)
使用的聚类方法简要说明
聚类可视化(建议与代码片段)
五、性能与挑战的实用技巧
六、验证与解读(指标与业务对齐)
七、呈现最佳实践
八、代码示例(Python/Scikit-learn + 可选RAPIDS) 说明:示例展示 CPU 管道与参数选择;全量拟合建议用GPU或分块。
import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import DBSCAN
# 自定义特征加权(乘以sqrt(weight)以作用到欧氏距离)
class FeatureWeighter(BaseEstimator, TransformerMixin):
def __init__(self, columns, weight=1.5, all_columns=None):
self.columns = columns
self.weight = weight
self.all_columns = all_columns
self.col_idx_ = None
def fit(self, X, y=None):
cols = self.all_columns if self.all_columns is not None else X.columns
self.col_idx_ = [cols.index(c) for c in self.columns]
return self
def transform(self, X):
X = X.copy()
w = np.sqrt(self.weight)
if isinstance(X, pd.DataFrame):
X.iloc[:, self.col_idx_] = X.iloc[:, self.col_idx_] * w
else:
X[:, self.col_idx_] = X[:, self.col_idx_] * w
return X
# 假设df已是窗口级18维特征,包含['site_id']用于评估,不参与距离
feature_cols = [c for c in df.columns if c not in ['site_id']]
power_cols = ['power_mean','power_std','power_rate','switch_freq'] # 示例映射到您的功率相关列名
scaler = RobustScaler()
weighter = FeatureWeighter(columns=power_cols, weight=1.5, all_columns=feature_cols)
pca = PCA(n_components=8, random_state=42)
# 拟合Scaler可按站点单独fit后合并transform,以下为简化全局写法
X_feat = df[feature_cols].values
X_scaled = scaler.fit_transform(X_feat)
X_weighted = weighter.fit_transform(pd.DataFrame(X_scaled, columns=feature_cols))
X_pca = pca.fit_transform(X_weighted)
# 抽样用于估eps
idx = np.random.choice(len(X_pca), size=min(400_000, len(X_pca)), replace=False)
X_sample = X_pca[idx]
min_samples = 16 # 网格候选之一
nbrs = NearestNeighbors(n_neighbors=min_samples, algorithm='auto', n_jobs=-1).fit(X_sample)
dists, _ = nbrs.kneighbors(X_sample)
k_dists = np.sort(dists[:, -1])
# 可视化:绘制k_dists曲线并用拐点法估计eps
# plt.plot(k_dists); plt.ylim(0, np.percentile(k_dists, 99)); plt.show()
eps = float(np.percentile(k_dists, 95)) # 可与拐点检测结合/调参
db = DBSCAN(eps=eps, min_samples=min_samples, n_jobs=-1).fit(X_pca)
labels = db.labels_ # -1为噪声
df['cluster_db'] = labels
# 简要质量指标
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
noise_ratio = np.mean(labels == -1)
print({'n_clusters': n_clusters, 'noise_ratio': noise_ratio})
from sklearn.cluster import KMeans
core_mask = db.core_sample_indices_ # 核心点索引
core_idx = np.zeros(len(labels), dtype=bool)
core_idx[core_mask] = True
kmeans = KMeans(n_clusters=6, random_state=42, n_init='auto').fit(X_pca[core_idx])
# 所有点吸附最近质心(或仅核心点用于下游)
assign_all = kmeans.predict(X_pca)
df['cluster_6'] = assign_all
import umap
import matplotlib.pyplot as plt
reducer = umap.UMAP(n_neighbors=50, min_dist=0.1, random_state=42)
emb = reducer.fit_transform(X_pca)
plt.figure(figsize=(7,6))
plt.scatter(emb[:,0], emb[:,1], c=df['cluster_db'], s=1, cmap='tab20', alpha=0.4)
plt.title('DBSCAN clusters (UMAP 2D)')
plt.show()
九、如何把“6类”与DBSCAN统一
十、小结
以下方案面向中文消费类商品用户评论(约5,200条),以“句向量+层次聚类(平均联接+余弦距离)”实现8簇主题发现与解释。包含概念说明、实施流程、方法选择、可视化与结果呈现的最佳实践,并提供完整代码骨架,确保可复现与业务可解释性。
一、聚类是什么,为什么重要
二、实施流程(端到端)
三、不同聚类方法与应用场景对比(简)
四、本项目的聚类方法(简要说明)
五、输出格式与示例
六、可执行代码骨架(Python/Sklearn/Sentence-Transformers) 依赖
伪代码/代码片段
import re, unicodedata, numpy as np, pandas as pd
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.neighbors import NearestNeighbors
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cdist
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
import jieba
# 0. 加载数据:df需包含['text','star','has_image','sentiment','text_len','question_ratio']
# text_len和question_ratio可事先计算;若无,后续补充
df = pd.read_csv('comments.csv')
# 1. 文本清洗
def clean_text(s: str) -> str:
if not isinstance(s, str): return ''
s = re.sub(r'<[^>]+>', ' ', s) # 去HTML
s = unicodedata.normalize('NFKC', s) # 全/半角统一
s = re.sub(r'http\S+|www\.\S+', ' ', s) # 去URL
s = re.sub(r'[@#]\S+', ' ', s) # 去@/话题
s = re.sub(r'\s+', ' ', s).strip()
return s
df['clean'] = df['text'].astype(str).map(clean_text)
# 2. 句向量(中文MiniLM)
model = SentenceTransformer('shibing624/text2vec-base-chinese') # 或 paraphrase-multilingual-MiniLM-L12-v2
emb = model.encode(df['clean'].tolist(), batch_size=64, normalize_embeddings=True) # L2归一化
# 3. 近重复去重(相似度>0.95)
# 用近邻半径搜索:cosine距离<0.05
nn = NearestNeighbors(metric='cosine', radius=0.05, n_jobs=-1)
nn.fit(emb)
radii = nn.radius_neighbors(emb, return_distance=False)
to_drop = set()
seen = set()
for i, neigh in enumerate(radii):
if i in to_drop: continue
group = [j for j in neigh if j != i]
for j in group:
if j not in seen:
to_drop.add(j)
seen.add(i)
mask = ~df.index.isin(to_drop)
df = df[mask].reset_index(drop=True)
emb = emb[mask.values]
# 4. PCA降到50维并再次L2归一化
pca = PCA(n_components=50, random_state=42)
X50 = pca.fit_transform(emb)
X50 = normalize(X50) # 便于余弦距离稳定
# 5. 层次聚类(平均联接+余弦)
Z = linkage(X50, method='average', metric='cosine')
labels = fcluster(Z, t=8, criterion='maxclust')
df['cluster'] = labels
# 6. 簇解释:关键词(TF-IDF聚合)与代表评论(medoid)
def jieba_tokenizer(s):
return [w for w in jieba.lcut(s) if w.strip()]
tfidf = TfidfVectorizer(tokenizer=jieba_tokenizer, ngram_range=(1,2),
min_df=5, max_df=0.8, sublinear_tf=True)
X_tfidf = tfidf.fit_transform(df['clean'])
vocab = np.array(tfidf.get_feature_names_out())
cluster_info = []
X50_centroids = np.vstack([X50[labels==(k+1)].mean(axis=0) for k in range(8)])
for k in range(1, 9):
idx = np.where(labels == k)[0]
sub = df.iloc[idx]
# 关键词:对簇内TF-IDF求和排序
tfidf_sum = X_tfidf[idx].sum(axis=0).A1
top_idx = tfidf_sum.argsort()[::-1][:15]
keywords = vocab[top_idx].tolist()
# 代表评论:medoid(簇中心最近)
centroid = X50_centroids[k-1].reshape(1,-1)
dists = cdist(X50[idx], centroid, metric='cosine').ravel()
medoid_i = idx[dists.argmin()]
rep = df.loc[medoid_i, 'text'] # 使用原文
# 元数据画像
stats = {
'size': len(idx),
'star_mean': sub['star'].mean() if 'star' in sub else np.nan,
'img_ratio': sub['has_image'].mean() if 'has_image' in sub else np.nan,
'sentiment_mean': sub['sentiment'].mean() if 'sentiment' in sub else np.nan,
'len_mean': sub['text_len'].mean() if 'text_len' in sub else np.nan,
'question_ratio_mean': sub['question_ratio'].mean() if 'question_ratio' in sub else np.nan
}
cluster_info.append({'cluster': k, 'keywords': keywords[:10], 'representative': rep, 'stats': stats})
# 7. 可视化:t-SNE二维投影(余弦)
tsne = TSNE(n_components=2, perplexity=30, init='pca', learning_rate='auto',
metric='cosine', random_state=42)
X2 = tsne.fit_transform(X50)
df['tsne_x'], df['tsne_y'] = X2[:,0], X2[:,1]
# 8. 评估指标
sil = silhouette_score(X50, labels, metric='cosine')
dbi = davies_bouldin_score(X50, labels)
chi = calinski_harabasz_score(X50, labels)
print('Silhouette(cosine)=', round(sil,4), ' DBI=', round(dbi,4), ' CH=', int(chi))
# 9. 导出结果
# 每簇:关键词、代表评论、统计
for c in cluster_info:
print(f"Cluster {c['cluster']} | size={c['stats']['size']}")
print('Keywords:', ', '.join(c['keywords']))
print('Representative:', c['representative'][:120], '...')
print('Stats:', c['stats'])
print('-'*60)
# 可选:绘图(matplotlib)
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(9,7))
sns.scatterplot(x='tsne_x', y='tsne_y', hue='cluster', data=df, palette='tab10', s=12, linewidth=0)
plt.title('t-SNE of Comments (Average-Linkage Cosine Clusters)')
plt.legend(title='Cluster', bbox_to_anchor=(1.02,1), loc='upper left')
plt.tight_layout(); plt.show()
# 树状图(大数据量可只抽样)
# dendrogram(Z, p=50, truncate_mode='lastp', no_labels=True); plt.show()
七、解读与可视化指导
八、优化聚类性能与常见挑战
九、结果呈现的最佳实践(面向业务)
十、总结
需要我基于您的真实数据运行并返回“每簇关键词与代表评论”的具体结果吗?可在您确认数据字段后,我可直接按上述代码生成最终输出表与可视化图。
为用户提供高效、专业的聚类解决方案,帮助他们选择合适的算法,掌握实现过程并准确解读聚类结果,以满足数据分析需求并提升工作效率。