Descriptive summary for an urban mobility trajectories dataset with anomalous samples
Scope and unit of analysis
- Objective: Quantify the characteristics of urban mobility trajectories, assess data quality, and summarize the prevalence and nature of anomalous samples.
- Units: GPS points and derived trip segments/trajectories. If anomaly labels are present, summarize both point-level and trip-level anomalies.
- Coverage to report: observation period (start–end), cities/regions covered, coordinate system (expected WGS84), and time zone.
Core schema (expected fields)
- Identifiers: device_id, trip_id (or session_id), point_id (optional).
- Spatiotemporal: timestamp (UTC or local), latitude, longitude, optional altitude, horizontal_accuracy (m), heading, speed (m/s or km/h).
- Attributes (if available): mode_label, road segment or matched link_id, POI/zone IDs.
- Anomalies: anomaly_flag (0/1), anomaly_type (categorical), anomaly_score (numeric), annotator/source (rule, model, human).
Data volume and coverage
Report:
- Total points; total trips; unique devices.
- Average points per trip; median sampling interval (s) and its variability.
- Spatial footprint: number of covered grid cells (e.g., 500 m), share of city area covered.
- Temporal footprint: days covered; distribution by weekday/weekend; hour-of-day counts.
Data quality profile
- Completeness: missing values per field (%), particularly timestamp, lat/lon, accuracy, speed.
- Validity:
- Coordinates within bounding box; remove (lat, lon) outside plausible ranges.
- Timestamp monotonicity within device/trip; non-increasing or duplicated timestamps (%).
- Duplicated points (exact duplicates and near-duplicates within 1–5 m).
- Positional quality: distribution of horizontal_accuracy (median, IQR, 90th percentile); share > 30 m and > 100 m.
- Sampling regularity: distribution of inter-point intervals; share > 60 s between consecutive fixes; prolonged gaps per trip.
- Device stability: per-device data volume and gap patterns; identify devices contributing disproportionately high errors.
Trajectory-level descriptors
Compute per trip (map-match if available; otherwise use great-circle distances):
- Duration: end_time − start_time; report median, IQR, 90th percentile.
- Distance: sum of segment distances (Haversine); report median, IQR, 90th percentile.
- Speeds: segment speed = distance/Δt; trip median speed; 95th percentile speed.
- Acceleration: change in speed/Δt; summarize typical ranges; cap extreme values to limit GPS noise.
- Dwell behavior: number of stops (speed < threshold for ≥ t_stop), median dwell time.
- Shape metrics: detour index (path length / straight-line distance), radius of gyration.
Recommended summary outputs:
- ECDFs and histograms for trip distance, duration, median speed.
- Cross-tabs by hour-of-day and weekday/weekend for trip counts and speeds.
- Segment-level speed distribution by road class or area type (if available).
Spatial distribution and OD structure
- Density maps of points and trip starts/ends; identify hotspots.
- OD analysis: top OD pairs and their share; flows by zone (e.g., TAZ or hex bins).
- Spatial imbalance: compare inflow vs outflow by zone and time (peak hours vs off-peak).
Temporal patterns
- Diurnal patterns: trip counts and median speed by hour; peak periods (AM/PM).
- Weekly patterns: weekday vs weekend volumes and speed differences.
- Seasonality across the observation window (if multi-week/month).
Anomalous samples summary
Clearly define anomaly taxonomy; typical categories:
- Implausible speed: segment speed above urban feasibility (e.g., > 160 km/h) or above lane-level limits after map-matching.
- Teleportation/jumps: large displacement in short time (e.g., > 1 km in < 5 s) or extreme acceleration.
- GPS drift/noise: high horizontal_accuracy with oscillations around a point; zig-zag in low-speed contexts.
- Timestamp issues: non-monotonic time, duplicated timestamps, large gaps within a trip.
- Route deviation: significant divergence from expected route (if route plan exists).
- Mode inconsistency: speed/trajectory features inconsistent with mode_label.
- Duplicates: overlapping trajectories from the same device_id/trip_id.
- Stationarity anomalies: long dwell without expected context (may be false positives near indoor/urban canyon).
Report:
- Prevalence:
- Point-level: % of points flagged anomalous; breakdown by type.
- Trip-level: % of trips with ≥1 anomaly; median count of anomalies per affected trip.
- Severity:
- Share of anomalies exceeding hard physical limits vs soft statistical thresholds.
- Distribution of anomaly_score (if provided) and suggested operating points (precision/recall trade-offs).
- Concentration:
- By device: top decile of devices by anomaly rate; Gini or Lorenz curve of anomaly contribution.
- By time: anomaly rate by hour/day; spikes during rush hours or nighttime.
- By location: hotspots where anomalies cluster (e.g., tunnels, high-rise canyons, near water).
- Root-cause indicators:
- Correlation with horizontal_accuracy and sampling gaps.
- Associations with specific phone models, OS versions, or app builds (if metadata available).
- Map-matching residuals vs speed anomalies.
Robust thresholds and methods
- Use robust statistics for outlier thresholds: median ± kMAD or IQR-based fences; avoid mean ± zSD in heavy-tailed distributions.
- Speed ceiling examples (tunable to context):
- Pedestrian: > 15 km/h sustained indicates mislabel or bicycle.
- Bicycle: > 50 km/h indicates motorized transport or GPS error.
- Urban motor vehicle: > 160 km/h indicates error.
- Teleportation: flag if displacement/Δt implies acceleration beyond 4–6 m/s² sustained or if segment distance > 300–500 m with Δt < 2–3 s.
- Positional noise: flag oscillations with low net displacement and high variance in bearing at low speeds.
Bias and representativeness
- Device and platform bias: compare trip rates by device type if available.
- Spatial bias: under-representation in suburban or low-density areas.
- Temporal bias: uneven sampling across hours/days; adjust analyses with weights if needed.
- Mode bias: if labels are crowd-sourced, estimate label noise rates using cross-validation or spot audits.
Quality controls and cleaning recommendations
- Standardize time zone and sort by device_id, timestamp.
- Remove or correct impossible coordinates; interpolate small gaps only when justifiable.
- Cap or smooth speeds using Kalman or Savitzky–Golay filters where appropriate; document impacts.
- Use map-matching to road or path networks before computing route-based metrics; report match rate and average residual.
- Maintain an audit trail: pre-cleaning vs post-cleaning metrics and counts of removed/altered points.
KPIs to include in the final report
- Coverage: total points, trips, devices; observation period; median sampling interval.
- Trip metrics: median distance, duration, median and 95th percentile speeds; detour index distribution.
- Quality: % points with accuracy > 30 m; % duplicate points; % trips with internal time gaps > 5 minutes.
- Anomalies: overall anomaly rate (points and trips), top 3 anomaly types with shares, top 5 spatial hotspots, top decile devices by anomaly contribution.
- Reliability uplift after cleaning: change in speed distribution tails, reduction in anomaly rate.
Assumptions and dependencies to confirm
- Coordinate system: WGS84; altitude usage if available.
- Time zone and clock synchronization; daylight saving handling.
- Definition of trip segmentation (gap threshold, stop duration).
- Source and meaning of anomaly labels (rule-based, model-based, or human annotation).
Next steps
- Share a data dictionary and a 1% sample to finalize thresholds and compute exact metrics.
- Produce baseline descriptive tables and visual summaries, then iterate on anomaly taxonomy with domain stakeholders.
- Establish monitoring dashboards to track KPIs over time and detect drift in anomaly rates or data quality.
If you provide the schema and a small sample (or aggregate counts), I can populate the specific metrics and generate the final concise report.