Skip to content

Synthetify — Synthetic Time Series Data Generation and Imputation Tool

2024–2025 | Python, Streamlit | Documentation

Interactive ML tool for generating and imputing synthetic data in time series datasets. Users can upload their data, preprocess it, and generate synthetic data using advanced statistical and ML models.

Problem / Motivation

Time series datasets often have missing values, irregular timestamps, and outliers, making analysis and modeling difficult. Manually imputing missing data or generating synthetic datasets is tedious and error-prone.

Synthetify addresses this by providing:

  • Automated preprocessing for time series data
  • Multiple imputation techniques to fill missing values accurately
  • Synthetic data generation for future predictions
  • Transparent model evaluation to ensure data consistency

This project demonstrates practical applications of statistical methods, imputation models, and synthetic data generation in time series workflows.

Core Functionalities

Data Upload and Preprocessing

  • Upload CSV files containing time series data.
  • Automatic identification of timestamp columns and frequency inference.
  • Preprocessing includes:
  • Handling missing values with KNN imputation
  • Outlier detection and replacement (Z-score, IQR)
  • Removal of duplicate values and timestamps
  • Dropping columns with >50% null values
  • Option to reintegrate missing timestamps

Dataset Analysis and Visualization

  • Provides dataset shape, data types, and statistical summary (df.describe())
  • Identifies timestamp column and inferred frequency
  • Overview of statistical properties (mean, median, mode)
  • Visualizations for null values and outliers

Synthetic Data Generation

Imputation of Missing Values

  • Evaluates multiple imputation models to select the best fit
  • Models include:
  • KDE, Inverse Transform Sampling, Copula, Monte Carlo Simulation, Markov Chain Monte Carlo
  • Forward Fill, Backward Fill, Linear Interpolation
  • KNN, MICE, Random Forest, Iterative Imputers
  • Applies the selected model to impute missing values

Generation of Future Data

  • Determines optimal model for generating future data
  • Models include KDE, ITS, Copula, Monte Carlo, MCMC
  • Generates synthetic data for a user-defined number of future days
  • Option to calculate confidence intervals for generated data

Model Evaluation and Selection

  • Evaluates models based on statistical properties and percentage changes
  • Selects the model that best preserves original data characteristics
  • Provides transparency of model choice and reasoning

User Interface and Interaction

  • Streamlit-based interactive interface
  • Progress bars and success/error messages for better UX
  • Download options for preprocessed, imputed, and generated datasets

Potential Applications

  • Forecasting and prediction in time series data
  • Handling missing data in real-world datasets
  • Generating realistic synthetic datasets for research and development
  • Augmenting datasets for machine learning model training

Future Enhancements

  • Support for more advanced time series models
  • Integration with interactive visualization libraries
  • Additional evaluation metrics for model performance
  • Support for categorical time series data
  • Prediction and forecasting capabilities
  • Synthetic text data generation using NLP techniques