Synthetify — Synthetic Time Series Data Generation and Imputation Tool
2024–2025 | Python, Streamlit | Documentation
Interactive ML tool for generating and imputing synthetic data in time series datasets. Users can upload their data, preprocess it, and generate synthetic data using advanced statistical and ML models.
Problem / Motivation
Time series datasets often have missing values, irregular timestamps, and outliers, making analysis and modeling difficult. Manually imputing missing data or generating synthetic datasets is tedious and error-prone.
Synthetify addresses this by providing:
- Automated preprocessing for time series data
- Multiple imputation techniques to fill missing values accurately
- Synthetic data generation for future predictions
- Transparent model evaluation to ensure data consistency
This project demonstrates practical applications of statistical methods, imputation models, and synthetic data generation in time series workflows.
Core Functionalities
Data Upload and Preprocessing
- Upload CSV files containing time series data.
- Automatic identification of timestamp columns and frequency inference.
- Preprocessing includes:
- Handling missing values with KNN imputation
- Outlier detection and replacement (Z-score, IQR)
- Removal of duplicate values and timestamps
- Dropping columns with >50% null values
- Option to reintegrate missing timestamps
Dataset Analysis and Visualization
- Provides dataset shape, data types, and statistical summary (
df.describe()
) - Identifies timestamp column and inferred frequency
- Overview of statistical properties (mean, median, mode)
- Visualizations for null values and outliers
Synthetic Data Generation
Imputation of Missing Values
- Evaluates multiple imputation models to select the best fit
- Models include:
- KDE, Inverse Transform Sampling, Copula, Monte Carlo Simulation, Markov Chain Monte Carlo
- Forward Fill, Backward Fill, Linear Interpolation
- KNN, MICE, Random Forest, Iterative Imputers
- Applies the selected model to impute missing values
Generation of Future Data
- Determines optimal model for generating future data
- Models include KDE, ITS, Copula, Monte Carlo, MCMC
- Generates synthetic data for a user-defined number of future days
- Option to calculate confidence intervals for generated data
Model Evaluation and Selection
- Evaluates models based on statistical properties and percentage changes
- Selects the model that best preserves original data characteristics
- Provides transparency of model choice and reasoning
User Interface and Interaction
- Streamlit-based interactive interface
- Progress bars and success/error messages for better UX
- Download options for preprocessed, imputed, and generated datasets
Potential Applications
- Forecasting and prediction in time series data
- Handling missing data in real-world datasets
- Generating realistic synthetic datasets for research and development
- Augmenting datasets for machine learning model training
Future Enhancements
- Support for more advanced time series models
- Integration with interactive visualization libraries
- Additional evaluation metrics for model performance
- Support for categorical time series data
- Prediction and forecasting capabilities
- Synthetic text data generation using NLP techniques
Links
- Documentation: Synthetify Documentation