top of page
Pevatrons company logo

Enabling Rapid Analytics from Diverse Data Sources with Data Lake

  • Writer: Pevatrons Engineer
    Pevatrons Engineer
  • Mar 26, 2023
  • 1 min read

Updated: Apr 24, 2024



ree



Challenges with the existing system

The client is a medium-scale apparel company that was looking to automate data collection, and storage and improve data quality for faster and more accurate analysis. Their data previously had to be sourced and manually ingested from over 30 different entities and they didn’t have a consistent template to collect data. Their marketing team also wanted user-friendly data visualizations to get a holistic view of advertising spend, help generates faster insights, and optimize marketing RoI.

Our Solution

We built product-based data models to standardize the input files from different data providers and channels. The datasets, which came at different schedules were then transformed to work with the respective models. The ETL (Apache Airflow) pipelines were set up, pre-processed, and changed using AWS Batch and Glue, to be then served through Amazon Athena. We built automated pipelines to pull the data via SFTP and batch data sources. Superset dashboards helped data partners a complete view of the advertising spend by monitoring data metrics.

Challenges faced during building a centralized data lake

  • Data is sourced from different providers, hence we had to write transformations for each source.

  • Multiple testing approaches have to be involved in the data ingestion, and integration stages.

  • Data profiling had to be applied at each stage for better understanding.

  • Use in-pipeline observability to identify anomalies and fraud events quickly.

  • Use intelligent pipelines to discover, mask, obfuscate, and encrypt sensitive data in motion.

  • Build, test, and manage the performance of data pipelines, ensuring that when they are put into production they meet the requirements at scale

Architecture

ree

Results

ree

Comments


© 2024 By PeVatrons

bottom of page