The JValue Project

Making open data safe, easy, and reliable to use

Final Thesis: ETL Data Pipelines Configurations in Spark

Abstract: The JValue Open Data Service (ODS) is an ETL data pipeline that provides data extraction from different source systems (Extract), performs transformations on the extracted data (Transform), and loads the data to a target database (Load). There are different kinds of stream processing engines that cope with data that have high volume, variety, and velocity. Existing ETLs cannot be applied to different streaming services, and the use of various frameworks and programming languages brings complexity along. Among different streaming services, Apache Spark offers accelerated, reusable, and scalable ETLs. This thesis aims to suggest an approach to compile and configure a data pipeline and have it runnable on Apache Spark.

Keywords: ETL pipeline, stream processing

PDF: Bachelor Thesis

Reference: Gizem Batmaci. ETL Data Pipelines Configurations in Spark. Bachelor Thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg: 2022.