Introduction:
In this blog we will discuss Spark ETL (Extract, transform and load) or ELT (Extract, load and transform). In Spark to connect different data sources we need to install libraries, we will discuss how to install all the required libraries and how to connect with different data sources and extract, transform and load data.
System Setup:
Before starting this if you don’t have Data Engineering Setup ready, please find below blog and video so you have your system ready to execute below Spark ETL pipelines.
https://developershome.blog/2023/01/30/data-engineering-tool-suite/: Spark ETL | ELT | Data ConnectionsSpark ETL Pipelines:
In coming days, we will discuss about below Spark ETL and data source connections (I will update link with each ETL process once video and blog are available)
0. Chapter0 -> Spark ETL with Files (CSV | JSON | Parquet)
1. Chapter1 -> Spark ETL with SQL Database (MySQL | PostgreSQL)
2. Chapter2 -> Spark ETL with NoSQL Database (MongoDB)
3. Chapter3 -> Spark ETL with Azure (Blob | ADLS)
4. Chapter4 -> Spark ETL with AWS (S3 bucket)
5. Chapter5 -> Spark ETL with Hive tables
6. Chapter6 -> Spark ETL with APIs
7. Chapter7 -> Spark ETL with Lakehouse (Delta)
8. Chapter8 -> Spark ETL with Lakehouse (HUDI)
9. Chapter9 -> Spark ETL with Lakehouse (Apache Iceberg)
10. Chapter10 -> Spark ETL with Lakehouse (Delta vs Iceberg vs HUDI)
11. Chapter11 -> Spark ETL with Lakehouse (Delta table Optimization)
12. Chapter12 -> Spark ETL with Lakehouse (Apache Kafka)
13. Chapter13 -> Spark ETL with GCP (Big Query)
14. Chapter 14 -> Spark ETL with Hadoop (Apache Sqoop)