Data Engineering Problem 2 (Employees with Sales Department)

Please find earlier blogs to have understanding of our Data Engineering Learning plan and system setup for Data Engineering. Today we are solving and learning one more Data engineering problem and learning new concepts.

Problem Statement

We have a table with employee’s tables in which we have employee details with salary and department id of the employees. We have one more table in which we have department id and department name. Provide below queries

Use both these tables and list all the employees working in the marketing department with the highest to lowest salary order.
Provide a count of employees in each department with the department name.

Problem Difficulty Level: Easy

Data Structure

Employee table

Department table

Data for employee salary table

In CSV Format

Data for department table

In CSV Format

Please clone below GitHub repo in which we have all the data and problem solution with PostgreSQL, Apache Spark and Spark SQL.

https://github.com/developershomes/DataEngineeringProblems/tree/main/Problem%202

Solve using SQL (PostgreSQL)

First create tables into PostgreSQL using create script and also import data into tables as we discussed in earlier blogs. And list both tables’ data so we have understanding of data.

Employee Salary table

Department table

Here, we see that the employee salary table is having department id and department details are stored into the department master table. So, we need to join both tables if we need depart name details.

Going to our problem statement

“Use both these tables and list all the employees working in the marketing department with the highest to lowest salary order.”

We will divide this problem into multiple steps so that it is easy to solve.

First, we will join both tables so that we have department name with each employee
Second, we will filter with department “Marketing”
Thirdly, we will order data based on salary in ascending order.

First part joining two tables

We have department_id as a key in both tables
And we need department details with respect to each employee and that’s we will use left outer join.

SELECT emp.*, department.*
	FROM public.employee_salary as emp
	LEFT OUTER JOIN public.department as department
	ON emp.department_id = department.department_id;

Now, we have department name with each employee.

Second part filter with “Marketing” department.

SELECT emp.*, department.*
	FROM public.employee_salary as emp
	LEFT OUTER JOIN public.department as department
	ON emp.department_id = department.department_id
	WHERE department.department_name = 'Marketing';

Third and final step is ordering based on salary

SELECT emp.*, department.*
	FROM public.employee_salary as emp
	LEFT OUTER JOIN public.department as department
	ON emp.department_id = department.department_id
	WHERE department.department_name = 'Marketing'
	ORDER BY salary DESC;

Solve Problem using Spark

In Spark, we will solve this problem using PySpark functions and also using Spark SQL.

First for both, we need to load data into Dataframe. First open Jupyter Lab and create one file and paste CSV data into that file.

As we discussed earlier, we will first load both the CSV files into dataframe and select data and also check count of rows so that we are familiar with data.

For Employee Salary CSV file

Same for Department CSV file

Solve using PySpark functions

As we did in PostgreSQL, here we will also divide this problem into multiple steps so it will be easy to solve.

First, we will join two data frames using department_id as it is key column in both

Second step is filtering this using “Marketing” and display salary in highest to lowest order

Solve using Spark SQL

Before using Spark SQL, we need to create a temp table (or HIVE table) from data frame.

Employee table

Department table

Now, we will use same query which we used in PostgreSQL with changing table names with temp table names in Spark

This is how we have solved this problem using PostgreSQL, Apache Spark function and Spark SQL.

We also learned below concepts

Join function
ORDER BY function and
Filtering on data

Please find below video for understanding using Video. I have not explained the second part here for that use Github query or below video. Please comment if you have any query on this one.

In below video we are solving seocnd part of this problem.

Data Engineering Problem 2 (Employees with Sales Department)

Problem Statement

Solve using SQL (PostgreSQL)

Solve Problem using Spark

Solve using PySpark functions

Solve using Spark SQL

One thought on “Data Engineering Problem 2 (Employees with Sales Department)”

Add yours

Leave a reply to Yaw Cancel reply

Problem Statement

Solve using SQL (PostgreSQL)

Solve Problem using Spark

Solve using PySpark functions

Solve using Spark SQL

Share this:

Related

One thought on “Data Engineering Problem 2 (Employees with Sales Department)”

Add yours

Leave a reply to Yaw Cancel reply