Data Engineering Problem 3 (Find diff between count of cities and distinct count of cities)

Please find earlier blogs to have understanding of our Data Engineering Learning plan and system setup for Data Engineering. Today we are solving and learning one more Data engineering problem and learning new concepts.

Problem Statement

Find the difference between the total number of CITY entries in the table and the number of distinct CITY entries in the table. The STATION table is described as follows:

Problem Difficulty Level: Easy

Data Structure

ID
City
State
Lattitude
Longitude

Data for station table

In CSV Format

Please clone below GitHub repo in which we have all the data and problem solution with PostgreSQL, Apache Spark and Spark SQL.

https://github.com/developershomes/DataEngineeringProblems/tree/main/Problem%203

Solve using SQL (PostgreSQL)

First create tables into PostgreSQL using create script and also import data into tables as we discussed in earlier blogs.

Reference blog:

Data Engineering Problem 0 (Employees with salary more than 100K)

Once data is loaded in table, we will do select query to check data,

As usual, we will divide this problem into multiple steps.

Get count of all cities
Get distinct count of all cities
Create query which will subtract from total count of cities and distinct city count

For getting count of cities, we will use below query

SELECT count(city) FROM public. Station;

Now, we will count distinct cities, for that we will use below query.

SELECT count(distinct(city)) FROM public. Station;

Now, we will have final query as below

SELECT count(city) as citycount, count(distinct(city)) as distinctcitycount,(count(city)  - count(distinct(city))) as diffbetweenboth
	FROM public. Station;

Solve Problem using Spark

In Spark, we will solve this problem using PySpark functions and also using Spark SQL.

First for both, we need to load data into Dataframe. First open Jupyter Lab and create one file and paste CSV data into that file.

Now, create Jupyter notebook using python as engine (kernel).

The first step is to load data into data frame.

Once data loaded as below, also check schema so we are sure that all the column is having proper datatypes.

Solve using PySpark functions

In PySpark, we have function count and countDistinct.

count function -> It gives count of rows

countDistinct function -> This function gives distinct count of rows

We first need to import these functions from pyspark.sql

#Solving Problem using PySpark 
# ind the difference between the total number of CITY entries in the table and the number of distinct CITY entries in the table. 
from pyspark.sql.functions import countDistinct
from pyspark.sql.functions import count
stationdf.select(count("City") - countDistinct("City")).show()

Solve using Spark SQL

Before using Spark SQL, we need to create a temp table (or HIVE table) from data frame.

We have already solved this problem using PostgreSQL earlier and we know that Spark SQL is always works with almost all ANSI SQL functions. So we will direct use that query by changing table name with temp table.

# Now we will write query to get max salary for each employee 
# so we will use SQL Group by and SQL Order by functions 
sqlContext.sql("SELECT count(city) as citycount, count(distinct(city)) as distinctcitycount \
                ,(count(city)  - count(distinct(city))) as diffbetweenboth \
                FROM tmpStation").show(n=100)

Conclusion:

This is how we solved this problem in SQL and Apache Spark.

Today, we have learned the functions below.

count -> For getting count

distinct -> to get distinct values.

Please also find the video to understand this problem in more detail. And also follow this blog to learn more.