Data Engineering Problem 4 (Get Shortest and Longest City Name)

Please find earlier blogs to have understanding of our Data Engineering Learning plan and system setup for Data Engineering. Today we are solving and learning one more Data engineering problem and learning new concepts.

Problem Statement

Query the two cities in STATION with the shortest and longest CITY names, as well as their respective lengths (i.e.: number of characters in the name). If there is more than one smallest or largest city, choose the one that comes first when ordered alphabetically. The STATION table is described as follows:

Problem Difficulty Level: Hard

Data Structure

ID
City
State
Lattitude
Longitude

Data for station table

In CSV Format

Sample Input

For example, CITY has four entries: DEF, ABC, PQRS and WXY.

Sample Output:

ABC 3
PQRS 4

Please clone below GitHub repo in which we have all the data and problem solution with PostgreSQL, Apache Spark and Spark SQL.

https://github.com/developershomes/DataEngineeringProblems/tree/main/Problem%204

Solve using SQL (PostgreSQL)

First create tables into PostgreSQL using create script and also import data into tables as we discussed in earlier blogs. And list both tables’ data so we have understanding of data.

Once data is loaded in table, we will do select query to check data,

As usual, we will divide this problem into multiple steps.

We need shortest and longest City names with its length
So first we will get length of all cities
Also, if in case we have same length of character in shortest we want to order by alphabetically and give rank to city.
After that we will get what is shortest length of city and what is longest length of city
After that we will filter those two cities with the longest and shortest length and also order by alphabetically.

First step, we will get length of each city. For that we will use length function in SQL.

SELECT city, length(city) FROM public. Station;

On the next step, if we have two cities with the same length, we will give priority based on alphabatically. In other words, we will give rank to same length city order by alphabatically.

Here, we will use windows function RANK() , we will give rank to cities. We also want to give rank to cities whose having same length. So, we also need to use PARTITION BY and we want to order by name of city hence we will also use ORDER BY. So our query will be as below.

SELECT CITY
      ,LENGTH(CITY) as citylength 
      ,RANK() OVER (PARTITION BY LENGTH(CITY) ORDER BY LENGTH(CITY),CITY) as actualrank  
 FROM STATION;

We can see that with length 3, we have two cities, Amo and Roy. And It gave rank Amo as 1 and Roy as 2.

So now if in case we only prioritize the first city in each length, we can simply say that we only want city with actualrank = 1.

On next step, we will get min and max length of cities from all cities.

For that we will use MIN() and MAX() functions in SQL. So, our query will be as below.

SELECT MIN(LENGTH(CITY)) as minimumLength,MAX(LENGTH(CITY)) as maximumLength FROM STATION;

Now, Final part where we will merge all of above queries.

First, we will filter all the cities with rank 1 (in our case we have given name as actualrank).

SELECT q1.city, q1.citylength
FROM
(SELECT CITY,LENGTH(CITY) as citylength, RANK() OVER (PARTITION BY LENGTH(CITY) ORDER BY LENGTH(CITY),CITY) as actualrank  
 FROM STATION) q1
 WHERE q1. actualrank = 1 ;

Next step, filter with minimum and maximum length.

SELECT q1.city, q1.citylength
FROM
(SELECT CITY,LENGTH(CITY) as citylength, RANK() OVER (PARTITION BY LENGTH(CITY) ORDER BY LENGTH(CITY),CITY) as actualrank  
 FROM STATION) q1
 WHERE q1. actualrank = 1 
   AND q1.citylength = (SELECT MIN(LENGTH(CITY)) FROM STATION) 
   OR q1.citylength = (SELECT MAX(LENGTH(CITY)) FROM STATION);

This is how we solved this complex problem by diving this problem in small steps.

Solve Problem using Spark

In Spark, we will solve this problem using PySpark functions and also using Spark SQL.

First for both, we need to load data into Dataframe. First open Jupyter Lab and create one file and paste CSV data into that file.

Now, create Jupyter notebook using python as engine (kernel).

We have already loaded data into dataframe.

Now we check schema so we are sure that all the columns have proper datatypes.

Also check data

Solve using Spark SQL

Before using Spark SQL, we need to create a temp table (or HIVE table) from data frame.

We have already solved this problem using PostgreSQL earlier and we know that Spark SQL is always works with almost all ANSI SQL functions. So, we will direct use that query by changing table name with temp table.

We have same LENGTH(), RANK() , MIN() and MAX() functions available in Spark SQL.

# Now we will write query to get max salary for each employee 
# so we will use SQL Group by and SQL Order by functions 
sqlContext.sql("SELECT q1.city, q1.citylength FROM \
                (SELECT CITY,LENGTH(CITY) as citylength, RANK() OVER (PARTITION BY LENGTH(CITY) ORDER BY LENGTH(CITY),CITY) as actualrank  \
                 FROM tmpStation) q1 \
                 WHERE q1. actualrank = 1 \
                   AND q1.citylength = (SELECT MIN(LENGTH(CITY)) FROM tmpStation) \
                   OR q1.citylength = (SELECT MAX(LENGTH(CITY)) FROM tmpStation)").show(n=100)

Conclusion

Today, we learn very important and most used SQL Functions

LENGTH() -> To get length of String

MIN() -> Get minimum value from column

MAX() -> Get maximum value from column

RANK() -> Rank will be used to get rank in table. We have to pass PARTITION BY with rank function. And ORDER BY can be passed to get rank based on columns ascending or descending order.

Please also find the video to understand this problem in more detail. And also follow this blog to learn more.

Part 1 (Using PostgreSQL)

Using Apache Spark (Spark SQL)

Problem Statement

Solve using SQL (PostgreSQL)

Solve Problem using Spark

Solve using Spark SQL

Conclusion

Share this:

Related

Leave a comment Cancel reply