Digital Marketing

SQL window functions in data science interviews conducted by Airbnb, Netflix, Twitter and Uber

Window functions are a group of functions that will perform calculations on a set of rows related to your current row. They are considered sql advanced and are often asked during data science interviews. It is also used a lot at work to solve many different types of problems. Let’s summarize the 4 different types of window functions and explain why and when you would use them.

4 types of window functions

1. Regular aggregate functions

o These are aggregated as AVG, MIN/MAX, COUNT, SUM

o You will want to use these to aggregate your data and group it by another column like month or year

2. Classification functions

or ROW_NUMBER, RANK, RANK_DENSE

o These are functions that help you classify your data. You can sort your entire data set or sort it by groups, such as by month or country.

o Extremely useful for generating classification indices within groups

3. Generation of statistics

o These are great if you need to generate simple statistics like NTILE (percentiles, quartiles, medians)

o You can use this for your entire dataset or by group

4. Handling time series data

o A very common window function, especially if you need to calculate trends like a monthly moving average or growth metric.

o LAG and LEAD are the two functions that allow you to do this.

1. Regular aggregate function

Regular aggregate functions are functions like average, count, sum, min/max that are applied to columns. The goal is to apply the aggregation function if you want to apply aggregations to different groups in the dataset, such as month.

This is similar to the type of calculation that can be done with an aggregate function that you would find in the SELECT clause, but unlike regular aggregate functions, window functions do not group multiple rows into a single output row, they are grouped together or retain their own identities, depending on how you find them.

Average() Example:

Let’s take a look at an example of an avg() window function implemented to answer a data analysis question. You can view the question and write the code at the following link:

platform.stratascratch.com/coding-question?id=10302&python=

This is a perfect example of using a window function and then applying avg() to a group of months. Here we are trying to calculate the average distance per dollar per month. This is hard to do in SQL without this window function. Here we have applied the avg() window function to the third column where we find the average value for the month-year for each month-year in the data set. We can use this metric to calculate the difference between the month average and the date average for each request date in the table.

The code to implement the window function would look like this:

SELECT a.request_date,

a.dist_to_cost,

AVG(a.dist_to_cost) OVER(PARTITION BY a.request_mnth) AS avg_dist_to_cost

OF

(SELECT *,

to_char(request_date::date, ‘YYYY-MM’) AS request_month,

(distance_to_trip/cost_monetary) AS dist_to_cost

FROM uber_request_logs) a

ORDER BY request_date

2. Classification functions

Classification functions are an important utility for a data scientist. You are always ranking and indexing your data to better understand which rows are the best in your data set. The SQL window functions give you 3 ranking utilities: RANK(), DENSE_RANK(), ROW_NUMBER(), depending on your exact use case. These functions will help you list your data in order and in groups based on what you want.

Range() Example:

Let’s take a look at a sort window function example to see how we can sort data into groups using SQL window functions. Follow along interactively with this link: platform.stratascratch.com/coding-question?id=9898&python=

Here we want to find the highest salaries by department. We can’t just find the top 3 salaries without a window function because it will only give us the top 3 salaries across all departments, so we need to sort the salaries by departments individually. This is done by rank() and splits by department. From there, it’s very easy to filter the top 3 across all departments.

Here is the code to generate this table. You can copy and paste into the SQL editor at the link above and see the same result.

SELECT department,

salary,

RANK() OVER (PARTITION BY a department

ORDER BY a.salary DESC) AS rank_id

OF

(SELECT department, salary

FROM twitter_employee

GROUP BY department, salary

ORDER BY department, salary) to

ORDER BY DEPARTMENT,

salary DESC

3. NIL

NTILE is a very useful feature for those in data analysis, business analytics, and data science. Often when dealing with statistical data deadlines, you probably need to create robust statistics like quartiles, quintiles, medians, deciles in your daily work, and NTILE makes it easy to generate these results.

NTILE takes an argument of the number of containers (or basically how many containers you want to split your data into) and then creates this number of containers by splitting your data into that number of containers. You set how the data is sorted and divided, if you want additional groupings.

NTILE(100) Example

In this example, we’ll learn how to use NTILE to categorize our data into percentiles. You can follow it interactively at the link here: platform.stratascratch.com/coding-question?id=10303&python=

What you’re trying to do here is identify the top 5 percent of claims based on the score that an algorithm generates. But you can’t just find the top 5% and place an order because you want to find the top 5% by state. So one way to do this is to use a sort function NTILE() and then PARTITION by state. You can then apply a filter in the WHERE clause to get the top 5%.

Here is the code to display the entire above table. You can copy and paste it at the link above.

SELECT policy_number,

condition,

claim_cost,

fraud Score,

percentile

OF

(SELECT *,

NTILE(100) OVER(PARTITION BY state

ORDER BYfraud_score DESC) AS percentile

FROM fraud_score) to

WHERE percentile <=5

4. Handling time series data

LAG and LEAD are two window functions that are useful for handling time series data. The only difference between LAG and LEAD is whether you want to get data from previous or next rows, almost like sampling previous or future data.

You can use LAG and LEAD to calculate monthly growth or moving averages. As a data scientist and business analyst, he is always dealing with time series data and creating those time metrics.

LAG() Example:

In this example, we want to find the year-over-year percentage growth, which is a very common question that data scientists and business analysts answer on a daily basis. The problem statement, data, and SQL editor are at the following link if you’d like to try coding the solution yourself: platform.stratascratch.com/coding-question?id=9637&python=

The tricky thing about this problem is that the data is set: you need to use the value from the previous row in your metric. But SQL isn’t designed to do that. SQL is designed to calculate whatever you want, as long as the values ​​are in the same row. So, we can use the window function lag() or lead() which will take the previous or next rows and put them on your current row, which is what this question is doing.

Here is the code to display the entire above table. You can copy and paste the code into the SQL editor at the link above:

SELECTYear,

host_of_current_year,

previous_previous_host,

round(((current_year_host – prev_year_host)/(cast(prev_year_host AS numeric)))*100) estimated_growth

OF

(SELECT year,

host_of_current_year,

LAG(host_current_year, 1) OVER (SORT BY year) LIKE host_previous_year

OF

(SELECT extract (year

FROM host_since::date) AS year,

count(id) host_current_year

FROM airbnb_search_details

WHERE host_since IS NOT NULL

GROUP BY extract (year

FROM host_from::date)

ORDER BY year) t1) t2

Leave a Reply

Your email address will not be published. Required fields are marked *