Note: This feature is currently enabled for our enterprise customers. If the below functionality interests you, get in touch.
People like automating repetitive tasks because of the convenience it provides them; likewise we introduced the DIVE keyword to eliminate a repetitive process faced by developers operating the web.
When traversing the internet for content from different points in time, you may find it necessary to extend the depth of the web you’re presently exploring. This could be due to a crawl or a simply repetitive extraction.
Say for example I’m interested in getting all artists and genres labeled by Spotify as of the end of 2023. The pseudo-code would look something like the following:
For each genre on everynoise:
For each artist on a genre page:
Grab each artist name and other metadata...
However anyone’s who spent time with scraping or browser automation will know that the code that actually enacts the heuristic described above will not comprise of that many lines of code unless it happens to be a plain HTML website and can be wrangled with a standard solution like beautifulsoup.
The complexity introduced is because of treating different pages as being intrinsically different contexts despite you being the same person tackling the same problem. Instead what if our pseudo-code looked something like the following?
Starting at everynoise:
Grab each genre link and dive into the subsequent pages:
Grab content with parent context implicitly inserted
That’s what done with the below SQL:
everynoise <| https://everynoise.com |
genre <| div.genre |
genre_link <| div.canvas a.navlink@href |
artist_name <| div.genre.scanme |
FROM everynoise
|> GROUP BY genre
|> SELECT genre, genre_link
|> DIVE INTO genre_link
|> GROUP BY artist_name
|> SELECT artist_name
To see corresponding screenshots of this in the past, see this Tweet and, for how the above SQL works, keep on reading for a walkthrough.
Repeating the SQL we’ll do a walkthrough for:
everynoise <| https://everynoise.com |
genre <| div.genre |
genre_link <| div.canvas a.navlink@href |
artist_name <| div.genre.scanme |
FROM everynoise
|> GROUP BY genre
|> SELECT genre, genre_link
|> DIVE INTO genre_link
|> GROUP BY artist_name
|> SELECT artist_name
The first four lines are ASSIGNing to variables that’ll help the subsequent statement be easier to read and edit.
everynoise <| https://everynoise.com |
genre <| div.genre |
genre_link <| div.canvas a.navlink@href |
artist_name <| div.genre.scanme |
The lines below then comprise the actual logic of what the SQL is doing starting with the FROM expression declaring that we are interested in content from the source identified as everynoise which is defined above as being [https://everynoise.com]
FROM everynoise
Next we are GROUPing and SELECTing the fields we’d like to carry over into the DIVE operation which simply are the genres plus links to their respective pages.
FROM everynoise
|> GROUP BY genre
|> SELECT genre, genre_link
At this point, if we were to leave the query as-is, we’d end up with a two column result with something like so:
┌───────┬────────────┐
│ genre │ genre_link │
├───────┼────────────┤
│ pop │ https... │
├───────┼────────────┤
│ jazz │ https... │
├───────┼────────────┤
│ rock │ https... │
│ │ │
└───────┴────────────┘
But we wanted to fetch not only the genres but also associated artists with those genres, that’s something we’ll get from the pages available at genre_link.
FROM everynoise
|> GROUP BY genre
|> SELECT genre, genre_link
|> DIVE INTO genre_link
And, since just going to a page doesn’t magically accomplish anything on its own, we’ll also GROUP and SELECT from the context of the dived-into page:
FROM everynoise
|> GROUP BY genre
|> SELECT genre, genre_link
|> DIVE INTO genre_link
|> GROUP BY artist_name
|> SELECT artist_name
Which now results in a table that looks like the following when run.
┌───────┬────────────┬─────────────┐
│ genre │ genre_link │ artist_name │
├───────┼────────────┼─────────────┤
│ pop │ https... │ Cotton │
├───────┼────────────┼─────────────┤
│ jazz │ https... │ Eyed │
├───────┼────────────┼─────────────┤
│ rock │ https... │ Joe │
│ │ │ │
└───────┴────────────┴─────────────┘
Another example where the DIVE keyword is helpful is for when there is pertinent information contained within links that are accessible from the “top-level” context.
Let’s say we we’re interested in not just the headlines on Ask HN but also the full question provided by the author. Doing this by hand means starting from the top-level Ask HN page, clicking on every link then copying the text at the top of the page before repeating for the other links.
So, why don’t we just get the Ask HN page then then pluck the authors’ questions after diving into each of the relevant links?
-- A shorthand for Ask HN
ask_hn <| https://news.ycombinator.com/ask |
-- Selectors for fields related to posts' discussions
subtitle_container <| span.subline |
author <| span.subline a.hnuser |
discussion_link <| span.subline span.age a@href |
-- Selectors for fields related to posts' contents
question <| div.toptext |
-- A function defined to obtain post conversations
get_post_conversation <|> <|
FROM hn
|> GROUP BY subtitle_container
|> SELECT author, discussion_link |
-- Should return a table with [discussion_link, post, post_link, author]
ZIP get_post_conversation get_post_meta
get_post_conversation
|> DIVE INTO discussion_link
|> SELECT question