Database |> Language |> Keywords |>

DIVE

Note: This feature is currently enabled for our enterprise customers. If the below functionality interests you, get in touch.

People like automating repetitive tasks because of the convenience it provides them; likewise we introduced the DIVE keyword to eliminate a repetitive process faced by developers operating the web.

Preliminary
Walkthrough
Example

Preliminary

When traversing the internet for content from different points in time, you may find it necessary to extend the depth of the web you’re presently exploring. This could be due to a crawl or a simply repetitive extraction.

Say for example I’m interested in getting all artists and genres labeled by Spotify as of the end of 2023. The pseudo-code would look something like the following:

For each genre on everynoise:
  For each artist on a genre page:
    Grab each artist name and other metadata...

However anyone’s who spent time with scraping or browser automation will know that the code that actually enacts the heuristic described above will not comprise of that many lines of code unless it happens to be a plain HTML website and can be wrangled with a standard solution like beautifulsoup.

The complexity introduced is because of treating different pages as being intrinsically different contexts despite you being the same person tackling the same problem. Instead what if our pseudo-code looked something like the following?

Starting at everynoise:
  Grab each genre link and dive into the subsequent pages:
    Grab content with parent context implicitly inserted

That’s what done with the below SQL:

everynoise <| https://everynoise.com |
genre <| div.genre |
genre_link <| div.canvas a.navlink@href |
artist_name <| div.genre.scanme |

FROM everynoise
|> GROUP BY genre
|> SELECT genre, genre_link
|> DIVE INTO genre_link
|> GROUP BY artist_name
|> SELECT artist_name

To see corresponding screenshots of this in the past, see this Tweet and, for how the above SQL works, keep on reading for a walkthrough.

Walkthrough

Repeating the SQL we’ll do a walkthrough for:

everynoise <| https://everynoise.com |
genre <| div.genre |
genre_link <| div.canvas a.navlink@href |
artist_name <| div.genre.scanme |

FROM everynoise
|> GROUP BY genre
|> SELECT genre, genre_link
|> DIVE INTO genre_link
|> GROUP BY artist_name
|> SELECT artist_name

The first four lines are ASSIGNing to variables that’ll help the subsequent statement be easier to read and edit.

everynoise <| https://everynoise.com |
genre <| div.genre |
genre_link <| div.canvas a.navlink@href |
artist_name <| div.genre.scanme |

The lines below then comprise the actual logic of what the SQL is doing starting with the FROM expression declaring that we are interested in content from the source identified as everynoise which is defined above as being [https://everynoise.com]

FROM everynoise

Next we are GROUPing and SELECTing the fields we’d like to carry over into the DIVE operation which simply are the genres plus links to their respective pages.

FROM everynoise
|> GROUP BY genre
|> SELECT genre, genre_link

At this point, if we were to leave the query as-is, we’d end up with a two column result with something like so:

┌───────┬────────────┐
│ genre │ genre_link │
├───────┼────────────┤
│ pop   │ https...   │
├───────┼────────────┤
│ jazz  │ https...   │
├───────┼────────────┤
│ rock  │ https...   │
│       │            │
└───────┴────────────┘

But we wanted to fetch not only the genres but also associated artists with those genres, that’s something we’ll get from the pages available at genre_link.

FROM everynoise
|> GROUP BY genre
|> SELECT genre, genre_link
|> DIVE INTO genre_link

And, since just going to a page doesn’t magically accomplish anything on its own, we’ll also GROUP and SELECT from the context of the dived-into page:

FROM everynoise
|> GROUP BY genre
|> SELECT genre, genre_link
|> DIVE INTO genre_link
|> GROUP BY artist_name
|> SELECT artist_name

Which now results in a table that looks like the following when run.

┌───────┬────────────┬─────────────┐
│ genre │ genre_link │ artist_name │
├───────┼────────────┼─────────────┤
│ pop   │ https...   │ Cotton      │
├───────┼────────────┼─────────────┤
│ jazz  │ https...   │ Eyed        │
├───────┼────────────┼─────────────┤
│ rock  │ https...   │ Joe         │
│       │            │             │
└───────┴────────────┴─────────────┘

Example

Another example where the DIVE keyword is helpful is for when there is pertinent information contained within links that are accessible from the “top-level” context.

Let’s say we we’re interested in not just the headlines on Ask HN but also the full question provided by the author. Doing this by hand means starting from the top-level Ask HN page, clicking on every link then copying the text at the top of the page before repeating for the other links.

So, why don’t we just get the Ask HN page then then pluck the authors’ questions after diving into each of the relevant links?

-- A shorthand for Ask HN
ask_hn <| https://news.ycombinator.com/ask |

-- Selectors for fields related to posts' discussions
subtitle_container <| span.subline |
author <| span.subline a.hnuser |
discussion_link <| span.subline span.age a@href |

-- Selectors for fields related to posts' contents
question <| div.toptext |

-- A function defined to obtain post conversations
get_post_conversation <|> <|
FROM ask_hn
|> GROUP BY subtitle_container
|> SELECT author, discussion_link |

 -- Should return a table with [discussion_link, post, post_link, author]
ZIP get_post_conversation get_post_meta

get_post_conversation
|> DIVE INTO discussion_link
|> SELECT question

docs

database

connect

language

postgres

quickstart

quickstart-llm

trips

intelligence

agents

internet

bibliography

browser

traverser

Database |> Language |> Keywords |>

DIVE

Contents

Preliminary

Walkthrough

Example

Related: