Syntax

What LSD SQL takes advantage of is how a SELECT INTO statement implies the desired structure of out the output hence the following query

SELECT
  a AS post
, a@href AS link
FROM
  https://news.ycombinator.com
GROUP BY
  span.titleline;

If you were to read it from the perspective of someone familiar with SQL without knowing beforehand what "tables" are in the database, you'd figure the result of this query should return a two-column result with the columns post and link. If you click the button above to run the query in our workbench you'd see that it's exactly what it's doing!

By not needing tables to be created beforehand this effectively yields SQL without ontology

Selecting

The syntax for selecting is the same as a SELECT statement in postgres except for two differences:

In code that resembles the following

SELECT {css selector} AS {some alias} FROM {url}

For the psql export to work, the resulting column column identifier (including the alias provided) must be a valid postgres identifier. Hence in the example query above, what's happening in the SELECT portion is the following:

SELECT
  a AS post -- Getting [a] elements and saving their text as [post]
, a@href AS link -- Getting [href] attribute from [a] tags as [link]
FROM
  https://news.ycombinator.com
GROUP BY
span.titleline;

To select more than one thing from a page, simply separate the select expressions with a comma like shown above

The @ doesn't serve a purpose for selecting elements in CSS so it is used in LSD SQL as a mnemonic for "attribute". In order to get both the post text and post link from hacker news, we simply grab the @href attribute of the a elements being selected

Providing sources

As mentioned earlier, URLs are the table identifiers. Like how you'd typically get columns from a table in an expression similar to the following

SELECT col FROM table_;

The way to get things from a particular page is by specifying the URL in the from clause

SELECT
  a AS post
, a@href AS link
FROM
  https://news.ycombinator.com
GROUP BY
  span.titleline;

To get the same elements from pages that are similar in structure, you can simply separate URLs with a comma

SELECT
  a AS post
, a@href AS link
FROM
  https://news.ycombinator.com
, https://news.ycombinator.com/newest
GROUP BY
  span.titleline

Filtering results

To filter out results, simply provide a WHERE clause

SELECT
  a AS portco
FROM
  https://www.abstractvc.com/companies
WHERE
  portco != '' AND portco != 'About' AND portco != '
GROUP BY
  div.MuiGrid-item;

You can refer to either the selector or alias when describing a conditional clause

Aggregating details

In an ordinary PostgreSQL database, a GROUP BY clause usually denotes how to aggregate rows in a given query. In LSD, the clause tells what the containing elements are of the selected columns. If the provided selector is not found, it then will look to see if the GROUP BY'ed element is the one being selected; lastly it will look to see if the selector applies to the parent document

SELECT
  a AS post
, a@href AS link
FROM
  https://news.ycombinator.com
GROUP BY
  span.titleline;

|> syntax

The pipe operator syntax in LSD draws inspiration from Google's approach and you can see how theirs looks

What is the pipe operator?

A syntactically close use of the pipe operator is Elixir's pipe operator and the motivation for introducing it to SQL is very similar - imperative programming paradigms feel more meticulous than resemblant of business logic as complexity increases. If you consider how you'd vocalize getting user or item rows, you'd likely say something along the lines of "from the users table, get these columns, oh and group by this column to deduplicate records"; why not have a syntax that resembles how you'd think about the problem?

FROM {some_table}
|> SELECT {some_column}, {another_column}, {woah_another_column}
|> GROUP BY {some_column};

Each block that's separated by a pipe operator signifies a functional instruction looking to be performed on the table in context

Translating standard SQL to |> syntax

To translate a standard SQL statement, simply consider each block relating to keywords and break up by pipe operators while starting the expression with a FROM clause

SELECT
  {some_column},
  {another_column},
  {woah_another_column}
FROM
  {some_table}
GROUP BY
  {some_column};

Here's what the separate blocks look like

[SELECT
  {some_column},
  {another_column},
  {woah_another_column}]

[FROM
  {some_table}]

[GROUP BY
  {some_column}];

Stitching together into the |> syntax then gets you the following

FROM {some_table}
|> SELECT {some_column}
,         {another_column}
,         {woah_another_column}
|> GROUP BY {some_column}

Examples

If you'd like to read a code script that's written similar in style to Learn X in Y minutes then check out this arxiv example

Otherwise, here's a script for getting every artist and genre from everynoise. To see screenshots of getting and viewing queried data, check out the corresponding tweet

FROM https://everynoise.com
|> SELECT div.genre AS genre, div.canvas a.navlink@href AS genre_link
|> GROUP BY div.genre
|> DIVE INTO genre_link PRESERVING genre
|> SELECT div.genre.scanme AS artist_name

Examples

Football headlines

SELECT
    h3.title.h5 AS headline
    , span.date_time__KhlCV.time AS how_long_ago
    FROM
    https://www.goal.com/en-us/category/transfers/1/k94w8e1yy9ch14mllpf4srnks
    GROUP BY
    div.content-wrapper;

Sequoia's Fintech investments

SELECT
    td.company-listing__cell-wide.company-listing__text.u-md-hide AS company_tagline
    , th.company-listing__cell-wide.company-listing__head AS company_name
    , td.u-lg-hide AS company_stage
    , li AS investor
    FROM
    https://www.sequoiacap.com/our-companies/?_categories=fintech&_sort=stage_current-asc#all-panel
    GROUP BY
    tr.aos-init.aos-animate;

Sequoia companies that have gotten acquired

SELECT
    th.company-listing__cell-wide.company-listing__head AS company_name
    , td.company-listing__cell-wide.company-listing__text.u-md-hide AS company_function
    , td.u-lg-hide AS company_status
    , li AS investor
    FROM
    https://www.sequoiacap.com/our-companies/?_stage_current=acquired&_sort=stage_current-asc#all-panel
    GROUP BY
    tr.aos-init;

Hummingbird VC portfolio company names

SELECT
    p.paragraph.break-spaces.hide-tablet AS what_does_it_do
    , p.text-sm.hide-tablet AS location
    , p.text-sm AS name
    FROM
    https://www.hummingbird.vc/portfolio
    GROUP BY
    div.grid-item-row;

Propel VC portfolio company names

SELECT
    a.company-t AS portco
    FROM
    https://www.propel.vc/investments
    GROUP BY
    div.company_link_cover;

Abstract VC portfolio company names

SELECT
    a AS portco
    FROM
    https://www.abstractvc.com/companies
    WHERE portco != '' AND portco != 'About' AND portco != '
    GROUP BY
    div.MuiGrid-item;

Abstract VC's portfolio company logos

SELECT
    img.featured-logo AS portco
    FROM
    https://abstractvc.com/companies
    GROUP BY
    div.featured-inner;

Obligatory demo getting posts from Hacker News

SELECT
    a AS post_title,
    a@href AS post_link
    FROM
    https://news.ycombinator.com
    GROUP BY
    span.titleline;

Scraping a list of web scraping services from a scraping service website

SELECT
    a@href AS scraping_tool
    FROM
    https://www.octoparse.com/blog/top-30-free-web-scraping-software
    GROUP BY
    strong;