$ ls contents
What LSD SQL takes advantage of is how a SELECT INTO statement implies the desired structure of out the output hence the following query
SELECT
a AS post
, a@href AS link
FROM
https://news.ycombinator.com
GROUP BY
span.titleline;
If you were to read it from the perspective of someone familiar with SQL without knowing beforehand what "tables" are in the database, you'd figure the result of this query should return a two-column result with the columns post and link. If you click the button above to run the query in our workbench you'd see that it's exactly what it's doing!
By not needing tables to be created beforehand this effectively yields SQL without ontology
The syntax for selecting is the same as a SELECT statement in postgres except for two differences:
In code that resembles the following
SELECT {css selector} AS {some alias} FROM {url}
For the psql export to work, the resulting column column identifier (including the alias provided) must be a valid postgres identifier. Hence in the example query above, what's happening in the SELECT portion is the following:
SELECT
a AS post -- Getting [a] elements and saving their text as [post]
, a@href AS link -- Getting [href] attribute from [a] tags as [link]
FROM
https://news.ycombinator.com
GROUP BY
span.titleline;
To select more than one thing from a page, simply separate the select expressions with a comma like shown above
The @ doesn't serve a purpose for selecting elements in CSS so it is used in LSD SQL as a mnemonic for "attribute". In order to get both the post text and post link from hacker news, we simply grab the @href attribute of the a elements being selected
As mentioned earlier, URLs are the table identifiers. Like how you'd typically get columns from a table in an expression similar to the following
SELECT col FROM table_;
The way to get things from a particular page is by specifying the URL in the from clause
SELECT
a AS post
, a@href AS link
FROM
https://news.ycombinator.com
GROUP BY
span.titleline;
To get the same elements from pages that are similar in structure, you can simply separate URLs with a comma
SELECT
a AS post
, a@href AS link
FROM
https://news.ycombinator.com
, https://news.ycombinator.com/newest
GROUP BY
span.titleline
To filter out results, simply provide a WHERE clause
SELECT
a AS portco
FROM
https://www.abstractvc.com/companies
WHERE
portco != '' AND portco != 'About' AND portco != '
GROUP BY
div.MuiGrid-item;
You can refer to either the selector or alias when describing a conditional clause
In an ordinary PostgreSQL database, a GROUP BY clause usually denotes how to aggregate rows in a given query. In LSD, the clause tells what the containing elements are of the selected columns. If the provided selector is not found, it then will look to see if the GROUP BY'ed element is the one being selected; lastly it will look to see if the selector applies to the parent document
SELECT
a AS post
, a@href AS link
FROM
https://news.ycombinator.com
GROUP BY
span.titleline;
The pipe operator syntax in LSD draws inspiration from Google's approach and you can see how theirs looks
A syntactically close use of the pipe operator is Elixir's pipe operator and the motivation for introducing it to SQL is very similar - imperative programming paradigms feel more meticulous than resemblant of business logic as complexity increases. If you consider how you'd vocalize getting user or item rows, you'd likely say something along the lines of "from the users table, get these columns, oh and group by this column to deduplicate records"; why not have a syntax that resembles how you'd think about the problem?
FROM {some_table}
|> SELECT {some_column}, {another_column}, {woah_another_column}
|> GROUP BY {some_column};
Each block that's separated by a pipe operator signifies a functional instruction looking to be performed on the table in context
To translate a standard SQL statement, simply consider each block relating to keywords and break up by pipe operators while starting the expression with a FROM clause
SELECT
{some_column}, {another_column}, {woah_another_column}FROM
{some_table}GROUP BY
{some_column};
Here's what the separate blocks look like
[SELECT
{some_column}, {another_column}, {woah_another_column}] [FROM
{some_table}] [GROUP BY
{some_column}];
Stitching together into the |> syntax then gets you the following
FROM
{some_table} |>SELECT
{some_column} , {another_column} , {woah_another_column} |>GROUP BY
{some_column}
If you'd like to read a code script that's written similar in style to Learn X in Y minutes then check out this arxiv example
Otherwise, here's a script for getting every artist and genre from everynoise. To see screenshots of getting and viewing queried data, check out the corresponding tweet
FROM
https://everynoise.com|> SELECT
div.genreAS
genre, div.canvas a.navlink@hrefAS
genre_link|> GROUP BY
div.genre|> DIVE INTO
genre_linkPRESERVING
genre|> SELECT
div.genre.scanmeAS
artist_name
SELECT
h3.title.h5 AS headline
, span.date_time__KhlCV.time AS how_long_ago
FROM
https://www.goal.com/en-us/category/transfers/1/k94w8e1yy9ch14mllpf4srnks
GROUP BY
div.content-wrapper;
SELECT
td.company-listing__cell-wide.company-listing__text.u-md-hide AS company_tagline
, th.company-listing__cell-wide.company-listing__head AS company_name
, td.u-lg-hide AS company_stage
, li AS investor
FROM
https://www.sequoiacap.com/our-companies/?_categories=fintech&_sort=stage_current-asc#all-panel
GROUP BY
tr.aos-init.aos-animate;
SELECT
th.company-listing__cell-wide.company-listing__head AS company_name
, td.company-listing__cell-wide.company-listing__text.u-md-hide AS company_function
, td.u-lg-hide AS company_status
, li AS investor
FROM
https://www.sequoiacap.com/our-companies/?_stage_current=acquired&_sort=stage_current-asc#all-panel
GROUP BY
tr.aos-init;
SELECT
p.paragraph.break-spaces.hide-tablet AS what_does_it_do
, p.text-sm.hide-tablet AS location
, p.text-sm AS name
FROM
https://www.hummingbird.vc/portfolio
GROUP BY
div.grid-item-row;
SELECT
a.company-t AS portco
FROM
https://www.propel.vc/investments
GROUP BY
div.company_link_cover;
SELECT
a AS portco
FROM
https://www.abstractvc.com/companies
WHERE portco != '' AND portco != 'About' AND portco != '
GROUP BY
div.MuiGrid-item;
SELECT
img.featured-logo AS portco
FROM
https://abstractvc.com/companies
GROUP BY
div.featured-inner;
SELECT
a AS post_title,
a@href AS post_link
FROM
https://news.ycombinator.com
GROUP BY
span.titleline;
SELECT
a@href AS scraping_tool
FROM
https://www.octoparse.com/blog/top-30-free-web-scraping-software
GROUP BY
strong;