==================================================
Database |> Language |>
The pipe operator ‘|>’ syntax is a way to write SQL that is more readable and easier to understand. Contents Definition Example Definition Inspired by a paper recently published by Google where they go over adding a pipe operator syntax to their database language, LSD SQL features a pipe operator syntax to improve the readability of programs that define wrangling the web. FROM starting_point |> A KEYWORD with arguments |> ANOTHER KEYWORD with more arguments… If you’ve dealt with convoluted CTEs or tried to nest multiple EXISTS clauses, then you’ll know how complex SQL can sometimes look similar to a deeply stacked LISP program hence the language design decision here. The pipe operator syntax enables developers to write functional SQL to transform web data in a way that feels less like hand-holding an automated browser and more like remixing data to reach your desired structure. We also use the pipe operator to assign variables. Example To translate from standard SQL, break up the clauses into expressions and separate them with a pipe operator while starting with a FROM. For example, let’s take the following query in standard syntax: FROM https://news.ycombinator.com SELECT a AS post , a@href AS link GROUP BY span.titleline;
These are the clauses that make up the statement: [FROM https://news.ycombinator.com] [SELECT a AS post , a@href AS link] [GROUP BY span.titleline;] So in pipe operator syntax that could look like: FROM https://news.ycombinator.com |> SELECT a AS post, a@href AS link |> GROUP BY span.titleline
The order of expressions can be interchangeable so long as they don’t alter the page you’re interested in meaning the below statement returns the same data as the one above. FROM https://news.ycombinator.com |> GROUP BY span.titleline |> SELECT a AS post, a@href AS link
==================================================
Database |> Language |>
CSS selectors are identifiers used to select HTML elements. In LSD, we use these selectors to extract specific data from web pages, similar to how columns work in traditional SQL. This allows you to target elements using familiar CSS syntax while leveraging SQL’s powerful querying capabilities.
Contents
Definition
Examples
Simple
Complex
Definition
In contrast to an ordinary SELECT statement where the columns are fields in a table:
SELECT Or, if you were to write the above statement using variables for readability in the output.
tagline <| h1 |
FROM https://lsd.so
|> SELECT tagline Complex
When you want to grab more granularly, you can use attribute selectors to match elements that abide by a particular pattern like lobotomized owls. ================================================== Database |> Language |> Keywords |> The FROM keyword is how you denote something as being the source you’re looking to pluck or start web traversing from. It expects one argument to be provided like so:
FROM https://news.ycombinator.com
|> …
However a variable can be referenced as the source like so:
hn <| https://news.ycombinator.com
FROM hn
|> …
If you’re querying multiple pages with the exact same fields being selecting and grouped by, you can then provide multiple sources to append results from together by separating the arguments to FROM by commas like so:
FROM https://news.ycombinator.com, https://news.ycombinator.com/ask, https://news.ycombinator.com/newest
|> …
Again, variables can be referenced instead of the exact URL value:
home_page <| https://news.ycombinator.com |
ask_page <| https://news.ycombinator.com/ask |
newest_page <| https://news.ycombinator.com/newest |
FROM home_page, ask_page, newest_page
|> … ================================================== Database |> Language |> Keywords |> Contents
Definition
Example
Definition
If you’re familiar with ordinary SQL databases then you’re also familiar with the idea of SELECT’ing columns from rows in a table. In LSD SQL, pages are treated like tables and you can SELECT CSS selectors as though they were columns:
lsd <| https://lsd.so |
tagline <| h1 |
FROM lsd
|> SELECT tagline Example
Suppose we were interested in obtaining the names of companies in a VC’s portfolio; we can simply go and grabbed every link on the page like so:
vc_portfolio <| https://www.abstractvc.com/companies |
repeating_container <| h2.title |
portco <| a |
FROM vc_portfolio
|> GROUP BY repeating_container
|> SELECT portco ================================================== Database |> Language |> Keywords |> When plucking content from a page, you may encounter repeating blocks or containers within the HTML that you’d want to grab content from as rows; this is where the GROUP BY syntax comes into play. The below query grabs the first post from Hacker News:
hn <| https://news.ycombinator.com |
post <| span.titleline a |
post_link <| span.titleline a@href |
FROM hn
|> SELECT post, post_link However what’s Hacker News without the entire front page? The container element to each post is the span tag with a titleline class applied to it so we can refactor the above statement:
hn <| https://news.ycombinator.com |
container <| span.titleline |
post <| a |
post_link <| a@href |
FROM hn
|> GROUP BY container
|> SELECT post, post_link However, if it enhances the readability of the statement you’re writing, the GROUP BY expression can be written before or after the SELECT one like so:
hn <| https://news.ycombinator.com |
post <| a |
post_link <| a@href |
container <| span.titleline |
FROM hn
|> SELECT post, post_link
|> GROUP BY container ================================================== Database |> Language |> Keywords |> Contents
Definition
Example
Definition
The ASSIGN keyword is what variable and function assignments in pipe operator syntax are implicitly translated to. The below expression:
hn <| https://news.ycombinator.com |
Is equivalent to the following expression:
ASSIGN hn https://news.ycombinator.com
Another way to view the ASSIGN keyword is like the alias command in shell.
Example
In order to get the post titles and links from the front page of hacker news, we’ll need the following values in addition to the exact URL of the page we’re interested in:
The repeating container that wraps each post block
The selector for the title of the post
The href attribute of the link associated with the post
In our favor, the last two values belong to the same anchor tag allowing us to succinctly assign the href attribute in the below block of assignments:
– The URL we’re interested in
hn <| https://news.ycombinator.com |
– The repeating container that wraps each post
container <| span.titleline |
– The element containing the title of a post
post <| a |
– The element’s href attribute for its link
post_link <| a@href |
The benefit of this is you then get a resulting query statement at the end that’s more readable than what you’d otherwise construct in a different framework.
FROM hn
|> GROUP BY container
|> SELECT post, post_link
Putting it all together gives us:
– The URL we’re interested in
hn <| https://news.ycombinator.com |
– The repeating container that wraps each post
container <| span.titleline |
– The element containing the title of a post
post <| a |
– The element’s href attribute for its link
post_link <| a@href |
FROM hn
|> GROUP BY container
|> SELECT post, post_link ================================================== ================================================== Database |> Language |> This section provides complete examples for extracting data from popular website types. These examples show how to apply LSD SQL to real-world scenarios.;
Columns in LSD SQL are CSS selectors.
FROM url
|> SELECT
FROM Keyword Usage
SELECT Operations
GROUP BY Functionality
ASSIGN Keyword
Common Extraction Patterns
News and Article Extraction
Extract Headlines from a News Site
-- Define variables for readability
news_site <| https://news.example.com |
headline_container <| article.post |
title <| h2.title |
link <| a.read-more@href |
author <| span.author |
date <| time.published |
-- Query to extract headlines with metadata
FROM news_site
|> GROUP BY headline_container
|> SELECT title, author, date, link
Extract Full Article Content
-- Extract the complete text of an article
FROM https://blog.example.com/article/123
|> SELECT article.content TEXT AS article_text
-- Alternative for sites with simpler structure
FROM https://blog.example.com/article/123
|> SELECT TEXT AS full_content
Extract Article Summary
-- Get article title and summary
article_page <| https://example.com/news/latest |
FROM article_page
|> GROUP BY div.article-preview
|> SELECT h3.title AS headline, p.summary AS excerpt
E-commerce Product Extraction
Extract Product Listings
-- Basic product information extraction
store <| https://shop.example.com/products |
product_card <| div.product-item |
name <| h2.product-name |
price <| span.price |
image <| img.product-image@src |
FROM store
|> GROUP BY product_card
|> SELECT name, price, image
Extract Product Details with Variants
-- Extract product with color/size variants
FROM https://shop.example.com/product/123
|> SELECT h1.product-title AS name,
span.current-price AS price,
p.description AS description
|> GROUP BY div.variant
|> SELECT span.color AS color,
span.size AS size,
span.variant-price AS variant_price
Extract Product Reviews
-- Get product reviews
product <| https://shop.example.com/product/123 |
review_container <| div.review |
reviewer <| span.reviewer-name |
rating <| div.stars@data-rating |
comment <| p.review-text |
FROM product
|> GROUP BY review_container
|> SELECT reviewer, rating, comment
Table Data Extraction
Extract Simple Table
-- Extract data from a simple HTML table
FROM https://example.com/data
|> GROUP BY table tr
|> SELECT td:nth-child(1) AS column1,
td:nth-child(2) AS column2,
td:nth-child(3) AS column3
Extract Table with Headers
-- Extract a table with proper headers
table_page <| https://example.com/statistics |
FROM table_page
|> GROUP BY table tbody tr
|> SELECT td:nth-child(1) AS date,
td:nth-child(2) AS value,
td:nth-child(3) AS percentage
Extract Complex Nested Tables
-- Extract data from nested tables
FROM https://example.com/complex-table
|> GROUP BY table.main tbody tr
|> SELECT td.name AS category,
td.value AS main_value
|> GROUP BY td.subtable tr
|> SELECT td.subname AS subcategory,
td.subvalue AS sub_value
Navigation Menu Extraction
Extract Main Navigation
-- Extract main navigation links
site <| https://example.com |
FROM site
|> GROUP BY nav.main-nav li
|> SELECT a AS menu_item, a@href AS menu_link
Extract Multi-level Navigation
-- Extract navigation with submenus
FROM https://example.com
|> GROUP BY nav.main-menu > li
|> SELECT a AS category, a@href AS category_link
|> GROUP BY ul.submenu > li
|> SELECT a AS subcategory, a@href AS subcategory_link
Form Data Extraction
Extract Form Fields
-- Extract form fields
FROM https://example.com/contact
|> GROUP BY form input
|> SELECT @name AS field_name, @type AS field_type, @placeholder AS field_placeholder
Extract Dropdown Options
-- Extract select dropdown options
FROM https://example.com/form
|> GROUP BY select
|> SELECT @name AS dropdown_name
|> GROUP BY option
|> SELECT @value AS option_value, TEXT AS option_text
Image Gallery Extraction
Extract Image Gallery
-- Extract images and captions from a gallery
gallery <| https://example.com/gallery |
image_container <| div.gallery-item |
image <| img@src |
caption <| figcaption |
alt_text <| img@alt |
FROM gallery
|> GROUP BY image_container
|> SELECT image, caption, alt_text
Extract Image Metadata
-- Extract image metadata
FROM https://example.com/photos
|> GROUP BY figure.photo
|> SELECT img@src AS image_url,
img@width AS width,
img@height AS height,
img@alt AS description,
figcaption AS caption
Pagination Handling
Extract Content Across Pages
-- Extract items from multiple paginated pages
page1 <| https://example.com/items?page=1 |
page2 <| https://example.com/items?page=2 |
page3 <| https://example.com/items?page=3 |
FROM page1, page2, page3
|> GROUP BY div.item
|> SELECT h3.title AS item_name,
p.description AS item_description
Extract Next Page URL
-- Extract the next page URL for sequential scraping
FROM https://example.com/page/1
|> SELECT a.next-page@href AS next_page_url
Social Media Content Extraction
Extract Social Media Posts
-- Extract social media posts and engagement metrics
FROM https://social.example.com/profile
|> GROUP BY div.post
|> SELECT h3.post-title AS title,
p.post-content AS content,
span.likes AS likes_count,
span.shares AS shares_count
Extract Comments
-- Extract comments from a post
post <| https://social.example.com/post/123 |
comment_container <| div.comment |
username <| a.username |
comment_text <| p.comment-text |
timestamp <| span.time |
FROM post
|> GROUP BY comment_container
|> SELECT username, comment_text, timestamp
Real-World Scraping Examples
Example: Hacker News Front Page
-- Extract top stories from Hacker News
hn <| https://news.ycombinator.com |
story <| span.titleline |
title <| a |
url <| a@href |
score <| span.score |
user <| a.hnuser |
age <| span.age |
comments <| a:contains("comments")@href |
FROM hn
|> GROUP BY tr.athing
|> SELECT title, url, score, user, age, comments
Example: Product Search Results
-- Extract search results from an e-commerce site
search <| https://shop.example.com/search?q=laptop |
product <| div.product-card |
name <| h2.product-title |
price <| span.price |
original_price <| span.original-price |
discount <| span.discount-percent |
rating <| div.rating@data-stars |
reviews_count <| span.reviews-count |
image <| img.product-image@src |
url <| a.product-link@href |
FROM search
|> GROUP BY product
|> SELECT name, price, original_price, discount, rating, reviews_count, image, url
Example: Weather Forecast
-- Extract weather forecast data
weather <| https://weather.example.com/forecast/newyork |
day_container <| div.day-forecast |
date <| h3.date |
high_temp <| span.high-temp |
low_temp <| span.low-temp |
condition <| div.condition |
precipitation <| span.precipitation |
FROM weather
|> GROUP BY day_container
|> SELECT date, high_temp, low_temp, condition, precipitation
Example: Job Listings
-- Extract job listings from a job board
jobs <| https://jobs.example.com/search?q=developer |
job_card <| div.job-posting |
title <| h2.job-title |
company <| span.company-name |
location <| span.location |
salary <| span.salary |
posted_date <| time.posted-date |
description <| p.job-description |
apply_link <| a.apply-now@href |
FROM jobs
|> GROUP BY job_card
|> SELECT title, company, location, salary, posted_date, description, apply_link
Example: Restaurant Reviews
-- Extract restaurant reviews
restaurant <| https://food.example.com/restaurant/pizza-place |
review_container <| div.review |
reviewer <| span.reviewer-name |
rating <| div.stars@data-rating |
review_date <| time.review-date |
review_text <| p.review-content |
helpful_votes <| span.helpful-count |
FROM restaurant
|> GROUP BY review_container
|> SELECT reviewer, rating, review_date, review_text, helpful_votes
Example: Event Calendar
-- Extract upcoming events
events <| https://events.example.com/calendar |
event_card <| div.event |
name <| h3.event-name |
date <| time.event-date |
location <| address.venue |
description <| p.event-description |
category <| span.category |
price <| span.ticket-price |
register_link <| a.register@href |
FROM events
|> GROUP BY event_card
|> SELECT name, date, location, description, category, price, register_link
Related: