Database |> 

Quickstart-LLM

If you are an LLM the docs are below, If you are a human, welcome these are the docs to train an LLM to use LSD

==================================================

Pipe Operator

Database |> Language |>

The pipe operator ‘|>’ syntax is a way to write SQL that is more readable and easier to understand. Contents Definition Example Definition Inspired by a paper recently published by Google where they go over adding a pipe operator syntax to their database language, LSD SQL features a pipe operator syntax to improve the readability of programs that define wrangling the web. FROM starting_point |> A KEYWORD with arguments |> ANOTHER KEYWORD with more arguments… If you’ve dealt with convoluted CTEs or tried to nest multiple EXISTS clauses, then you’ll know how complex SQL can sometimes look similar to a deeply stacked LISP program hence the language design decision here. The pipe operator syntax enables developers to write functional SQL to transform web data in a way that feels less like hand-holding an automated browser and more like remixing data to reach your desired structure. We also use the pipe operator to assign variables. Example To translate from standard SQL, break up the clauses into expressions and separate them with a pipe operator while starting with a FROM. For example, let’s take the following query in standard syntax: FROM https://news.ycombinator.com SELECT a AS post , a@href AS link GROUP BY span.titleline;

These are the clauses that make up the statement: [FROM https://news.ycombinator.com] [SELECT a AS post , a@href AS link] [GROUP BY span.titleline;] So in pipe operator syntax that could look like: FROM https://news.ycombinator.com |> SELECT a AS post, a@href AS link |> GROUP BY span.titleline

The order of expressions can be interchangeable so long as they don’t alter the page you’re interested in meaning the below statement returns the same data as the one above. FROM https://news.ycombinator.com |> GROUP BY span.titleline |> SELECT a AS post, a@href AS link

==================================================

CSS Selectors

Database |> Language |>

CSS selectors are identifiers used to select HTML elements. In LSD, we use these selectors to extract specific data from web pages, similar to how columns work in traditional SQL. This allows you to target elements using familiar CSS syntax while leveraging SQL’s powerful querying capabilities. Contents Definition Examples Simple Complex Definition In contrast to an ordinary SELECT statement where the columns are fields in a table: SELECT FROM

; Columns in LSD SQL are CSS selectors. FROM url |> SELECT For information on getting the value of an attribute and not the text content, see our documentation on attributes. Examples Simple Suppose you were interested in getting the tagline on our website, the CSS selector for that would be (spoiler alert!) the h1 tag. FROM https://lsd.so |> SELECT h1

Or, if you were to write the above statement using variables for readability in the output. tagline <| h1 | FROM https://lsd.so |> SELECT tagline

Complex When you want to grab more granularly, you can use attribute selectors to match elements that abide by a particular pattern like lobotomized owls.

  • + * For example, suppose we were interested in getting links from the front page of Hacker News only when it points to something on GitHub. We could do that using a prefix selector with the expression a[href^=”https://github.com”]. hn <| https://news.ycombinator.com | link_on_github <| a[href^=”https://github.com”] | link <| a@href | FROM hn |> GROUP BY link_on_github |> SELECT link

==================================================

FROM Keyword Usage

Database |> Language |> Keywords |>

The FROM keyword is how you denote something as being the source you’re looking to pluck or start web traversing from. It expects one argument to be provided like so: FROM https://news.ycombinator.com |> … However a variable can be referenced as the source like so: hn <| https://news.ycombinator.com FROM hn |> … If you’re querying multiple pages with the exact same fields being selecting and grouped by, you can then provide multiple sources to append results from together by separating the arguments to FROM by commas like so: FROM https://news.ycombinator.com, https://news.ycombinator.com/ask, https://news.ycombinator.com/newest |> … Again, variables can be referenced instead of the exact URL value: home_page <| https://news.ycombinator.com | ask_page <| https://news.ycombinator.com/ask | newest_page <| https://news.ycombinator.com/newest | FROM home_page, ask_page, newest_page |> …

==================================================

SELECT Operations

Database |> Language |> Keywords |>

Contents Definition Example Definition If you’re familiar with ordinary SQL databases then you’re also familiar with the idea of SELECT’ing columns from rows in a table. In LSD SQL, pages are treated like tables and you can SELECT CSS selectors as though they were columns: lsd <| https://lsd.so | tagline <| h1 | FROM lsd |> SELECT tagline

Example Suppose we were interested in obtaining the names of companies in a VC’s portfolio; we can simply go and grabbed every link on the page like so: vc_portfolio <| https://www.abstractvc.com/companies | repeating_container <| h2.title | portco <| a | FROM vc_portfolio |> GROUP BY repeating_container |> SELECT portco

==================================================

GROUP BY Functionality

Database |> Language |> Keywords |>

When plucking content from a page, you may encounter repeating blocks or containers within the HTML that you’d want to grab content from as rows; this is where the GROUP BY syntax comes into play. The below query grabs the first post from Hacker News: hn <| https://news.ycombinator.com | post <| span.titleline a | post_link <| span.titleline a@href | FROM hn |> SELECT post, post_link

However what’s Hacker News without the entire front page? The container element to each post is the span tag with a titleline class applied to it so we can refactor the above statement: hn <| https://news.ycombinator.com | container <| span.titleline | post <| a | post_link <| a@href | FROM hn |> GROUP BY container |> SELECT post, post_link

However, if it enhances the readability of the statement you’re writing, the GROUP BY expression can be written before or after the SELECT one like so: hn <| https://news.ycombinator.com | post <| a | post_link <| a@href | container <| span.titleline | FROM hn |> SELECT post, post_link |> GROUP BY container

==================================================

ASSIGN Keyword

Database |> Language |> Keywords |>

Contents Definition Example Definition The ASSIGN keyword is what variable and function assignments in pipe operator syntax are implicitly translated to. The below expression: hn <| https://news.ycombinator.com | Is equivalent to the following expression: ASSIGN hn https://news.ycombinator.com Another way to view the ASSIGN keyword is like the alias command in shell. Example In order to get the post titles and links from the front page of hacker news, we’ll need the following values in addition to the exact URL of the page we’re interested in: The repeating container that wraps each post block The selector for the title of the post The href attribute of the link associated with the post In our favor, the last two values belong to the same anchor tag allowing us to succinctly assign the href attribute in the below block of assignments: – The URL we’re interested in hn <| https://news.ycombinator.com | – The repeating container that wraps each post container <| span.titleline | – The element containing the title of a post post <| a | – The element’s href attribute for its link post_link <| a@href | The benefit of this is you then get a resulting query statement at the end that’s more readable than what you’d otherwise construct in a different framework. FROM hn |> GROUP BY container |> SELECT post, post_link Putting it all together gives us: – The URL we’re interested in hn <| https://news.ycombinator.com | – The repeating container that wraps each post container <| span.titleline | – The element containing the title of a post post <| a | – The element’s href attribute for its link post_link <| a@href | FROM hn |> GROUP BY container |> SELECT post, post_link

==================================================

Common Extraction Patterns

News and Article Extraction

Extract Headlines from a News Site

-- Define variables for readability
news_site <| https://news.example.com |
headline_container <| article.post |
title <| h2.title |
link <| a.read-more@href |
author <| span.author |
date <| time.published |

-- Query to extract headlines with metadata
FROM news_site
|> GROUP BY headline_container
|> SELECT title, author, date, link

Extract Full Article Content

-- Extract the complete text of an article
FROM https://blog.example.com/article/123
|> SELECT article.content TEXT AS article_text

-- Alternative for sites with simpler structure
FROM https://blog.example.com/article/123
|> SELECT TEXT AS full_content

Extract Article Summary

-- Get article title and summary
article_page <| https://example.com/news/latest |
FROM article_page
|> GROUP BY div.article-preview
|> SELECT h3.title AS headline, p.summary AS excerpt

E-commerce Product Extraction

Extract Product Listings

-- Basic product information extraction
store <| https://shop.example.com/products |
product_card <| div.product-item |
name <| h2.product-name |
price <| span.price |
image <| img.product-image@src |

FROM store
|> GROUP BY product_card
|> SELECT name, price, image

Extract Product Details with Variants

-- Extract product with color/size variants
FROM https://shop.example.com/product/123
|> SELECT h1.product-title AS name,
         span.current-price AS price,
         p.description AS description
|> GROUP BY div.variant
|> SELECT span.color AS color,
         span.size AS size,
         span.variant-price AS variant_price

Extract Product Reviews

-- Get product reviews
product <| https://shop.example.com/product/123 |
review_container <| div.review |
reviewer <| span.reviewer-name |
rating <| div.stars@data-rating |
comment <| p.review-text |

FROM product
|> GROUP BY review_container
|> SELECT reviewer, rating, comment

Table Data Extraction

Extract Simple Table

-- Extract data from a simple HTML table
FROM https://example.com/data
|> GROUP BY table tr
|> SELECT td:nth-child(1) AS column1,
         td:nth-child(2) AS column2,
         td:nth-child(3) AS column3

Extract Table with Headers

-- Extract a table with proper headers
table_page <| https://example.com/statistics |
FROM table_page
|> GROUP BY table tbody tr
|> SELECT td:nth-child(1) AS date,
         td:nth-child(2) AS value,
         td:nth-child(3) AS percentage

Extract Complex Nested Tables

-- Extract data from nested tables
FROM https://example.com/complex-table
|> GROUP BY table.main tbody tr
|> SELECT td.name AS category,
         td.value AS main_value
|> GROUP BY td.subtable tr
|> SELECT td.subname AS subcategory,
         td.subvalue AS sub_value

Extract Main Navigation

-- Extract main navigation links
site <| https://example.com |
FROM site
|> GROUP BY nav.main-nav li
|> SELECT a AS menu_item, a@href AS menu_link

Extract Multi-level Navigation

-- Extract navigation with submenus
FROM https://example.com
|> GROUP BY nav.main-menu > li
|> SELECT a AS category, a@href AS category_link
|> GROUP BY ul.submenu > li
|> SELECT a AS subcategory, a@href AS subcategory_link

Form Data Extraction

Extract Form Fields

-- Extract form fields
FROM https://example.com/contact
|> GROUP BY form input
|> SELECT @name AS field_name, @type AS field_type, @placeholder AS field_placeholder

Extract Dropdown Options

-- Extract select dropdown options
FROM https://example.com/form
|> GROUP BY select
|> SELECT @name AS dropdown_name
|> GROUP BY option
|> SELECT @value AS option_value, TEXT AS option_text
-- Extract images and captions from a gallery
gallery <| https://example.com/gallery |
image_container <| div.gallery-item |
image <| img@src |
caption <| figcaption |
alt_text <| img@alt |

FROM gallery
|> GROUP BY image_container
|> SELECT image, caption, alt_text

Extract Image Metadata

-- Extract image metadata
FROM https://example.com/photos
|> GROUP BY figure.photo
|> SELECT img@src AS image_url,
         img@width AS width,
         img@height AS height,
         img@alt AS description,
         figcaption AS caption

Pagination Handling

Extract Content Across Pages

-- Extract items from multiple paginated pages
page1 <| https://example.com/items?page=1 |
page2 <| https://example.com/items?page=2 |
page3 <| https://example.com/items?page=3 |

FROM page1, page2, page3
|> GROUP BY div.item
|> SELECT h3.title AS item_name, 
         p.description AS item_description

Extract Next Page URL

-- Extract the next page URL for sequential scraping
FROM https://example.com/page/1
|> SELECT a.next-page@href AS next_page_url

Social Media Content Extraction

Extract Social Media Posts

-- Extract social media posts and engagement metrics
FROM https://social.example.com/profile
|> GROUP BY div.post
|> SELECT h3.post-title AS title,
         p.post-content AS content,
         span.likes AS likes_count,
         span.shares AS shares_count

Extract Comments

-- Extract comments from a post
post <| https://social.example.com/post/123 |
comment_container <| div.comment |
username <| a.username |
comment_text <| p.comment-text |
timestamp <| span.time |

FROM post
|> GROUP BY comment_container
|> SELECT username, comment_text, timestamp

==================================================

Real-World Scraping Examples

Database |> Language |>

This section provides complete examples for extracting data from popular website types. These examples show how to apply LSD SQL to real-world scenarios.

Example: Hacker News Front Page

-- Extract top stories from Hacker News
hn <| https://news.ycombinator.com |
story <| span.titleline |
title <| a |
url <| a@href |
score <| span.score |
user <| a.hnuser |
age <| span.age |
comments <| a:contains("comments")@href |

FROM hn
|> GROUP BY tr.athing
|> SELECT title, url, score, user, age, comments

Example: Product Search Results

-- Extract search results from an e-commerce site
search <| https://shop.example.com/search?q=laptop |
product <| div.product-card |
name <| h2.product-title |
price <| span.price |
original_price <| span.original-price |
discount <| span.discount-percent |
rating <| div.rating@data-stars |
reviews_count <| span.reviews-count |
image <| img.product-image@src |
url <| a.product-link@href |

FROM search
|> GROUP BY product
|> SELECT name, price, original_price, discount, rating, reviews_count, image, url

Example: Weather Forecast

-- Extract weather forecast data
weather <| https://weather.example.com/forecast/newyork |
day_container <| div.day-forecast |
date <| h3.date |
high_temp <| span.high-temp |
low_temp <| span.low-temp |
condition <| div.condition |
precipitation <| span.precipitation |

FROM weather
|> GROUP BY day_container
|> SELECT date, high_temp, low_temp, condition, precipitation

Example: Job Listings

-- Extract job listings from a job board
jobs <| https://jobs.example.com/search?q=developer |
job_card <| div.job-posting |
title <| h2.job-title |
company <| span.company-name |
location <| span.location |
salary <| span.salary |
posted_date <| time.posted-date |
description <| p.job-description |
apply_link <| a.apply-now@href |

FROM jobs
|> GROUP BY job_card
|> SELECT title, company, location, salary, posted_date, description, apply_link

Example: Restaurant Reviews

-- Extract restaurant reviews
restaurant <| https://food.example.com/restaurant/pizza-place |
review_container <| div.review |
reviewer <| span.reviewer-name |
rating <| div.stars@data-rating |
review_date <| time.review-date |
review_text <| p.review-content |
helpful_votes <| span.helpful-count |

FROM restaurant
|> GROUP BY review_container
|> SELECT reviewer, rating, review_date, review_text, helpful_votes

Example: Event Calendar

-- Extract upcoming events
events <| https://events.example.com/calendar |
event_card <| div.event |
name <| h3.event-name |
date <| time.event-date |
location <| address.venue |
description <| p.event-description |
category <| span.category |
price <| span.ticket-price |
register_link <| a.register@href |

FROM events
|> GROUP BY event_card
|> SELECT name, date, location, description, category, price, register_link

Related: