home / tils / til

Menu
  • GraphQL API

til: datasette_crawling-datasette-with-datasette.md

This data as json

path topic title url body html shot created created_utc updated updated_utc shot_hash slug
datasette_crawling-datasette-with-datasette.md datasette Crawling Datasette with Datasette https://github.com/simonw/til/blob/main/datasette/crawling-datasette-with-datasette.md I wanted to add the new tutorials on https://datasette.io/tutorials to the search index that is used by the https://datasette.io/-/beta search engine. To do this, I needed the content of those tutorials in a SQLite database table. But the tutorials are implemented as static pages in [templates/pages/tutorials](https://github.com/simonw/datasette.io/tree/9dffe361b0210b9d8b1f2fb820a3f2193f0f2fc7/templates/pages/tutorials) - so I needed to crawl that content and insert it into a table. I ended up using a combination of the `datasette.client` mechanism ([documented here](https://docs.datasette.io/en/stable/internals.html#internals-datasette-client)), [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and [sqlite-utils](https://sqlite-utils.readthedocs.io/) - all wrapped up in [a Python script](https://github.com/simonw/datasette.io/blob/9dffe361b0210b9d8b1f2fb820a3f2193f0f2fc7/index_tutorials.py) that's now called as part of [the GitHub Actions build process](https://github.com/simonw/datasette.io/blob/9dffe361b0210b9d8b1f2fb820a3f2193f0f2fc7/scripts/build.sh#L35) for the site. I'm also using [configuration directory mode](https://docs.datasette.io/en/stable/settings.html#config-dir). Here's the annotated script: ```python import asyncio from bs4 import BeautifulSoup as Soup from datasette.app import Datasette import pathlib import sqlite_utils # This is an async def function because it needs to call await ds.client async def main(): db = sqlite_utils.Database("content.db") # We need to simulate the full https://datasette.io/ site - including all # of its custom templates and plugins. On the command-line we would do this # by running "datasette ." - using configuration directory mode. This is # the equivalent of that when constructing the Datasette object directly: ds = Datasette(config_dir=pathlib.Path(".")) # Equivalent of fetching the HTML from https://datasette.io/tutorials index_response = await ds.client.get("/tutorials") index_soup = Soup(index_response.text, "html5lib") # We want to crawl the links inside <div class="content"><ul>...<a href=""> tutorial_links = index_soup.select(".content ul a") for link in tutorial_links: # For each one fetch the HTML, e.g. from /tutorials/learn-sql tutorial_response = await ds.client.get(link["href"]) # The script should fail loudly if it encounters a broken link assert tutorial_response.status_code == 200 # Now we can parse the page and extract the <h1> and <div class="content"> soup = Soup(tutorial_response.text, "html5lib") # Beautiful Soup makes extracting text easy: title = soup.select("h1")[0].text body = soup.select(".content")[0].text # Insert this into the "tutorials" table, creating it if it does not exist db["tutorials"].insert( { "path": link["href"], "title": title, "body": body.strip(), }, # Treat path, e.g. /tutorials/learn-sql, as the primary key pk="path", # This will over-write any existing records with the same path replace=True, ) if __name__ == "__main__": # This idiom executes the async function in an event loop: asyncio.run(main()) ``` It's then added to the search index by this [Dogsheep Beta](https://datasette.io/tools/dogsheep-beta) search configuration [fragment](https://github.com/simonw/datasette.io/blob/9dffe361b0210b9d8b1f2fb820a3f2193f0f2fc7/templates/dogsheep-beta.yml#L209-L229): ```yaml content.db: tutorials: sql: |- select path as key, title, body as search_1, 1 as is_public from tutorials display_sql: |- select highlight( body, :q ) as snippet from tutorials where tutorials.path = :key display: |- <h3>Tutorial: <a href="{{ key }}">{{ title }}</a></h3> <p>{{ display.snippet|safe }}</p> ``` See [Building a search engine for datasette.io](https://simonwillison.net/2020/Dec/19/dogsheep-beta/) for more details on exactly how this works. <p>I wanted to add the new tutorials on <a href="https://datasette.io/tutorials" rel="nofollow">https://datasette.io/tutorials</a> to the search index that is used by the <a href="https://datasette.io/-/beta" rel="nofollow">https://datasette.io/-/beta</a> search engine.</p> <p>To do this, I needed the content of those tutorials in a SQLite database table. But the tutorials are implemented as static pages in <a href="https://github.com/simonw/datasette.io/tree/9dffe361b0210b9d8b1f2fb820a3f2193f0f2fc7/templates/pages/tutorials">templates/pages/tutorials</a> - so I needed to crawl that content and insert it into a table.</p> <p>I ended up using a combination of the <code>datasette.client</code> mechanism (<a href="https://docs.datasette.io/en/stable/internals.html#internals-datasette-client" rel="nofollow">documented here</a>), <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow">Beautiful Soup</a> and <a href="https://sqlite-utils.readthedocs.io/" rel="nofollow">sqlite-utils</a> - all wrapped up in <a href="https://github.com/simonw/datasette.io/blob/9dffe361b0210b9d8b1f2fb820a3f2193f0f2fc7/index_tutorials.py">a Python script</a> that's now called as part of <a href="https://github.com/simonw/datasette.io/blob/9dffe361b0210b9d8b1f2fb820a3f2193f0f2fc7/scripts/build.sh#L35">the GitHub Actions build process</a> for the site.</p> <p>I'm also using <a href="https://docs.datasette.io/en/stable/settings.html#config-dir" rel="nofollow">configuration directory mode</a>.</p> <p>Here's the annotated script:</p> <div class="highlight highlight-source-python"><pre><span class="pl-k">import</span> <span class="pl-s1">asyncio</span> <span class="pl-k">from</span> <span class="pl-s1">bs4</span> <span class="pl-k">import</span> <span class="pl-v">BeautifulSoup</span> <span class="pl-k">as</span> <span class="pl-v">Soup</span> <span class="pl-k">from</span> <span class="pl-s1">datasette</span>.<span class="pl-s1">app</span> <span class="pl-k">import</span> <span class="pl-v">Datasette</span> <span class="pl-k">import</span> <span class="pl-s1">pathlib</span> <span class="pl-k">import</span> <span class="pl-s1">sqlite_utils</span> <span class="pl-c"># This is an async def function because it needs to call await ds.client</span> <span class="pl-k">async</span> <span class="pl-k">def</span> <span class="pl-en">main</span>(): <span class="pl-s1">db</span> <span class="pl-c1">=</span> <span class="pl-s1">sqlite_utils</span>.<span class="pl-v">Database</span>(<span class="pl-s">"content.db"</span>) <span class="pl-c"># We need to simulate the full https://datasette.io/ site - including all</span> <span class="pl-c"># of its custom templates and plugins. On the command-line we would do this</span> <span class="pl-c"># by running "datasette ." - using configuration directory mode. This is</span> <span class="pl-c"># the equivalent of that when constructing the Datasette object directly:</span> <span class="pl-s1">ds</span> <span class="pl-c1">=</span> <span class="pl-v">Datasette</span>(<span class="pl-s1">config_dir</span><span class="pl-c1">=</span><span class="pl-s1">pathlib</span>.<span class="pl-v">Path</span>(<span class="pl-s">"."</span>)) <span class="pl-c"># Equivalent of fetching the HTML from https://datasette.io/tutorials</span> <span class="pl-s1">index_response</span> <span class="pl-c1">=</span> <span class="pl-k">await</span> <span class="pl-s1">ds</span>.<span class="pl-s1">client</span>.<span class="pl-en">get</span>(<span class="pl-s">"/tutorials"</span>) <span class="pl-s1">index_soup</span> <span class="pl-c1">=</span> <span class="pl-v">Soup</span>(<span class="pl-s1">index_response</span>.<span class="pl-s1">text</span>, <span class="pl-s">"html5lib"</span>) <span class="pl-c"># We want to crawl the links inside &lt;div class="content"&gt;&lt;ul&gt;...&lt;a href=""&gt;</span> <span class="pl-s1">tutorial_links</span> <span class="pl-c1">=</span> <span class="pl-s1">index_soup</span>.<span class="pl-en">select</span>(<span class="pl-s">".content ul a"</span>) <span class="pl-k">for</span> <span class="pl-s1">link</span> <span class="pl-c1">in</span> <span class="pl-s1">tutorial_links</span>: <span class="pl-c"># For each one fetch the HTML, e.g. from /tutorials/learn-sql</span> <span class="pl-s1">tutorial_response</span> <span class="pl-c1">=</span> <span class="pl-k">await</span> <span class="pl-s1">ds</span>.<span class="pl-s1">client</span>.<span class="pl-en">get</span>(<span class="pl-s1">link</span>[<span class="pl-s">"href"</span>]) <span class="pl-c"># The script should fail loudly if it encounters a broken link</span> <span class="pl-k">assert</span> <span class="pl-s1">tutorial_response</span>.<span class="pl-s1">status_code</span> <span class="pl-c1">==</span> <span class="pl-c1">200</span> <span class="pl-c"># Now we can parse the page and extract the &lt;h1&gt; and &lt;div class="content"&gt;</span> <span class="pl-s1">soup</span> <span class="pl-c1">=</span> <span class="pl-v">Soup</span>(<span class="pl-s1">tutorial_response</span>.<span class="pl-s1">text</span>, <span class="pl-s">"html5lib"</span>) <span class="pl-c"># Beautiful Soup makes extracting text easy:</span> <span class="pl-s1">title</span> <span class="pl-c1">=</span> <span class="pl-s1">soup</span>.<span class="pl-en">select</span>(<span class="pl-s">"h1"</span>)[<span class="pl-c1">0</span>].<span class="pl-s1">text</span> <span class="pl-s1">body</span> <span class="pl-c1">=</span> <span class="pl-s1">soup</span>.<span class="pl-en">select</span>(<span class="pl-s">".content"</span>)[<span class="pl-c1">0</span>].<span class="pl-s1">text</span> <span class="pl-c"># Insert this into the "tutorials" table, creating it if it does not exist</span> <span class="pl-s1">db</span>[<span class="pl-s">"tutorials"</span>].<span class="pl-en">insert</span>( { <span class="pl-s">"path"</span>: <span class="pl-s1">link</span>[<span class="pl-s">"href"</span>], <span class="pl-s">"title"</span>: <span class="pl-s1">title</span>, <span class="pl-s">"body"</span>: <span class="pl-s1">body</span>.<span class="pl-en">strip</span>(), }, <span class="pl-c"># Treat path, e.g. /tutorials/learn-sql, as the primary key</span> <span class="pl-s1">pk</span><span class="pl-c1">=</span><span class="pl-s">"path"</span>, <span class="pl-c"># This will over-write any existing records with the same path</span> <span class="pl-s1">replace</span><span class="pl-c1">=</span><span class="pl-c1">True</span>, ) <span class="pl-k">if</span> <span class="pl-s1">__name__</span> <span class="pl-c1">==</span> <span class="pl-s">"__main__"</span>: <span class="pl-c"># This idiom executes the async function in an event loop:</span> <span class="pl-s1">asyncio</span>.<span class="pl-en">run</span>(<span class="pl-en">main</span>())</pre></div> <p>It's then added to the search index by this <a href="https://datasette.io/tools/dogsheep-beta" rel="nofollow">Dogsheep Beta</a> search configuration <a href="https://github.com/simonw/datasette.io/blob/9dffe361b0210b9d8b1f2fb820a3f2193f0f2fc7/templates/dogsheep-beta.yml#L209-L229">fragment</a>:</p> <div class="highlight highlight-source-yaml"><pre><span class="pl-ent">content.db</span>: <span class="pl-ent">tutorials</span>: <span class="pl-ent">sql</span>: <span class="pl-s">|-</span> <span class="pl-s"> select</span> <span class="pl-s"> path as key,</span> <span class="pl-s"> title,</span> <span class="pl-s"> body as search_1,</span> <span class="pl-s"> 1 as is_public</span> <span class="pl-s"> from</span> <span class="pl-s"> tutorials</span> <span class="pl-s"></span> <span class="pl-ent">display_sql</span>: <span class="pl-s">|-</span> <span class="pl-s"> select</span> <span class="pl-s"> highlight(</span> <span class="pl-s"> body, :q</span> <span class="pl-s"> ) as snippet</span> <span class="pl-s"> from</span> <span class="pl-s"> tutorials</span> <span class="pl-s"> where</span> <span class="pl-s"> tutorials.path = :key</span> <span class="pl-s"></span> <span class="pl-ent">display</span>: <span class="pl-s">|-</span> <span class="pl-s"> &lt;h3&gt;Tutorial: &lt;a href="{{ key }}"&gt;{{ title }}&lt;/a&gt;&lt;/h3&gt;</span> <span class="pl-s"> &lt;p&gt;{{ display.snippet|safe }}&lt;/p&gt;</span></pre></div> <p>See <a href="https://simonwillison.net/2020/Dec/19/dogsheep-beta/" rel="nofollow">Building a search engine for datasette.io</a> for more details on exactly how this works.</p> <Binary: 77,466 bytes> 2022-02-27T22:37:16-08:00 2022-02-28T06:37:16+00:00 2022-02-27T22:37:16-08:00 2022-02-28T06:37:16+00:00 20fa7576084d0bd54463685c205d2be5 crawling-datasette-with-datasette
Powered by Datasette · How this site works · Code of conduct