strip-tags

Strip tags from HTML, optionally from areas identified by CSS selectors

See llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs for more on this project.

Installation

Install this tool using pip:

pip install strip-tags

Usage

Pipe content into this tool to strip tags from it:

cat input.html | strip-tags > output.txt

Or pass a filename:

strip-tags -i input.html > output.txt

To run against just specific areas identified by CSS selectors:

strip-tags '.content' -i input.html > output.txt

This can be called with multiple selectors:

cat input.html | strip-tags '.content' '.sidebar' > output.txt

To return just the first element on the page that matches one of the selectors, use --first:

cat input.html | strip-tags .content --first > output.txt

To remove content contained by specific selectors - e.g. the

section of a page, use -r or --remove:

cat input.html | strip-tags -r nav > output.txt

To minify whitespace - reducing multiple space and tab characters to a single space, removing any remaining blank lines - add -m or --minify:

cat input.html | strip-tags -m > output.txt

You can also run this command using python -m like this:

python -m strip_tags --help

Keeping the markup for specified tags

When passing content to a language model, it can sometimes be useful to leave in a subset of HTML tags -

`This is the heading`

for example - to provide extra hints to the model.

The -t/--keep-tag option can be passed multiple times to specify tags that should be kept.

This example looks at the

section of https://datasette.io/ and keeps the tags around the list items and

elements:

curl -s https://datasette.io/ | strip-tags header -t h1 -t li

<li>Usesli>
<li>Documentation Docsli>
<li>Tutorialsli>
<li>Examplesli>
<li>Pluginsli>
<li>Toolsli>
<li>Newsli>
<h1>
    Datasette
h1>
Find stories in data

All attributes will be removed from the tags, except for the id= and class= attribute since those may provide further useful hints to the language model.

The href attribute on links, the alt attribute on images and the name and value attributes on meta tags are kept as well.

You can also specify a bundle of tags. For example, strip-tags -t hs will keep the tag markup for all levels of headings.

The following bundles can be used:

-t hs:
,
,
,
,
,
-t metadata: </code>, <code><meta></code></li> <li><code>-t structure</code>: <code><header></code>, <code><nav></code>, <code><main></code>, <code><article></code>, <code><section></code>, <code><aside></code>, <code><footer></code></li> <li><code>-t tables</code>: <code><table></code>, <code><tr></code>, <code><td></code>, <code><th></code>, <code><thead></code>, <code><tbody></code>, <code><tfoot></code>, <code><caption></code>, <code><colgroup></code>, <code><col></code></li> <li><code>-t lists</code>: <code><ul></code>, <code><ol></code>, <code><li></code>, <code><dl></code>, <code><dd></code>, <code><dt></code></li> </ul> <div class="markdown-heading" dir="auto"><h2 class="heading-element" dir="auto">As a Python library</h2><a id="user-content-as-a-python-library" class="anchor" aria-label="Permalink: As a Python library" href="#as-a-python-library"><svg class="octicon octicon-link" viewbox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path d="m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z"></path></svg></a></div> <p dir="auto">You can use <code>strip-tags</code> from Python code too. The function signature looks like this:</p> <div class="highlight highlight-source-python notranslate position-relative overflow-auto" dir="auto" data-snippet-clipboard-copy-content="def strip_tags( input: str, selectors: Optional[Iterable[str]]=None, *, removes: Optional[Iterable[str]]=None, minify: bool=False, remove_blank_lines: bool=False, first: bool=False, keep_tags: Optional[Iterable[str]]=None, all_attrs: bool=False ) -> str:"><pre><span class="pl-k">def</span> <span class="pl-en">strip_tags</span>( <span class="pl-s1">input</span>: <span class="pl-smi">str</span>, <span class="pl-s1">selectors</span>: <span class="pl-v">Optional</span>[<span class="pl-v">Iterable</span>[<span class="pl-smi">str</span>]]<span class="pl-c1">=</span><span class="pl-c1">None</span>, <span class="pl-c1">*</span>, <span class="pl-s1">removes</span>: <span class="pl-v">Optional</span>[<span class="pl-v">Iterable</span>[<span class="pl-smi">str</span>]]<span class="pl-c1">=</span><span class="pl-c1">None</span>, <span class="pl-s1">minify</span>: <span class="pl-smi">bool</span><span class="pl-c1">=</span><span class="pl-c1">False</span>, <span class="pl-s1">remove_blank_lines</span>: <span class="pl-smi">bool</span><span class="pl-c1">=</span><span class="pl-c1">False</span>, <span class="pl-s1">first</span>: <span class="pl-smi">bool</span><span class="pl-c1">=</span><span class="pl-c1">False</span>, <span class="pl-s1">keep_tags</span>: <span class="pl-v">Optional</span>[<span class="pl-v">Iterable</span>[<span class="pl-smi">str</span>]]<span class="pl-c1">=</span><span class="pl-c1">None</span>, <span class="pl-s1">all_attrs</span>: <span class="pl-smi">bool</span><span class="pl-c1">=</span><span class="pl-c1">False</span> ) <span class="pl-c1">-></span> <span class="pl-smi">str</span>:</pre></div> <p dir="auto">Here's an example:</p> <div class="highlight highlight-source-python notranslate position-relative overflow-auto" dir="auto" data-snippet-clipboard-copy-content="from strip_tags import strip_tags html = """ <div> <h1>This has tags</h1> <p>And whitespace too</p> </div> Ignore this bit. """ stripped = strip_tags(html, ["div"], minify=True, keep_tags=["h1"]) print(stripped)"><pre><span class="pl-k">from</span> <span class="pl-s1">strip_tags</span> <span class="pl-k">import</span> <span class="pl-s1">strip_tags</span> <span class="pl-s1">html</span> <span class="pl-c1">=</span> <span class="pl-s">"""</span> <span class="pl-s"><div></span> <span class="pl-s"><h1>This has tags</h1></span> <span class="pl-s"></span> <span class="pl-s"><p>And whitespace too</p></span> <span class="pl-s"></div></span> <span class="pl-s">Ignore this bit.</span> <span class="pl-s">"""</span> <span class="pl-s1">stripped</span> <span class="pl-c1">=</span> <span class="pl-en">strip_tags</span>(<span class="pl-s1">html</span>, [<span class="pl-s">"div"</span>], <span class="pl-s1">minify</span><span class="pl-c1">=</span><span class="pl-c1">True</span>, <span class="pl-s1">keep_tags</span><span class="pl-c1">=</span>[<span class="pl-s">"h1"</span>]) <span class="pl-en">print</span>(<span class="pl-s1">stripped</span>)</pre></div> <p dir="auto">Output:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="<h1>This has tags</h1> And whitespace too"><pre class="notranslate"><code><h1>This has tags</h1> And whitespace too </code></pre></div> <p dir="auto">Use <code>remove_blank_lines=True</code> to remove any remaining blank lines from the output.</p> <div class="markdown-heading" dir="auto"><h2 class="heading-element" dir="auto">strip-tags --help</h2><a id="user-content-strip-tags---help" class="anchor" aria-label="Permalink: strip-tags --help" href="#strip-tags---help"><svg class="octicon octicon-link" viewbox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path d="m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z"></path></svg></a></div> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="Usage: strip-tags [OPTIONS] [SELECTORS]... Strip tags from HTML, optionally from areas identified by CSS selectors Example usage: cat input.html | strip-tags > output.txt To run against just specific areas identified by CSS selectors: cat input.html | strip-tags .entry .footer > output.txt Options: --version Show the version and exit. -r, --remove TEXT Remove content in these selectors -i, --input FILENAME Input file -m, --minify Minify whitespace -t, --keep-tag TEXT Keep these <tags> --all-attrs Include all attributes on kept tags --first First element matching the selectors --help Show this message and exit. "><pre class="notranslate"><code>Usage: strip-tags [OPTIONS] [SELECTORS]... Strip tags from HTML, optionally from areas identified by CSS selectors Example usage: cat input.html | strip-tags > output.txt To run against just specific areas identified by CSS selectors: cat input.html | strip-tags .entry .footer > output.txt Options: --version Show the version and exit. -r, --remove TEXT Remove content in these selectors -i, --input FILENAME Input file -m, --minify Minify whitespace -t, --keep-tag TEXT Keep these <tags> --all-attrs Include all attributes on kept tags --first First element matching the selectors --help Show this message and exit. </code></pre></div> <div class="markdown-heading" dir="auto"><h2 class="heading-element" dir="auto">Development</h2><a id="user-content-development" class="anchor" aria-label="Permalink: Development" href="#development"><svg class="octicon octicon-link" viewbox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path d="m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z"></path></svg></a></div> <p dir="auto">To contribute to this tool, first checkout the code. Then create a new virtual environment:</p> <div class="highlight highlight-source-shell notranslate position-relative overflow-auto" dir="auto" data-snippet-clipboard-copy-content="cd strip-tags python -m venv venv source venv/bin/activate"><pre><span class="pl-c1">cd</span> strip-tags python -m venv venv <span class="pl-c1">source</span> venv/bin/activate</pre></div> <p dir="auto">Now install the dependencies and test dependencies:</p> <div class="highlight highlight-source-shell notranslate position-relative overflow-auto" dir="auto" data-snippet-clipboard-copy-content="pip install -e '.[test]'"><pre>pip install -e <span class="pl-s"><span class="pl-pds">'</span>.[test]<span class="pl-pds">'</span></span></pre></div> <p dir="auto">To run the tests:</p> <div class="highlight highlight-source-shell notranslate position-relative overflow-auto" dir="auto" data-snippet-clipboard-copy-content="pytest"><pre>pytest</pre></div> </article></div> </div> </div> </main> <footer>Powered by <a href="https://datasette.io/" title="Datasette v1.0a2">Datasette</a> · <a href="https://github.com/simonw/datasette.io">How this site works</a> · <a href="https://github.com/simonw/datasette/blob/main/CODE_OF_CONDUCT.md">Code of conduct</a> </footer> </body> </html>

strip-tags by simonw

README source code

strip-tags

Installation

Usage

,
,
,

,
,

,