Strip tags from HTML, optionally from areas identified by CSS selectors
See llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs for more on this project.
Install this tool using pip:
pip install strip-tagsPipe content into this tool to strip tags from it:
cat input.html | strip-tags > output.txtOr pass a filename:
strip-tags -i input.html > output.txtTo run against just specific areas identified by CSS selectors:
strip-tags '.content' -i input.html > output.txtThis can be called with multiple selectors:
cat input.html | strip-tags '.content' '.sidebar' > output.txtTo return just the first element on the page that matches one of the selectors, use --first:
cat input.html | strip-tags .content --first > output.txtTo remove content contained by specific selectors - e.g. the section of a page, use -r or --remove:
cat input.html | strip-tags -r nav > output.txtTo minify whitespace - reducing multiple space and tab characters to a single space, removing any remaining blank lines - add -m or --minify:
cat input.html | strip-tags -m > output.txtYou can also run this command using python -m like this:
python -m strip_tags --helpWhen passing content to a language model, it can sometimes be useful to leave in a subset of HTML tags - for example - to provide extra hints to the model.This is the heading
The -t/--keep-tag option can be passed multiple times to specify tags that should be kept.
This example looks at the section of https://datasette.io/ and keeps the tags around the list items and elements:
curl -s https://datasette.io/ | strip-tags header -t h1 -t li
<li>Usesli>
<li>Documentation Docsli>
<li>Tutorialsli>
<li>Examplesli>
<li>Pluginsli>
<li>Toolsli>
<li>Newsli>
<h1>
Datasette
h1>
Find stories in dataAll attributes will be removed from the tags, except for the id= and class= attribute since those may provide further useful hints to the language model.
The href attribute on links, the alt attribute on images and the name and value attributes on meta tags are kept as well.
You can also specify a bundle of tags. For example, strip-tags -t hs will keep the tag markup for all levels of headings.
The following bundles can be used:
-t hs:,,,,,-t metadata:,-t structure:,,,,,,-t tables:,
, , , ,,,,,-t lists:,,,,,You can use
strip-tagsfrom Python code too. The function signature looks like this:def strip_tags( input: str, selectors: Optional[Iterable[str]]=None, *, removes: Optional[Iterable[str]]=None, minify: bool=False, remove_blank_lines: bool=False, first: bool=False, keep_tags: Optional[Iterable[str]]=None, all_attrs: bool=False ) -> str:
Here's an example:
Ignore this bit. """ stripped = strip_tags(html, ["div"], minify=True, keep_tags=["h1"]) print(stripped)">This has tags
And whitespace too
from strip_tags import strip_tags html = """
Ignore this bit. """ stripped = strip_tags(html, ["div"], minify=True, keep_tags=["h1"]) print(stripped)This has tags
And whitespace too
Output:
This has tags
And whitespace tooUse
remove_blank_lines=Trueto remove any remaining blank lines from the output.Usage: strip-tags [OPTIONS] [SELECTORS]... Strip tags from HTML, optionally from areas identified by CSS selectors Example usage: cat input.html | strip-tags > output.txt To run against just specific areas identified by CSS selectors: cat input.html | strip-tags .entry .footer > output.txt Options: --version Show the version and exit. -r, --remove TEXT Remove content in these selectors -i, --input FILENAME Input file -m, --minify Minify whitespace -t, --keep-tag TEXT Keep these--all-attrs Include all attributes on kept tags --first First element matching the selectors --help Show this message and exit. To contribute to this tool, first checkout the code. Then create a new virtual environment:
cd strip-tags python -m venv venv source venv/bin/activate
Now install the dependencies and test dependencies:
pip install -e '.[test]'To run the tests:
pytest