home / tils / til

Menu
  • GraphQL API

til: bash_finding-bom-csv-files-with-ripgrep.md

This data as json

path topic title url body html shot created created_utc updated updated_utc shot_hash slug
bash_finding-bom-csv-files-with-ripgrep.md bash Finding CSV files that start with a BOM using ripgrep https://github.com/simonw/til/blob/main/bash/finding-bom-csv-files-with-ripgrep.md For [sqlite-utils issue 250](https://github.com/simonw/sqlite-utils/issues/250) I needed to locate some test CSV files that start with a UTF-8 BOM. Here's how I did that using [ripgrep](https://github.com/BurntSushi/ripgrep): ``` $ rg --multiline --encoding none '^(?-u:\xEF\xBB\xBF)' --glob '*.csv' . ``` The `--multiline` option means the search spans multiple lines - I only want to match entire files that begin with my search term, so this means that `^` will match the start of the file, not the start of individual lines. `--encoding none` runs the search against the raw bytes of the file, disabling ripgrep's default BOM detection. `--glob '*.csv'` causes ripgrep to search only CSV files. The regular expression itself looks like this: ^(?-u:\xEF\xBB\xBF) This is [rust regex](https://docs.rs/regex/1.5.4/regex/#syntax) syntax. `(?-u:` means "turn OFF the `u` flag for the duration of this block" - the `u` flag, which is on by default, causes the Rust regex engine to interpret input as unicode. So within the rest of that `(...)` block we can use escaped byte sequences. Finally, `\xEF\xBB\xBF` is the byte sequence for the UTF-8 BOM itself. <p>For <a href="https://github.com/simonw/sqlite-utils/issues/250">sqlite-utils issue 250</a> I needed to locate some test CSV files that start with a UTF-8 BOM.</p> <p>Here's how I did that using <a href="https://github.com/BurntSushi/ripgrep">ripgrep</a>:</p> <pre><code>$ rg --multiline --encoding none '^(?-u:\xEF\xBB\xBF)' --glob '*.csv' . </code></pre> <p>The <code>--multiline</code> option means the search spans multiple lines - I only want to match entire files that begin with my search term, so this means that <code>^</code> will match the start of the file, not the start of individual lines.</p> <p><code>--encoding none</code> runs the search against the raw bytes of the file, disabling ripgrep's default BOM detection.</p> <p><code>--glob '*.csv'</code> causes ripgrep to search only CSV files.</p> <p>The regular expression itself looks like this:</p> <pre><code>^(?-u:\xEF\xBB\xBF) </code></pre> <p>This is <a href="https://docs.rs/regex/1.5.4/regex/#syntax" rel="nofollow">rust regex</a> syntax.</p> <p><code>(?-u:</code> means "turn OFF the <code>u</code> flag for the duration of this block" - the <code>u</code> flag, which is on by default, causes the Rust regex engine to interpret input as unicode. So within the rest of that <code>(...)</code> block we can use escaped byte sequences.</p> <p>Finally, <code>\xEF\xBB\xBF</code> is the byte sequence for the UTF-8 BOM itself.</p> <Binary: 66,397 bytes> 2021-05-28T22:23:45-07:00 2021-05-29T05:23:45+00:00 2021-05-28T22:23:45-07:00 2021-05-29T05:23:45+00:00 708508f8876dcdb33cc2e58461643886 finding-bom-csv-files-with-ripgrep
Powered by Datasette · How this site works · Code of conduct