home / tils / til

Menu
  • GraphQL API

til: bash_skip-csv-rows-with-odd-numbers.md

This data as json

path topic title url body html shot created created_utc updated updated_utc shot_hash slug
bash_skip-csv-rows-with-odd-numbers.md bash Skipping CSV rows with odd numbers of quotes using ripgrep https://github.com/simonw/til/blob/main/bash/skip-csv-rows-with-odd-numbers.md I'm working with several huge CSV files - over 5 million rows total - and I ran into a problem: it turned out there were a few lines in those files that imported incorrectly because they were not correctly escaped. Here's an example of an invalid line: SAI Exempt,"Patty B"s Hats & Tees,LLC",,26 Broad St The apostrophe in `Patty B's Hats & Tees` is incorrectly represented here as a double quote, and since that's in a double quoted string it breaks that line of CSV. I decided to filter out any rows that had an odd number of quotation marks in them - saving those broken lines to try and clean up later. ## Finding rows with odd numbers of quotes StackOverflow [offered this regular expression](https://stackoverflow.com/a/16863999) for finding lines with an odd number of quotation marks: ``` [^"]*" # Match any number of non-quote characters, then a quote (?: # Now match an even number of quotes by matching: [^"]*" # any number of non-quote characters, then a quote [^"]*" # twice )* # and repeat any number of times. [^"]* # Finally, match any remaining non-quote characters ``` I translated this into a `ripgrep` expression, adding `^` to the beginning and `$` to the end in order to match whole strings. This command counted the number of invalid lines: rg '^[^"]*"(?:[^"]*"[^"]*")*[^"]*$' --glob '*.csv' --count 04.csv:52 03.csv:42 02.csv:24 01.csv:29 Adding `--invert-match` showed me the count of lines that did NOT have an odd number of quotes: rg '^[^"]*"(?:[^"]*"[^"]*")*[^"]*$' --glob '*.csv' --count --invert-match 05.csv:2829 04.csv:812351 03.csv:961311 02.csv:994265 01.csv:995404 This shows that the invalid lines are a tiny subset of the overall files. Removing `--count` shows the actual content. ## Importing into SQLite with sqlite-utils I used this for loop to import only the valid lines into a SQLite database: ```bash for file in *.csv; do rg --invert-match '^[^"]*"(?:[^"]*"[^"]*")*[^"]*$' $file | \ sqlite-utils insert my.db rows - --csv; done; ``` ## Saving the broken lines for later To save the lines that contained odd numbers of double quotes I used this command: ```bash rg '^[^"]*"(?:[^"]*"[^"]*")*[^"]*$' \ --glob '*.csv' \ --no-line-number \ --no-filename > saved.txt ``` Since I don't actually care which file they lived in - all of these CSV files share the same structure - I used `--no-filename` to omit the filename from the results and `--no-line-number` to omit the line number. The result is a `saved.txt` file containing just the raw CSV data that I skipped from the import. <p>I'm working with several huge CSV files - over 5 million rows total - and I ran into a problem: it turned out there were a few lines in those files that imported incorrectly because they were not correctly escaped.</p> <p>Here's an example of an invalid line:</p> <pre><code>SAI Exempt,"Patty B"s Hats &amp; Tees,LLC",,26 Broad St </code></pre> <p>The apostrophe in <code>Patty B's Hats &amp; Tees</code> is incorrectly represented here as a double quote, and since that's in a double quoted string it breaks that line of CSV.</p> <p>I decided to filter out any rows that had an odd number of quotation marks in them - saving those broken lines to try and clean up later.</p> <h2> <a id="user-content-finding-rows-with-odd-numbers-of-quotes" class="anchor" href="#finding-rows-with-odd-numbers-of-quotes" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Finding rows with odd numbers of quotes</h2> <p>StackOverflow <a href="https://stackoverflow.com/a/16863999" rel="nofollow">offered this regular expression</a> for finding lines with an odd number of quotation marks:</p> <pre><code>[^"]*" # Match any number of non-quote characters, then a quote (?: # Now match an even number of quotes by matching: [^"]*" # any number of non-quote characters, then a quote [^"]*" # twice )* # and repeat any number of times. [^"]* # Finally, match any remaining non-quote characters </code></pre> <p>I translated this into a <code>ripgrep</code> expression, adding <code>^</code> to the beginning and <code>$</code> to the end in order to match whole strings.</p> <p>This command counted the number of invalid lines:</p> <pre><code>rg '^[^"]*"(?:[^"]*"[^"]*")*[^"]*$' --glob '*.csv' --count 04.csv:52 03.csv:42 02.csv:24 01.csv:29 </code></pre> <p>Adding <code>--invert-match</code> showed me the count of lines that did NOT have an odd number of quotes:</p> <pre><code>rg '^[^"]*"(?:[^"]*"[^"]*")*[^"]*$' --glob '*.csv' --count --invert-match 05.csv:2829 04.csv:812351 03.csv:961311 02.csv:994265 01.csv:995404 </code></pre> <p>This shows that the invalid lines are a tiny subset of the overall files.</p> <p>Removing <code>--count</code> shows the actual content.</p> <h2> <a id="user-content-importing-into-sqlite-with-sqlite-utils" class="anchor" href="#importing-into-sqlite-with-sqlite-utils" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Importing into SQLite with sqlite-utils</h2> <p>I used this for loop to import only the valid lines into a SQLite database:</p> <div class="highlight highlight-source-shell"><pre><span class="pl-k">for</span> <span class="pl-smi">file</span> <span class="pl-k">in</span> <span class="pl-k">*</span>.csv<span class="pl-k">;</span> <span class="pl-k">do</span> rg --invert-match <span class="pl-s"><span class="pl-pds">'</span>^[^"]*"(?:[^"]*"[^"]*")*[^"]*$<span class="pl-pds">'</span></span> <span class="pl-smi">$file</span> <span class="pl-k">|</span> \ sqlite-utils insert my.db rows - --csv<span class="pl-k">;</span> <span class="pl-k">done</span><span class="pl-k">;</span></pre></div> <h2> <a id="user-content-saving-the-broken-lines-for-later" class="anchor" href="#saving-the-broken-lines-for-later" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Saving the broken lines for later</h2> <p>To save the lines that contained odd numbers of double quotes I used this command:</p> <div class="highlight highlight-source-shell"><pre>rg <span class="pl-s"><span class="pl-pds">'</span>^[^"]*"(?:[^"]*"[^"]*")*[^"]*$<span class="pl-pds">'</span></span> \ --glob <span class="pl-s"><span class="pl-pds">'</span>*.csv<span class="pl-pds">'</span></span> \ --no-line-number \ --no-filename <span class="pl-k">&gt;</span> saved.txt</pre></div> <p>Since I don't actually care which file they lived in - all of these CSV files share the same structure - I used <code>--no-filename</code> to omit the filename from the results and <code>--no-line-number</code> to omit the line number. The result is a <code>saved.txt</code> file containing just the raw CSV data that I skipped from the import.</p> <Binary: 68,046 bytes> 2020-12-11T19:50:58-08:00 2020-12-12T03:50:58+00:00 2021-01-18T17:27:54-08:00 2021-01-19T01:27:54+00:00 79abed69911556279dfa18b015588d8c skip-csv-rows-with-odd-numbers
Powered by Datasette · How this site works · Code of conduct