How to sort lines matching certain criteria into separate files?

Refresh

December 2018

Views

273 time

1

I have a CSV file, file1.csv, which has a custom format, with three columns, like this:

This is some data. [text] This is some more data.
  • Everything before the first [ is in the first column.
  • Everything between the first set of square brackets is in the second column.
  • Everything after the first ] is in the third column, no matter what content follows.

E.g.:

First. [second] Third.
      ^        ^

I want to sort the lines of the file into two files, withnumbers.csv and withoutnumbers.csv, essentially, by those containing numbers within the third column, and those not containing numbers within the third column.

Later square brackets might appear, but they are not regarded as a new column, they are still part of the third columns data, e.g.:

First. [second] Third. [some more text] This is still in the third column.
      ^        ^

Lines containing numbers can match them like *0*, *1*, *2*, etc.. These all contain numbers:

Water is H20.
The bear ate 2,120 fish.
The Wright Flyer flew in 1903.

Numbers found anywhere within a pair of square brackets in the third column do not count as a match, e.g., these lines would be sent to withoutnumbers.csv:

First. [second] Some text. [This has the number 1.]
First. [second] Some more text. [The Wright Flyer flew in 1903.]

These would be sent to withnumbers.csv, because they still have a number outside of the square brackets, but inside the third column:

First. [second] Some text with 1. [This has the number 1.]
First. [second] Some more text with the number 3. [The Wright Flyer flew in 1903.]

How can I sort the lines of the file into those containing numbers in the third column, not considering those numbers found within square brackets, and those lines not containing numbers?

3 answers

1

This splits on the first closing square bracket and checks for digits inside square brackets in the part of the line after the first closing square bracket or if that part consist solely of non-digits. It writes those lines to the withoutnumbers.csv. Otherwise, it writes the line to withnumbers.csv.

perl -lne 'BEGIN {open ND, ">", withoutnumbers.csv; open D, ">", withnumbers.csv} @fields = split(/]/,$_,2); $fields[1] =~ /\[.*?\d.*?\]|^\D+$/ ? print ND $_ : print D $_' file1.csv
3

Well, I'm not going to lie, I'm not loving the solution I came up with. However, your problem is rather peculiar and desperate times call for desperate measures. So, give this a try:

awk -F'\[[^\]]*\]' '{
  printed = 0
  for (i = 2; i <= NF; i++) {
    if ($i ~ /[0-9]+/) {
      print $0 >> "withNumbers"
      printed = 1
      break
    }
  }

  if (! printed) {
    print $0 >> "withoutNumbers"
  }
}' file
1

Here's a go

shopt -s extglob
rm withnumbers.csv withoutnumbers.csv
touch withnumbers.csv withoutnumbers.csv

while IFS= read -r line; do
  col3=${line#*\]}            # remove everything before and including the first ]
  col3=${col3//\[*([^]])\]/}  # remove all bracketed items
  if [[ $col3 == *[[:digit:]]* ]]; then
    printf "%s\n" "$line" >> withnumbers.csv
  else
    printf "%s\n" "$line" >> withoutnumbers.csv
   fi
done < file1.csv