Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

0 votes

5 answers

1060 views

command-line tool to sum the values in a column of a CSV file

I am looking for a command-line tool to calculate the sum of the values in a specified column of a CSV file. (**Update**: The CSV file might have quoted fields, so a simple solution just to break on a delimiter (',') does not work.) Given the following sample CSV file: ``` description A,description...

description A,description B,data 1, data 2
fruit,"banana,apple",3,17
veggie,cauliflower,7,18
animal,"fish,meat",9,22

I want to build the sum, for example, over the column data 1 with the result **19**. I have tried to use [csvkit] for this but didn't get very far. Are there other command-lien tools specialised in this CSV operation?

halloleo (649 rep)

Jul 2, 2024, 01:41 AM • Last activity: Apr 23, 2025, 03:04 PM

0 votes

2 answers

1296 views

CSV fields max length error and setting quoting=csv.QUOTE_NONE

quoting csv csvkit

After running `csvcut` on a comma-delimited .csv file: [root@server files]# csvcut -c title,mpn,overview,techspecs2,image_carousel_elargesrc syn_multi-image.csv > syn_scraped_cut.csv I get the error: > CSV contains fields longer than maximum length of 131072 characters. > Try raising the maximum wit...

                                  After running csvcut on a comma-delimited .csv file:

    [root@server files]# csvcut -c title,mpn,overview,techspecs2,image_carousel_elargesrc syn_multi-image.csv > syn_scraped_cut.csv

I get the error:

> CSV contains fields longer than maximum length of 131072 characters.
> Try raising the maximum with the field_size_limit parameter, or try
> setting quoting=csv.QUOTE_NONE.

Though large, I can tell you for sure that my longest field is only 65535 characters long, which is under the maximum allowed length by a pretty safe margin.  

I have no idea what setting quoting=csv.QUOTE_NONE refers to.  I have only been using simple csvkit commands and that is all I know.

Reading similar threads and answers such as [here](https://stackoverflow.com/questions/15063936/csv-error-field-larger-than-field-limit-131072)  and [here](https://stackoverflow.com/a/18408911/9095603) , I am unable to extract any kind of solution in the context of csvkit, specifically.  I'm not adept at programming in general and am limited to using csvkit, its commands and options.

How do I fix this error?

ptrcao (5995 rep)

Jul 25, 2019, 11:14 PM • Last activity: Dec 17, 2023, 10:32 PM

7 votes

2 answers

515 views

Deduplicate CSV rows based on a specific column, with a CSV parser

linux csv csvkit miller

I searched for this task, and found the following older questions: - https://unix.stackexchange.com/questions/681059/removing-duplicates-from-a-csv-based-on-specified-columns - https://unix.stackexchange.com/questions/444476/identify-unique-records-on-csv-based-on-specific-columns But I can't use `a...

                                  I searched for this task, and found the following older questions:

 - https://unix.stackexchange.com/questions/681059/removing-duplicates-from-a-csv-based-on-specified-columns 
 - https://unix.stackexchange.com/questions/444476/identify-unique-records-on-csv-based-on-specific-columns 

But I can't use awk because my data is a complex CSV file with multiple nested double quotes.

Let's say I want to deduplicate the following (simplified case):

    Ref,xxx,zzz
    ref1,"foo, bar, base",qux
    ref1,"foo, bar, base",bar
    ref2,aaa,bbb

In the output I need it as follows:

    Ref,xxx,zzz
    ref1,"foo, bar, base",qux
    ref2,aaa,bbb

No awk solution, please, only with any CSV parser.

I tried the following:

    mlr --csv uniq -a -g Ref file.csv

But it's an error.
                                

Mévatlavé Kraspek (541 rep)

May 29, 2023, 01:34 AM • Last activity: May 29, 2023, 06:50 PM

6 votes

3 answers

513 views

Truncate an CSV column using CsvKit

csv csvkit

How can I truncate the length of a column using CSVKit? The definition looks like this: * Column 1: no length restriction * Column 2: This should properly handle escaped (quoted) columns and new lines. For example: ``` First Header,Second Header foo, foo,b foo,bar foo,"bar" foo,"""bar" foo," bar" ``...

First Header,Second Header
foo,
foo,b
foo,bar
foo,"bar"
foo,"""bar"
foo,"
bar"

should become

First Header,Second Header
foo,
foo,b
foo,ba
foo,ba
foo,"""b"
foo,"
b"

patstuart (163 rep)

Jul 14, 2022, 06:10 PM • Last activity: Jul 15, 2022, 07:44 AM

3 votes

2 answers

4044 views

how to install csvkit in bash

software-installation mingw csvkit

Kusalananda nicely recommends using `csvformat` from [csvkit](https://csvkit.readthedocs.io/en/latest/) to format `jq` `@csv` into a csv format without double quotes `"` [answering how to parse json with jq](https://unix.stackexchange.com/a/506790/530603). This answer does not seem to involve the us...

                                  Kusalananda nicely recommends using csvformat from [csvkit](https://csvkit.readthedocs.io/en/latest/)  to format jq @csv into a csv format without double quotes " [answering how to parse json with jq](https://unix.stackexchange.com/a/506790/530603) .

This answer does not seem to involve the use of python. But the csvkit [installation tutorial](https://csvkit.readthedocs.io/en/latest/tutorial/1_getting_started.html#installing-csvkit)  and its [installation troubleshooting](https://csvkit.readthedocs.io/en/latest/tricks.html#troubleshooting)  do seem to rely on, perhaps require, the use of python. This makes me, a newbie, confused:

Is it possible to install csvkit in git bash without using python (read: open spyder or anaconda, let's say)? How?

**Edit.** MINGW64 (git bash) displays bash: pip: command not found. Same for conda.
How do you recommend moving on from there?

python is installed, pip.exe being in ...\Anaconda\Scripts. There are several suggested solutions on other sites e.g. in various ways adding the dir of pip.exe to PATH [here](https://www.stackoverflow.com/questions/6318156/adding-python-to-path-on-windows)  and [here](https://www.stackoverflow.com/question/32597209/python-not-working-in-the-command-line-of-git-bash)) .

Johan (439 rep)

Jun 20, 2022, 05:40 PM • Last activity: Jun 22, 2022, 12:17 PM

1 votes

1 answers

152 views

How can I separate these two columns in this csv file in Linux/Bash?

bash data csvkit

I am looking to separate these two columns, each into their own separate text files. This data is from a csv file on Kaggle that contains Titanic passenger data. The first column is the number of passengers, and the second column is the age of those passengers I.e. 10 one year olds, 12 two year olds...

                                  I am looking to separate these two columns, each into their own separate text files. This data is from a csv file on Kaggle that contains Titanic passenger data. The first column is the number of passengers, and the second column is the age of those passengers I.e. 10 one year olds, 12 two year olds, etc . I want to separate these and put them into a simple graph in the command line.I have used csvkit so far to manipulate the data set. Thanks! I am new to Linux and this is my first dabble into tapping into the community!
 
1
2
3
4
5
6
7
8
9
10
                                

Tyler Young (11 rep)

May 8, 2021, 08:50 PM • Last activity: May 8, 2021, 09:09 PM

0 votes

2 answers

2262 views

Concatenating columns of the same csv file to create a new column with a new heading

awk sed csv csvkit

What I have is a CSV file to this effect: +------------+--------------+ | Category I | Sub-Category | +------------+--------------+ | 1144 | 128 | | 1144 | 128 | | 1000 | 100 | | 1001 | 100 | | 1002 | 100 | | 1002 | 100 | | 1011 | 102 | | 1011 | 102 | | 1011 | 102 | | 1011 | 102 | | 1011 | 102 | | 1...

                                  What I have is a CSV file to this effect:


    +------------+--------------+
    | Category I | Sub-Category |
    +------------+--------------+
    |       1144 |          128 |
    |       1144 |          128 |
    |       1000 |          100 |
    |       1001 |          100 |
    |       1002 |          100 |
    |       1002 |          100 |
    |       1011 |          102 |
    |       1011 |          102 |
    |       1011 |          102 |
    |       1011 |          102 |
    |       1011 |          102 |
    |       1011 |          102 |
    |       1013 |          103 |
    |       1013 |          103 |
    |       1013 |          103 |
    |       1013 |          103 |
    |       1013 |          103 |
    |       1013 |          103 |
    |       1013 |          103 |
    +------------+--------------+



I wish to concatenate the first and second columns above to form a third, new column with a new arbitrary heading, to this effect:


    +-------------+--------------+-----------------------+
    | Category ID | Sub-Category | Arbitrary New Heading |
    +-------------+--------------+-----------------------+
    |        1144 |          128 |               1144128 |
    |        1144 |          128 |               1144128 |
    |        1000 |          100 |               1000100 |
    |        1001 |          100 |               1001100 |
    |        1002 |          100 |               1002100 |
    |        1002 |          100 |               1002100 |
    |        1011 |          102 |               1011102 |
    |        1011 |          102 |               1011102 |
    |        1011 |          102 |               1011102 |
    |        1011 |          102 |               1011102 |
    |        1011 |          102 |               1011102 |
    |        1011 |          102 |               1011102 |
    |        1013 |          103 |               1013103 |
    |        1013 |          103 |               1013103 |
    |        1013 |          103 |               1013103 |
    |        1013 |          103 |               1013103 |
    |        1013 |          103 |               1013103 |
    |        1013 |          103 |               1013103 |
    |        1013 |          103 |               1013103 |
    +-------------+--------------+-----------------------+

My usual go-to utility, csvkit does not have the means to achieve this, afaik - see https://github.com/wireservice/csvkit/issues/930 .

What is a simple solution not requiring advanced programming knowledge, which can achieve this?

I'm vaguely aware of awk and sed as potential solutions, but I don't want to limit the enquiry to those just in case there is a better (i.e. simpler) solution.

The solution must be efficient for very large files, i.e containing 120,000+ lines.

Edit: I have included the sample data for the convenience of those wanting to take a crack at it; download here: https://www.dropbox.com/s/achtyxg7qi1629k/category-subcat-test.csv?dl=0 
                                

ptrcao (5995 rep)

Dec 21, 2019, 09:56 AM • Last activity: Dec 25, 2019, 06:24 AM

0 votes

1 answers

142 views

Syntactical error with csvsql query?

mysql csv csvkit

I have a csv file `attributes.csv` from which I want to retrieve all records to a new file `attributes_withoutPIDate.csv` excluding records for which the `Name` column has "PI Date" as the value. Commanding `csvsql` in this manner csvsql -d ',' -I --query 'select * where Name "PI Date" from attribut...

I have a csv file attributes.csv from which I want to retrieve all records to a new file attributes_withoutPIDate.csv excluding records for which the Name column has "PI Date" as the value. Commanding csvsql in this manner csvsql -d ',' -I --query 'select * where Name "PI Date" from attributes' attributes.csv > attributes_withoutPIDate.csv yields an error

(sqlite3.OperationalError) near "from": syntax error
[SQL: select * where Name  "PI Date" from attributes]
(Background on this error at: http://sqlalche.me/e/e3q8)

I suspect a syntactical error. Can someone advise how to fix it?

ptrcao (5995 rep)

Dec 22, 2019, 04:35 AM • Last activity: Dec 23, 2019, 03:45 AM

0 votes

1 answers

1109 views

How to write a csvcut script to cut column by header with multiple files?

scripting csv csvkit

Since `csvcut` (from [`csvkit`](http://csvkit.readthedocs.io/)) does not take more than a single file at a time, I need to write a script to process multiple files using it. The first parameter should be the delimiter, the second parameter should be the header of the column to extract, and remaining...

Since csvcut (from [csvkit](http://csvkit.readthedocs.io/)) does not take more than a single file at a time, I need to write a script to process multiple files using it. The first parameter should be the delimiter, the second parameter should be the header of the column to extract, and remaining arguments are the filenames. If the file names are missing, the script should standard input. It should be something like this

csvcut ';' Measure calories.csv

I'm not really familiar with csvkit. Can anyone help?

amV (75 rep)

Aug 12, 2019, 07:46 AM • Last activity: Aug 12, 2019, 09:18 AM

Showing page 1 of 9 total questions