TopHome
<2024-04-30 Tue>techlinux

Datamash

I have used awk many times to do many numerical operations, but came across this tool for the first time which can offload some of the tedium and go a few steps forward.

There are other tools like miller which are in the same space, but datamash seems to be quite light and easy to use.

For example, in the scenario I had, the data was in the form of:

1 10 <number>
1 10 <number>
1 10 <number>
2 10 <number>
2 10 <number>
2 10 ...
1 20 ...
1 20 ...
1 20 ...
2 20 ...
2 20 ...
...

You get the idea. Point is, I want the fastest way to average the 3rd column for every given pair in the first and second columns.

The corresponding Datamash command looks like the following:

datamash -t\  groupby 1,2 mean 3 max 3 < input.txt

which:

  1. Uses " " as the separator.
  2. Groupby using the first AND second columns.
  3. Printout the mean of the 3 column (within groups)
  4. Printout the max of the 3 column (within groups)

The man page of the command should have everything else needed.