Everything that I’m showing here can be done with regular code. You can load the file, parse the CSV data, and then transform it using regular JavaScript, Python, or any other language. But there are a few reasons why I reach out for command-line interfaces (CLIs) whenever I need to transform data:

The main reason I love Miller is that it’s a standalone tool. There are many great tools for data manipulation, but every other tool I found was part of a specific ecosystem. The tools written in Python required knowing how to use pip and virtual environments; for those written in Rust, it was cargo, and so on.

On top of that, it’s fast. The data files are streamed, not held in memory, which means that you can perform operations on large files without freezing your computer.

That’s it, and you should now have the mlr command available in your terminal.

Run mlr help topics to see if it worked. This will give you instructions to navigate the built-in documentation. You shouldn’t need it, though; that’s what this tutorial is for!

Miller commands work the following way:

Example: mlr --csv filter '$color != "red"' example.csv

Let’s deconstruct:

We can use those verbs to run specific operations on your data. There’s a lot we can do. Let’s explore.

Note: For the sake of brevity, I’ve renamed the file from mlr --csv head ./IMDb_Economist_tv_ratings.csv to tv_ratings.csv.

Above, I mentioned that every command contains a specific operation or verb. Let’s learn our first one, called head. What it does is show you the beginning of the file (the “head”) rather than print the entire file in the console.

You can run the following command:

And this is the output you’ll see:

This is a bit hard to read, so let’s make it easier on the eye by adding --opprint.

The resulting output will be the following:

Much better, isn’t it?

Note: Rather than typing --csv --opprint every time, we can use the --c2p option, which is a shortcut.

That’s where the fun begins. Rather than run multiple commands, we can chain the verbs together by using the then keyword.

You can see that there’s a titleId column that isn’t very useful. Let’s get rid of it using the cut verb.

It gives you the following output:

This is the verb that I first showed earlier. We can remove all the rows that don’t match a specific expression, letting us clean our data with only a few characters.

If we only want the rating of the first seasons of every series in the dataset, this is how you do it:

We can sort our data based on a specific column like it would be in a UI like Excel or macOS Numbers. Here’s how you would sort your data based on the series with the highest rating:

The resulting output will be the following:

We can see that Parenthood, from 1990, has the highest rating on IMDb — who knew!

By default, Miller only prints your processed data to the console. If we want to save it to another CSV file, we can use the > operator.

If we wanted to save our sorted data to a new CSV file, this is what the command would look like:

Most of the time, you don’t use CSV data directly in your application. You convert it to a format that is easier to read or doesn’t require additional dependencies, like JSON.

Miller gives you the --c2j option to convert your data from CSV to JSON. Here’s how to do this for our sorted data:

Let’s apply everything we learned above to a real-world use case. Let’s say that you have a detailed dataset of every athlete who participated in the 2016 Olympic games in Rio, and you want to know who the 5 with the highest number of medals are.

Let’s open up the following file:

The resulting output will be something like the following:

The CSV file has a few fields we don’t need. Let’s clean it up by removing the info , id , weight, and date_of_birth columns.

Now we can move to our original problem: we want to find who won the highest number of medals. We have how many of each medal (bronze, silver, and gold) the athletes won, but not the total number of medals per athlete.

Let’s compute a new value called medals which corresponds to this total number (bronze, silver, and gold added together).

It gives you the following output:

Sort by the highest number of medals by adding a sort.

Respectively, the resulting output will be the following:

Restrict to the top 5 by adding -n 5 to your head operation.

You will end up with the following file:

As a final step, let’s convert this into a JSON file with the --c2j option.

Here is our final command:

With a single command, we’ve computed new data, sorted the result, truncated it, and converted it to JSON.

Bonus: If you wanted to show the top 5 women, you could add a filter.

Respectively, you would end up with the following output:

source