ffmpeg concept

FFmpeg concepts

Here is a quick intro to how FFmpeg actually works!

For those who are just joining in: please get the example assets if you want to test out the commands shown in this chapter!

FFmpeg opens the file, decodes it into memory, then encodes the in-memory packets back and puts them into some container: some output file. The term “codec” is a mix of the words “coder & encoder”. Those are the magic parts before and after the “decoded frames”.

The decoded frames are uncompressed images in-memory, e.g. the most basic pixel format for video frames is called “rgb24”. This just stores red, green, and blue values right after each other in 3x8 bits, or 3x1 byte, which could hold 16m colors.

The importance of this is that other than a few exceptions, you can only manipulate or encode the decoded frames. So when we get to different audio/video filters or transcoding, you’ll need the decoded frames for all that. But don’t worry, FFmpeg does this automatically for you.

Inputs

So you see and probably guessed, that FFmpeg must access the input data somehow. FFmpeg knows how to handle most media files, as the awesome people who develop FFmpeg and the related libraries made encoders and decoders for most formats available!

Don’t think that it is a trivial thing.  Many formats are reverse engineered, a hard task requiring brilliant people.

So although we often refer to input files, the input could come from many sources, such as the network, a hardware device and so on. We’ll learn more about that later on in this article.

Many media files are containers for different streams, meaning that a single file might contain multiple streams of content.

For example, a .mov file might contain one or more streams:

  • video tracks
  • audio tracks (e.g. for the different languages or audio formats such as stereo or 5.1)
  • subtitle tracks
  • thumbnails

All these are streams of data from the viewpoint of FFmpeg. Input files and their streams are numerically differentiated with a 0-based index. So, for example, 1:0 means the first(0) stream of the second(1) input file. We’ll learn more about that later too!

Important to note that FFmpeg can open any number of input files simultaneously, and the filtering and mapping will decide what it will do with those. Again more on that later!

Streams

As we have seen in the previous section, streams are the fundamental building blocks of containers. So every input file must have at least one stream. And that’s what you can list by the simple ffmpeg -i command for example.

A stream might contain an audio format such as MP3, or a video format such as an H.264 stream.

Also, a stream, depending on the codec, might contain multiple “things”. For example, an mp3 or a WAV stream might include various audio channels.

So the building block hierarchy, in this case is: File → Stream → Channels.

Outputs

Of course, an output could be a local file, but it doesn’t need to be. It could be a socket, a stream and so on. In the same way as with inputs, you could have multiple outputs, and the mapping determines what goes into which output file.

The output also must have some format or container. Most of the time FFmpeg can and will guess that for us, mostly from the extension, but we can specify it too.

Mapping

Mapping refers to the act of connecting input file streams with output file streams. So if you give 3 input files and 4 output files to FFmpeg, you must also define what should go to where.

If you give a single input and a single output, then FFmpeg will guess it for you without specifying any mapping, but make sure you know how exactly that happens, to avoid surprises. More on all that later!

Filtering

Filtering stands for the feature of FFmpeg to modify the decoded frames (audio or video). Other applications might call them effects, but i’m sure there is a reason why FFmpeg calls them filters.

There are two kinds of filtering supported by FFmpeg, simple and complex. In this article we’ll only discuss the complex filters, as it is a superset of the simple filters, and this way, we avoid confusion and redundant content.

Simple filters are a single chain of filters between a single input and output. Complex filters can have more chains of filters, with any number of inputs and outputs.

The following figure extends the previous overview image with the filtering module:

A complex filter graph is built from filter chains, which are built from filters.

So a single filter does a single thing, for example, changes the volume. This filter is quite trivial, it has a single input, changes the volume, and it has a single output.

For video, we could check out the scale filter, which is also quite straightforward: it has a single input, scales the incoming frames, and it has a single output too.

You can chain these filters, meaning that you connect the output of one to the input of the next one! So you can have a volume filter after an echo filter, for example, and this way, you’ll add echo, and then you change the volume.

This way, your chain will have a single input, and it will do several things with it and will output something at the end.

Now, the “complex” comes in when you have multiple chains of these filters!

But before we go there, you should also know that some single filters might have multiple inputs or outputs!

For example:

  • The overlay filter puts 2 video streams above each other and will output a single video stream.
  • The split filter splits a single video stream into 2+ video streams (by copying).

So let’s discuss a complex example from a bird’s eye view! I have two video files, I want to put them above each other, and I want the output in two files/sizes, 720p and 1080p.

Now, that’s where complex filtering will be faithful to its name: to achieve this, you’ll need several filter chains!

  • Chain 1: [input1.mp4] [input2.mp4]overlaysplit[overlaid1] [overlaid2]
  • Chain 2: [overlaid1]scale[720p_output]
  • Chain 3: [overlaid2]scale[1080p_output]

As you see, you can connect chains, and you can connect chains to output files. There is a rule that you can only consume a chain once, and that’s why we used split instead of the same input for chains 2 and 3.

The takeaway is this: with complex filter graphs (and mapping), you can:

  • build individual chains of filters
  • connect input files to filter chains
  • connect filter chains to filter chains
  • connect filter chains to output files