Pipeline (Unix)
In UNIX and other Unix-like operating systems, a pipeline is a set of processes chained by their standard streams, so that the output of each process ("stdout") feeds directly as input ("stdin") of the next one. Filter programs are often used in this way. The concept was named by analogy to a physical pipeline.
This feature of UNIX was borrowed by other operating systems, such as Taos and MS-DOS, and eventually became the pipes and filters design pattern of software engineering. Unix pipelines should not be confused with other data processing pipelines found in modern computer systems, although the general concept is quite similar.
| Contents |
Pipelines in the CLI
Most Unix shells have a special syntax construct for the creation of pipelines. Typically, one simply writes the filter commands in sequence, separated by the ASCII vertical bar character "|" (which, for this reason, is often called "pipe character" by Unix users). The shell starts the processes and arranges for the necessary connections between their standard streams (including some amount of buffer storage). Multi-line entry is possible with the \ break slash, after each pipe.
Error stream
By default, the standard error streams ("stderr") of the processes are not passed on through the pipe; instead, they are merged and directed to the console. However, many shells have additional syntax for changing this behaviour. In the csh shell, for instance, using "|&" instead of "| " signifies that the standard error stream too should be merged with the standard output and fed to the next process.
Example
Below is an example of a pipeline that implements a kind of spell checker for the web resource indicated by a URL [1].
curl http://www.wikipedia.org/wiki/Pipeline | \ sed 's/[^a-zA-Z ]//g' | \ tr 'A-Z ' 'a-z\n' | \ grep '[a-z]' | \ sort -u | \ comm -23 - /usr/dict/words
Here is an explanation of the pipeline:
- First the curl program obtains the HTML contents of a web page.
- The contents of this page are piped through sed, which removes all characters which are not spaces or letters.
- tr then changes all of the uppercase letters into their corresponding lowercase counterparts, and converts the spaces in the lines of text to newlines.
- Each 'word' is now on a separate line.
- grep is used to remove lines of whitespace.
- sort sorts the list of 'words' into alphabetical order, and removes duplicates.
- Finally, comm finds which of the words in the list are not in the given dictionary file (in this case, /usr/dict/words).
Creating pipelines by program
Pipelines can be created also under program control.
Implementation
In most Unix-like systems, all processes of a pipeline are started at the same time, with their streams appropriately connected, and managed by the scheduler together with all other processes running on the machine. An important aspect of this, setting Unix pipes apart from other pipe implementations, is the concept buffering: a sending program may produce 1000 bytes per second, and a receiving program may only be able to accept 100 bytes per second, but the data is held in a buffer, or queue, by the operating system so that the receiving program need not worry about dropping data on account of it being too busy to receive it. Buffers also collect data from their senders as soon as it is made available, so that a sender need not finish its job, or exit, before a receiver can start its work on the product.
Other implementations of pipes have provided pipe-like functionality without multitasking. Under MS-DOS, for example, only one process could be running at the same time, but it could start another process, which would then need to complete before the initial process could recover control. So when pipes were used, the command.com shell would first create a temporary buffer file, making sure this file is its standard output. Then it would start the "sending" process, this process would inherit the buffer file as its standard output and would write its entire output to it. Once the sending process has terminated, the shell would close the buffer file and open it again, in read mode, as its standard input, and then run the second, receiving process, which would again inherit it as standard input. This provided similar functionality to Unix shell pipes, but required processes to complete their work before handing their ouput off to the receiver. This was impractical for long-running processes, therefore MS-DOS programs often offered their own output "pagination" if they output lots of text data on their standard output, rather than relying on the user to pipe them through the standard more.exe utility. Let's note that many, if not most non-trivial MS-DOS programs did not use MS-DOS file handles for either input not output, and instead read the keyboard directly at BIOS level or performed BIOS calls directly for output, or even wrote to video memory directly. Such programs could not be redirected thru pipes in either direction.
Tools like netcat can connect pipes to TCP/IP sockets, following the Unix philosophy of "everything is a file".
CMS Pipelines is a port of the pipeline idea to VM/CMS and MVS systems. In supports much more complex pipeline structures than Unix shells, with steps taking multiple input streams and producing multiple output shells. (Such functionality is supported by the Unix kernel, but few programs use it.) Due to the different nature of IBM mainframe operating systems, it implements many steps inside CMS Pipelines which in Unix are separate external programs, but can also call separate external programs for their functionality. Also, due to the record-oriented nature of files on IBM mainframes, the pipelines also operate in a record-oriented rather than stream-oriented manner.
History
The pipeline concept and the vertical-bar notation was invented by Douglas McIlroy, one of the authors of the early command shells, after he noticed that much of the time they were processing the output of one program as the input to another. The idea was eventually ported to other operating systems, such as DOS, OS/2, Windows NT, BeOS, and Mac OS X, often with the same notation.
See also
- Tee (Unix) for fitting together two pipes
- Pipeline (software) for the general software engineering concept.
- Pipeline (computer) for other computer-related pipelines.
- Hartmann pipeline
