Bash-скрипты, часть 7: sed и обработка текстов

Использование sed в Linux

sed (от англ. Stream EDitor) — потоковый текстовый редактор (а также язычок программирования), использующий различные предопределённые текстовые преобразования к последовательному потоку текстовых этих. Sed можно утилизировать как grep, выводя строки по шаблону базового регулярного выражения:

Может быть использовать его для удаления строк (удаление всех пустых строк):

Основным инструментом работы с sed является выражение типа:

Так, образчик, если выполнить команду:

Выше рассмотрены различия меж «grep», «egrep» и «fgrep». Невзирая на различия в наборе используемых регулярных представлений и скорости выполнения, параметры командной строчки остаются одинаковыми для всех трех версий grep.

Xargs и сut

$ cut -d: -f1 < /etc/passwd | sort | xargs echo

А команда вида

file * | grep ASCII | cut -d":" -f1 | xargs -p vim

будет последовательно открывать файлы для редактирования в vim.
Обратим внимание на опцию -p. Благодаря ей команда будет выполняться в интерактивном режиме: перед открытием каждого файла будет запрашиваться подтверждение (y/n)

В заключение приведём ещё один сложный и интересный пример — рекурсивный поиск файлов самого большого размера в некоторой директории:

$ find . -type f -printf '%20s %p\n' | sort -n | cut -b22- | tr '\n' '

$ find . -type f -printf '%20s %p\n' | sort -n | cut -b22- | tr '\n' '\000' | xargs -0 ls -laSr

0′ | xargs -0 ls -laSr

Нетривиальные примеры

Основы мы вспомнили, типичные варианты использования рассмотрели… Перейдем теперь к более сложным и нетривиальным примерам. До некоторых из них мы додумались самостоятельно, работая над повседневными задачами, а некоторые — почерпнули с сайта http://www.commandlinefu.com (всем желающим научиться тонкостям работы с командной строкой очень рекомендуем время от времени его посещать — там порой можно найти очень полезные советы).

Баним IP-адреса из списка

Чтобы забанить IP-адреса из списка, нужно их добавить в IP tables c правилом DROP. Эта операция осуществляется при помощи команды:

$ cat bad_ip_list | xargs -I IP iptables -A INPUT -s IP -j DROP

Можно проделать и более сложную операцию и забанить все адреса по AS:

$ /usr/bin/whois -H -h whois.ripe.net -T route -i origin AS|egrep «^route»|awk «{print $2}» |xargs -I NET iptables -A INPUT -s NET -j DROP

Изменяем формат URL

Преобразовать URL вида «http%3A%2F%2Fwww.google.com» в «http://www,google.com» можно при помощи команды:

Echo «http%3A%2F%2Fwww.google.com» | sed -e»s/%/\\\\\x\1/g» | xargs echo -e

Генерируем пароль из 10 символов

Сгенерировать надежный пароль можно при помощи команды вида:

$ tr -dc A-Za-z0-9_

Генерировать пароли можно и без помощи xargs: для этого cуществует специализированная утилита pwgen. Некоторые другие способы генерации паролей описаны также .

Ищем бинарные файлы, установленные без использования dpkg

Такая операция может потребоваться в случае, если, например, машина стала жертвой хакерской атаки и на ней было установлено вредоносное программное обеспечение. Выявить, что за программы поставили злоумышленники, поможет следующая команда (она ищет запущенные «бинарники», установленные без использования менеджера пакетов dpkg):

$ сat /var/lib/dpkg/info/*.list > /tmp/listin ; ls /proc/*/exe |xargs -l readlink | grep -xvFf /tmp/listin; rm /tmp/listin

Удаляем устаревшие пакеты ядра

Проблема удаления старых ядер уже обсуждалась на Хабре — см. (по этой же ссылке можно найти любопытные примеры команд).

Преобразуем скрипт в строку

Иногда возникает необходимость преобразовать большой скрипт в одну строку. Сделать это можно так:

$ (sed «s/#.*//g»|sed «/^ *$/d»|tr «\n» «;»|xargs echo)

Handling Files with Special Characters in the Names

A filename can contain special characters like single quotes, double quotes, or spaces.

Let’s see how to handle those files when passed to xargs.

4.1. Filename Contains Spaces

Let’s check if xargs passes the file file 1.log, to the rm command, despite the space in the name:

As we can see, xargs sends two arguments ./log/file and 1.log to the rm command instead of a single argument ./log/file 1.log. Of course, there are no files with name file and 1.log, so the rm command gives an error message No such file or directory.

To make xargs recognize the files with spaces in their names, again, we will use the -I option. But we must quote the placeholder:

The file 1.log has been deleted now.

4.2. Filename Contains Quotes

Let’s start with another example:

We have created two log files. The file file’1.log has a single quote in the name, while the file”2.log has a double quote character in the name.

Next, let’s pass the output of find to the xargs command and try to remove the two files:

Let’s see if quoting the placeholder of the xargs‘s -I option helps:

The test shows, if a file contains a single or double quote character in its name, xargs cannot work by quoting the filename.

To deal with this problem, both xargs and the command passing the output to xargs must use a null character as the record separator.

We can pass the option -print0 to the find command to ask it to output filenames separated by a null character instead of the default newline.

On the xargs command side, since the default record separator is a newline character as well, we need the -0 option to use a null character as a record separator:

The output above shows the two files have been deleted.

Note, however, that some commands cannot use a null character as a separator (for example, head, tail, ls, echo, sed, and wc). In these cases, xargs cannot handle the output of such commands if the output contains quote marks.

Параметры запуска

Использует во входном потоке символ-разделитель NULL («\0») вместо «пробела» и «перевода строки», хорошо сочетается с опцией команды

Выполнять команду для каждой группы из заданного числа непустых строк аргументов, прочитанных со стандартного ввода. Последний вызов команды может быть с меньшим числом строк аргументов. Считается, что строка заканчивается первым встретившимся символом перевода строки, если только перед ним не стоит пробел или символ табуляции; пробел/табуляция в конце сигнализируют о том, что следующая непустая строка является продолжением данной. Если число опущено, оно считается равным 1. Опция -l включает опцию -x.

Режим вставки: команда выполняется для каждой строки стандартного ввода, причём вся строка рассматривается как один аргумент и подставляется в начальные_аргументы вместо каждого вхождения цепочки символов зам_цеп. Допускается не более 5 начальных_аргументов, содержащих одно или несколько вхождений зам_цеп. Пробелы и табуляции в начале вводимых строк отбрасываются. Сформированные аргументы не могут быть длиннее 255 символов. Если цепочка зам_цеп не задана, она полагается равной { }. Опция -I включает опцию -x.

Выполнить команду, используя максимально возможное количество аргументов, прочитанных со стандартного ввода, но не более заданного числа. Будет использовано меньше аргументов, если их общая длина превышает размер (см. ниже опцию -s), или если для последнего вызова их осталось меньше, чем заданное число. Если указана также опция -x, каждая группа из указанного числа аргументов должны укладываться в ограничение размера, иначе выполнение xargs прекращается.

Режим трассировки: команда и каждый построенный список аргументов перед выполнением выводится в стандартный поток ошибок.

Режим с приглашением: xargs перед каждым вызовом команды запрашивает подтверждение. Включается режим трассировки (-t), за счет чего печатается вызов команды, который должен быть выполнен, а за ним — приглашение ?…. Ответ y (за которым может идти что угодно) приводит к выполнению команды; при каком-либо другом ответе, включая возврат каретки, данный вызов команды игнорируется.

Завершить выполнение, если очередной список аргументов оказался длиннее, чем размер (в символах). Опция -x включается опциями -i и -l. Если ни одна из опций -i, -l или -n не указана, общая длина всех аргументов должна укладываться в ограничение размера.

Максимальный общий размер (в символах) каждого списка аргументов установить равным заданному размеру. Размер должен быть положительным числом, не превосходящим 470 (подразумеваемое значение). При выборе размера следует учитывать, что к каждому аргументу добавляется по одному символу; кроме того, запоминается число символов в имени команды.

Цепочка символов лконф_цеп считается признаком логического конца файла. Если опция -e не указана, признаком конца считается подчеркивание (_). Опция -e без лконф_цеп аннулирует возможность устанавливать логический конец файла (подчеркивание при этом рассматривается как обычный символ). Команда xargs читает стандартный ввод до тех пор, пока не дойдет до конца файла или не встретит цепочку лконф_цеп.

Выполнение программы xargs прекращается, если она получает от команды код завершения −1 или если команда не может быть выполнена. Если команда — это shell-программа, она должна явно выполнять exit с соответствующим аргументом, чтобы избежать случайного возврата кода −1.

Encoding problem

The argument separator processing of is not the only problem with using the program in its default mode. Most Unix tools which are often used to manipulate filenames (for example , , , etc.) are text processing tools. However, Unix path names are not really text. Consider a path name /aaa/bbb/ccc. The /aaa directory and its bbb subdirectory can in general be created by different users with different environments. That means these users could have a different locale setup, and that means that aaa and bbb do not even necessarily have to have the same character encoding. For example, aaa could be in UTF-8 and bbb in Shift JIS. As a result, an absolute path name in a Unix system may not be correctly processable as text under a single character encoding. Tools which rely on their input being text may fail on such strings.

One workaround for this problem is to run such tools in the C locale, which essentially processes the bytes of the input as-is. However, this will change the behavior of the tools in ways the user may not expect (for example, some of the user’s expectations about case-folding behavior may not be met).

Placement of arguments

-I option: single argument

The xargs command offers options to insert the listed arguments at some position other than the end of the command line. The -I option to xargs takes a string that will be replaced with the supplied input before the command is executed. A common choice is %.

$ mkdir ~/backups
$ find /path -type f -name '*~' -print0 | xargs -0 -I % cp -a % ~/backups

The string to replace may appear multiple times in the command part. Using -I at all limits the number of lines used each time to one.

Shell trick: any number

Another way to achieve a similar effect is to use a shell as the launched command, and deal with the complexity in that shell, for example:

$ mkdir ~/backups
$ find /path -type f -name '*~' -print0 | xargs -0 sh -c 'for filename; do cp -a "$filename" ~/backups; done' sh

The word at the end of the line is for the POSIX shell to fill in for , the «executable name» part of the positional parameters (argv). If it weren’t present, the name of the first matched file would be instead assigned to and the file wouldn’t be copied to . One can also use any other word to fill in that blank, for example.

Since accepts multiple files at once, one can also simply do the following:

$ find /path -type f -name '*~' -print0 | xargs -0 sh -c 'if ; then cp -a "$@" ~/backup; fi' sh

This script runs with all the files given to it when there are any arguments passed. Doing so is more efficient since only one invocation of is done for each invocation of .

Examples

One use case of the xargs command is to remove a list of files using the rm command. POSIX systems have an ARG_MAX for the maximum total length of the command line, so the command may fail with an error message of «Argument list too long» (meaning that the exec system call’s limit on the length of a command line was exceeded): or . (The latter invocation is incorrect, as it may expand globs in the output.)

This can be rewritten using the command to break the list of arguments into sublists small enough to be acceptable:

find /path -type f -print | xargs rm

In the above example, the utility feeds the input of with a long list of file names. then splits this list into sublists and calls once for every sublist.

xargs can also be used to parallelize operations with the argument to specify how many parallel processes should be used to execute the commands over the input argument lists. However, the output streams may not be synchronized. This can be overcome by using an argument where possible, and then combining the results after processing. The following example queues 24 processes and waits on each to finish before launching another.

find /path -name '*.foo' | xargs -P 24 -I '{}' /cpu/bound/process '{}' -o '{}'.out

xargs often covers the same functionality as the command substitution feature of many shells, denoted by the notation (`...` or $(...)). xargs is also a good companion for commands that output long lists of files such as , and , but only if one uses (or equivalently ), since without deals badly with file names containing ', " and space. GNU Parallel is a similar tool that offers better compatibility with find, locate and grep when file names may contain ', ", and space (newline still requires ).

Examples

find /tmp -name core -type f -print | xargs /bin/rm -f

Find files named core in or below the directory /tmp and delete them. Note that this will work incorrectly if there are any file names containing newlines or spaces.

find /tmp -name core -type f -print0 | xargs -0 /bin/rm -f

Find files named core in or below the directory /tmp and delete them, processing file names in such a way that file or directory names containing spaces or newlines are correctly handled.

find /tmp -depth -name core -type f -delete

Find files named core in or below the directory /tmp and delete them, but more efficiently than in the previous example (because we avoid the need to use fork and exec rm, and we don’t need the extra xargs process).

cut -d: -f1 < /etc/passwd | sort | xargs echo

Uses cut to generate a compact listing of all the users on the system.

xargs sh -c 'emacs "" < /dev/tty' emacs

Launches the minimum number of copies of Emacs needed, one after the other, to edit the files listed on xargs‘ standard input.

Options

—arg-file=file, -a file	Read items from file instead of standard input. If you use this option, stdin remains unchanged when commands are run. Otherwise, stdin is redirected from /dev/null.
—null, -0	Input items are terminated by a null character instead of by whitespace, and the quotes and backslash are not special (every character is taken literally). Disables the end-of-file string, which is treated like any other argument. Useful when input items might contain white space, quote marks, or backslashes. The find -print0 option produces input suitable for this mode.
—delimiter=delim, -d delim	Input items are terminated by the specified character. Quotes and backslash are not special; every character in the input is taken literally. Disables the end-of-file string, which is treated like any other argument. This can be used when the input consists of newline-separated items, although it is almost always better to design your program to use —null where this is possible. The specified delimiter may be a single character, a C-style character escape such as \n, or an octal or hexadecimal escape code. Octal and hexadecimal escape codes are understood as for the printf command. Multibyte characters are not supported.
-E eof-str	Set the end-of-file string to eof-str. If the end-of-file string occurs as a line of input, the rest of the input is ignored. If neither -E nor -e is used, no end-of-file string is used.
—eof[=eof-str], -e[eof-str]	This option is a synonym for the -E option. Use -E instead, because it is POSIX compliant while this option is not. If eof-str is omitted, there is no end-of-file string. If neither -E nor -e is used, no end-of-file string is used.
—help	Display a help message summarizing xargs options, and exit.
-I replace-str	Replace occurrences of replace-str in the initial-arguments with names read from standard input. Also, unquoted blanks do not terminate input items; instead the separator is the newline character. Implies -x and -L 1.
—replace[=replace-str], -i[replace-str]	This option is a synonym for -Ireplace-str if replace-str is specified, and for -I{} otherwise. This option is deprecated; use -I instead.
-L max-lines	Use at most max-lines nonblank input lines per command line. Trailing blanks cause an input line to be logically continued on the next input line. Implies -x.
—max-lines[=max-lines], -l[max-lines]	Synonym for the -L option. Unlike -L, the max-lines argument is optional. If max-lines is not specified, it defaults to one. The -l option is deprecated since the POSIX standard specifies -L instead.
—max-args=max-args, -n max-args	Use at most max-args arguments per command line. Fewer than max-args arguments will be used if the size (see the -s option) is exceeded, unless the -x option is given, in which case xargs will exit.
—interactive, -p	Prompt the user about whether to run each command line and read a line from the terminal. Only run the command line if the response starts with «y» or «Y«. Implies -t.
—max-chars=max-chars, -s max-chars	Use at most max-chars characters per command line, including the command and initial-arguments and the terminating nulls at the ends of the argument strings. The largest allowed value is system-dependent, and is calculated as the argument length limit for exec, less the size of your environment, less 2048 bytes of headroom. If this value is more than 128 KiB, 128 Kib is used as the default value; otherwise, the default value is the maximum. 1 KiB is 1024 bytes.
—verbose, -t	Print the command line on the standard error output before executing it.
—version	Print the version number of xargs, and exit.
—show-limits	Display the limits on the command-line length that are imposed by the operating system, xargs‘ choice of buffer size and the -s option. Pipe the input from /dev/null (and perhaps specify —no-run-if-empty) if you don’t want xargs to do anything.
—exit, -x	Exit if the size (see the -s option) is exceeded.
—max-procs=max-procs, -P max-procs	Run up to max-procs processes at a time; the default is 1. If max-procs is , xargs will run as many processes as possible at a time. Use the -n option with -P; otherwise chances are that only one exec will be done.

Separator problem

Many Unix utilities are line-oriented. These may work with as long as the lines do not contain , , or a space. Some of the Unix utilities can use NUL as record separator (e.g. Perl (requires and instead of ), (requires using ), (requires using ), (requires or ), (requires using )). Using for deals with the problem, but many Unix utilities cannot use NUL as separator (e.g. , , , , , , , ).

But often people forget this and assume is also line-oriented, which is not the case (per default separates on newlines and blanks within lines, substrings with blanks must be single- or double-quoted).

The separator problem is illustrated here:

# Make some targets to practice on
touch important_file
touch 'not important_file'
mkdir -p '12" records'

find . -name not\* | tail -1 | xargs rm
find \! -name . -type d | tail -1 | xargs rmdir

Running the above will cause to be removed but will remove neither the directory called , nor the file called .

The proper fix is to use the GNU-specific option, but (and other tools) do not support NUL-terminated strings:

# use the same preparation commands as above
find . -name not\* -print0 | xargs -0 rm
find \! -name . -type d -print0 | xargs -0 rmdir

When using the option, entries are separated by a null character instead of an end-of-line. This is equivalent to the more verbose command: or shorter, by switching to (non-POSIX) line-oriented mode with the (delimiter) option:

but in general using with should be preferred, since newlines in filenames are still a problem.

GNU is an alternative to that is designed to have the same options, but is line-oriented. Thus, using GNU Parallel instead, the above would work as expected.

For Unix environments where does not support the nor the option (e.g. Solaris, AIX), the POSIX standard states that one can simply backslash-escape every character:. Alternatively, one can avoid using xargs at all, either by using GNU parallel or using the functionality of .

Решение с GNU Parallel

Ниже перевод введения из мануала к утилите:

Утилита командной строки для параллельного запуска задач на одном или нескольких компьютерах. Задача в данном контексте — это одна команда или скрипт, который должен быть запущен для каждого входящего аргумента. Типичный набор аргументов — это список файлов, хостов, пользователей, урлов или таблиц. Аргументы также могут быть переданы через пайп. GNU parallel может разделить аргументы и параллельно передать их командам.

Если вы используете xargs
, то вы легко сможете использовать parallel
, так как эта утилита поддерживает те же аргументы командной строки что и xargs
. Если вы используете циклы в шелл-скриптах, то, вероятно, parallel
поможет вам избавиться от них и ускорить выполнение за счет параллельного запуска команд.

GNU parallel возаращает результаты выполнения команд в том же порядке как если бы они были запущены последовательно. Это делает возможным использование результатов работы parallel как входных данных для других программ.

Для каждой входящей строки GNU parallel запустит команду
и передаст ей эту строку в качетсве аргументов. Если команда
не задана, то входящая строка будет исполнена. Несколько строк будут выполнены одновременно. GNU parallel может быть использована как замена для xargs
и cat | bash
.

У этой утилиты как минмум 2 видимых преимущества перед xargs
:

она позволяет запускать команды не в рамках одного сервера, а сразу на нескольких,
руководство обещает, что результаты будут выводиться последовательно.

Испытаем. Поверим обещаниям того, что parallel принимает те же аргументы, что и xargs и просто заменим имя одной утилиты на другую в команде, которую использовали ранее:

Time
echo
{
1
..20
}
| parallel -n 1
-P 4
./do-something.sh -x
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

real 0m0.562s
user 0m0.135s

sys 0m0.096s

Результат 4

Работает! Команда выполнилась примено за те же 0,5 секунд, что и xargs и результат возвращен в правильной последовательности.

Теперь попробуем вернуть обратно случайную задержку, зменим в скрипте do-something.sh sleep 0.1 на sleep $rnd и запустим еще раз. Результат будет возвращен опять в правильной последовательности, несмотря на то, что из-за разной задержки команды запущенные позже могут быть выполнены раньше предыдущих команд (это хорошо видно во втором результате выше).

Единственным недостатком является то, что xargs возвращает результаты как только они готовы, а parallel — только тогда когда выполнение всех команд завершено. Но это цена, которую приходится платить за корректную последовательность результатов. Если запустить parallel с аргументом —bar , то во время работы будет выводиться прогресс бар, показывающий процент выполненных команд.

Теперь испытаем еще одну киллер-фичу parallel — возможность запустить команду на нескольких серверах сразу. Для этого воспользуемся примером из доки: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Parallel-grep .

# Добавим список серверов в конфиг. В моем случае сервера имеют имена dev и test

(echo
dev; echo
test)
> .parallel/my_cluster
# Убедимся, что существует файл.ssh/config и забэкапим его

touch
.ssh/config
cp .ssh/config .ssh/config.backup
# Временно отключим StrictHostKeyChecking

(echo
«Host *»
; echo
StrictHostKeyChecking no)
>> .ssh/config
parallel —slf my_cluster —nonall true

# Откатываем назад изменения StrictHostKeyChecking в конфиге SSH

mv .ssh/config.backup .ssh/config

Теперь сервера из файла.parallel/my_cluster добавлены в.ssh/known_hosts .

Наконец, нужно скопировать скрипт do-something.sh в домашнюю директорию текущего пользователя на удаленных серверах (в моем примере test и dev).

После выполненной подготовки мы можем запустить команду на серверах dev и test добавив к вызову parallel опцию —sshlogin dev,test .

Попробуем:

Time
echo
{
1
..3200
}
| parallel -n 1
-P 4
—sshlogin test,dev ./do-something.sh -x
real 0m0.334s
user 0m0.080s

sys 0m0.032s

Результат 5

Виден выигрыш в скорости даже на такой элементарной операции, несмотря на оверхед связанный с установкой соединения по сети. В случае с действительно тяжелыми командами, выполнение которых может занимать десятки секунд или минут, выигрыш от такого распределенного выполнения может оказаться еще заметнее.

Ромка»s blog
Log in or register to post comments

Description

xargs reads items from the standard input, delimited by blanks (which can be protected with double or single quotes or a backslash) or newlines, and executes the command (the default command is echo, located at /bin/echo) one or more times with any initial-arguments followed by items read from standard input. Blank lines on the standard input are ignored.

Because Unix file names can contain blanks and newlines, this default behaviour is often problematic; file names containing blanks or newlines are incorrectly processed by xargs. In these situations it is better to use the -0 option (that’s a zero, not a capital o), which prevents such problems. When using this option you will need to ensure that the program which produces the input for xargs also uses a null character as a separator. If that program is find for example, the -print0 option does this for you.

If any invocation of the command exits with a status of 255, xargs will stop immediately without reading any further input. An error message is issued on stderr when this happens.

This documentation is specific to the GNU version of xargs, which is commonly distributed with most variants of Linux.