Parsing files in linux

Парсинг CSV-файла средствами bash и awk

Доброго времени суток, Хаброчитатель!

Возникла у меня необходимость перевести интерфейс одной системы. Перевод для каждой формы лежит в отдельном XML-файле, а файлы группами разбросаны по папкам, что очень неудобно. Решено было создать единый словарь, чтобы в Excel’е работать с переводом всех форм. Данная задача в свою очередь разбивается на 2 подзадачи: извлечь информацию из всех XML-файлов в один CSV-файл, после перевода из CSV-файла создать XML-файлы с прежней структурой. В качестве инструментов были выбраны bash и awk. Первую подзадачу описывать смысла нет, так как она достаточно тривиальная. А вот как распарсить CSV-файл?

В Интернете можно найти множество информации на эту тему. Большинство примеров с легкостью справляются только с простыми вариантами. Но я не нашел ничего подходящего, например, для такого:

./web/analyst/xml/list.template.xml;test;»t «»test»»; est»
./web/analyst/xml/list.template.xml;%1 _s found. Displaying %2 through %3;Найдено объектов: %1. Отображено с %2 по %3

В Excel’е эти строки выглядит так:

Файл Тег Перевод
./web/analyst/xml/list.template.xml test t «test»; est
./web/analyst/xml/list.template.xml %1 _s found. Displaying %2 through %3 Найдено объектов: %1. Отображено с %2 по %3

Взяв за основу пример с OpenNET, я решил его изменить. Вот текст awk-программы:

А вот фрагмент bash-скрипта ( XML_PATH – переменная с путем, по которому располагаются папки с XML-файлами):

Источник

How to parse JSON with shell scripting in Linux?

I have a JSON output from which I need to extract a few parameters in Linux.

This is the JSON output:

I want to write a file that contains heading like instance id, tag like name, cost center, owner. and below that certain values from the JSON output. The output here given is just an example.

How can I do that using sed and awk ?

Expected output :

12 Answers 12

The availability of parsers in nearly every programming language is one of the advantages of JSON as a data-interchange format.

Rather than trying to implement a JSON parser, you are likely better off using either a tool built for JSON parsing such as jq or a general purpose script language that has a JSON library.

For example, using jq, you could pull out the ImageID from the first item of the Instances array as follows:

Alternatively, to get the same information using Ruby’s JSON library:

I won’t answer all of your revised questions and comments but the following is hopefully enough to get you started.

Suppose that you had a Ruby script that could read a from STDIN and output the second line in your example output[0]. That script might look something like:

How could you use such a script to accomplish your whole goal? Well, suppose you already had the following:

  • a command to list all your instances
  • a command to get the json above for any instance on your list and output it to STDOU

One way would be to use your shell to combine these tools:

Now, maybe you have a single command that give you one json blob for all instances with more items in that «Instances» array. Well, if that is the case, you’ll just need to modify the script a bit to iterate through the array rather than simply using the first item.

In the end, the way to solve this problem, is the way to solve many problems in Unix. Break it down into easier problems. Find or write tools to solve the easier problem. Combine those tools with your shell or other operating system features.

[0] Note that I have no idea where you get cost-center from, so I just made it up.

Источник

How can I parse CSV files on the Linux command line? [closed]

Want to improve this question? Update the question so it’s on-topic for Stack Overflow.

Closed 6 years ago .

How can I parse CSV files on the Linux command line?

To do things like:

to extract fields from columns 2, 5 and 6 from all rows.

It should be able to handle the csv file format: https://www.rfc-editor.org/rfc/rfc4180 which means quoting fields and escaping inner quotes as appropriate, so for an example row with 3 fields:

so that if I request field 2 for the row above I get:

I appreciate that there are numerous solutions, Perl, Awk (etc.) to this problem but I would like a native bash command line tool that does not require me to invoke some other scripting environment or write any additional code(!).

12 Answers 12

csvtool is really good. Available in Debian / Ubuntu ( apt-get install csvtool ). Example:

See the CSVTool manual page for usage tips.

My FOSS CSV stream editor CSVfix does exactly what you want. There is a binary installer for Windows, and a compilable version (via a makefile) for UNIX/Linux.

As suggested by @Jonathan in a comment, there is a module for python that provides the command line tool csvfilter. It works like cut, but properly handles CSV column quoting:

If you have python (and you should), you can install it simply like this:

I found csvkit to be useful, it is based on python csv module and has quite a lot of options for parsing complex csv files.

Although it seems to be a bit slow. I am getting 4MB/s (with 100% cpu) when extracting one field from a 7GB csv with 5 columns.

To extract 4th column from file.csv

Try crush-tools, they are great at manipulating delimited data. It sounds like exactly what you’re looking for.

My gut reaction would be to write a script wrapper around Python’s csv module (if there isn’t already such a thing).

I wrote one of these tools too (UNIX only) called csvprintf. It can also converts to XML in an online fashion.

For a super lightweight wrapper around Python’s csv module, you could look at pluckr.

This sounds like a job for awk.

You will most likely need to write your own script for your specific needs, but this site has some dialogue about how to go about doing this.

You could also use the cut utility to strip the fields out.

where the -f argument is the field you want and -d is the delimeter you want. You could then sort these results, find the unique ones, or use any other bash utility. There is a cool video here about working with CSV files from the command line. Only about a minute, I’d take a look.

However, I guess you could group the cut utility with awk and not want to use it. I don’t really know what exactly you mean by native bash command though, so I’ll still suggest it.

Источник

How can I parse a YAML file from a Linux shell script?

I wish to provide a structured configuration file which is as easy as possible for a non-technical user to edit (unfortunately it has to be a file) and so I wanted to use YAML. I can’t find any way of parsing this from a Unix shell script however.

21 Answers 21

Here is a bash-only parser that leverages sed and awk to parse simple yaml files:

It understands files such as:

Which, when parsed using:

it also understands yaml files, generated by ruby which may include ruby symbols, like:

and will output the same as in the previous example.

typical use within a script is:

parse_yaml accepts a prefix argument so that imported settings all have a common prefix (which will reduce the risk of namespace collisions).

Note that previous settings in a file can be referred to by later settings:

Another nice usage is to first parse a defaults file and then the user settings, which works since the latter settings overrides the first ones:

I’ve written shyaml in python for YAML query needs from the shell command line.

Example’s YAML file (with complex features):

More complex looping query on complex values:

A few key points:

  • all YAML types and syntax oddities are correctly handled, as multiline, quoted strings, inline sequences.
  • \0 padded output is available for solid multiline entry manipulation.
  • simple dotted notation to select sub-values (ie: subvalue.maintainer is a valid key).
  • access by index is provided to sequences (ie: subvalue.things.-1 is the last element of the subvalue.things sequence.)
  • access to all sequence/structs elements in one go for use in bash loops.
  • you can output whole subpart of a YAML file as . YAML, which blend well for further manipulations with shyaml.

More sample and documentation are available on the shyaml github page or the shyaml PyPI page.

My use case may or may not be quite the same as what this original post was asking, but it’s definitely similar.

I need to pull in some YAML as bash variables. The YAML will never be more than one level deep.

YAML looks like so:

Output like-a dis:

I achieved the output with this line:

  • s/:[^:\/\/]/=»/g finds : and replaces it with =» , while ignoring :// (for URLs)
  • s/$/»/g appends » to the end of each line
  • s/ *=/=/g removes all spaces before =

yq is a lightweight and portable command-line YAML processor

The aim of the project is to be the jq or sed of yaml files.

As an example (stolen straight from the documentation), given a sample.yaml file of:

Given that Python3 and PyYAML are quite easy dependencies to meet nowadays, the following may help:

It’s possible to pass a small script to some interpreters, like Python. An easy way to do so using Ruby and its YAML library is the following:

, where data is a hash (or array) with the values from yaml.

As a bonus, it’ll parse Jekyll’s front matter just fine.

here an extended version of the Stefan Farestam’s answer:

This version supports the — notation and the short notation for dictionaries and lists. The following input:

produces this output:

as you can see the — items automatically get numbered in order to obtain different variable names for each item. In bash there are no multidimensional arrays, so this is one way to work around. Multiple levels are supported. To work around the problem with trailing white spaces mentioned by @briceburg one should enclose the values in single or double quotes. However, there are still some limitations: Expansion of the dictionaries and lists can produce wrong results when values contain commas. Also, more complex structures like values spanning multiple lines (like ssh-keys) are not (yet) supported.

A few words about the code: The first sed command expands the short form of dictionaries < key: value, . >to regular and converts them to more simple yaml style. The second sed call does the same for the short notation of lists and converts [ entry, . ] to an itemized list with the — notation. The third sed call is the original one that handled normal dictionaries, now with the addition to handle lists with — and indentations. The awk part introduces an index for each indentation level and increases it when the variable name is empty (i.e. when processing a list). The current value of the counters are used instead of the empty vname. When going up one level, the counters are zeroed.

Edit: I have created a github repository for this.

Moving my answer from How to convert a json response into yaml in bash, since this seems to be the authoritative post on dealing with YAML text parsing from command line.

I would like to add details about the yq YAML implementation. Since there are two implementations of this YAML parser lying around, both having the name yq , it is hard to differentiate which one is in use, without looking at the implementations’ DSL. There two available implementations are

  1. kislyuk/yq — The more often talked about version, which is a wrapper over jq , written in Python using the PyYAML library for YAML parsing
  2. mikefarah/yq — A Go implementation, with its own dynamic DSL using the go-yaml v3 parser.

Both are available for installation via standard installation package managers on almost all major distributions

Both the versions have some pros and cons over the other, but a few valid points to highlight (adopted from their repo instructions)

kislyuk/yq

  1. Since the DSL is the adopted completely from jq , for users familiar with the latter, the parsing and manipulation becomes quite straightforward
  2. Supports mode to preserve YAML tags and styles, but loses comments during the conversion. Since jq doesn’t preserve comments, during the round-trip conversion, the comments are lost.
  3. As part of the package, XML support is built in. An executable, xq , which transcodes XML to JSON using xmltodict and pipes it to jq , on which you can apply the same DSL to perform CRUD operations on the objects and round-trip the output back to XML.
  4. Supports in-place edit mode with -i flag (similar to sed -i )

mikefarah/yq

  1. Prone to frequent changes in DSL, migration from 2.x — 3.x
  2. Rich support for anchors, styles and tags. But lookout for bugs once in a while
  3. A relatively simple Path expression syntax to navigate and match yaml nodes
  4. Supports YAML->JSON, JSON->YAML formatting and pretty printing YAML (with comments)
  5. Supports in-place edit mode with -i flag (similar to sed -i )
  6. Supports coloring the output YAML with -C flag (not applicable for JSON output) and indentation of the sub elements (default at 2 spaces)
  7. Supports Shell completion for most shells — Bash, zsh (because of powerful support from spf13/cobra used to generate CLI flags)

My take on the following YAML (referenced in other answer as well) with both the versions

Various actions to be performed with both the implementations (some frequently used operations)

  1. Modifying node value at root level — Change value of root_key2
  2. Modifying array contents, adding value — Add property to coffee
  3. Modifying array contents, deleting value — Delete property from orange_juice
  4. Printing key/value pairs with paths — For all items under food

Using kislyuk/yq

Which is pretty straightforward. All you need is to transcode jq JSON output back into YAML with the -y flag.

Using mikefarah/yq

As of today Dec 21st 2020, yq v4 is in beta and supports much powerful path expressions and supports DSL similar to using jq . Read the transition notes — Upgrading from V3

Hard to say because it depends on what you want the parser to extract from your YAML document. For simple cases, you might be able to use grep , cut , awk etc. For more complex parsing you would need to use a full-blown parsing library such as Python’s PyYAML or YAML::Perl.

I just wrote a parser that I called Yay! (Yaml ain’t Yamlesque!) which parses Yamlesque, a small subset of YAML. So, if you’re looking for a 100% compliant YAML parser for Bash then this isn’t it. However, to quote the OP, if you want a structured configuration file which is as easy as possible for a non-technical user to edit that is YAML-like, this may be of interest.

It’s inspred by the earlier answer but writes associative arrays (yes, it requires Bash 4.x) instead of basic variables. It does so in a way that allows the data to be parsed without prior knowledge of the keys so that data-driven code can be written.

As well as the key/value array elements, each array has a keys array containing a list of key names, a children array containing names of child arrays and a parent key that refers to its parent.

This is an example of Yamlesque:

Here is an example showing how to use it:

And here is the parser:

There is some documentation in the linked source file and below is a short explanation of what the code does.

The yay_parse function first locates the input file or exits with an exit status of 1. Next, it determines the dataset prefix , either explicitly specified or derived from the file name.

It writes valid bash commands to its standard output that, if executed, define arrays representing the contents of the input data file. The first of these defines the top-level array:

Note that array declarations are associative ( -A ) which is a feature of Bash version 4. Declarations are also global ( -g ) so they can be executed in a function but be available to the global scope like the yay helper:

The input data is initially processed with sed . It drops lines that don’t match the Yamlesque format specification before delimiting the valid Yamlesque fields with an ASCII File Separator character and removing any double-quotes surrounding the value field.

The two expressions are similar; they differ only because the first one picks out quoted values where as the second one picks out unquoted ones.

The File Separator (28/hex 12/octal 034) is used because, as a non-printable character, it is unlikely to be in the input data.

The result is piped into awk which processes its input one line at a time. It uses the FS character to assign each field to a variable:

All lines have an indent (possibly zero) and a key but they don’t all have a value. It computes an indent level for the line dividing the length of the first field, which contains the leading whitespace, by two. The top level items without any indent are at indent level zero.

Next, it works out what prefix to use for the current item. This is what gets added to a key name to make an array name. There’s a root_prefix for the top-level array which is defined as the data set name and an underscore:

The parent_key is the key at the indent level above the current line’s indent level and represents the collection that the current line is part of. The collection’s key/value pairs will be stored in an array with its name defined as the concatenation of the prefix and parent_key .

For the top level (indent level zero) the data set prefix is used as the parent key so it has no prefix (it’s set to «» ). All other arrays are prefixed with the root prefix.

Next, the current key is inserted into an (awk-internal) array containing the keys. This array persists throughout the whole awk session and therefore contains keys inserted by prior lines. The key is inserted into the array using its indent as the array index.

Because this array contains keys from previous lines, any keys with an indent level grater than the current line’s indent level are removed:

This leaves the keys array containing the key-chain from the root at indent level 0 to the current line. It removes stale keys that remain when the prior line was indented deeper than the current line.

The final section outputs the bash commands: an input line without a value starts a new indent level (a collection in YAML parlance) and an input line with a value adds a key to the current collection.

The collection’s name is the concatenation of the current line’s prefix and parent_key .

When a key has a value, a key with that value is assigned to the current collection like this:

The first statement outputs the command to assign the value to an associative array element named after the key and the second one outputs the command to add the key to the collection’s space-delimited keys list:

When a key doesn’t have a value, a new collection is started like this:

The first statement outputs the command to add the new collection to the current’s collection’s space-delimited children list and the second one outputs the command to declare a new associative array for the new collection:

All of the output from yay_parse can be parsed as bash commands by the bash eval or source built-in commands.

Another option is to convert the YAML to JSON, then use jq to interact with the JSON representation either to extract information from it or edit it.

I wrote a simple bash script that contains this glue — see Y2J project on GitHub

If you need a single value you could a tool which converts your YAML document to JSON and feed to jq , for example yq .

Content of sample.yaml:

A quick way to do the thing now (previous ones haven’t worked for me):

Example asd.yaml:

parsing root:

parsing key3:

I know this is very specific, but I think my answer could be helpful for certain users.
If you have node and npm installed on your machine, you can use js-yaml .
First install :

then in your bash script

Also if you are using jq you can do something like that

Because js-yaml converts a yaml file to a json string literal. You can then use the string with any json parser in your unix system.

If you have python 2 and PyYAML, you can use this parser I wrote called parse_yaml.py. Some of the neater things it does is let you choose a prefix (in case you have more than one file with similar variables) and to pick a single value from a yaml file.

For example if you have these yaml files:

You can load both without conflict.

And even cherry pick the values you want.

You could use an equivalent of yq that is written in golang:

Whenever you need a solution for «How to work with YAML/JSON/compatible data from a shell script» which works on just about every OS with Python (*nix, OSX, Windows), consider yamlpath, which provides several command-line tools for reading, writing, searching, and merging YAML, EYAML, JSON, and compatible files. Since just about every OS either comes with Python pre-installed or it is trivial to install, this makes yamlpath highly portable. Even more interesting: this project defines an intuitive path language with very powerful, command-line-friendly syntax that enables accessing one or more nodes.

To your specific question and after installing yamlpath using Python’s native package manager or your OS’s package manager (yamlpath is available via RPM to some OSes):

You didn’t specify that the data was a simple Scalar value though, so let’s up the ante. What if the result you want is an Array? Even more challenging, what if it’s an Array-of-Hashes and you only want one property of each result? Suppose further that your data is actually spread out across multiple YAML files and you need all the results in a single query. That’s a much more interesting question to demonstrate with. So, suppose you have these two YAML files:

File: data1.yaml

File: data2.yaml

How would you report only the sku of every item in inventory after applying the changes from data2.yaml to data1.yaml, all from a shell script? Try this:

You get exactly what you need from only a few lines of code:

As you can see, yamlpath turns very complex problems into trivial solutions. Note that the entire query was handled as a stream; no YAML files were changed by the query and there were no temp files.

I realize this is «yet another tool to solve the same question» but after reading the other answers here, yamlpath appears more portable and robust than most alternatives. It also fully understands YAML/JSON/compatible files and it does not need to convert YAML to JSON to perform requested operations. As such, comments within the original YAML file are preserved whenever you need to change data in the source YAML file. Like some alternatives, yamlpath is also portable across OSes. More importantly, yamlpath defines a query language that is extremely powerful, enabling very specialized/filtered data queries. It can even operate against results from disparate parts of the file in a single query.

If you want to get or set many values in the data at once — including complex data like hashes/arrays/maps/lists — yamlpath can do that. Want a value but don’t know precisely where it is in the document? yamlpath can find it and give you the exact path(s). Need to merge multiple data file together, including from STDIN? yamlpath does that, too. Further, yamlpath fully comprehends YAML anchors and their aliases, always giving or changing exactly the data you expect whether it is a concrete or referenced value.

Источник

Читайте также:  Просел fps windows 10
Оцените статью