csv-data-tools

Tools for working with CSV data (or other basic tabular data formats)

Table of Contents

chars_in_field.rb
check_structure.rb
clean_convert_report.rb
convert_encoding.rb
file_info.rb
profiler.rb
- When --details file
- When --details compile
reformat_csv.pl
reformat_tables.rb
show_rows_with_column_ct.rb
tables_and_columns.rb
table_preview.rb

chars_in_field.rb

A silly utility script to output a list of unique characters used in specified field(s) of a CSV, or the whole CSV if no fields are specified.

Initial use case: find all characters used in CollectionSpace media file names, so I could replace any characters unsafe for use in S3 key names with something else.

Do ruby chars_in_field.rb -h for usage details.

`check_structure.rb`

ℹ️	Supports comma, tab, or pipe delimited files currently

Writes out a report on whether the rows of a CSV all have the same number of columns
If a file has any ragged rows, it tells you how many columns the weird rows have and gives you the row numbers of up to 3 rows having that number of columns.
Where the example field for a non-ok file is blank, that’s the expected/correct number of columns based on the number of headers

`clean_convert_report.rb`

This was a mess and I deleted it. It’s being replaced by simpler scripts that do specific parts of what this did.

`convert_encoding.rb`

⚠️	script not finished. Do not use yet

❗	If you have not installed or used the `charlock_holmes` gem before, you need to: * install icu4c if you do not have it: `brew install icu4c` run the following command in your terminal before running this script: `bundle config build.charlock_holmes --with-icu-dir=/usr/local/opt/icu4c`

`file_info.rb`

Given path to a directory, a file suffix, and an optional minimum file size (in bytes)…

Prints to screen a the following info for all files with suffix in the directory:

size
result of unix file command (file encoding, presence of "very long lines," line terminators)

Do ruby file_info.rb for parameters/options.

Example output:

dbo_ZLocalUseAttribs.txt:
  size: 814242
  info: Little-endian UTF-16 Unicode text, with CRLF line terminators


dbo_ZResultFields.txt:
  size: 13274
  info: Little-endian UTF-16 Unicode text, with CRLF line terminators


dbo_dtproperties.txt:
  size: 34462
  info: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators

`profiler.rb`

Usage example: Defaults:

ruby profiler.rb -i ~/data/export/ -o ~/data/profiled

All .csv files in ~/data/export directory are included, and the default details mode is files. Default options sent to Ruby standard library CSV parser are:

{headers: true, header_converters: [:downcase], converters: [:stripplus],
  skip_blanks: true, empty_value: nil}

Usage example: compiled details for tab-separated .tsv files:

ruby profiler.rb -i ~/data/export/ -o ~/data/profiled -s tsv -c '{col_sep: "\t"} -d compiled'

All .tsv files in ~/data/export directory are included. The Ruby standard library CSV parser option col_sep: "\t" is merged into the default option hash shown above.

When `--details file`

One .csv file written to output directory per table column.

For example, if source file addresses.csv has a :city column, there is an address_city.csv file written.

The output CSV has one row per unique value found in the source column. The first column is the occurrence count of the value in the source column. The second column is the value.

When `--details compile`

Given a directory containing CSV files, writes out two CSV reports:

summary - a row for each column in source CSVs, with the following columns:
- table - source CSV name
- column - column name
- column index - for putting them in the order in which they appear in source document
- uniq vals - count of unique values found in column
- null vals - count of empty cells in column
details - a row for each unique value in each column in source CSVs, with the following columns:
- table - source CSV name
- column - column name
- column index - for putting them in the order in which they appear in source document
- value - a unique value found in column (puts "NULL VALUE/EMPTY FIELD" to represent that)
- occurrences - number of time value occurs in column

⚠️	There’s a known bug where not all apparently empty fields are getting counted as "NULL VALUE/EMPTY FIELD". The number that get left out is small and I didn’t have time to chase this down now, but will try to the next time I need this thing.

`reformat_csv.pl`

Reformats a list of CSVs, allowing you to change the separator and escape characters. Output is to STDOUT.

💡

This can handle parsing \n inside quoted fields that contain unescaped quotes. We did not find a Ruby CSV parsing solution that handled this particular flavor of CSV horror. (Thanks, potential client legacy system which shall not be named…)

To run this script on all files in a directory, writing the reformatted files to another directory, see reformat_tables.rb.

Usage: reformat_csv.pl [options] FILES

Usage example: perl reformat_csv.pl --input_sep ';' --input_esc '#' ~/data/test.csv > ~/data/test_fix.csv

Options:

input_sep - Separator character in input CSVs (default: ,)
input_esc - Escape character in input CSVs (default: ")
output_sep - Separator character in output CSVs (default: ,)
output_esc - Escape character in output CSVs (default: ")

💡	To pass TAB as `input_sep` or `output_sep`, use the literal tab character by typing `Ctrl-v`, then `TAB` on the command line.

While handy, this program primarily exists to take adavantage of Text::CSV_XS’s ability to deal with unescaped quotes in fields. To do this, set input_esc to anything other than '"', for instance '#'.

Requires you have the Text::CSV_XS Perl module installed

`reformat_tables.rb`

This is a wrapper around reformat_csv.pl. It requires you have Perl and the Text::CSV_XS module installed.

The input/output sep and esc options are the same as described for reformat_csv.pl

The only required argument is --input (or -i), which specifies the directory containing the tabular data files you wish to reformat.

If no --output/-o value is given, a new directory called reformatted is created in your --input directory, and reformatted files are saved in new directory. Any other directory value can be provided. If the directory does not exist at run time, it will be created.

File suffix (--suffix/-s) defaults to csv.

Usage example:

ruby reformat_tables.rb -i ~/data/lafayette/export --input_sep ';' --input_esc "#" --output_sep ' '

Writes semicolon delimited .csv files with unescaped quotes to tab-delimited.

`show_rows_with_column_ct.rb`

Meant to be used to investigate specific files reported by clean_convert_report.rb as having bad structure (i.e. ragged columns: some row having different number of columns than other rows)

Given path to file, delimiter name, number of columns you want to see rows for, and option number of rows you want to see…

Outputs to screen rows with the given number of columns.

This is useful for coming up with the specific find/replace mess you are going to have to implement to keep rows from being broken up in a ragged way.

Generally I use this iteratively with edits made to a migration-specific copy of clean_convert_report.rb to eliminate or minimize the number of ragged-column files I end up having to manually fix for a migration.

`tables_and_columns.rb`

Utility script for creating data review spreadsheet.

Given a directory containing tabular data files, outputs two CSVs:

tables.csv
- table/filename
- column count
- row count
columns.csv
- table/filename
- column name

These become two tabs in a data review tracking/mapping Excel sheet.

Do ruby tables_and_columns.rb -h for parameters/options.

`table_preview.rb`

Useful for initial data review work.

Reads all files with given file suffix in the given directory. For each, prints out the file/table name, headers, and the first 25 rows of data, nicely formatted, in one text file you can scroll/search through. You don’t have to open a million files to get your head around the general shape and character of the data.

Requires csvlook from csvkit to be installed and available in your PATH

Do ruby table_preview.rb -h for parameters/options.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
aggregator		aggregator
tabular_data_naraka		tabular_data_naraka
value_aggregator		value_aggregator
value_replacer		value_replacer
.gitignore		.gitignore
LICENSE		LICENSE
README.adoc		README.adoc
chars_in_field.rb		chars_in_field.rb
check_structure.rb		check_structure.rb
convert_encoding.rb		convert_encoding.rb
csv_to_json.rb		csv_to_json.rb
file_info.rb		file_info.rb
profiler.rb		profiler.rb
reformat_csv.pl		reformat_csv.pl
reformat_tables.rb		reformat_tables.rb
show_rows_with_column_ct.rb		show_rows_with_column_ct.rb
table_preview.rb		table_preview.rb
tables_and_columns.rb		tables_and_columns.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

csv-data-tools

chars_in_field.rb

`check_structure.rb`

`clean_convert_report.rb`

`convert_encoding.rb`

`file_info.rb`

`profiler.rb`

When `--details file`

When `--details compile`

`reformat_csv.pl`

`reformat_tables.rb`

`show_rows_with_column_ct.rb`

`tables_and_columns.rb`

`table_preview.rb`

About

Releases

Packages

Contributors 3

Languages

License

lyrasis/csv-data-tools

Folders and files

Latest commit

History

Repository files navigation

csv-data-tools

chars_in_field.rb

check_structure.rb

clean_convert_report.rb

convert_encoding.rb

file_info.rb

profiler.rb

When --details file

When --details compile

reformat_csv.pl

reformat_tables.rb

show_rows_with_column_ct.rb

tables_and_columns.rb

table_preview.rb

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

`check_structure.rb`

`clean_convert_report.rb`

`convert_encoding.rb`

`file_info.rb`

`profiler.rb`

When `--details file`

When `--details compile`

`reformat_csv.pl`

`reformat_tables.rb`

`show_rows_with_column_ct.rb`

`tables_and_columns.rb`

`table_preview.rb`

Packages