-
Notifications
You must be signed in to change notification settings - Fork 3
natgaertner/ersatzpg
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
ERSATZ A tool for loading csvs into postgresql databases quickly postgresql's copy command is just fine for bulk-loading CSVs into a table if those CSVs are already formatted to exactly match the table columns and data types. It even lets you load a subset of the table's columns from the CSV. Unfortunately, it doesn't let you: Load a subset of the CSV's columns Transform columns from the CSV before insertion Combine columns Insert directly into partition tables Define default column values Load multiple tables from different subsets of a CSV (and define a sequential key for linking those tables later) An old module, pgloader (http://pgfoundry.org/projects/pgloader), will let you do some of these things but is somewhat limited and seemed kind of slow. Ersatz is an attempt to provide all the functionality above and achieve fast copy rates. It is dependent on psycopg2 (http://www.initd.org/psycopg/) for its postrgres interface. An example of the arcane configuration is included. Look in the examples directory. test_part.py is the configuration file. import.py is a script that runs ersatz pointed at the configuration. Some config explanation: Most of the configuration is in dictionaries, which allows you to create default values and update copies of the default dictionary with specific config values for tables. univ_settings.py contains some basic stuff: database connection values (user, db name, password. (assumes the existence of a password file that contains a passwd value. This can easily be changed to an environment variable setting or whatever password storing scheme you prefer) test_part.py contains the table definitions. Table-specific config values: dict_reader is whether or not to use a header-name based dictionary when referencing columns. Basically whether you're using csv.DictReader or csv.reader skip_head_lines is how much header to skip. Will be ignored if dict_reader is True udcs is a dictionary with column names as keys and default column values as values table is the table name partitions is an ordered dict of column names to values to partitiion on partition_table_pattern is the python format string to create partition table names with filename is the csv to import from field_sep is the csv field separator columns is a dict of column names or in a special case a tuple of column names to one of: integer: a column in the csv (only available if dict_reader is False) string: a column name in the csv (only available if dict_reader is True) dict with one key: 'key' and a value indicating the externally defined sequential key to use for this field dict with keys: 'function': a function returning a tuple of values, 'columns': a tuple of integers (dict_reader False) or strings (dict_reader True) indicating which columns of the csv to pass to the function, 'defaults': a dict of default values and their parameter names to pass to the function. In this case the column dict key can be a single column name or a tuple of column names depending on how many values are in the function's return tuple. universal config values: tables: a dict of table names to table definition dicts parallel_load: a tuple of dicts, each dict with keys: 'tables': a tuple of table names to load in parallel 'keys': a dict of key names to key source names key_sources: a dict of key source names to initial values please see http://neworganizing.com/content/blog/ersatz1 and http://neworganizing.com/content/blog/ersatz2 for more detailed discussion
About
A version of pgloader stripped down to components I found essential. Sacrifices a lot of flexibility, but throws out some overhead
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published