Contents
Welcome to the pg_bulkload Project Home Page
pg_bulkload is a high speed data loading utility for PostgreSQL. Here is the pgFoundry page, where you can find downloads, documentation, bug reports, mailing lists, and a whole lot more.
Documentation
-
High speed data loader: pg_bulkload documentation
- pg_bulkload provides high-speed data loading capability to PostgreSQL users.
-
Optional tool : pg_timestamp documentation
- pg_timestamp ia an optional tool to skip parsing overhead of timestamp string.
Performance Results
Here are comparisons between COPY and pg_bulkload. Performance was measured with basic-tuned PostgreSQL server.
- Table definition
- Customer table in DBT-2 benchmark, that is an implementation of TPC-C.
- Index definition
- There are 2 indexes. The first one is a primary key with one ascending integer column. The second one is a non-unique index with one random integer column.
There are the following measurement patterns.
- Initial data loading to an empty table for 4GB of data
- Appended data loading to a table with 4GB of data for 1GB of new data
- Performance efficiencies by maintenance_work_mem and FILTER features
Result 1: Initial data loading
Pg_bulkload on WRITER = PARALLEL mode can load data with an almost half time against COPY. In PARALLEL mode, performance will be improved on multi-CPU server because reading an input file and writing rows to a table are done with two processes.
COPY also can be more fast by loading into table without indexes and create indexes after it.
Item | Duration | Duration comparison |
---|---|---|
COPY with indexes | 500 sec | - |
COPY without indexes + CREATE INDEX |
333 sec (229 sec + 51 sec+ 53 sec) |
66.7 % |
pg_bulkload (DIRECT) with indexes |
334 sec | 66.8 % |
pg_bulkload (PARALLEL) with indexes |
221 sec | 44.2 % |
Result 2: Appended data loading
Pg_bulkload on WRITER = PARALLEL mode can load data with an almost half time against COPY in this case also. COPY without indexes is not faster than COPY with indexes. because it has to create indexes for total records of the table from initial.
Item | Duration | Duration comparison |
---|---|---|
COPY with indexes | 140 sec | - |
COPY without indexes + CREATE INDEX |
187 sec (62 sec + 60 sec + 65 sec) |
133.6 % |
pg_bulkload (DIRECT) with indexes |
93 sec | 66.4 % |
pg_bulkload (PARALLEL) with indexes |
70 sec | 50.0 % |
Result 3: Influence from parameters and features
The maintenece_work_mem, PostgreSQL parameter, affects the performance of pg_bulkload. The duration becomes almost 15 % shorter if this parameter is changed from 64 MB to 1 GB.
FILTER feature transforms input data in various operations, but it's not free. The actual measurement shows the loading time is increased to almost 240 % with SQL functions and almost 140 % with C functions.
項目 | 時間 | ||
---|---|---|---|
Initial (4GB) | Appended (1GB) | ||
MWM = 64 MB | DIRECT | 397 sec | 109 sec |
MWM = 1 GB | |||
DIRECT | 334 sec | 93 sec | |
DIRECT with SQL FILTER | 801 sec | 216 sec | |
DIRECT with C FILTER | 456 sec | 126 sec |
Conditions
Item | Value |
---|---|
Server | Dell PowerEdge R410 |
CPU | Intel Xeon E5645 (2.4GHz) 12 core * 2 |
Memory | 32GB |
Disks | SAS 10000rpm 2TB * 4 |
OS | CentOS 6.2 (64bit) |
PostgreSQL version | 9.3.4 |
pg_bulkload version | 3.1.6 |
shared_buffers | 3210MB |
checkpoint_segments | 300 |
checkpoint_timeout | 15min |
work_mem | 1MB |
maintenance_work_mem | 1GB |
Table definition | DBT-2 customer table |
Indexed columns | c_id (PRIMARY KEY) |
c_d_id (non-unique B-Tree) | |
Constraints | NOT NULL for all columns |
Input file format | CSV |
Copyright (c) 2007-2015, NIPPON TELEGRAPH AND TELEPHONE CORPORATION