PostgreSQL COPY and COPY commands

This article is contributed. See the original author and article here.

Use of COPY command as an alternative way to PostgreSQL data migration

Using pg_dump, psql, recover data structure in new server

Generate COPY command pipes

Copy real data through COPY command pipes

Advantages

COPY command through pipe, no space required in client side
Increased performance with parallelism

Disadvantages

COPY method is not taking care of indexing, foreign keys, additional steps will be required to have a complete migration mechanism using this approach

Preparation

pgsource.sh = bash script with psql and connection string to source tests database

pgdestination.sh = bash script with psql and connection string to destination tests database

cat <<EOF > pgsource.sh
PGPASSWORD=<sourcePassword> psql -t -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests
EOF
chmod +x pgsource.sh

cat <<EOF > pgdestination.sh
PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests
EOF
chmod +x pgdestination.sh

Migrating Basic Azure Database for PostgreSQL server to Azure Database for PostgreSQL server GeneralPurpose

Data preparation

pgsource has 1 single table “pru” with 1 million rows, will generate some other table to test with, there are no indexes or foreign keys in this database

for j in {1..20}
do
       echo "create table pru$j as select * from pru;"|./pgsource.sh
done

Test full restoration with pg_dump/pg_restore (no parallel)

Internally, pg_dump and pg_restore uses the COPY command

time PGPASSWORD=<sourcePassword> pg_dump -C -Fc -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests|PGPASSWORD=<destinationPassword> pg_restore -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> -C -c -d postgres

real    6m15.310s
user    0m22.976s
sys     0m9.393s

– Total Time: 6m15sec

Test full restoration with pg_dump/pg_restore (parallel)

Need to use the directory format export, dump to regular files (see: https://www.postgresql.org/docs/10/app-pgrestore.html, https://www.postgresql.org/docs/10/app-pgdump.html)

time PGPASSWORD=<sourcePassword> pg_dump -C -Fd -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests -j 4 -f testsdir

real    3m23.004s
user    0m20.291s
sys     0m13.888s

time PGPASSWORD=<destinationPassword> pg_restore -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> -C -c -d postgres -j 4 testsdir

real    0m56.175s
user    0m1.671s
sys     0m0.478s

– Total Time: 4m19sec

Test alternative restoration with COPY (no parallel)

time PGPASSWORD=<sourcePassword> pg_dump -C -Fc --schema-only -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests|PGPASSWORD=<destinationPassword> pg_restore -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> --schema-only -C -c -d postgres

real    0m15.288s
user    0m0.144s
sys     0m0.054s

time for j in `echo "SELECT table_name FROM information_schema.tables WHERE table_schema = 'public' and table_type='BASE TABLE' ORDER BY table_name"|./pgsource.sh`
do
       echo "copy $j FROM PROGRAM 'echo copy binary $j to STDOUT|./pgsource.sh' WITH BINARY;"|./pgdestination.sh
done

real    7m47.179s
user    0m19.518s
sys     0m23.467s

– Total Time: 8m2sec

Test alternative restoration with COPY (parallel)

No need to use dump to regular file, still with pipe mechanism

export maxParal=4
>curParal.tmp
time for j in `echo "SELECT table_name FROM information_schema.tables WHERE table_schema = 'public' and table_type='BASE TABLE' ORDER BY table_name"|./pgsource.sh`
do
  echo $j >>curParal.tmp
  nohup $SHELL <<EOF &
  echo "copy $j FROM PROGRAM 'echo copy binary $j to STDOUT|./pgsource.sh' WITH BINARY;"|./pgdestination.sh; sed -i '/^$j$/d' curParal.tmp
EOF
  while [ `wc -l curParal.tmp|cut -f 1 -d " "` -ge $maxParal ]
  do
    sleep 1 
  done 
  echo $curParal $maxParal
done

real    2m59.491s
user    0m24.388s
sys     0m40.200s

– Total Time: 3m14sec

Results

Without paralellism:

pg_dump/pg_restore won with 6m15sec vs 8m2sec of COPY approach

With paralelism:

COPY approach won with 3m14sec vs 4m19sec of pg_dump/pg_restore

Although pg_dump/pg_restore uses COPY command internally, when we use a manual parallelism mechanism we can obtain better execution times due to the fact that we can create Pipes without need to save data to staging area as pg_dump/pg_restore require when using parallelism.

Using COPY command to ingest from http datasources

Using COPY command we can ingest not only .csv data or copy from another database through pipes, we can create authentic ETL like processes with a single command

This type of pipeline should be executed with the “copy” psql instruction, that is different from the “copy” command, because resources should be accessible from psql client side, not the database server. https://www.postgresql.org/docs/11/sql-copy.html

In this case we are going to ingest the open data dataset “COVID-19 cases worldwide” csv from https://data.europa.eu/euodp/en/data/dataset/covid-19-coronavirus-data

First create the table covid19casesww

echo "create table covid19casesww (dateRep date,day integer,month integer,year integer,cases integer,deaths integer,countriesAndTerritories text,geoId text,countryterritoryCode text,popData2019 integer,continentExp text);"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests

Pipe from wget to copy command

echo "set datestyle to SQL,DMY;COPY covid19casesww FROM PROGRAM 'wget -q -O - "$@" "https://opendata.ecdc.europa.eu/covid19/casedistribution/csv"|tail -n +2' CSV DELIMITER ',';"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests

Now we can access the data and calculate, by example, the 10th territories with low number of cases

echo "select sum(cases) as cases, countriesAndTerritories, continentexp from covid19casesww group by countriesAndTerritories,continentExp order by 1 asc limit 10;"|PGPASSWORD=<destinationPassword> psql -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests

cases |      countriesandterritories      | continentexp
-------+-----------------------------------+--------------
     3 | Anguilla                          | America
     7 | Bonaire, Saint Eustatius and Saba | America
     8 | British_Virgin_Islands            | America
    11 | Montserrat                        | America
    11 | Papua_New_Guinea                  | Oceania
    11 | Seychelles                        | Africa
    12 | Holy_See                          | Europe
    13 | Falkland_Islands_(Malvinas)       | America
    13 | Greenland                         | America
    15 | Saint_Kitts_and_Nevis             | America
(10 rows)

Using COPY command to ingest from other databases

Suppose we want migrate data from a Azure Database for MySQL server database to a Azure Database for PostgreSQL server database

First create the table prumysql

echo "create table prumysql (a text, b integer);"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests

Pipe from mysql to copy command

echo "COPY prumysql FROM PROGRAM 'echo select a,b from prumysql|mysql -h <sourceServerName>.mysql.database.azure.com -u <username>@<sourceServerName> -p<sourcePassword> -sN --ssl tests -B' ;"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.

PostgreSQL COPY and COPY commands

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

We look forward to meeting you