This article is contributed. See the original author and article here.
Use of COPY command as an alternative way to PostgreSQL data migration
Using pg_dump, psql, recover data structure in new server
Generate COPY command pipes
Copy real data through COPY command pipes
Advantages
- COPY command through pipe, no space required in client side
- Increased performance with parallelism
Disadvantages
- COPY method is not taking care of indexing, foreign keys, additional steps will be required to have a complete migration mechanism using this approach
Preparation
pgsource.sh = bash script with psql and connection string to source tests database
pgdestination.sh = bash script with psql and connection string to destination tests database
cat <<EOF > pgsource.sh
PGPASSWORD=<sourcePassword> psql -t -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests
EOF
chmod +x pgsource.sh
cat <<EOF > pgdestination.sh
PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests
EOF
chmod +x pgdestination.sh
Migrating Basic Azure Database for PostgreSQL server to Azure Database for PostgreSQL server GeneralPurpose
Data preparation
pgsource has 1 single table “pru” with 1 million rows, will generate some other table to test with, there are no indexes or foreign keys in this database
for j in {1..20}
do
echo "create table pru$j as select * from pru;"|./pgsource.sh
done
Test full restoration with pg_dump/pg_restore (no parallel)
Internally, pg_dump and pg_restore uses the COPY command
time PGPASSWORD=<sourcePassword> pg_dump -C -Fc -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests|PGPASSWORD=<destinationPassword> pg_restore -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> -C -c -d postgres
real 6m15.310s
user 0m22.976s
sys 0m9.393s
– Total Time: 6m15sec
Test full restoration with pg_dump/pg_restore (parallel)
Need to use the directory format export, dump to regular files (see: https://www.postgresql.org/docs/10/app-pgrestore.html, https://www.postgresql.org/docs/10/app-pgdump.html)
time PGPASSWORD=<sourcePassword> pg_dump -C -Fd -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests -j 4 -f testsdir
real 3m23.004s
user 0m20.291s
sys 0m13.888s
time PGPASSWORD=<destinationPassword> pg_restore -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> -C -c -d postgres -j 4 testsdir
real 0m56.175s
user 0m1.671s
sys 0m0.478s
– Total Time: 4m19sec
Test alternative restoration with COPY (no parallel)
time PGPASSWORD=<sourcePassword> pg_dump -C -Fc --schema-only -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests|PGPASSWORD=<destinationPassword> pg_restore -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> --schema-only -C -c -d postgres
real 0m15.288s
user 0m0.144s
sys 0m0.054s
time for j in `echo "SELECT table_name FROM information_schema.tables WHERE table_schema = 'public' and table_type='BASE TABLE' ORDER BY table_name"|./pgsource.sh`
do
echo "copy $j FROM PROGRAM 'echo copy binary $j to STDOUT|./pgsource.sh' WITH BINARY;"|./pgdestination.sh
done
real 7m47.179s
user 0m19.518s
sys 0m23.467s
– Total Time: 8m2sec
Test alternative restoration with COPY (parallel)
No need to use dump to regular file, still with pipe mechanism
export maxParal=4
>curParal.tmp
time for j in `echo "SELECT table_name FROM information_schema.tables WHERE table_schema = 'public' and table_type='BASE TABLE' ORDER BY table_name"|./pgsource.sh`
do
echo $j >>curParal.tmp
nohup $SHELL <<EOF &
echo "copy $j FROM PROGRAM 'echo copy binary $j to STDOUT|./pgsource.sh' WITH BINARY;"|./pgdestination.sh; sed -i '/^$j$/d' curParal.tmp
EOF
while [ `wc -l curParal.tmp|cut -f 1 -d " "` -ge $maxParal ]
do
sleep 1
done
echo $curParal $maxParal
done
real 2m59.491s
user 0m24.388s
sys 0m40.200s
– Total Time: 3m14sec
Results
Without paralellism:
pg_dump/pg_restore won with 6m15sec vs 8m2sec of COPY approach
With paralelism:
COPY approach won with 3m14sec vs 4m19sec of pg_dump/pg_restore
Although pg_dump/pg_restore uses COPY command internally, when we use a manual parallelism mechanism we can obtain better execution times due to the fact that we can create Pipes without need to save data to staging area as pg_dump/pg_restore require when using parallelism.
Using COPY command to ingest from http datasources
Using COPY command we can ingest not only .csv data or copy from another database through pipes, we can create authentic ETL like processes with a single command
This type of pipeline should be executed with the “copy” psql instruction, that is different from the “copy” command, because resources should be accessible from psql client side, not the database server. https://www.postgresql.org/docs/11/sql-copy.html
In this case we are going to ingest the open data dataset “COVID-19 cases worldwide” csv from https://data.europa.eu/euodp/en/data/dataset/covid-19-coronavirus-data
First create the table covid19casesww
echo "create table covid19casesww (dateRep date,day integer,month integer,year integer,cases integer,deaths integer,countriesAndTerritories text,geoId text,countryterritoryCode text,popData2019 integer,continentExp text);"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests
Pipe from wget to copy command
echo "set datestyle to SQL,DMY;COPY covid19casesww FROM PROGRAM 'wget -q -O - "$@" "https://opendata.ecdc.europa.eu/covid19/casedistribution/csv"|tail -n +2' CSV DELIMITER ',';"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests
Now we can access the data and calculate, by example, the 10th territories with low number of cases
echo "select sum(cases) as cases, countriesAndTerritories, continentexp from covid19casesww group by countriesAndTerritories,continentExp order by 1 asc limit 10;"|PGPASSWORD=<destinationPassword> psql -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests
cases | countriesandterritories | continentexp
-------+-----------------------------------+--------------
3 | Anguilla | America
7 | Bonaire, Saint Eustatius and Saba | America
8 | British_Virgin_Islands | America
11 | Montserrat | America
11 | Papua_New_Guinea | Oceania
11 | Seychelles | Africa
12 | Holy_See | Europe
13 | Falkland_Islands_(Malvinas) | America
13 | Greenland | America
15 | Saint_Kitts_and_Nevis | America
(10 rows)
Using COPY command to ingest from other databases
Suppose we want migrate data from a Azure Database for MySQL server database to a Azure Database for PostgreSQL server database
First create the table prumysql
echo "create table prumysql (a text, b integer);"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests
Pipe from mysql to copy command
echo "COPY prumysql FROM PROGRAM 'echo select a,b from prumysql|mysql -h <sourceServerName>.mysql.database.azure.com -u <username>@<sourceServerName> -p<sourcePassword> -sN --ssl tests -B' ;"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests
Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.
Recent Comments