1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
6 <title>pgpool-II Tutorial</title>
7 <meta http-equiv="Content-Type" content="text/html" />
8 <link href="pgpool.css" rel="stylesheet" type="text/css" />
12 <h1>pgpool-II Tutorial</h1>
13 <p>Welcome to the Tutorial for pgpool-II. From here, you can learn how
14 to install, setup, and run parallel queries or do replication using
15 pgpool-II. We assume that you already know basic PostreSQL operations,
16 so please refer to the PostgreSQL document if needed.</p>
19 <dt><em>Table of Contents</em></dt>
20 <dt>1. <a href="#start">Let's Begin!</a></dt>
21 <dd>1.1. <a href="#install">Installing pgpool-II</a></dd>
22 <dd>1.2. <a href="#config">Configuration Files</a></dd>
23 <dd>1.3. <a href="#pcp-config">Configuring PCP commands</a></dd>
24 <dd>1.4. <a href="#db-node">Preparing Database Nodes</a></dd>
25 <dd>1.5. <a href="#start-shutdown">Starting/Stopping pgpool-II</a></dd>
26 <dt>2. <a href="#replication">Your First Replication</a></dt>
27 <dd>2.1. <a href="#replication-config">Configuring Replication</a></dd>
28 <dd>2.2. <a href="#replication-check">Checking Replication</a></dd>
29 <dt>3. <a href="#parallel">Your First Parallel Query</a></dt>
30 <dd>3.1. <a href="#parallel-config">Configuring Parallel Query</a></dd>
31 <dd>3.2. <a href="#system-db">Configuring the System Database</a></dd>
32 <dd>3.3. <a href="#dist-def">Partitioning Rule Definition</a></dd>
33 <dd>3.4. <a href="#replicate-def">Replication Rule Definition</a></dd>
34 <dd>3.5. <a href="#parallel-check">Checking Parallel Query</a></dd>
38 <h2>1. <a name="start">Let's Begin!</a></h2>
40 <p>First, we must learn how to install, configure pgpool-II and
41 database nodes before using replication or parallel query.</p>
43 <h3>1.1. <a name="install">Installing pgpool-II</a></h3>
45 <p>Installing pgpool-II is very easy. In the directory which you have
46 extracted the source tar ball, execute the following commands.</p>
54 <p><code>configure</code> script collects your system information
55 and use it for the compilation procedure. You can pass command line
56 arguments to <code>configure</code> script to change the default
57 behavior, such as the installation directory. pgpool-II will be
58 installed to <code>/usr/local</code> directory by default.</p>
60 <p><code>make</code> command compiles the source code, and <code>make
61 install</code> will install the executables. You must have write
62 permission on the installation directory.</p>
64 <p>In this tutorial, we will install pgpool-II in the default
65 <code>/usr/local</code> directory.</p>
67 <p><em>Note</em>: pgpool-II requires libpq library in PostgreSQL 7.4
68 or later (version 3 protocol). If <code>configure</code> script
69 displays the following error message, libpq library may not be
70 installed, or it is not of version 3.</p>
72 <pre>configure: error: libpq is not installed or libpq is old</pre>
74 <p>If the library is version 3, but above message is still displayed,
75 your libpq library is probably not recognized by
76 <code>configure</code> script.</p>
78 <p><code>configure</code> script searches for libpq library under
79 <code>/usr/local/pgsql</code> libaray. If you have installed
80 PostgreSQL to a directory other than <code>/usr/local/pgsql</code>,
81 use <code>--with-pgsql</code>, or <code>--with-pgsql-includedir</code>
82 and <code>--with-pgsql-libdir</code> command line options when you
83 execute <code>configure</code>.</p>
85 <h3>1.2. <a name="config">Configuration Files</a></h3>
87 <p>pgpool-II configuration parameters are saved in
88 <code>pgpool.conf</code> file. The file is in "<code>parameter =
89 value</code>" per line format. When you install pgpool-II,
90 <code>pgpool.conf.sample</code> is automatically created. We recommend
91 copying and renaming it to <code>pgpool.conf</code>, and edit it as
94 <pre>$ cp /usr/local/etc/pgpool.conf.sample /usr/local/etc/pgpool.conf</pre>
96 <p>pgpool-II only accepts connections from the local host using port
97 9999. If you wish to receive conenctions from other hosts, set
98 <code>listen_addresses</code> to '*'.</code></p>
101 listen_addresses = 'localhost'
105 <p>We will use the default parameters in thie tutorial.</p>
107 <h3>1.3. <a name="pcp-config">Configuring PCP Commands</a></h3>
109 <p>pgpool-II has an interface for administration purpose to retrieve
110 information on database nodes, shutdown pgpool-II, etc. via
111 network. To use PCP commands, user authentication is required. This
112 authentication is different from PostgreSQL's user authentication. A
113 username and password need to be defined in <code>pcp.conf</code>
114 file. In the file, a username and password are listed as a pair on
115 each line, and they are separated by a colon (:). Passwords are
116 encrypted in md5 hash format.</p>
118 <pre>postgres:e8a48653851e28c69d0506508fb27fc5</pre>
120 When you install pgpool-II, <code>pcp.conf.sample</code> is
121 automatically created. We recommend copying and renaming it to
122 <code>pcp.conf</code>, and edit it.
124 <pre>$ cp /usr/local/etc/pcp.conf.sample /usr/local/etc/pcp.conf</pre>
126 <p>To encrypt your password into md5 hash format, use pg_md5 command,
127 which is installed as a part of pgpool-II
128 executables. <code>pg_md5</code> takes text as an command line
129 argument, and displays its md5-hashed text.</p>
131 <p>For example, give "postgres" as the command line argument, at
132 <code>pg_md5</code> displays md5-hashed text to the standard
136 $ /usr/bin/pg_md5 postgres
137 e8a48653851e28c69d0506508fb27fc5
140 <p>PCP commands are executed via network, so the port number must be
141 configured with <code>pcp_port</code> parameter in
142 <code>pgpool.conf</code> file.</p>
144 <p>We will use the default 9898 for <code>pcp_port</code> in this tutorial.</p>
146 <pre>pcp_port = 9898</pre>
148 <h3>1.4. <a name="db-node">Preparing Database Nodes</a></h3>
150 <p>Now, we need to set up backend PostgreSQL servers for
151 pgpool-II. These servers can be placed within the same host as
152 pgpool-II, or on separate machines. If you decide to place the servers
153 on the same host, different port numbers must be assigned for each
154 server. If the servers are placed on separate machines, they must be
155 configured properly so that they can accept network connections from
158 <p>In this tutorial, we will place three servers within the same host
159 as pgpool-II, and assign 5432, 5433, 5434 port numbers
160 respectively. To configure pgpool-II, edit <code>pgpool.conf</code> as
164 backend_hostname0 = 'localhost'
167 backend_hostname1 = 'localhost'
170 backend_hostname2 = 'localhost'
175 <p>For <code>backend_hostname</code>, <code>backend_port</code>,
176 <code>backend_weight</code>, set the node's hostname, port number, and
177 ratio for load balancing. At the end of each parameter string, node ID
178 must be specified by adding positive integers starting with 0 (i.e. 0,
181 <p><code>backend_weight</code> parameters are all 1, meaning that
182 SELECT queries are equally distributed among three servers.</p>
184 <h3>1.5. <a name="start-shutdown">Starting/Stopping pgpool-II</a></h3>
186 <p>To fire up pgpool-II, execute the following command on a terminal.</p>
190 <p>The above command, however, prints no log messages because pgpool
191 detaches the terminal. If you want to show pgpool log messages, you
192 pass <code>-n</code> option to pgpool command. pgpool-II is executed
193 as non-daemon process, and the terminal will not be detached.
201 The log messages are printed on the terminal, so the recommended
202 options to use are like the following.
205 <pre>$ pgpool -n -d > /tmp/pgpool.log 2>&1 &</pre>
207 <p><code>-d</code> option enables debug messages to be generated.</p>
210 The above command keeps appending log messages to /tmp/pgpool.log. If
211 you need to rotate log files, pass the logs to a external command
212 which have log rotation function.
214 For example, you can use rotatelogs coming with Apache2:
216 $ pgpool -n 2>&1 | /usr/local/apache2/bin/rotatelogs \
217 -l -f /var/log/pgpool/pgpool.log.%A 86400 &
219 This will generate a log file named "pgpool.log.Thursday" everyday then
220 rotate it 00:00 at midnight. Rotatelogs adds log to a file if it already
221 exists. To delete old log files before rotation, you could use cron:
223 55 23 * * * /usr/bin/find /var/log/pgpool -type f -mtime +5 -exec /bin/rm -f '{}' \;
225 Please note that rotaelogs may exist as /usr/sbin/rotatelogs2 in some
227 -f option generates a log file as soon as rotatelogs starts and is
228 available apache2 2.2.9 or greater.
232 href="http://www.cronolog.org"><code>cronolog</code></a> helps you.
236 $ pgpool -n 2>&1 | /usr/sbin/cronolog \
237 --hardlink=/var/log/pgsql/pgpool.log \
238 '/var/log/pgsql/%Y-%m-%d-pgpool.log' &
241 <p>To stop pgpool-II process, execute the following command.</p>
243 <pre>$ pgpool stop</pre>
245 <p>If any client is still connected, pgpool-II waits for them to
246 disconnect, and then terminate itself. Execute the following command
247 instead if you want to shutdown pgpool-II forcibly.
249 <pre>$ pgpool -m fast stop</pre>
251 <h2>2. <a name="replication">Your First Replication</a></h2>
253 <p>Replication enables the same data to be copied to multiple database
256 <p>In this section, we'll use three database nodes, which we have
257 already set up in section "1. <a href="#start">Let's Begin!</a>", and
258 takes you step by step to create a database replication system. Sample
259 data to be replicated will be generated by pgbench benchmark
262 <h3>2.1. <a name="replication-config">Configuring Replication</a></h3>
264 <p>To enable the database replication function, set
265 <code>replication_mode</code> to true in <code>pgpool.conf</code>
268 <pre>replication_mode = true</pre>
270 <p>When <code>replication_mode</code> is set to true, pgpool-II will send a
271 copy of a received query to all the database nodes.</p>
273 <p>When <code>load_balance_mode</code> is set to true, pgpool-II will
274 distribute SELECT queries among the database nodes.</p>
276 <pre>load_balance_mode = true</pre>
278 <p>In this section, we enable both <code>replication_mode</code> and
279 <code>load_balance_mode</code>.</p>
281 <h3>2.2. <a name="replication-check">Checking Replication</a></h3>
283 <p>To reflect the changes in <code>pgpool.conf</code>, pgpool-II must
284 be restarted. Please refer to section "1.5 <a
285 href="#start-shutdown">Starting/Stopping pgpool-II</a>".</p>
287 <p>After configuring <code>pgpool.conf</code> and restarting
288 pgpool-II, let's try the actual replication and see if everything is
291 <p>First, we need to create a database to be replicated. We will name
292 it "bench_replication". This database needs to be created on all the
293 nodes. Use <code>createdb</code> commands through pgpool-II, and the
294 database will be created on all the nodes.</p>
296 <pre>$ createdb -p 9999 bench_replication</pre>
298 <p>Then, we'll execute pgbench with <code>-i</code>
299 option. <code>-i</code> option initializes the database with
300 pre-defined tables and data.</p>
302 <pre>$ pgbench -i -p 9999 bench_replication</pre>
304 <p>The following table is the summary of tables and data, which will
305 be created by <code>pgbench -i</code>. If, on all the nodes, the
306 listed tables and data are created, replication is working correctly.
308 <table border="1" align="center">
311 <th>Number of Rows</th>
331 <p>Let's use a simple shell script to check the above on all the
332 nodes. The following script will display the number of rows in
333 branches, tellers, accounts, and history tables on all the nodes (5432,
336 <pre>$ for port in 5432 5433 5434; do
338 > for table_name in branches tellers accounts history; do
339 > echo $table_name
340 > psql -c "SELECT count(*) FROM $table_name" -p $port bench_replication
345 <h2>3. <a name="parallel">Your First Parallel Query</a></h2>
347 Data within the different range is stored in two or more data base nodes in a parallel Query. This is called a partitioning. Moreover, the same data as two or more data base nodes can be reproduced with partitioning.
350 <p>To enable parallel query in pgpool-II, you must set up another
351 database called "System Database" (we will denote it as SystemDB from
354 <p>SystemDB holds the user-defined rules to decide what data will be
355 saved in which database node. Another use of SystemDB is to merge
356 results sent back from the database nodes using dblink.</p>
358 <p>In this section, we will use three database nodes which we have set
359 up in section "1. <a href="#start">Let's Begin!</a>", and takes you
360 step by step to create a parallel query database system. We will use
361 pgbench again to create sample data.</p>
363 <h3>3.1. <a name="parallel-config">Configuring Parallel Query</a></h3>
365 <p>To enable the parallel query function, set <code>parallel_mode</code> to true in <code>pgpool.conf</code> file.</p>
367 <pre>parallel_mode = true</pre>
369 <p>Setting <code>paralle_mode</code> to true does not start parallel
370 query automatically. pgpool-II needs SystemDB and the rules
371 to know how to distribute data to the database nodes.</p>
373 <p>Also, dblink used by SystemDB makes connections to
374 pgpool-II. Therefore, <code>listen_addresses</code> needs to be
375 configured so that pgpool-II accepts those connections.</p>
377 <pre>listen_addresses = '*'</pre>
380 Attention: The replication is not done for the table that does the partitioning
381 though a parallel Query and the replication can be made effective at the same time.
382 The data base made by Moreover, because the composition of the data stored in
383 the data base is different in a parallel Query and the replication, that is,
384 "bench_replication" created in section "2. <a href="#replication">Your
385 First Replication</a>" cannot be reused.</p>
388 replication_mode = true
389 load_balance_mode = false
395 replication_mode = false
396 load_balance_mode = true
399 <p>In this section, we will set <code>parallel_mode and load_balance_mode</code> to true,
400 <code>listen_addresses</code> to '*', <code>replication_mode</code>to false.</p>
402 <h3>3.2. <a name="system-db">Configuring SystemDB</a></h3>
405 There is no difference in the data base the system data base and usually.
406 However, the function of dblink is defined in the system data base, and
407 the table that stores a distribution rule.. It is necessary to define dist_def. Moreover,
408 the data base node One can make the system data base, and pgpool-II can be
409 distributed in the load by connecting the cascade.
412 <p>In this section, we will create SystemDB within the 5432 port
413 node. The following list is the configuration parameters for
417 system_db_hostname = 'localhost'
418 system_db_port = 5432
419 system_db_dbname = 'pgpool'
420 system_db_schema = 'pgpool_catalog'
421 system_db_user = 'pgpool'
422 system_db_password = ''
425 <p>Actually, the above are the default settings of
426 <code>pgpool.conf</code>. Now, we must create a user called "pgpool",
427 and a database called "pgpool" owned by user "pgpool".</p>
430 $ createuser -p 5432 pgpool
431 $ createdb -p 5432 -O pgpool pgpool
434 <h4><p>3.2.1. Installing dblink</p></h4>
436 <p>Next, we must install dblink into "pgpool" database. dblink is one
437 of the tools included in <code>contrib</code> directory in the
438 PostgreSQL source code.</p>
440 <p>To install dblink to your system, execute the following commands.</p>
443 $ USE_PGXS=1 make -C contrib/dblink
444 $ USE_PGXS=1 make -C contrib/dblink install
447 <p>After dblink has been installed into your system, we will define
448 dblink functions in "pgpool" database. If PostgreSQL is installed in
449 <code>/usr/local/pgsql</code>, <code>dblink.sql</code> (a file with
450 function definitions) should have been installed in
451 <code>/usr/local/pgsql/share/contrib</code>. Now, execute the
452 following command to define dblink functions.</p>
454 <pre>$ psql -f /usr/local/pgsql/share/contrib/dblink.sql -p 5432 pgpool</pre>
456 <h4><p>3.2.2. Defining dist_def table</p></h4>
458 <p>Next, we will define a table called "dist_def" to hold the
459 distribution rules. When pgpool-II was installed, a file called
460 <code>system_db.sql</code> should have been installed in
461 <code>/usr/local/share/system_db.sql</code> (note that in this
462 tutorial, we are using the default installation directory,
463 <code>/usr/local</code>). <code>systeym_db.sql</code> contains
464 definitions to create special tables including "dist_def"
465 table. Execute the following command to define "dist_def" table.</p>
467 <pre>$ psql -f /usr/local/share/system_db.sql -p 5432 -U pgpool pgpool</pre>
469 <p>In <code>system_db.sql</code>, tables including "dist_def" are
470 installed in "pgpool_catalog" schema. If you have configured
471 <code>system_db_schema</code> to use other schema, you need to edit
472 <code>system_db.sql</code> accordingly.</p>
474 <p>The definition for "dist_def" is as shown here, and the table name
475 cannot be changed.</p>
478 CREATE TABLE pgpool_catalog.dist_def (
479 dbname text, -- database name
480 schema_name text, -- schema name
481 table_name text, -- table name
482 col_name text NOT NULL CHECK (col_name = ANY (col_list)), -- distribution key-column
483 col_list text[] NOT NULL, -- list of column names
484 type_list text[] NOT NULL, -- list of column types
485 dist_def_func text NOT NULL, -- distribution function name
486 PRIMARY KEY (dbname, schema_name, table_name)
490 <p>A tuple stored in "dist_def" can be classified into two types.</p>
493 <li>Distribution Rule (col_name, dist_def_func)</li>
494 <li>Table's meta-information (dbname, schema_name, table_name, col_list, type_list)</li>
497 <p>A distribution rule decides how to distribute data to a
498 particular node. Data will be distributed depending on the value of
499 "col_name" column. "dist_def_func" is a function that takes the value
500 of "col_name" as its argument, and returns an integer which
501 points to the appropriate database node ID where the data should be
504 <p>A meta-information is used to rewrite queries. Parallel query
505 must rewrite queries so that the results sent back from the backend
506 nodes can be merged into one result.</p>
508 <h4><p>3.2.2. Defining replicate_def table</p></h4>
510 When the table that does the replication to one SQL sentence with the table
511 registered in dist_def by uniting tables is specified, information on the
512 table that does the replication (replication rule) is registered in the table
513 named replicate_def beforehand.
514 The replicate_def table has already been made when making it from the
515 system_db.sql file when dist_def is defined.
516 The replicate_def table is defined as follows.
519 CREATE TABLE pgpool_catalog.replicate_def (
520 dbname text, -- database name
521 schema_name text, -- schema name
522 table_name text, -- table name
523 col_list text[] NOT NULL, -- list of column names
524 type_list text[] NOT NULL, -- list of column types
525 PRIMARY KEY (dbname, schema_name, table_name)
530 <h3>3.3. <a name="dist-def">Defining Distribution Rules</a></h3>
532 <p>In this tutorial, we will define rules to distribute pgbench's
533 sample data into three database nodes. The sample data will be created
534 by "pgbench -i -s 3" (i.e. scale factor of 3). We will create a new
535 database called "bench_parallel" for this section.</p>
537 <p>In pgpool-II's source code, you can find
538 <code>dist_def_pgbench.sql</code> file in <code>sample</code>
539 directoy. We will use this sample file here to create distribution
540 rules for pgbench. Execute the following command in extracted
541 pgpool-II source code directory.</p>
543 <pre>$ psql -f sample/dist_def_pgbench.sql -p 5432 pgpool</pre>
545 <p>Here is the explanation of <code>dist_def_pgbench.sql</code>.</p>
547 <p>Inside <code>dist_def_pgbench.sql</code>, we are inserting one
548 row into "dist_def" table. There is a distribution
549 function for accounts table.
550 For key-column, aid is defined for accounts respectively (which is primary keys)</p>
553 INSERT INTO pgpool_catalog.dist_def VALUES (
558 ARRAY['aid', 'bid', 'abalance', 'filler'],
559 ARRAY['integer', 'integer', 'integer', 'character(84)'],
560 'pgpool_catalog.dist_def_accounts'
564 <p>Now, we must define the distribution function for accounts table. Note
565 that you can use the same function from different tables. Also, you
566 can define functions using languages other than SQL (e.g. PL/pgSQL,
570 The accounts table when data is initialized specifying 3 scale factor,
571 The value of the aid is 1 to 300000.
572 The function is defined so that data is evenly distributed to three data base nodes.
575 SQL function will be defined as the return of the number of the data base node.
579 <pre>CREATE OR REPLACE FUNCTION pgpool_catalog.dist_def_branches(anyelement)
580 RETURNS integer AS $$
581 SELECT CASE WHEN $1 > 0 AND $1 <= 1 THEN 0
582 WHEN $1 > 1 AND $1 <= 2 THEN 1
588 <h3>3.4. <a name="replicate-def">Defining Replication Rules</a></h3>
590 The replication rule is the one that which table decides the replication whether to be done.
594 Here, it is made with pgbench With the branches table and tellers table are registered.
596 As a result, the accounts table and the inquiry that uses the branches table
597 and the tellers table become possible.
600 INSERT INTO pgpool_catalog.replicate_def VALUES (
604 ARRAY['bid', 'bbalance', 'filler'],
605 ARRAY['integer', 'integer', 'character(88)']
608 INSERT INTO pgpool_catalog.replicate_def VALUES (
612 ARRAY['tid', 'bid', 'tbalance', 'filler'],
613 ARRAY['integer', 'integer', 'integer', 'character(84)']
617 Replicate_def_pgbench.sql is prepared in sample directory.
619 In the directory that progresses the source code to define a replicate rule by using this as follows The psql command is executed.
622 $ psql -f sample/replicate_def_pgbench.sql -p 5432 pgpool
625 <h3>3.5. <a name="parallel-check">Checking Parallel Query</a></h3>
627 <p>To reflect the changes in <code>pgpool.conf</code>, pgpool-II must
628 be restarted. Please refer to section "1.5 <a
629 href="#start-shutdown">Starting/Stopping pgpool-II</a>".</p>
631 <p>After configuring <code>pgpool.conf</code> and restarting
632 pgpool-II, let's try and see if parallel query is working OK.</p>
634 <p>First, we need to create a database to be distributed. We will name
635 it "bench_parallel". This database needs to be created on all the
636 nodes. Use <code>createdb</code> commands through pgpool-II, and the
637 database will be created on all the nodes.</p>
639 <pre>$ createdb -p 9999 bench_parallel</pre>
641 <p>Then, we'll execute pgbench with <code>-i -s 3</code>
642 options. <code>-i</code> option initializes the database with
643 pre-defined tables and data. <code>-s</code> option specifies the
644 scale factor for initialization.</p>
646 <pre>$ pgbench -i -s 3 -p 9999 bench_parallel</pre>
648 <p>The tables and data created are shown in "3.3. <a
649 href="#dist-def">Defining Distribution Rules</a>".</p>
651 <p>One way to check if the data have been distributed correctly is to
652 execute a SELECT query via pgpool-II and directly on the backend, and
653 compare two results. If everything is configured right,
654 "bench_parallel" should be distributed as follows.</p>
656 <table border="1" align ="center">
659 <th >the number of lines</th>
679 <p>Let's use a simple shell script to check the above on all the nodes
680 and via pgpool-II. The following script will display the minimum and
681 maximum values in accounts table using port 5432, 5433, 5434, and
684 <pre>$ for port in 5432 5433 5434 9999; do
686 > psql -c "SELECT min(aid), max(aid) FROM accounts" -p $port bench_parallel
690 <div class="copyright">
693 Copyright © 2003 – 2008 PgPool Global Development Group