Sql partitioning of tables. Creating a physical database model: performance engineering. In this section

Good evening/day/morning, dear habrapeople! We continue to develop and expand the blog about my beloved open source rdbms Postgresql. Miraculously, it so happened that the topic of today’s topic has never been raised here before. I must say that partitioning in postgresql is very well described in the documentation, but will that really stop me?).

Introduction

In general, partitioning is generally understood not as some kind of technology, but rather as an approach to database design that appeared long before DBMSs began to support the so-called. partitioned tables. The idea is very simple - divide the table into several smaller parts. There are two subtypes - horizontal and vertical sectioning.

Horizontal partitioning

Parts of a table contain different rows. Let's say we have a log table for some abstract application - LOGS. We can break it down into parts - one for January 2009 logs, another for February 2009, etc.

Vertical partitioning

Parts of a table contain different columns. Finding an application for vertical partitioning (when it is actually justified) is somewhat more difficult than for horizontal partitioning. As a spherical horse, I propose to consider this option: the NEWS table has columns ID, SHORTTEXT, LONGTEXT, and let the LONGTEXT field be used much less frequently than the first two. In this case, it makes sense to split the NEWS table by columns (create two tables for SHORTTEXT and LONGTEXT, respectively, connected by primary keys + create a NEWS view containing both columns). Thus, when we only need a description of the news, the DBMS does not have to read the entire text of the news from disk.

Support for partitioning in modern DBMSs

Most modern DBMSs support table partitioning in one form or another.

Oracle- supports partitioning starting from version 8. Working with sections, on the one hand, is very simple (you don’t have to think about them at all, you work like with a regular table*), and on the other hand, everything is very flexible. Sections can be divided into “subpartitions”, deleted, divided, transferred. Various options for indexing a partitioned table are supported (global index, partitioned index). Link to lengthy description.
Microsoft SQL Server - support for partitioning appeared recently (in 2005). The first impression of use is “Well, finally!! :)”, the second is “It works, everything seems to be ok.” Documentation on msdn
MySQL- supports starting from version 5.1. Very good description on Habré
And so on…

*-I’m lying, of course, there is a standard set of difficulties - creating a new section on time, throwing out the old one, etc., but somehow everything is simple and clear.

Partitioning in Postgresql

Partitioning tables in postgresql is slightly different in implementation from other databases. The basis for partitioning is table inheritance (a thing unique to postgresql). That is, we must have a master table, and its sections will be successor tables. We will consider partitioning using an example of a task close to reality.

Formulation of the problem

The database is used to collect and analyze data about visitors to the site/sites. The volumes of data are large enough to think about partitioning. In most cases, analysis uses data from the last day.
1. Create the main table:

CREATE TABLE analytics.events

user_id UUID NOT NULL ,
event_type_id SMALLINT NOT NULL ,
event_time TIMESTAMP DEFAULT now() NOT NULL ,
url VARCHAR (1024) NOT NULL ,
referrer VARCHAR (1024),
ip INET NOT NULL
);

2. We will partition by day using the event_time field. We will create a new section for each day. We will name the sections according to the rule: analytics.events_DDMMYYYY. Here, for example, is the section for January 1, 2010.

CREATE TABLE analytics.events_01012010
event_id BIGINT DEFAULT nextval("analytics.seq_events" ) PRIMARY KEY ,
CHECK (event_time >= TIMESTAMP "2010-01-01 00:00:00" AND event_time< TIMESTAMP "2010-01-02 00:00:00" )
) INHERITS(analytics.events);
* This source code was highlighted with Source Code Highlighter.

When creating a section, we explicitly set the event_id field (PRIMARY KEY is not inherited) and create a CHECK CONSTRAINT on the event_time field so as not to insert unnecessary things.

3. Create an index on the event_time field. When we partition the table, we expect that most queries against the events table will use a condition on the event_time field, so an index on this field will help us a lot.

CREATE INDEX events_01012010_event_time_idx ON analytics.events_01012010 USING btree(event_time);
* This source code was highlighted with Source Code Highlighter.

4. We want to ensure that when inserted into the main table, the data ends up in the section intended for it. To do this, we do the following trick - we create a trigger that will control data flows.

CREATE OR REPLACE FUNCTION analytics.events_insert_trigger()
RETURNS TRIGGER AS $$
BEGIN
IF (NEW .event_time >= TIMESTAMP "2010-01-01 00:00:00" AND
NEW .event_time< TIMESTAMP "2010-01-02 00:00:00" ) THEN
INSERT INTO analytics.events_01012010 VALUES (NEW .*);
ELSE
RAISE EXCEPTION "Date % is out of range. Fix analytics.events_insert_trigger", NEW .event_time;
END IF ;
RETURN NULL ;
END ;
$$
LANGUAGE plpgsql;
* This source code was highlighted with Source Code Highlighter.

CREATE TRIGGER events_before_insert
BEFORE INSERT ON analytics.events
FOR EACH ROW EXECUTE PROCEDURE analytics.events_insert_trigger();
* This source code was highlighted with Source Code Highlighter.

5. Everything is ready, we now have a partitioned table called analytics.events. We can start furiously analyzing her data. By the way, we created CHECK constraints not only to protect sections from incorrect data. Postgresql can use them when creating a query plan (however, with a live index on event_time, the gain will be minimal), just use the constraint_exclusion directive:

SET constraint_exclusion = on ;
SELECT * FROM analytics.events WHERE event_time > CURRENT_DATE ;
* This source code was highlighted with Source Code Highlighter.

End of the first part

So what do we have? Let's go point by point:
1. The events table, divided into sections, analysis of available data for the last day becomes easier and faster.
2. The horror of realizing that all this needs to be supported somehow, sections need to be created on time, not forgetting to change the trigger accordingly.

I’ll tell you how to easily and carefree work with partitioned tables in the second part.

UPD1: Replaced partitioning with partitioning
UPD2:
Based on a comment from one of the readers who, unfortunately, does not have an account on Habré:
There are several issues associated with inheritance that should be taken into account when designing. Partitions do not inherit the primary key and foreign keys on their columns. That is, when creating a section, you need to explicitly create PRIMARY KEY and FOREIGN KEYs for the section columns. I would like to note on my own that it is not possible to create FOREIGN KEY on the columns of a partitioned table. the best way. In most cases, the partitioned table is a "fact table" and itself refers to a "dimension" of the table.

Good evening/day/morning, dear habrapeople! We continue to develop and expand the blog about my favorite open source rdbms Postgresql. Miraculously, it so happened that the topic of today’s topic has never been raised here before. I must say that partitioning in postgresql is very well described in the documentation, but will that really stop me?).

Introduction

Horizontal partitioning

Vertical partitioning

Support for partitioning in modern DBMSs

Most modern DBMSs support table partitioning in one form or another.

Oracle- supports partitioning starting from version 8. Working with sections, on the one hand, is very simple (you don’t have to think about them at all, you work like with a regular table*), and on the other hand, everything is very flexible. Sections can be divided into “subpartitions”, deleted, divided, transferred. Various options for indexing a partitioned table are supported (global index, partitioned index). Link to lengthy description.
Microsoft SQL Server- support for partitioning appeared recently (in 2005). The first impression of use is “Well, finally!! :)”, the second is “It works, everything seems to be ok.” Documentation on msdn
MySQL- supports starting from version 5.1.
And so on…

*-I’m lying, of course, there is a standard set of difficulties - creating a new section on time, throwing out the old one, etc., but somehow everything is simple and clear.

Partitioning in Postgresql

Formulation of the problem

CREATE TABLE analytics.events

user_id UUID NOT NULL ,
event_type_id SMALLINT NOT NULL ,
event_time TIMESTAMP DEFAULT now() NOT NULL ,
url VARCHAR (1024) NOT NULL ,
referrer VARCHAR (1024),
ip INET NOT NULL
);

CREATE TABLE analytics.events_01012010
event_id BIGINT DEFAULT nextval("analytics.seq_events" ) PRIMARY KEY ,
CHECK (event_time >= TIMESTAMP "2010-01-01 00:00:00" AND event_time< TIMESTAMP "2010-01-02 00:00:00" )
) INHERITS(analytics.events);
* This source code was highlighted with Source Code Highlighter.

When creating a section, we explicitly set the event_id field (PRIMARY KEY is not inherited) and create a CHECK CONSTRAINT on the event_time field so as not to insert unnecessary things.

CREATE INDEX events_01012010_event_time_idx ON analytics.events_01012010 USING btree(event_time);
* This source code was highlighted with Source Code Highlighter.

4. We want to ensure that when inserted into the main table, the data ends up in the section intended for it. To do this, we do the following trick - we create a trigger that will control data flows.

CREATE OR REPLACE FUNCTION analytics.events_insert_trigger()
RETURNS TRIGGER AS $$
BEGIN
IF (NEW .event_time >= TIMESTAMP "2010-01-01 00:00:00" AND
NEW .event_time< TIMESTAMP "2010-01-02 00:00:00" ) THEN
INSERT INTO analytics.events_01012010 VALUES (NEW .*);
ELSE
RAISE EXCEPTION "Date % is out of range. Fix analytics.events_insert_trigger", NEW .event_time;
END IF ;
RETURN NULL ;
END ;
$$
LANGUAGE plpgsql;
* This source code was highlighted with Source Code Highlighter.

CREATE TRIGGER events_before_insert
BEFORE INSERT ON analytics.events
FOR EACH ROW EXECUTE PROCEDURE analytics.events_insert_trigger();
* This source code was highlighted with Source Code Highlighter.

SET constraint_exclusion = on ;
SELECT * FROM analytics.events WHERE event_time > CURRENT_DATE ;
* This source code was highlighted with Source Code Highlighter.

End of the first part

I’ll tell you how to easily and carefree work with partitioned tables in the second part.

UPD1: Replaced partitioning with partitioning
UPD2:
Based on a comment from one of the readers who, unfortunately, does not have an account on Habré:
There are several issues associated with inheritance that should be taken into account when designing. Partitions do not inherit the primary key and foreign keys on their columns. That is, when creating a section, you need to explicitly create PRIMARY KEY and FOREIGN KEYs for the section columns. I would like to note on my own that creating FOREIGN KEY on the columns of a partitioned table is not the best way. In most cases, the partitioned table is a "fact table" and itself refers to a "dimension" of the table.

In this article, I'm going to demonstrate the specifics of query execution plans when accessing partitioned tables. Note that there is a big difference between partitioned tables (which only became available in SQL Server 2005) and partitioned views (which were available in SQL Server 2000 and are still available in SQL Server 2005 and later versions). I will demonstrate the specifics of query plans for partitioned views in another article.

View table

Let's create a simple partitioned table:

create partition function pf(int) as range for values (0, 10, 100)

create partition scheme ps as partition pf all to ()

create table t (a int, b int) on ps(a)

This script creates a table with four partitions. SQL Server assigned values to the IDs of each of the four partitions as shown in the table:

PtnId	Values
1	t.a<= 0
2	0 < t.a <= 10
3	10 < t.a <= 100
4	100 < t.a

Now let's look at a query plan that would force the optimizer to use a Table Scan:

……|–Constant Scan(VALUES:(((1)),((2)),((3)),((4))))
…….|–Table Scan(OBJECT:([t]))

In the above plan, SQL Server explicitly specifies all partition IDs in the "Constant Scan" statement, which implements the table scan and supplies the data to the nested loop join operator. It's worth remembering here that the nested loop join operator traverses the internal table (in this case, a full table scan) once for each value from the outer table (in our case, "Constant Scan"). So we scan the table four times; once for each section ID.

It should also be noted that the connection of nested loops clearly shows that the outer table is the values of the column where the partition IDs are stored. Although it is not immediately visible in the text view of the execution plan (unfortunately, we sometimes do not notice this information), the table scan uses a column with the IDs of the sections that are selected to perform the scan and determine which section to scan. This information is always available in the graphical execution plan (you need to look at the properties of the table view operator), as well as in the XML representation of the query execution plan:

Popular

How to number pages in OpenOffice

« From: Belgorod Registered: 2015.03.18 Posts: 550 Likes: 81 Topic: Open Document How to number pages Page numbers in a text... »

The largest public pages on VKontakte Cook it!

« 1. Collections of movies, Public films with those “save them on your wall so you don’t forget to watch them.” For convenience, in publications along with lists... »

Samsung Galaxy A9 Star Pro: the first smartphone with a quad camera Information about the type of speakers and audio technologies supported by the device

« The Koreans presented the world's first smartphone with four cameras - Samsung Galaxy A9 (2018). The price at the start of sales is 39,990 rubles. Someday this... »