Archiving
There is the idea that data is sacred and needs to be stored forever. However, f you keep your data forever, you will, sooner or later, have a very large database.
In order to keep operations running smoothly, it would be useful to allocated data that is used more frequently in certain tables and keep data that is used less frequently in archive tables.
Let us discuss some examples:
You have a large table that stores transactions and it's size is 200Gb. It is that way, because your company has been around for 5 years, but in the last year, your company has been successful acquiring new users and your data has doubled.
You now have data in that table for that whole 5 year period, but you only really care about the last 1-3 months. There may be a use case where someone requires data about a customer for a period starting a year ago and there may also be a reporting request to aggregate data for the last 3 years. Therefore, to play it safe, we need everything in one table.
However, this greatly effects performance. It would be more helpful to try and separate those 3 concerns into 3 different data models:
- A table for a 3 month period for frequently used data.
- An archive table that keeps all old and infrequently used data
- A summary table for reporting
With these, we are complying with the principle of Single-Responsibility and greatly improve performance for each purpose.
I would also add that having a 'main' table with only the last 3 months worth of data, greatly allows you to scale. For example, even if your data doubles every year for the next 3-5 years, you still only have to manage a subset of that data. So if those 3 months once took a table 20Gb to store, the year following would be 40Gb and the year after would be 80Gb: These sizes are still very manageable by todays standards. In addition, hardware and software improves over time, so there can be a legitimate expectation that simply by upgrading and updating, you can keep humming along.
Taking the effort to identify 'hot' and 'cold' data and allocating it to the right tables, can mean that your scalability concerns will be addressed for the long term.
How to implement Archiving?
Table _archive
One way to implement archiving, is by having a table that ends with _archive.
To enable this, you will need to be able to redirect your queries (from your code mainly, or by a proxy that can do that) to the main or the archive table, based on a particular criteria.
For example, if the date is less than today's date minus 30 days, then send it to the archive table, if not, then the main table. Another example may be, if the status column equals 'inactive' send to the archive table.
Partitioning by Date
While this is not a different physical data model, this does help split the table into a few tables and achieves the desired purpose without application code changes.
Is it very common to partition your table to specify which data may be old and allocate it in the right partition, based on date.
mysql> CREATE TABLE `largetable` (
-> `id` bigint unsigned NOT NULL AUTO_INCREMENT,
-> `dateCreated` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
-> `status` int default 1,
-> `sometext` text,
-> PRIMARY KEY (`id`,`dateCreated`)
-> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.03 sec)
mysql> alter table largetable partition by RANGE(YEAR(dateCreated)) (
-> PARTITION p2016 VALUES LESS THAN (2017),
-> PARTITION p2017 VALUES LESS THAN (2018),
-> PARTITION p2018 VALUES LESS THAN (2019),
-> PARTITION p2019 VALUES LESS THAN (2020),
-> PARTITION p2020 VALUES LESS THAN (2021),
-> PARTITION pmax VALUES LESS THAN MAXVALUE);
Query OK, 0 rows affected (0.05 sec)
Records: 0 Duplicates: 0 Warnings: 0
The above example, allocates data by which year the row was created. Please note, after 2020, this sort of manual partitioning will require manually adding new years to this table. If you do it in advance, this can be done without disrupting operations.
Partitioning by Status
You can also have a partition (as mentioned above) to a status column to active/inactive and simply by using UPDATE to change the value MySQL will move over that row to the right partition. REPLACE or INSERT + DELETE will work as well.
mysql> CREATE TABLE `largetable` (
-> `id` bigint unsigned NOT NULL AUTO_INCREMENT,
-> `dateCreated` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
-> `status` int default 1, -- default active
-> `sometext` text,
-> PRIMARY KEY (`id`,`status`)
-> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.02 sec)
mysql> alter table largetable partition by list(status) (
-> partition pactive values in (1), -- active
-> partition pinactive values in (2) -- inactive
-> );
Query OK, 0 rows affected (0.03 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> select * from largetable partition (pactive);
Empty set (0.00 sec)
mysql> select * from largetable partition (pinactive);
Empty set (0.00 sec)
mysql> insert into largetable(sometext) values ('hello');
Query OK, 1 row affected (0.01 sec)
mysql> select * from largetable partition (pinactive);
Empty set (0.00 sec)
mysql> select * from largetable partition (pactive);
+----+---------------------+--------+----------+
| id | dateCreated | status | sometext |
+----+---------------------+--------+----------+
| 1 | 2017-10-30 10:04:03 | 1 | hello |
+----+---------------------+--------+----------+
1 row in set (0.00 sec)
mysql> update largetable set status = 2 where id =1 ;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> select * from largetable partition (pactive);
Empty set (0.00 sec)
mysql> select * from largetable partition (pinactive);
+----+---------------------+--------+----------+
| id | dateCreated | status | sometext |
+----+---------------------+--------+----------+
| 1 | 2017-10-30 10:04:03 | 2 | hello |
+----+---------------------+--------+----------+
1 row in set (0.00 sec)
Partitioning by ID
And lastly, you can partition on the sequence of your auto incrementing id key.
mysql> CREATE TABLE `largetable` (
-> `id` bigint unsigned NOT NULL AUTO_INCREMENT,
-> `dateCreated` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
-> `status` int default 1,
-> `sometext` text,
-> PRIMARY KEY (`id`)
-> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.02 sec)
mysql> alter table largetable partition by RANGE(id) (
-> PARTITION p1 VALUES LESS THAN (500000000),
-> PARTITION p2 VALUES LESS THAN (1000000000),
-> PARTITION p3 VALUES LESS THAN (1500000000),
-> PARTITION p4 VALUES LESS THAN (2000000000),
-> PARTITION p5 VALUES LESS THAN (2500000000),
-> PARTITION pmax VALUES LESS THAN MAXVALUE);
Query OK, 0 rows affected (0.06 sec)
Records: 0 Duplicates: 0 Warnings: 0
The above example specifies which partition the row should goto based on the range of what the id number is. This example is more useful if your system does a lot of primary key look ups. It also helps with distributing the table sizes more equally when compared to dates, as you can have more data in recent years.
A word of caution
Partitioning on the right key is absolutely crucial and not easy. You need to analyse the queries that the application sends to that specific table and come up with a partitioning key(s) that works well and does not slow down the table - at least not the top 80% of the slowest queries.
The partitioning key would need to go into the PRIMARY KEY and in order for the optimiser to send you to the right partition, that key would ideally be included in all SELECT/UPDATE/DELETE queries. Otherwise, your query would run sequentially through each partition in that table.