Data location informing service to quickly find out the data distributed in remote sites

ITER UDA structure and the replicated repository: If there is only one indexing DB, it would be a single point of failure (SPOF). Remote site’s independence against accidental loss of long-distance network connectivity or planned power outages can be improved if the replicated indexing DB can continue operation independently from the primary indexing DB even though the real-time synchronization is temporarily lost. The indexer process should always register a new data entry synchronously with the data migration process making a new copy of the data or moving to other place, locally or remotely.

ITER UDA structure and the replicated repository: If there is only one indexing DB, it would be a single point of failure (SPOF). Remote site’s independence against accidental loss of long-distance network connectivity or planned power outages can be improved if replicated indexing DB can continue operation independently from the primary indexing DB even though real-time synchronization is temporarily lost. Indexer process should always register new data entry synchronously with data migration process making new copy of data or moving to other place, locally or remotely.

The ITER Remote Experimentation Centre (REC) in Japan is planning to replicate all the data, as fast as possible, from ITER in France which is more than 10 000 km away. The data location database and its information service will play an important role in finding and retrieving data efficiently from multiple locations created by the replication. In order to replicate the database information between ITER in France and REC in Japan when a long delay time would be involved, due to very distant tele-communication, it is necessary to adopt a "multi-master asynchronous replication" configuration between these databases. In this study, we have examined several database products and found that Postgres BDR can provide the expected replication functionality.

As a result of bi-directional replication tests using the LHD experimental database and a domestic network of about 1 000 km distance, it was found that the replication performance was sufficient for remote fusion experiments such as between ITER and REC.

In modern fusion experiments, remote data access has already come into wide use in both domestic and international research collaborations. The SNET fusion data exchanging platform in Japan interconnects four fusion experimental sites, LHD, QUEST, GAMMA10, and TST-2, over a 1 000 km distance. SNET enables remote collaborators to seamlessly access each site’s data as if they were at a local site.

Similarly, the Remote Experimentation Centre (REC) in Rokkasho, Japan, is planning to replicate the full dataset of ITER data over a 10 000 km distance, where high-performance computing resources are ready for off-site analyses of ITER physics data. Prior to this study, our group has performed a series of inter-continental massive data replication tests between ITER and the REC sites.

In such multi-site data repository environments, the data location informing “locator” service will be essential for finding the best data server from which users can retrieve the data most efficiently. Considering that the latency time is more than 100 milliseconds for inter-continental network transactions, not only the data repositories but also the locator servers should be distributed to multiple sites. Since the data location information will be stored and served by means of a relational database (RDB), such as PostgreSQL or MySQL, realtime synchronization between the distributed RDBs will be necessary to provide a consistent data locator service around the world.

From years’ of operational experience of the SNET remote sites, we have also found that the operational independence of the remote site is quite important against unexpected and planned service outages at the original site. A typical example is the annual electricity inspection mandated by law, with a power outage of the whole site, which should never concern the remote site work continuity, such as data access for analyses.

To make available widely distributed data repositories of practical use for off-site data analyses, the data location informing service should be running at each repository site to sustain the independent operation against any accidental or scheduled stops of other sites. To satisfy the conflicting requirements of both the site independence and the mutual data synchronization, asynchronous multi-master, i.e., bi-directional, replication should be applied between the cooperating relational databases that serve the data location indexes.

In this study, bi-directional replications between multi-master locator RDBs have been tested by using the LHD data system and SNET. Since the current structure of the SNET distributed data system adopts a master-slave RDB replication architecture, the system structure must be redesigned for applying the multi-master bi-directional replication. Such structural refinements for the distributed databases would be a common issue not only for SNET’s multiple sites but also for the bi-directional replication between ITER and the REC.

Through a functional survey on popular open-source and commercial RDBMS software, we found PostgreSQL to be the most promising solution, at the moment, for such multi-site DB replications between ITER and the REC, and also between the LHD and SNET. Since ITER CODAC and the LHD have already adopted the PostgreSQL as their standard RDBMS, its extension for the bi-directional replication might cause fewer compatibility issues. Other popular RDB products, such as Oracle and MySQL, seem to be aiming for densely coupled cluster configurations and are oriented towards being applied to mission-critical cases rather than being used in collaborative research uses.

Consequently, we decided to carry out some performance tests using the PostgreSQL version 9.4, with the extension of bi-directional replication (BDR). Some technical surveys and comparisons of other RDB replication methods have been also made before the actual performance test.

To verify the replication performance, the throughputs of Postgres BDR have been measured on SNET using the LHD indexing databases. The result shows that the replication speeds are adequately fast compared to the network round-trip time. Thus, we can conclude that Postgres BDR is a very practical solution from both functional and performance viewpoints, for our multi-site database replication in fusion experiments.

Even though BDR 1.x was open-source software based on the PostgreSQL 9.4, the successor BDR 2.0 on PostgreSQL 9.6 and BDR 3.0 on PostgreSQL 11 and 12 became non-free software licensed by a commercial company. Therefore, we expect further developments on the standard PostgreSQL logical replication method, because the BDR developer continues to contribute to it. Our investigations on this matter will be also continued in the near future.

Another kind of verification test is also planned on relaying replication through more than two sites. In such a case, some selection schemes would be necessary to choose the most efficient repository and also the indexing database.

This research was conducted by Drs. Nakanishi, Hideya, Nakajima, Noriyoshi, and Emoto, Masahiko with their research group at the National Institute for Fusion Science (NIFS), in cooperation with Dr. Yamanaka, Kenjiro, et al. at the National Institute of Informatics (NII), and Drs. Tokunaga, Shinsuke, and Ishii, Yasutomo, et al. at Rokkasho Fusion Institute of the National Institutes for Quantum and Radiological Science and Technology (QST).

The research result was published in Fusion Engineering and Design, an academic journal of Elsevier BV, on 31 January 2021.