HAJournalServer

From BigData
Jump to: navigation, search

STATUS

HA is available in bigdata 1.3.0.

Background

The HAJournalServer provides a highly available replication cluster for the scale-up database deployment architecture (the Journal). The HAJournalServer provides horizontal scaling of query, not data. Since the data is fully replicated on each node, query scales linearly in the size of the replication cluster. Because query relies entirely on local indices, the HAJournalServer offers the same low latency for query as the Journal. In contrast, in a scale-out deployment, the data is partitioned and distributed across the nodes of a cluster. Because of this partitioning, query evaluation must be coordinated across multiple nodes in a scale-out deployment. Due to this higher coordination overhead, scale-out query has higher latency, but can achieve higher throughput by doing more parallel work. Due to the combination of low latency query, horizontal scaling of query, and high availability, the HAJournalServer deployment model should be preferred when the data scale will be less than 50 billions and when low latency for individual queries is more important than throughput on high data volume queries.

High Availability

High availability is based on a quorum model and the low-level replication of write cache blocks across a pipeline of services. A highly available service exposes an RMI interface using Apache River and establishes watchers (that reflect) and actors (that influence) the distributed quorum state in Apache zookeeper. Sockets are used for efficient transfer of write cache blocks along the write pipeline. The services publish themselves through zookeeper. Services register with the quorum for a given logical service. A majority of services must form a consensus around the last commit point on the database. One of those services is elected as the leader and the others are elected as followers (collectively, these are referred to as the joined services – the services that are joined with the met quorum). Once a quorum meets, the leader services write requests while reads may be served by the leader or any of the followers. The followers are fully consistent with the leader at each commit point. If a follower can not commit, it will drop out of the quorum and resynchronize before re-entering the quorum.

HAQuorum.jpg

Write replication occurs at the level of 1MB cache blocks. Each cache blocks typically contain many records, as well as indicating records that have been released. Writes are coalesced in the cache on the leader, leading to a very significant reduction in disk and network IO. Followers receive and relay write cache blocks and also lay them down on the local backing store. In addition, both the leaders and the followers write the cache blocks onto a HALog file. The write pipeline is flushed before each commit to ensure that all services are synchronized at each commit point. A 2-phase commit protocol is used. If a majority of the joined services votes for a commit, then the root blocks are applied. Otherwise the write set is discarded. This provides an ACID guarantee for the highly available replication cluster.

HALog files play an important role in the HA architecture. Each HALog file contains the entire write set for a commit point, together with the opening and closing of root blocks for that commit point. HALog files provide the basis for both incremental backup, online resynchronization of services after a temporary disconnect, and online disaster recovery of a service from the other services in a quorum. HALog files are retained until the later of (a) their capture by an online backup mechanism, and (b) a fully met quorum.

Online resynchronization is achieved by replaying the HALog files from the leader for the missing commit points. The service will go through a local commit point for each HALog file it replays. Once it catches up it will join the already met quorum. If any HALog files are unavailable or corrupt, then an online rebuild replicates the leader’s committed state and then enters the resynchronization protocol. These processes are automatic.

Online backup uses the same mechanisms. Incremental backups request any new HALog files, and write them into a locally accessible directory. Full backups request a copy of the leader’s backing store. The replication cluster remains online during backups. Restore is an offline process. The desired full backup and any subsequent HALog files are copied into the data directory of the service. When the service starts, it will apply all HALog files for commit points more recent than the last commit point on the Journal. Once the HALog files have been replayed, the service will seek a consensus (if no quorum is met) or attempt to resynchronize and join an already met quorum.

Requirements

what description
Java Java 1.6+
clocks Clocks should be synchronized using an appropriate mechanism, e.g., nntp. The leader is responsible for assigning timestamps for transactions, so those timestamps will always go forward on a given node. However, if clocks are not synchronized, then log messages can be difficult to interpret and it may introduce artifacts around the visibility periods for historical commit points. Synchronize the clocks.

Configuration

Configuration is performed primarily through the HAJournal.config file. This is a jini/river configuration file. It is based on a subset of the Java language.

  src/resources/HAJournal/HAJournal.config

The critical fields are all listed near the top of the file and are documented in the file. Typical installations will only need to edit the 'bigdata' and 'org.apache.zookeeper.ZooKeeper' sections.

org.apache.zookeeper.ZooKeeper

This is the configuration section for Apache Zookeeper. Zookeeper is used to coordinate the distributed state transitions in the quorum.

field description
zroot Root znode for the federation instance.
servers A comma separated list of host:port pairs, where the port is the client port for the zookeeper server instance.
sessionTimeout The Apache Zookeeper session time out (optional property)
acl ACL for the zookeeper nodes created by the bigdata service. zookeeper ACLs are not transmitted over secure channels and are placed into plain text Configuration files.

See the Apache Zookeeper Administrator Manual for more information.

com.bigdata.service.jini.JiniClient

This provides the information required to connect to the jini/river service registrars. You should not need to edit this section:

field description
groups The discovery groups. For either unicast discovery or multiple setups, you MUST specify this field.
locators An array of one or more unicast LookupLocators. Each LookupLocator accepts a URI of the form jini://host/ or jini://host:port/. This field MAY be an empty array if you want to use multicast discovery AND you have specified the groups as LookupDiscovery.ALL_GROUPS (a null).
entries Optional metadata entries for the service.

com.bigdata.journal.jini.ha.HAJournalServer

This is the configuration section for the HAJournalServer.

field description
serviceDir This directory has the configuration for the HAJournalServer process. By default, it is the also the parent of the other directories. It should be replicated.
logicalServiceId This field identifies the logical service to which this HAJournalServer instance belongs. Instances that belong to different logical services are not part of the same HA Replication Cluster.
writePipelineAddr The internet address at which this service will listen for replicated writes.
replicationFactor The number of services that make up the quorum. For High Availability fail-over features, this must be an odd integer GTE THREE (3). The quorum will meet when a majority of the services are in consensus. To use the HAJournalService to support HA Backup features without High Availability, you may run in a HA(1) configuration by setting this value to 1.
jettyXml The location of the jetty.xml file that will be used to configure jetty (since v1.3.1).
haLogDir This directory contains the write set for each commit point. It should be replicated to durable media (e.g., offsite). Old HALog files will be automatically purged when they are no longer required to satisify the restorePolicy.
snapshotDir This directory contains the snapshot files. It should be replicated to durable media (e.g., offsite). New snapshot files are written periodically and are only read during disaster recovery. Old snapshot files will be automatically purged when they are no longer required to satisify the restorePolicy.
snapshotPolicy An ISnapshotPolicy that specifies when a new snapshot (aka a full backup) will be made. Snapshots provide checkpoints that can be quickly recovered. Offsite replication of snapshots ensures that data is not lost in the case of a single site disaster. The ISnapshotPolicy is responsible for scheduling snapshots and also specifies constraints on whether a snapshot needs to be taken. By default, a snapshot will be taken the first time the HAJournalServer joins with a met quorum. This provides the initial restore point. Subsequent restore points are provided by HALogs. As new snapshots are taken, old snapshots and old HALogs may be released based on the IRestorePolicy (see below). The HARestore utility may be used to apply HALog files to a snapshot, rolling it forward to some desired commit point.

Use DefaultSnapshotPolicy if you want to have take automatic snapshots. For example, the following requests a full backup every day at 0200, but only the size of the HALog files on the disk is at least 20 percent of the size of the Journal file on the disk.

 snapshotPolicy = new DefaultSnapshotPolicy( 200 /*hhmm*/ , 20 /*percentLogSize*/ );

A percentLogSize of ZERO (0) will trigger a snapshot regardless of the size of the journal and the bytes on the disk for the HALog files (but not if a snapshot already exits (or is being prepared) for the current commit point). A value of 100 will trigger the snapshot if the HALog files occupy as much space on the disk as the Journal. Other values may be used as appropriate. However, since the HALogs can be applied to an older snapshot in order to roll the snapshot forward to a desired commit point, a new snapshot should only be taken when there is an explicit desire to consolidate the database state in a snapshot file, when the time required to replay the HALogs would cause a noticeable delay on startup, or when the space on the disk for the HALogs becomes significant.

Use NoSnapshotPolicy if you want to schedule snapshots manually. As a special case, the NoSnapshotPolicy NEVER takes a snapshot. You are completely responsible for scheduling all snapshots, including the snapshots of the initial service deployment.

 snapshotPolicy = new NoSnapshotPolicy();

Once a snapshot exists, HALogs WILL NOT be pruned until you either (a) take another snapshot; or (b) remove the existing snapshot. Thus, if you want to operate without backups most of the time, but occasionally will make a snapshot to copy to a remote location, you must remove the snapshot in order to release storage associated with the snapshot and to stop the accumulation of HALogs.

Snapshots can also be scheduled using the REST API:

 GET /status?snapshot[=percentLogSize]

This makes it easy to schedule snapshots from cron, etc. The percentLogSize value is optional and defaults to ZERO (0). This mechanism may be used in combination with the NoSnapshotPolicy to take complete control over when snapshots are scheduled.

Note: Snapshots should be staggered across the nodes in a replication cluster in order to avoid a simultaneous increase in resource utilization on all nodes. For example, with an HA3 cluster they could be scheduled at 12am, 2am, and 4am on the different nodes.

restorePolicy Once a snapshot is taken, the IRestorePolicy specifies how long that snapshot will be retained. Each snapshot makes it possible to restore a specific commit point. HALogs captured after that snapshot (and before the next snapshot) are used to roll forward the snapshot to recover a transactions after that commit point. Thus, you can always restore any commit point between the oldest snapshot and the most recent commit point on the database. The most recent snapshot is NEVER released. The HARestore utility may be used to apply HALog files to a snapshot, rolling it forward to some desired commit point.

The DefaultRestorePolicy retain snapshots and/or HALogs as necessary in order to preserve the ability to restore commit points up to some specified age. For example, the following will retain backups sufficient to restore any commit point up to 7 days old.

 restorePolicy = new DefaultRestorePolicy(ConfigMath.d2ms(7));

You may also specify the minimum number of snapshots to retain and the minimum number of commit points to retain. Thus, the fully qualified constructor invocation could look as follows:

 restorePolicy = new DefaultRestorePolicy(
  ConfigMath.d2ms(7), // The minimum age before a commit point can no longer be restore from local backups.
  3, // The minimum number of snapshots to retain.
  500 // The minimum number of commit points that must be restorable from the retained snapshots and HALogs.
  );

When a snapshot is released, the HALog files for commit points LTE the commit counter on the next snapshot are also released since those HALog files can no longer be applied to the deleted snapshot. However, HALog files are NOT deleted until the quorum is fully met, thus allowing disconnected follower(s) to automatically resynchronize with the quorum leader (if we violated this principle then a service that was not joined with the met quorum might not be able to automatically resynchronize and would instead need to undergo a disaster rebuild). Once the quorum is fully met, HALog files older than the earliest retained snapshot will be deleted.

Regardless of the IRestorePolicy in effect, snapshots and HALog files will not be removed unless the quorum is fully met (e.g., 3 out of 3 HAJournalServers are up and running).

com.bigdata.rdf.sail.webapp.NanoSparqlServer

The configuration section for the NanoSparqlServer.

This has been replaced in 1.3.1 by the jettyXml property on the HAJournalServer component.

field description
namespace The namespace of the default KB instance.
queryThreadPoolSize The number of concurrent queries that will be processed by the end point. Additional queries will block until a thread is available to execute them. This parameter serves as a throttle on the resource demand on the SPARQL end point(s).
port The port on which jetty will run (1.3.0).
servletContextListenerClass A class that extends BigdataRDFServletContextListener (optional). This offers applications a means to hook the ServletContextListener methods.

Multiple HA Replication Clusters

It is possible to run multiple HA replication clusters in the same network if you pay attention to the following configuration properties.

Each HA Replication Cluster has a distinct logicalServiceId that is shared by all members of the quorum. This is specified in the configuration file. You can have multiple HA Replication Clusters simply by assigning each one an distinct logicalServiceId.

  private static logicalServiceId = "HAJournal-1";

Another important configuration property is the fedname. Services with different fednames will not communicate because they will have different root znodes in Apache zookeeper and different locator groups in Apache River.

  private static fedname = "benchmark";

Deployment

Note: Best practices require that you establish a separate set of images that are running a zookeeper ensemble.

The deployment procedure will start the following services on each server. These services will be executed within a single JVM using the Apache River ServiceStarter. Scripts are provided to start and stop the ServiceStarter, which will start and stop the services running inside of that JVM. It is possible to configure the ServiceStarter to start other services if you need to run additional services on each node.

  1. ClassServer (for downloadable code required by Reggie).
  2. Reggie (the Apache River service registrar).
  3. HAJournalServer (the bigdata database process).

Basic Deployment

The following pattern may be used to create and configure a basic deployment:

# checkout/update from SVN
# svn checkout http://svn.code.sf.net/p/bigdata/code/branches/BIGDATA_RELEASE_1_3_0
# Build the deployment artifact (REL.version.tgz)
ant deploy-artifact

## Unpack the release artifact into the installation directory.
cd /var/lib
tar xvfz REL.version.tgz

## Configure basic environment variables.  Obviously, you must use your own parameters for LOCATORS and ZK_SERVERS.
# Name of the federation of services (controls the Apache River GROUPS).
export FEDNAME=my-cluster-1
# Path for local storage for this federation of services.
export FED_DIR=/var/lib/bigdata
# Name of the replication cluster to which this HAJournalServer will belong.
export LOGICAL_SERVICE_ID=HA-Replication-Cluster-1
# Where to find the Apache River service registrars (can also use multicast).
export LOCATORS="jini://bigdata15/,jini://bigdata16/,jini://bigdata17/"
# Where to find the Apache Zookeeper ensemble.
export ZK_SERVERS="bigdata15:2081,bigdata16:2081,bigdata17:2081"

## Start the local services.
dist/bigdata/bin/startHAServices

## Review the startup log for problems.
tail -f HAJournalServer.log | egrep '(ERROR|WARN)'

Custom Deployment

If you want to create a custom configuration, then unpack the REL.version.tgz file, edit the configuration files, and then create a new archive with your custom configuration. This will allow you to deploy a configuration with specific defaults for all of the parameters that were defined above. In addition, you can customize the Configuration file for the HAJournalServer. This allows you to control the configuration of the default KB instance, the IRestorePolicy, the ISnapshotPolicy and all of the other configuration parameters defined on this page. The basic structure of the REL.version.tgz file is as follows:

path description
bin Command line utilities.
bin/startHAServices Script to start the ServiceStarter (used by the init.d style script if installed as a service).
bigdata/bin/disco-tool Command line utility for the com.bigdata.disco.DiscoveryTool.
bin/config Configuration files for the command line utilities in bin/.
lib, lib-dl, lib-ext The JARS for bigdata and its dependencies. lib-dl is the downloadable jars exported by the ClassServer and used by Reggie.
var/config bigdata configuration files.
var/config/build.properties A copy of the top level project build.properties file (unused by the current HA deployment model).
var/config/logging Logging configuration files for log4j (used by bigdata) and java.util.logging (used by Apache River).
var/config/policy JVM policy files.
var/config/policy/policy.all A default policy file with all permissions granted.
var/config/jini Apache River configuration files for bigdata, reggie, etc.
var/config/jini/HAJournal.config Apache River configuration file for the HAJournalServer.
var/config/jini/reggie.config Apache River configuration file for the the service registrar (reggie).
var/config/jini/startAll.config Apache River configuration file for the services starter by the ServiceStarter (ClassServer, Reggier, HAJournalServer).
etc Files for installation as a Linux service
doc Documentation
doc/LEGAL License files for dependencies.
doc/LICENSE.txt bigdata license file.
doc/NOTICE Copyright NOTICE files.
etc/init.d/bigdataHA HA services start/stop script.
etc/default/bigdataHA Configuration file for the init.d style script.

Durable Data

what environment variable default description
fedname FEDNAME benchmark The name of the federation (also constrains the discovery groups and provides a zk namespace). Services only communicate within the same federation can communicate.
logicalServiceId LOGICAL_SERVICE_ID HAJournal-1 The logical service identifier shared by all members of the quorum. This identifier names the HA replication cluster as a whole.
serviceDir N/A ${FED_DIR}/${LOGICAL_SERVICE_ID}/HAJournalServer The name of the service directory for an HAJournalServer process.
dataDir DATA_DIR serviceDir The directory containing the journal file (the main database backing store). This directory MUST be mounted on a fast disk (SSD or PCIe flash preferred).
haLogDir HALOG_DIR serviceDir/HALog The directory containing the transaction log files (each HALog file is a single transaction). This directory MUST be mounted on a durable disk.
snapshotDir SNAPSHOT_DIR serviceDir/snapshot The directory containing the transaction log files (each HALog file is a single transaction). This directory MUST be mounted on a durable disk.

Installation as a service (linux)

The archive also includes an init.d style script:

bigdataHA (start|stop|status|restart)

This script relies on a /etc/default/bigdataHA that MUST define the following variables to specify the location of the installed scripts. These variables SHOULD use absolute path names.

binDir
pidFile

This script could be used by an rpm or other installer to install the HA replication cluster as an init.d style service on a linux platform.

Load Balancer

Available since 1.3.1.

See the HALoadBalancer page

Ports, Firewalls and Security

The HAJournalServer deployment requires a number of components and ports. This section explains what those ports are, how to configure them, and how to secure them. The table identifies the file in which the configured value is specified. However, nearly all configuration properties can be overridden using environment variables in /etc/default/bigdataHA. are sometimes different Those environment variables are communicated to the JVM that run the HAJournalServer by the startHAServices script. The environment variable names in /etc/default/bigdataHA and startHAServices are sometimes different due to the difficulty with directly expressing environment variables with embedded "." characters such as "jetty.port" in a command line shell. When the name of the environment variable for startHAServices differs it is given in parenthesis, e.g., (NSS_PORT). The translation to the actual environment variable names is performed by startHAServices when it constructs the java command that launches the HAJournalServer process.

environment variable default config file description
jetty.port (NSS_PORT) 9090 jetty.xml The port at which the embedded jetty server will respond for HTTPD connections to the NanoSparqlServer.
jetty.jmxrmiport 1090 jetty.xml The port at which jetty server will self-report MBeans over RMI (if enabled in jetty.xml).
RMI_PORT 9080 HAJournal.config The port at which the HAGlue interface will be exported by the HAJournalServer. This interface is used by the HAJournalServer instances to coordinate their activities.
HA_PORT 9090 HAJournal.config The port at which the HAJournalServer will listen for replicated writes. The payload for the write replication messages is transmitted over this port and consists of low-level cache blocks that are written onto the HALog files and onto the backing database. The RMI_PORT is used to communicate the RMI messages that coordinate the replicated writes (among other things).
ZK_SERVERS 2081 HAJournal.config, zoo.cfg The client port for zookeeper (the port at which zookeeper clients will attempt to contact the zookeeper server). See the Zookeeper Administrator's Guide. This port must be open on each machine running a zookeeper server process to requests from the HAJournalServer hosts.
N/A 2888:3888 zoo.cfg The ports used by the zookeeper ensemble (server port and leader election port). These are configured on the machines that actually run zookeeper. See the Zookeeper Administrator's Guide. These ports must be open on each machine running a zookeeper server process to requests from the other zookeeper server hosts. You need to separately install and configure the zookeeper server(s).
N/A 239.2.11.71 HAJournal.config The multicast address for ganglia (optional). This option may be specified using the Journal properties from within the HAJournal.config file. See GangliaPlugIn for more information.
N/A 8649 HAJournal.config The port for ganglia (optional). This option may be specified using the Journal properties from within the HAJournal.config file. See GangliaPlugIn for more information.
- TODO Securing zookeeper.
- TODO Securing the write replication port.
- TODO Securing the RMI_PORT (HAGlue)
- TODO Securing the LUS.
- TODO Securing the ClassServer.

Monitoring

It is important to establish proper monitoring of the HA replication cluster.

There are several different monitoring techniques that are available:

  1. GET .../status provides detailed reporting on the HA replication cluster.
  2. GET .../status?HA will report {Leader, Follower, or NotReady}. A 404 will be reported if the REST API is not running.
  3. An RMI against the HAGlue.getHAStatus() method published by the server process may also be used to decide whether a service is a Leader, Follower, or NotReady.
  4. The HAJournalServer self-publishes performance counters to ganglia (if this option is enabled in its Configuration file). When enabled, the published metrics can be reviewed using the tool chain for that cluster monitoring system.
  5. The HAJournalServer self-publishes performance counters through the /counters page (HTTP GET).

Critical Alerts

At a minimum, monitoring and alerts should be established for:

  1. counters?path=%2FVolumes - This is the set of storage volumes used by the service. A human operator MUST be alerted when any of these storage volumes is nearly out of disk space. Critical volumes include Data, HALog, and Snapshot. These MUST have sufficient free space for reliable operations.

Self-Healing Behaviors

The HA replication cluster has several self-healing behaviors, which are described below. In addition, you can establish policy for automated backup and restore the database from backups.

RESYNC

During normal operation, low level write replication ensures that each service joined with the met quorum observes all writes on the database. However, if one service is temporarily unavailable, it will automatically catch up (resynchronize) when it rejoined the quorum, including after a restart of the node or the service. The resynchronization protocol examines the current commit point on the local service and the commit point of the quorum consensus. Any missing commit points are automatically replicated from the quorum leader and applied to the local service.

Resynchronization depends on the existence of the corresponding HALog files on the leader for the missing commit points. Those HALog files are normally maintained by ALL services until (a) the quorum goes through a fully met 2-phase commit; and (b) the IRestorePolicy permits the associated commit points to be released. However, if the necessary HALog files are unavailable or corrupt, then the service will perform an automatic REBUILD.

REBUILD

Rebuild provides an online disaster recovery behavior. If the service is not able to resynchronize due to missing or corrupt files (see RESYNC) then a REBUILD will be triggered. The rebuild operation requests a snapshot of the journal from the leader and installs it on the failed service. Once the snapshot has installed, the service enters RESYNC again to catch up on any new commit points that were not part of the installed snapshot.

HA Backup

HA Backup combines the HALog files that are produced for each commit point on the database with periodic snapshots that capture the state of the database at a specific commit point. HALog files are essentially incremental backups. Snapshots are full backups. While HALog files are automatically replicated to each server, snapshots are taken independently on each server. Snapshots can be taken automatically or scripted, e.g., using a cron job.

How HA Backup Works

An HALog file is produced for each write set. Each HALog file has an opening and a closing root block. These root blocks are identical until the file is closed, at which point the closing root block is written. This provides an atomic decisions protocol for identifying when an HALog is complete.

The low level write cache blocks are replicated from the leader along the write pipeline, where each service records them in its local HALog file and also applies them to the backing store. During the 2-phase commit protocol, the leader flushes the write cache which causes the write cache blocks to be replicated to all services in the quorum. The new root block is then prepared by the leader and sent to the joined services in a 2-Phase PREPARE message. If the majority of the joined services vote YES, then the 2-phase COMMIT message instructs them to write the new root block on the HALog and backing store.

Each HALog file serves as an "incremental" backup. It can be replayed and used to advanced a database from one commit point to the next. Logically, if all HALog files are preserved, they could be replayed against an empty Journal to reconstruct any desired commit point on the database. In practice, HALogs are supplemented by snapshots of the Journal. Snapshots correspond to full backups. HALogs correspond to incremental backups.

HA Backup creates a compressed snapshot of the Journal file. A read-only transaction is started to prevent recycling of committed state and the root blocks are atomically captured from the journal. A temporary file is created in the snapshot directory and the root blocks and the data from the journal are written onto that file as a compressed stream. If the snapshot is successful, the temporary file is renamed onto the appropriate file name for the captured commit point. Finally, the read-only transaction is released so recycling may resume. The journal remains available for both read and write during this process.

The user can create a backup policy which specifies a period of time (the RESTORE WINDOW) that will be highly available. Any commit point in that time span may be restored by starting from the appropriate snapshot and applying incremental HALog files until the desired commit point is reached.

The decision to take a snapshot is made locally by each service. The default snapshot policy is based on the amount of data on the disk in the incremental HALog files and the #of HALog files. The time to restore a given commit point is a direct function of the amount of data that must be replayed from the HALogs, thus the data on the disk dominates this decision making. The node remains fully online while making a snapshot. There is no need to make a global decision about when a node executes a snapshot. However, the node MUST NOT release a snapshot that would be required to reconstruct a commit point protected by its backup policy (the RESTORE WINDOW). Since HALogs represent deltas over snapshots, the HALogs for commit points made after snapshot A but before snapshot B will be released when snapshot A is released.

The combination of local snapshots and HALogs provides the ability to fallback to any previous committed state of the database. Additional measures SHOULD be taken to ensure that the database may be recovered if the site is lost. The following outlines the various directories that are maintained by the HAJournalServer and provides guidance for offsite backup.

Requesting a Snapshot

As described above, the HAJournalServer automatically records an HALog file for each commit point on the database. In addition, you may establish an automated policy for creating full backups (called snapshots) for each node in the HA replication cluster. Snapshots are GZIP copies of the Journal that are consistent with the current committed state of the Journal at the moment that the snapshot was begun. Snapshots do not block any other operations on the database - it remains open for both readers and writers. Snapshots are local to each node and are written into the configured snapshot directory. The snapshots and HALog files required to restore the database to a specific commit point are aged out according to the configured ISnapshotPolicy and the IRestorePolicy.

Snapshots can be triggered by external processes using:

  GET .../status?snapshot(=percentLogSize)

This will initiate a snapshot if none is running and if the accumulated HALogs since the last snapshot exceed some percentage of the size of the journal on the disk. The percentLogSize parameter is a percentage expressed as an integer. It is optional and defaults to ZERO (0) when using this GET request. This makes it easy to trigger full backups from a cron job. This request will NOT trigger a snapshot if one is already running or if one already exists for the current commit point. However, it is advisable to specify a non-zero value for percentLogSize in order to avoid generating unnecessary full backups (HALog files can be used to roll forward from a snapshot so it is not necessary to take a full backup unless the replay time for the HALog files or their size on the disk is substantial.)

Backup Media and Offsite Storage

The service directory, snapshot files, and HALog files SHOULD be copied to backup media on a regular basis, with copies of the backup media stored offsite.

The following files SHOULD NOT be included when copying files to backup media:

  1. serviceDir/bigdata-ha.jnl - This is the backing store for the database. It CAN NOT be correctly copied if there are writers running against the database. The proper procedure is to take periodic snapshots and then copy the snapshots and HALog files to the backup media.
  2. serviceDir/pid - This is the process identifier.
  3. serviceDir/.lock - This is the process lock file.

Offline Restore

A restore from backups is only required if there is a disastrous loss of data for the HA replication cluster or if there is a desire to roll back the database to some specific point in time. During normal operations, the self-healing behaviors of the HA replication cluster will automatically handle issues such as transient down time of one or more services.

  1. The service MUST be offline (not running) when you actually install the restored journal.
  2. Make backups of the snapshots and HALog files before beginning a restore procedure.
  3. If the service is running, then it can delete snapshots and HALog files at any time according to its configured ISnapshotPolicy and IRestorePolicy.

Each snapshot file corresponds to a specific commit point. The files are named for the commit points, so if you know the commit point that you want to restore you can just select the first snapshot whose file name is the largest commit counter LTE the desired commit point. For example, if you have the following snapshots:

  snapshotDir/00000000000000000002.jnl.gz
  snapshotDir/00000000000000000020.jnl.gz
  snapshotDir/00000000000000000055.jnl.gz

and you wished to rollback the database to the commit point 40, then you would start with the snapshot:

  snapshotDir/00000000000000000020.jnl.gz

and then apply the HALog files until you reached the desired commit point. The HARestore utility can be used to automatically apply HALog files in order to roll forward to a specific commit point. The snapshot files are just GZIP copies of the journal. If all you want to do is restore a specific snapshot, you can just uncompress the snapshot and copy it over the existing journal file.

BEFORE you restart the service, remove any snapshots or HALog files whose commit counter is GT the restore commit point. If you fail to do this, then the HAJournalServer will automatically apply HALog files for the next commit point (if they exist) on startup. You SHOULD NOT leave snapshots or HALog files in place that correspond to a future history when your goal is to rollback the database to an older commit point.

Additional Resources

The following is a list of additional resources providing more depth on the HA replication cluster. In addition to these resources, the main RMI entry point into the HA replication cluster is the HAGlue interface.

  1. http://www.bigdata.com/bigdata/whitepapers/semtech_ha_deck.pdf (Slide deck on the HA architecture)
  2. http://www.bigdata.com/whitepapers/Bigdata-HA-Quorum-Detailed-Design.pdf (Detailed design for the zookeeper integration)
  3. http://docs.google.com/presentation/d/1IdKQaBouV-a3Bjblk8PtXk-L29Rf_8VjInw7xWKwMDA/edit?usp=sharing (State Transitions for the HAJournalServer)