[RecommendationEngine] Using database replication for offline recommendations #mysql #data

Ecommerce in the real world

With all things analytics based they can slow down databases with the huge amount of reads/writes that are performed. Initial tests with the Recommendation Engine demo showed that using a JDBC data model was slow on a per user recommendation against a csv file based dataset.

Ultimately with all these things we don’t want to annoy or distrupt the customer experience on the website side of things or the point of sale.

Database Replication

Replication is preferred in terms of having the master database deal with all the realworld tasks (ecommerce, point of sale and so on) and a replicated slave database for the recommendation engine to the processing.

The diagram below shows the “live” web server (with the ecommerce site) and the live database in white, the slave server with the replicated database and the recommendation engine in the blue.  With this setup the slave can do the analytics on the customer data while not impacting the live server to the customers.  



Now for the wake up call…

Replication is communication between two database servers (the walkthrough I’ve done is for MySQL), what happens on the master is copied to the slave.

There is one big issue to keep in mind and that’s how the two servers communicate with each other. Straight master database user to slave database user is a dangerous way of doing things and doesn’t do a good job in protecting the data. Ideally each database server should only run locally and use secure shell (ssh) to communicate with each other.

This means using a ssh tunnel from the slave to the master to get the replication working.

The walk through

A few assumptions. Firstly that you’re using MySQL for your database needs. Secondly that you are running on some form of Unix/Linux based server. This should work on a Windows based system but I’ve never tried it, nor shall I in the near future.

First things first, on the slave server create a ssh key, passwordless. This will save to the .ssh/ folder in a file called id_rsa and id_rsa.pub.

$ ssh-keygen -t rsa

Now copy this to the master server. If your ssh is running under a different port number you are better off to use the server address in quotes with the port number within the quotes. The ssh-copy-id is a script and not an actual program.

$ ssh-copy-id 'user@master.myserver.com -p [port]'

Now create the ssh tunnel from the slave.myserver.com to the master.

$ ssh -p [port] master.myserver.com -L 3305: -N

To find out if it’s working you can run a mysql client from the slave server to see if you can access the master database.

$ mysql -h -P 3305 -uYourUser -p

On the master server edit the /etc/mysql/my.cnf file and ensure the following are in place. Replication requires binary logging is enabled for MySQL. For other database servers you’ll have to read the documentation.


Now create a user specific for the replication slave to access. Then grant this user permission to access the master db as a replicated slave.

CREATE USER 'repuser'@'' IDENTIFIED BY 'reppass';GRANT REPLICATION SLAVE ON *.* TO 'repuser'@'';

On the slave DB edit the /etc/mysql/my.cnf. You can name the databases you want to replicate with the “replicate-do-db” setting. You can list as many as you want.


Make sure the table structure is in place on the slave as well as the master. Now restart both servers and attempt to insert data into the master server and make sure the results are appeari
ng on the slave.

The recommendation engine

On the slave server you can now use Sqoop and the recommendation engine to append the transactions and recommend products back to the user.

The synchronisation works on way, the master db records an entry and the slave replicates it. Writing the recommendations to the slave won’t work their way back to the master.

One way to do this though is to rig the recommendation engine to write out a csv file of userid/itemid pairs to a file and as the ssh tunnel is in place import these back to the master db.


2 responses to “[RecommendationEngine] Using database replication for offline recommendations #mysql #data”

  1. I’d be wary of using ssh tunnels to host replication traffic as it’s introducing another failure point that probably isn’t necessary. Typically, slaves are in the same datacentre as the master(s) and you should just use separate internal private subnets for replication and not worry about encryption or compression. This also makes it easier to fail over a slave if the master goes down.If the slave is remote, eg: dedicated for batch or analytical purposes as above, then a VPN is a more stable connectivity solution. ssh tunnels have a tendency to collapse over time I’ve found.If you must use tunnels then you might want -oTCPKeepAlive=yes as an argument to ssh as well.

  2. Thanks John, all valid points as ever. This was more a documented experiment as "you must do it like this". I just had two VPS servers kicking about to play with :)Cheers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: