2014

Setup Docker in Ubuntu 14.04

10/09/14 16:51 Filed in: Cloud architecture | Docker

Overview
We are entering an age of open cloud infrastructure and application portability. The applications/services are required to scale in/out as per demand and need to be executed in multiple host environment. For this applications are required to run within shell, and the shell/container needs to work across several hosts and be portable to any cloud environment.

Docker is a container-based software framework for automating deployment of applications, makes it easy to partition a single host into multiple containers. However, although useful, many applications require resources beyond a single host, and real-world deployments require multiple hosts for resilience, fault tolerance and easy scaling of applications.
Installtion
These instructions describe to setup latest version of Docker for Ubuntu 14.04 for not latest release from Docker. Following are the list of commands.

sudo apt-get update
sudo apt-get install docker.io
sudo ln -sf /usr/bin/docker.io /usr/local/bin/docker
sudo sed -i '$acomplete -F _docker docker' /etc/bash_completion.d/docker.io

To setup latest Docker release, add the Docker repository key to your local keychain and process the install commands

$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 36A1D7869245C8950F966E92D8576A8BA88D21E9
$ sudo sh -c "echo deb https://get.docker.io/ubuntu docker main > /etc/apt/sources.list.d/docker.list"
$ sudo apt-get update
$ sudo apt-get install lxc-docker

After install the Docker you can search any community containers. For e.g if you need to search for any debian package,

docker search debian

Setup ActiveMQ, Zookeeper, and Replicated LevelDB running in JDK 8 and CentOS

22/08/14 11:53 Filed in: Data pipeline

This guide describes the step-by-step guide to setup ActiveMQ to use replicated LevelDB persistence with Zookeeper. CentOS environment is used for servers. Zookeeper is used to replicate the LevelDB to support Master/Slave activeMQs. Three VMware instances are used with 2 Core processes, 2G RAM and 20G disk space. For simplicity, stop the IPtable service (firewall) in CentOS. If IPTable is required then you need to open set of ports. List of port numbers are included in pre-setup work section.

Overview
ActiveMQ cluster environment includes following
1 Three VM instances with CentOS os AND JDK 8
2 Three ActiveMQ instances.
3 Three Zookeeper instances.

defb635b28c0bfdeb2d827783524bf31

Pre-install

Following ports are required to open in Iptables host firewall.

Zookeeper

- 2181 – the port that clients will use to connect to the ZK ensemble
- 2888 – port used by ZK for quorum election
- 3888 – port used by ZK for leader election

ActiveMQ

- 61616 – default Openwire port
- 8161 – Jetty port for web console
- 61619 – LevelDB peer-to-peer replication port for ActiveMQ slaves.

To check Iptables status.

- service iptables status

To stop iptables service

- service iptables save
- service iptables stop
- chkconfig iptables off

To start again

- service iptables start
- chkconfig iptables on

Setup proper hostname, edit following files

- Update HOSTNAME value in /etc/sysconfig/network : E.g: HOSTNAME=msgq1.dev.int
- Add host name with IP address of the machine in /etc/hosts: E.g: 192.168.163.160 msgq1.dev.int msgqa1
- Restart the instance and repeat same process each instances.

Installation

Java JDK

Download JDK 8 from http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
Extract JdK-8*.tar.gz to a folder of your choice (I used ~/Dev/server). This will create folder ~/Dev/server/jdk1.8.0_20.
Set up JAVA_HOME directory.
To setup for all users edit /etc/profile and add following line export JAVA_HOME=HOME_DIR/Dev/server/jdk1.8.0_20.

Zookeeper

Download zookeeper from site http://zookeeper.apache.org/
Extract the file into your folder of your choice. I have used ~/Dev/server/
Create soft link zookeeper for extracted directory.
For reliable ZooKeeper service, the ZK should be deployed in a cluster mode knows as ensemble. As long as a majority of the ensemble are up, the service will be available.
Goto conf directory and create zookeeper configuration directory.
cd /conf
Copy zoo_sample.cfg to zoo.cfg
Make sure the file has following lines

- tickTime=2000initLimit=5
  syncLimit=2
  dataDir=~/Dev/server/data/zk
  clientPort=2181

Add following lines into zoo.cfg at the end.

- server.1=zk1_IPADDRESS:2888:3888
  server.2=zk2_IPADDRESS:2888:3888
  server.3=zk3_IPADDRESS:2888:3888
- zk1, zk2 & zk3 are IP addresses for the ZK servers.
- Port 2181 is used to communicate with client
- Port 2888 is used by peer ZK servers to communicate among themselves (Quorum port)
- Port 3888 is used for leader election (Leader election port).

The last three lines of the server.id=host:port:port format specifies that there are three nodes in the ensemble. In an ensemble, each ZooKeeper node must have a unique ID between 1 and 255. This ID is defined by creating a file named myid in the dataDir directory of each node. For example, the node with the ID 1 (server.1=zk1:2888:3888) will have a myid file at /home/sthuraisamy/Dev/server/data/zk with the text 1 inside it.
Create myid file in data directory for zk1 ( server.1) and for other ZK servers as 2 & 3.

- echo 1 > myid

ActiveMQ
Download activemq distribution from http://apache.mirror.nexicom.net/activemq/5.10.0/apache-activemq-5.10.0-bin.tar.gz

Extract the file into your folder of your choice. I have used ~/Dev/server/
Create soft link activemq for extracted directory.
Do following

- cd /bin
- chmod 755 activemq
- /bin/activemq start

To confirm the activemq is listening on port 61616 or check the log file and confirm port listening messages are populated.

- netstat -an|grep 61616

In activemq config file, following bean classes define the settings

- PropertyPlaceholderConfigurer
- Credentials
- Broker section

- - constantPendingMessageLimitStrategy: limit the number of messages to be keep in memory for slow consumers.

- Other settings to handle slower consumers, refer http://activemq.apache.org/slow-consumer-handling.html
- Persistence adapter to define the storage to keep the messages.
- For better performance

- - Use NIO : Refer http://activemq.apache.org/configuring-transports.html#ConfiguringTransports-TheNIOTransport

- Replicated LevelDB store using Zookeeper http://activemq.apache.org/replicated-leveldb-store.html
- The settings need to be done in ActiveMQ after zookeeper is setup. Add following lines into conf/activemq.xml

- - hostname should be assigned with separate IP address for each instance.

My approach was to get the software setup on a single VM instance in VM Ware fusion, and create two more clones to have three servers. I have named the instances as messageq1, messageq2, and messageq3. After starting instances confirm the myid file and IP address in the zoo.cfg are setup properly with new instance’s ip address.

After configured everything

Start the Zookeeper instances in all three nodes : /bin/zk_Server.sh start
Start the activeMQ instances in all three nodes : /bin/activemq start
In my setup when I start the first node I didn’t find any issue. After I have started the second node, I found exception in the log file.

No IOExceptionHandler registered, ignoring IO exception | org.apache.activemq.broker.BrokerService | LevelDB IOExcepti
on handler.
java.io.IOException: com.google.common.base.Objects.firstNonNull(Ljava/lang/Object;Ljava/lang/Object

Ljava/lang/Object;
     at org.apache.activemq.util.IOExceptionSupport.create(IOExceptionSupport.java:39)[activemq-client-5.10.0.jar:5.10.0]
     at org.apache.activemq.leveldb.LevelDBClient.might_fail(LevelDBClient.scala:552)[activemq-leveldb-store-5.10.0.jar:5.10.0]
     at org.apache.activemq.leveldb.LevelDBClient.replay_init(LevelDBClient.scala:657)[activemq-leveldb-store-5.10.0.jar:5.10.0]
     at org.apache.activemq.leveldb.LevelDBClient.start(LevelDBClient.scala:558)[activemq-leveldb-store-5.10.0.jar:5.10.0]
     at org.apache.activemq.leveldb.DBManager.start(DBManager.scala:648)[activemq-leveldb-store-5.10.0.jar:5.10.0]
     at org.apache.activemq.leveldb.LevelDBStore.doStart(LevelDBStore.scala:235)[activemq-leveldb-store-5.10.0.jar:5.10.0]
     at org.apache.activemq.leveldb.replicated.MasterLevelDBStore.doStart(MasterLevelDBStore.scala:110)[activemq-leveldb-store-5.10.0.jar:5.10.0]
     at org.apache.activemq.util.ServiceSupport.start(ServiceSupport.java:55)[activemq-client-5.10.0.jar:5.10.0]
     at org.apache.activemq.leveldb.replicated.ElectingLevelDBStore$$anonfun$start_master$1.apply$mcV$sp(ElectingLevelDBStore.scala:226)[activemq-lev
eldb-store-5.10.0.jar:5.10.0]
     at org.fusesource.hawtdispatch.package$$anon$4.run(hawtdispatch.scala:330)[hawtdispatch-scala-2.11-1.21.jar:1.21]
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)[:1.8.0_20]
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)[:1.8.0_20]

After going through tickets in activeMQ found following ticket has been reported https://issues.apache.org/jira/browse/AMQ-5225. Workaround described in the ticket will solve the issue. The work around for this issue,

remove pax-url-aether-1.5.2.jar from lib directory
comment out the log query section

To confirm ActiveMQ listening for request
◦ In master, check with netstat -an | grep 61616 and confirm the port is in listen mode.
◦ In slaves, you can run netstat -an | grep 6161 and output should show you slave binding port 61619

Post-Install
◦ For Zookeeper, set the Java heap size. This is very important to avoid swapping, which will seriously degrade ZooKeeper performance. To determine the correct value, use load tests, and make sure you are well below the usage limit that would cause you to swap. Be conservative - use a maximum heap size of 3GB for a 4GB machine.
◦ Increase the open file number to support 51200. E.g: limit -n 51200.
◦ Review linux network setting parameters : http://www.nateware.com/linux-network-tuning-for-2013.html#.VA8pN2TCMxo
◦ Review ActiveMQ transports configuration settings : http://activemq.apache.org/configuring-transports.html
◦ Review ActiveMQ persistence configuration settings : http://activemq.apache.org/persistence.html
◦ Review zookeeper configuration settings : http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_configuration

Reference
◦ tickTime: the length of a single tick, which is the basic time unit used by ZooKeeper, as measured in milliseconds. It is used to regulate heartbeats, and timeouts. For example, the minimum session timeout will be two ticks.
◦ initLimit: Amount of time, in ticks , to allow followers to connect and sync to a leader. Increased this value as needed, if the amount of data managed by ZooKeeper is large.
◦ syncLimit: Amount of time, in ticks , to allow followers to sync with ZooKeeper. If followers fall too far behind a leader, they will be dropped.
◦ clientPort: The port to listen for client connections; that is, the port that clients attempt to connect to.
◦ dataDir: The location where ZooKeeper will store the in-memory database snapshots and, unless specified otherwise, the transaction log of updates to the database.

ActiveMQ configuration file from msgq1
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
http://activemq.apache.org/schema/core http://activemq.apache.org/schema/core/activemq-core.xsd">


            file:${activemq.conf}/credentials.properties


        scope="singleton" init-method="start" destroy-method="stop">
        dataDirectory="${activemq.data}">















                            zkAddress="192.168.163.160:2181,192.168.163.161:2181,192.168.163.162:2181"
                directory=“~/Dev/server/activemq/data/leveldb"
                hostname="192.168.163.160"/>















                            class="org.apache.activemq.hooks.SpringContextHook"/>


ZK Configuration file
tickTime=2000
initLimit=5
syncLimit=2
dataDir=/home/sthuraisamy/Dev/server/data/zk
clientPort=2181

server.1=192.168.163.160:2888:3888
server.2=192.168.163.161:2888:3888
server.3=192.168.163.162:2888:3888

In ActiveMQ 5.10 web console you can view and delete the pending messages in a Queue
46c06aedd0b78d9e7045d94b977642f8

Streaming Data

21/08/14 00:49 Filed in: Data pipeline

BigData is a collection of data set so large and complex that it becomes difficult to process using on-hand database management tools or traditional applications. The datasets not only contain structured datasets, but also include unstructured datasets. Big data has three characteristics:

Volume
Velocity.
Variety.

The diagram explains the characteristics.

Bigdata3v

Recently, more and more is getting easily available as streams. The stream data item can be classified into 5Ws data dimensions.

What the data is? Video, Image, Text, Number
Where the data came from? From twitter, Smart phone, Hacker
When the data occurred? The timestamp of data incidence.
Who received the data? Friend, Bank account, victim
Why the data occurred? Sharing photos, Finding new friends, Spreading a virus
How the data was transferred? By Internet, By email, Online transferred

Conventional data processing technologies are now unable to process these kind of volume data within a tolerable elapsed time. In-memory databases also have certain key problems such as larger data size may not fit into memory, moving all data sets into centralized machine is too expensive. To process data as they arrive, the paradigm has changed from the conventional “one-shot” data processing approach to elastic and virtualized datacenter cloud-based data processing frameworks that can mine continuos, high-volume, open-ended data streams.

The framework contains three main components.

Data Ingestion: Accepts the data from multiple sources such as social networks, online services, etc.
Data Analytics: Consist many systems to read, analyze, clean and normalize.
Data Storage: Provide to store and index data sets.

Following diagram explain these components.

data process

PCI DSS 3.0

31/07/14 15:12 Filed in: Security

The Payment Card Industry Data Security Standard (PCI DSS) was developed by following payment entities.

American Express
Discover Financial Services
JCB International
MasterCard Worldwide
Visa.

PCI DSS mandates set of requirements and processes for security management, policies, procedures, network architecture, software design and critical protective measures. The requirement must be met by all organizations (merchants and service providers) that transmit, process or store payment card data, or directly or indirectly affect the security of cardholder data. If an organization uses a third party to manage cardholder data, the organization has a responsibility to ensure that this third party is compliant with the PCI DSS. The requirements are published and controlled by the independent PCI Security Standards Council (SSC) defines qualifications for Qualified Security Assessors (QSAs), Internal Security Assessors (ISA), PCI Forensic Investigators (PFI), PCI Professionals (PCIP), Qualified Integrators and Resellers (QIR), and Approved Scanning Vendors (ASVs). It trains,

Key definitions2 and acronyms in the PCI DSS:

Acquirer: Bank, which acquires merchants.
Payment brand: Visa, MasterCard, Amex, Discover, JCB.
Merchant: Sells products to cardholders.
Service provider: A business entity, directly or indirectly involved in the processing, storage, transmission and switching of cardholder data.
PAN (Primary Account Number): The 16 digit payment card number.

TPPs (Third Party Processors): Who process payment card transactions.
DSEs (Data Storage Entities): Who store or transmit payment card data.
QSA (Qualified Security Assessor): Someone who is trained and certified to carry out PCI DSS compliance assessments.
ISA (Internal Security Assessor): Someone who is trained and certified to conduct internal security assessments.
ASV (Approved Scanning Vendor): An organization that is approved as competent to carry out the security scans required by PCI DSS.
PFI (PCI Forensic Investigator): An individual trained and certified to investigate and contain information security breaches involving cardholder data.
CDE (Cardholder Data Environment): Any network that possesses cardholder data or sensitive authentication data.

PCI DSS applies to all processes, people and technology, and all system components, including network components, servers, or applications that are included in or connected to the cardholder data environment. It also applies to telephone recording technology used by call centres that accept payment card transactions.

Service provider compliance levels are described in below table:
Screen Shot 2014-07-31 at 4.06.37 PM

PCI DSS Procedures

Install and maintain a firewall configuration to protect cardholder data

Establish and implement firewall and router configuration standards.
Build firewall and router configurations that restrict connections between untrusted networks and any system components in the cardholder data environment.
Prohibit direct public access between the Internet and any system component in the cardholder data environment.
Install personal firewall software on any mobile and/or employee-owned devices that connect to the Internet when outside the network.
Ensure that security policies and operational procedures for managing firewalls are documented, in use, and known to all affected parties.
Maintain current network and data flow diagrams.

Do not use vendor-supplied defaults for system passwords and other security parameters

Always change vendor-supplied defaults and remove or disable unnecessary default accounts before installing a system on the network.
Develop configuration standards for all system components. Ensure that these standards address all known security vulnerabilities and are consistent with industry-accepted system hardening standards.
Encrypt all non-console administrative access using strong cryptography. Use technologies such as SSH, VPN, or SSL/TLS for web-based management and other non-console administrative access.
Maintain an inventory of system components that are in scope for PCI DSS.
Ensure that security policies and operational procedures for managing vendor defaults and other security parameters are documented, in use, and known to all affected parties.
Shared hosting providers must protect each entity’s hosted environment and cardholder data.

Protect stored cardholder data

Keep cardholder data storage to a minimum by implementing data retention and disposal policies, procedures and processes.
Do not store sensitive authentication data after authorization (even if encrypted). If sensitive authentication data is received, render all data unrecoverable upon completion of the authorization process.
Mask PAN when displayed (the first six and last four digits, at maximum), such that only personnel with a legitimate business need can see the full PAN.
Render PAN unreadable anywhere it is stored (including on portable digital media, backup media, and in logs).
Document and implement procedures to protect keys used to secure stored cardholder data against disclosure and misuse.
Fully document and implement all key-management processes and procedures for cryptographic keys used for encryption of cardholder data.
Ensure that security policies and operational procedures for protecting stored cardholder data are documented, in use, and known to all affected parties.

Encrypt transmission of cardholder data across open, public networks

Use strong cryptography and security protocols (for example, SSL/TLS, IPSEC, SSH, etc.) to safeguard sensitive cardholder data during transmission over open, public networks.
Never send unprotected PANs by end-user messaging technologies (for example, e-mail, instant messaging, chat etc.).
Ensure that security policies and operational procedures for encrypting transmissions of cardholder data are documented, in use, and known to all affected parties.

Protect all systems against malware and regularly update anti-virus software or programs

Ensure that all anti-virus mechanisms are maintained.
Ensure that anti-virus mechanisms are actively running and cannot be disabled or altered by users, unless specifically authorized by management on a case-by-case basis for a limited time period.
Ensure that security policies and operational procedures for protecting systems against malware are documented, in use, and known to all affected parties.

Develop and maintain secure systems and applications

Establish a process to identify security vulnerabilities, using reputable outside sources for security vulnerability information, and assign a risk ranking.
Ensure that all system components and software are protected from known vulnerabilities by installing applicable vendor-supplied security patches. Install critical security patches within one month of release.
Develop internal and external software applications securely (including web-based administrative access to applications).
Follow change control processes and procedures for all changes to system components.
Address common coding vulnerabilities in software development processes.
For public-facing web applications, address new threats and vulnerabilities on an ongoing basis and ensure these applications are protected against known attacks.
Ensure that security policies and operational procedures for developing and maintaining secure systems and applications are documented, in use, and known to all affected parties.

Restrict access to cardholder data by business need-to-know

Limit access to system components and cardholder data to only those individuals whose job requires such access.
Establish an access control system for systems components that restricts access based on a user’s need to know, and is set to “deny all” unless specifically allowed.
Ensure that security policies and operational procedures for restricting access to cardholder data are documented, in use, and known to all affected parties.

Identify and authenticate access to system components

Define and implement policies and procedures to ensure proper user identification management for non-consumer users and administrators on all system components.
In addition to assigning a unique ID, ensure proper user-authentication management for non-consumer users and administrators on all system components.
Incorporate two-factor authentication for remote network access originating from outside the network by personnel (including users and administrators) and all third parties, (including vendor access for support or maintenance).
Document and communicate authentication procedures and policies.
Do not use group, shared or generic IDs, passwords, or other authentication methods.
Where other authentication mechanisms are used (for example, physical or logical security tokens, smart cards, certificates etc.), use of these mechanisms must be assigned to an individual account and only the intended account can use that mechanism.
All access to any database containing cardholder data (including access by applications, administrators, and all other users) is restricted.
Ensure that security policies and operational procedures for identification and authentication are documented, in use, and known to all affected parties.

Restrict physical access to cardholder data

Use appropriate facility entry controls to limit and monitor physical access to systems in the cardholder data environment.
Develop procedures to easily distinguish between onsite personnel and visitors.
Control physical access for onsite personnel to the sensitive areas.
Implement procedures to identify and authorize visitors.
Physically secure all media.
Maintain strict control over the internal or external distribution of any kind of media
Maintain strict control over the storage and accessibility of media.
Destroy media when it is no longer needed for business or legal reasons.
Protect devices that capture payment card data via direct physical interaction with the card from tampering and substitution.

Track and monitor all access to network resources and cardholder data

Implement audit trails to link all access to system components to each individual user.
Implement automated audit trails for all system components to reconstruct events,
create an audit trail for all system components for each event,
Using time-synchronization technology, synchronize all critical system clocks and times and ensure that the following is implemented for acquiring, distributing and storing time.
Secure audit trails so they cannot be altered.
Review logs and security events for all system components to identify anomalies or suspicious activity.
Retain audit trail history for at least one year, with a minimum of three months immediately available for analysis.
Ensure that security policies and operational procedures for monitoring all access to network resources and cardholder data are documented, in use, and known to all affected parties.

Regularly test security systems and processes

Implement processes to test for the presence of wireless access points (802.11), and detect and identify all authorized and unauthorized wireless access points on a quarterly basis.
Run internal and external network vulnerability scans at least quarterly and after any significant change in the network.
Implement a methodology for penetration testing.
Use intrusion-detection and/or intrusion-prevention techniques to detect and/or prevent intrusions into the network.
Deploy a change-detection mechanism to alert personnel to unauthorized modification of critical system files, configuration files, or content files; and configure the software to perform critical file comparisons at least weekly.
Ensure that security policies and operational procedures for security monitoring and testing are documented, in use, and known to all affected parties.

Maintain a policy that addresses information security for all personnel

Establish, publish, maintain and disseminate a security policy.
Implement a risk assessment process.
Develop usage policies for critical technologies and define proper use of these technologies.
Ensure that the security policy and procedures clearly define information security responsibilities for all personnel.
Assign to an individual or team information security management responsibilities.
Implement a formal security awareness programme to make all personnel aware of the importance of cardholder data security.
Screen potential personnel prior to hire to minimize the risk of attacks from internal sources.
Maintain and implement policies and procedures to manage service providers with whom cardholder data is shared, or that could affect the security of cardholder data.
Additional requirement for service providers: Service providers acknowledge in writing to customers that they are responsible for the security of cardholder data the service provider possesses or otherwise stores, processes or transmits on behalf of the customer, or to the extent that they could impact the security of the customer’s cardholder data environment.
Implement an incident response plan. Be prepared to respond immediately to a system breach.

Reliability design for Cloud applications

23/07/14 17:26 Filed in: Cloud architecture

On of the backbones of the software reliability is avoiding the faults. In software reliability engineering, there are four major approaches to improve system reliability.

Fault Forecasting: Provides the predictive approach to software reliability engineering. Measures of the forecasted dependability, can be obtained by modelling or using the experience from previously deployed systems. Forecasting is a front-end product development life cycle exercise. It is done during system exploration and requirements definition. Fault prevention: Aims to prevent the introduction of faults, e.g., by constraining the design processes by means of rules. Prevention occurs during the product development phases of a project where the requirements, design, and implementation are occurring.
Fault removal: Aims to detect the presence of faults, and then to locate and remove them. Fault removal begins at the first opportunity that faults injected into the product are discovered. At the design phase, requirements phase products are passed to the design team. This is the first opportunity to discover faults in the requirements models and specifications. Fault removal extends into implementation and through installation.
Fault tolerance: Provides the intrinsic ability of a software system to continuously deliver service to its users in the presence of faults. This approach to software reliability addresses how to keep a system functioning after the faults in the delivered system manifest themselves. Fault tolerance relies primarily on error detection and error correction, with the latter being either backward recovery (e.g., retry), forward recovery (e.g., exception handling) or compensation recovery (e.g., majority voting). From the middle phases of the software development life cycle through product delivery and maintenance, reliability efforts focus on fault tolerance.

When changing to the cloud environment, the applications deployed in the cloud are usually distributed into multiple components, only having fault prevention and fault removal techniques are not sufficient. Another approach for building reliable systems is software fault tolerance, which is to employ functionally equivalent components to tolerate faults. Software fault tolerance approach takes advantage of the redundant resources in the cloud environment, and makes the system more robust by masking faults instead of removing them.

Cloud computing platforms typically prefer to build reliability into the software. The software should be designed for failure and assumes that components will misbehave/fail or go away from time to time. Reliability should be built into application and as well as data. I.e. within the component or external module should monitor the service components and provide/execute recovery process. Multiple copies of data are maintained such that if you lose any individual machine the system continues to function (in the same way that if you lose a disk in a RAID array the service is uninterrupted). Large scale services will ideally also replicate data in multiple locations, such that if a rack, row of racks or even an entire datacenter were to fail then the service would still be uninterrupted

ActiveMQ HA Tunning

15/07/14 22:52 Filed in: Data pipeline

Performance is based on following factors

The network topology
Transport protocols used
Quality of service
Hardware, network, JVM and operating system
Number of producers, number of consumers
Distribution of messages across destinations along with message size

The possible bottlenecks are

Overall system
Network latencies
Disk IO
Threading overheads
JVM optimizations

Async publishing
When an ActiveMQ message producer sends a non-persistent message, its dispatched asynchronously (fire and forget) - but for persistent messages, the publisher will block until it gets a notification that the message has been processed (saved to the store - queued to be dispatched to any active consumers etc) by the broker. messages are dispatched with delivery mode set to be persistent by default (which is required by the JMS spec). So if you are sending messages on a Topic, the publisher will block by default (even if there are no durable subscribers on the topic) until the broker has returned a notification. So if you looking for good performance with topic messages, either set the delivery mode on the publisher to be non-persistent, or set the useAsyncSend property on the ActiveMQ ConnectionFactory to be true.

Pre-fetch sizes for Consumer
ActiveMQ will push as many messages to the consumer as fast as possible, where they will be queued for processing by an ActiveMQ Session. The maximum number of messages that ActiveMQ will push to a Consumer without the Consumer processing a message is set by the pre-fetch size. You can improve throughput by running ActiveMQ with larger pre-fetch sizes. Pre-fetch sizes are determined by the ActiveMQPrefetchPolicy bean, which is set on the ActiveMQ ConnectionFactory.
queue -> 1000
queue browser -> 500
topic -> 32767
durable topic -> 1000

Optimized acknowledge
When consuming messages in auto acknowledge mode (set when creating the consumers' session), ActiveMQ can acknowledge receipt of messages back to the broker in batches (to improve performance). The batch size is 65% of the prefetch limit for the Consumer. Also if message consumption is slow the batch will be sent every 300ms. You switch batch acknowledgment on by setting the optimizeAcknowledge property on the ActiveMQ ConnectionFactory to be true

Straight through session consumption
By default, a Consumer's session will dispatch messages to the consumer in a separate thread. If you are using Consumers with auto acknowledge, you can increase throughput by passing messages straight through the Session to the Consumer by setting the alwaysSessionAsync property on the ActiveMQ ConnectionFactory to be false

File based persistence
file based persistence store that can be used to increase throughput for the persistent messages

Performance test tools

http://activemq.apache.org/activemq-performance-module-users-manual.html
http://activemq.apache.org/jmeter-performance-tests.html

Kafka Design

05/07/14 01:20 Filed in: Data pipeline

Existing messaging systems have too much complexity created the limitations in performance, scaling and managing. To overcome this issue, LinkedIn (www.linkedin.com) decided to build Kafka to address their need for monitoring activity stream data and operational metrics such as CPU, I/O usage, and request timings.

While developing Kafka, the main focus was to provide the following

An API for producers and consumers to support custom implementation
Low overhead for network and storage with message persistence
High throughput supporting millions of messages
Distributed and highly scalable architecture

In a very basic structure, a producer publishes messages to a Kafka topic, which is created on a Kafka broker acting as a Kafka server. Consumers then subscribe to the Kafka topic to get the messages.
Important Kafka design facts are as follows:

The fundamental backbone of Kafka is message caching and storing it on the filesystem. In Kafka, data is immediately written to the OS kernel page. Caching and flushing of data to the disk is configurable.
Kafka provides longer retention of messages ever after consumption, allowing consumers to re-consume, if required.
Kafka uses a message set to group messages to allow lesser network overhead.
Unlike most of the messaging systems, where metadata of the consumed messages are kept at server level, in Kafka, the state of the consumed messages is maintained at consumer level. This also addresses issues such as:
Loosing messages due to failure
Multiple deliveries of the same message
By default, consumers store the state in ZooKeeper, but Kafka also allows storing it within other storage systems used for Online Transaction Processing (OLTP) applications as well.
In Kafka, producers and consumers work on the traditional push-and-pull model, where producers push the message to a Kafka broker and consumers pull the message from the broker.
Kafka does not have any concept of a master and treats all the brokers as peers. This approach facilitates addition and removal of a Kafka broker at any point, as the metadata of brokers are maintained in ZooKeeper and shared with producers and consumers.
In Kafka 0.7.x, ZooKeeper-based load balancing allows producers to discover the broker dynamically. A producer maintains a pool of broker connections, and constantly updates it using ZooKeeper watcher callbacks. But in Kafka 0.8.x, load balancing is achieved through Kafka metadata API and ZooKeeper can only be used to identify the list of available brokers.
Producers also have an option to choose between asynchronous or synchronous mode for sending messages to a broker.

The above notes are taken from “Apache Kafka” Book.

Load balancer in scalable architecture

29/06/14 15:34 Filed in: Cloud architecture

In any SaaS or web application, incoming traffic from end-user, normally hit a load balancer as the first tier. The role load balancer is to balance the load of the request to next lower tier. There are two types of load balancers such as software-base or hardware based. Most load balancers provide different rules for distributing the load. Following are two of the most common options available.

Round Robin: With round robin routing, the load balancer distributes inbound requests one at a time to an application server for processing, one after another. Provided that most requests can be serviced in an approximately even amount of time, this method results in a relatively well-balanced distribution of work across the available cluster of application servers.The drawback of this approach is that each individual transaction or request, even if it is from the same user session, will potentially go to a different application server for processing.

Sticky Routing: When a load balancer implements sticky routing, the intent is to process all transactions or requests for a given user session by the same application server. In other words, when a user logs in, an application server is selected for processing by the load balancer. All subsequent requests for that user are then routed to the original application server for processing. This allows the application to retain session state in the application server, that is, information about the user and their actions as they continue to work in their session.Sticky routing is very useful in some scenarios, for example tracking a secure connection for the user without the need of authenticating repeatedly with each individual request (potentially a very expensive operation). The session state can also yield a better experience for the end user, by utilizing the session history of actions or other information about the user to tailor application responses.Given that the load balancer normally stores only information about the user connection and the assigned application server for the connection, sticky routing still meets the definition of a stateless component.

Cloud Modernization Process - Mistakes could be

27/06/14 00:45 Filed in: Cloud architecture

Integrating legacy systems with cloud-based applications is a complex matter with ample room for error. The most frequent, overpriced, poorly functioning and even damaging mistakes businesses make include:

Mistake #1. Replacing the whole system:
Some enterprises choose to avoid the challenges of integration by creating a new system that replaces the full functionality of the old one. This is the most costly, difficult, and risk-prone option, but it does offer a long-term solution and may provide a system that is sufficiently agile to respond to changing business needs. Despite that potential pay-off, complete replacement requires a large, up-front investment for development, poses difficulties in duplicating behaviour of the legacy system, and increases the risk of complete software project failure.

Mistake #2. Wrapping existing legacy applications:
Often, organizations decide to wrap their legacy assets in shiny, more modern interfaces. These interfaces allow the use of a more flexible service-orientated architecture (SOA) approach, but they do not actually help make the system more flexible or easier to maintain. Certainly, wrapping allows increased access to the legacy system by other systems and the potential to replace individual parts on a piecemeal basis.
Although wrapping presents a modest up-front cost and relatively low risk, it fails to solve the core problem of legacy systems; enterprises still need to maintain outdated assets and still suffer from a lack of agility. To make matters worse, the look and feel of the wrappers are rarely elegant and quickly become dated themselves.

Mistake #3. Giving up and living with what you have:
Integration is difficult, and perhaps it shouldn’t be surprising that so many IT leaders go down this road. Despite the appearance of avoiding risk by avoiding change, giving up does not offer any hope of alleviating the legacy problem and provides only stagnation. Living with what you have frees you from up-front costs, but it denies you the ability to reduce maintenance expenses, increase operational efficiencies, or increase business agility that could boost your enterprise’s competitive advantage.

Mistake #4. Having the IT team create the utilities necessary to link individual applications:
This scenario occurs frequently but generally for a short time. Enterprises quickly discover that manually writing interfaces to accommodate obsolete platforms and programming languages is resource-intensive and prone to error.

Mistake #5: Implementing highly complex middleware solutions:
When considering a cloud integration platform, be sure there are not too many moving parts. If the middleware platform itself is so complex that you have to choose from among dozens of different products and then integrate the middleware to itself first before you can even begin to integrate your legacy applications to the cloud, you have a problem.If middleware requires the use of one or more programming or scripting language, you have another problem. A unified integration platform will simplify legacy application integration with the cloud.

Hadoop Platform Architecture

27/05/14 00:21 Filed in: Data pipeline | NoSQL

Hadoop platform contains following major modules

Storage
Programming
Data Platform
Database
Provisioning
Coordination
Workflow Scheduler
Data Mining
Data Integration

Log Management with Flume

18/04/14 00:36 Filed in: Data pipeline

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunnable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.
Flume allows you to configure your Flume installation from a central point, without having to ssh into every machine, update a configuration variable and restart a daemon or two. You can start, stop, create, delete and reconfigure logical nodes on any machine running Flume from any command line in your network with the Flume jar available.
Flume also has centralized liveness monitoring. We've heard a couple of stories of Scribe processes silently failing, but lying undiscovered for days until the rest of the Scribe installation starts creaking under the increased load. Flume allows you to see the health of all your logical nodes in one place (note that this is different from machine liveness monitoring; often the machine stays up while the process might fail).
Flume supports three distinct types of reliability guarantees, allowing you to make tradeoffs between resource usage and reliability. In particular, Flume supports fully ACKed reliability, with the guarantee that all events will eventually make their way through the event flow.
Flume's also really extensible - it's really easy to write your own source or sink and integrate most any system with Flume. If rolling your own is impractical, it's often very straightforward to have your applications output events in a form that Flume can understand (Flume can run Unix processes, for example, so if you can use shell script to get at your data, you're golden).

Enabling two-factor authentication for cloud applications

06/03/14 23:48 Filed in: Security | Cloud architecture

Two-factor authentication provide more security to make sure your accounts don't get hacked. Passwords, unfortunately, aren't as secure as they used to be. Having strong password may not help also. Humans are the weakest link and string password can be compromised. Two-factor authentication solves this problem and this is a simple feature that asks for more than just your password. It requires both "something you know" (like a password) and "something you have" (like your phone). After you enter your password, you'll get a second code sent to your phone, and only after you enter it will you get into your account. Currently, a lot of sites have recently implemented it, including many of social sites, business applications, etc. Here are some services that support two-factor authentication.

Google/Gmail: Google's two-factor authentication sends you a 6-digit code via text message when you attempt to log in from a new machine.
Apple: Apple's two-factor authentication sends you a 4-digit code via text message or Find My iPhone notifications when you attempt to log in from a new machine.
Facebook: Facebook's two-factor authentication, called "Login Approvals," sends you a 6-digit code via text message when you attempt to log in from a new machine.
Twitter: Twitter's two-factor authentication sends you a 6-digit code via text message when you attempt to log in from a new machine.
Dropbox: Dropbox's two-factor authentication sends you a 6-digit code via text message when you attempt to log in from a new machine.
Evernote: Free Evernote users will need to use an authenticator app like Google Authenticator for Android, iOS, and BlackBerry, though premium users can also receive a code via text message to log into a new machine.
PayPal: PayPal's two-factor authentication sends you a 6-digit code via text message when you attempt to log in from a new machine.
Yahoo! Mail: Yahoo's two-factor authentication sends you a 6-digit code via text message when you attempt to log in from a new machine.
Amazon Web Services: Amazon's web services, like Amazon S3 or Glacier storage, support two-factor authentication via authenticator apps, like the Google Authenticator app for Android, iOS, and BlackBerry.
LinkedIn: LinkedIn's two-factor authentication sends you a 6-digit code via text message when you attempt to log in from a new machine.
WordPress: WordPress supports two-factor authentication via the Google Authenticator app for Android, iOS, and BlackBerry.

There are third party vendors such as Duo Web, and Authy provide REST API to enable your web applications as well. Evan Hanh List have more information on how to enable/integrate them.

NoSQL Systems

03/02/14 21:53 Filed in: Data pipeline | NoSQL

These are so many NoSQL systems these days that it’s hard to get a quick overview of the major trade-offs involved when evaluating relational and non-relational systems in non-single server environments. There are three primary concerns must be balanced when choosing a data management.

Consistency means that each client always has the same view of the data.
Availability means that all clients can always read and write.
Partition tolerance means that the system works well across physical network partitions.

According to the CAP Theorem, you can only pick two. So how does this all relate to NoSQL systems?One of the primary goals of NoSQL systems is to bolster horizontal scalability. To scale horizontally, you need strong network partition tolerance which requires giving up either consistency or availability. NoSQL systems typically accomplish this by relaxing relational abilities and/or loosening transactional semantics. Below diagram shows category of multiple NoSQL systems.

CAP Visual

Cloud Characteristics

26/01/14 20:03 Filed in: Cloud architecture

The NIST Draft – Cloud Computing Synopsis and Recommendations defines Cloud characteristics as:
• On-demand self-service
• Broad network access
• Resource pooling
• Rapid elasticity
• Measured service

How do these characteristics influence run-time behavior and determine whether PaaS offerings are cloud washed or cloud native?

Measured service or pay per use
The first Cloud characteristic, measured service, enables pay-as-you-go consumption models and subscription to metered services. Resource usage is monitored, and the system generates bills based on a charging model. To close the perception gap between business end-users and IT teams, the Cloud solution should bill for business value or business metrics instead of billing for IT resources. Business end-users do not easily correlate business value with an invoice for CPU time, network I/O, or data storage bytes. In contrast, business focused IT teams communicate value and charges based on number of users, processed forms, received marketing pieces, or sales transactions. A cloud native PaaS supports monitoring, metering, and billing based on business oriented entities.

Rapid Elasticity
A stateful monolithic application server cluster connected to a relational database does not efficiently scale with rapid elastically. Dynamic discoverability and rapid provisioning can instantiate processing and message nodes across a flexible and distributed topology. Applications exposing stateless services (or where state is transparently cached and available to instances) will seamlessly expand and contract to execute on available resources. A cloud native PaaS will interoperate with cloud management components to coordinate spinning up and tearing down instances based on user, message, and business transaction load in addition to raw infrastructure load (i.e. CPU and memory utilization).

Resource Pooling
Development and operation teams are familiar with resource pooling. Platform environments commonly pool memory, code libraries, database connections, and resource bundles for use across multiple requests or application instances. But because hardware isolation has traditionally been required to enforce quality of service and security, hardware resource utilization has traditionally been extremely low (~5-15%). While virtualization is often used to increase application-machine density and raise machine utilization, virtualization efforts often result in only (~50-60%) utilization. With PaaS level multi-tenancy, deterministic performance, and application container level isolation, an organization could possible shrink it’s hardware footprint by half. Sophisticated PaaS environments allocate resources and limit usage based on policy and context. The environment may limit usage by throttling messages, time slicing resource execution, or queuing demand.
Integration and SOA run-time infrastructure delivers effective resource pools. As teams start to pool resources beyond a single Cloud environment, integration is required to merge disparate identities, entitlements, policies, and resource models. As teams start to deliver application capabilities as Cloud services, a policy aware SOA run-time infrastructure pools service instances, manages service instance lifecycle, and mediates access.

On-demand self-service
On-demand self-service requires infrastructure automation to flexibly assign workloads and decrease provisioning periods. If teams excessively customize an environment, they will increase time to market, lower resource pooling, and create a complex environment, which is difficult to manage and maintain. Users should predominantly subscribe to standard platform service offerings, and your team should minimize exceptions. Cloud governance is an important Cloud strategy component.

Decentralized Application

Architecture & Development