<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>
<channel>
	<title>eNovance</title>
	<atom:link href="http://www.enovance.com/fr/blog/flux_rss" rel="self" type="application/rss+xml" />
	<link>http://www.enovance.com</link>
	<description>Cloud &#38; Managed Services Provider</description>
	<lastBuildDate>Wed, 09 May 2012 08:54:07 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	
	<item>
		<title>OpenStack Swift eventual consistency analysis</title>
		<link>http://www.enovance.com/fr/blog/4781/openstack-swift-eventual-consistency-analysis-and-bottlenecks</link>
		<comments>http://www.enovance.com/%postname#comments</comments>
		<pubDate>Wed, 25 Apr 2012 13:23:09 +0000</pubDate>
		<dc:creator>Julien Danjou</dc:creator>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[cloud storage]]></category>
		<category><![CDATA[data availability]]></category>
		<category><![CDATA[eNovance]]></category>
		<category><![CDATA[openstack]]></category>
		<category><![CDATA[openstack object storage]]></category>
		<category><![CDATA[openstack swift]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[storage]]></category>
		<guid isPermaLink="false">http://www.enovance.com/?p=4781</guid>

		<description><![CDATA[Swift is the software behind the OpenStack Object Storage service. This service provides a simple storage service for applications using RESTful interfaces, providing maximum data availability and storage capacity. I explain here how some parts of the storage and replication in Swift works. If you don't know Swift and want to read a more "shallow" overview first, you can read John Dickinson's Swift Tech Overview...]]></description>
	
		<content:encoded><![CDATA[<p style="text-align: justify;"><a href="https://launchpad.net/swift">Swift</a> is the software behind the <a href="http://openstack.org/projects/storage/">OpenStack Object Storage</a> service. </p>
<p style="text-align: justify;">This service provides a simple storage service for applications using <a href="http://docs.openstack.org/api/openstack-object-storage/1.0/content/">RESTful interfaces</a>, providing maximum data availability and storage capacity.</p>
<p style="text-align: justify;">I explain here how some parts of the storage and replication in Swift works.</p>
<p style="text-align: justify;">If you don&#8217;t know Swift and want to read a more &laquo;&nbsp;shallow&nbsp;&raquo; overview first, you can read John Dickinson&#8217;s <a href="http://programmerthoughts.com/openstack/swift-tech-overview/">Swift Tech Overview</a>.</p>
<p class="Titre-Niveau-1" style="text-align: justify;">How Swift storage works</p>
<p style="text-align: justify;">If we refer to the <a href="http://en.wikipedia.org/wiki/CAP_theorem">CAP theorem</a>, Swift chose <strong>availability</strong> and <strong>partition tolerance</strong> and dropped <strong>consistency</strong>. That means that you&#8217;ll always get your data, they will be dispersed on many places, but you could get an old version of them (or no data at all) in some odd cases (like some server overload or failure). This compromise is made to allow maximum availability and scalability of the storage platform.</p>
<p style="text-align: justify;">But there are mechanisms built into Swift to minimize the potential data inconsistency window: they are responsible for data replication and consistency.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">The <a href="http://swift.openstack.org/">official Swift documentation</a> explains the internal storage in a certain way, but I&#8217;m going to write my own explanation here about this.</p>
<p class="Titre-Niveau-2" style="text-align: justify;">Consistent hashing</p>
<p style="text-align: justify;">Swift uses the principle of <a href="http://en.wikipedia.org/wiki/Consistent_hashing">consistent hashing</a>. It builds what it calls a <em>ring</em>. A ring represents the space of all possible computed hash values divided in equivalent parts. Each part of this space is called a <em>partition</em>.</p>
<p style="text-align: justify;">The following schema (stolen from the <a href="http://wiki.basho.com/">Riak</a> project) shows the principle nicely:</p>
<p>&nbsp;</p>
<p><a href="http://www.enovance.com/wp-content/uploads/2012/04/riak-ring.png"><img class="size-full wp-image-4771 aligncenter" src="http://www.enovance.com/wp-content/uploads/2012/04/riak-ring.png" alt="riak ring" width="350" height="229" /></a></p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">In a simple world, if you wanted to store some objects and distribute them on 4 nodes, you would split your hash space in 4. You would have 4 partitions, and computing <em>hash(object) modulo 4</em> would tell you where to store your object: on node 0, 1, 2 or 3.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">But since you want to be able to extend your storage cluster to more nodes without breaking the whole hash mapping and moving everything around, you need to build a lot more partitions. Let&#8217;s say we&#8217;re going to build 2<sup>10</sup> partitions. Since we have 4 nodes, each node will have <code>2<sup>10</sup> ÷ 4 = 256</code> partitions. If we ever want to add a 5<sup>th</sup> node, it&#8217;s easy: we just have to re-balance the partitions and move 1⁄4 of the partitions from each node to this 5<sup>th</sup> node. That means all our nodes will end up with <code>2<sup>10</sup> ÷ 5 ≈ 204</code> partitions. We can also define a <em>weight</em> for each node, in order for some nodes to get more partitions than others.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">With 2<sup>10</sup> partitions, we can have up to 2<sup>10</sup> nodes in our cluster. Yeepee.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">For reference, Gregory Holt, one of the Swift authors, also wrote <a href="http://www.tlohg.com/p/building-consistent-hashing-ring.html">an explanation post about the ring</a>.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">Concretely, when building one Swift ring, you&#8217;ll have to say how much partitions you want, and this is what this value is really about.</p>
<p class="Titre-Niveau-2" style="text-align: justify;">Data duplication</p>
<p style="text-align: justify;">Now, to assure availability and partitioning (as seen in the <em>CAP theorem</em>) we also want to store replicas of our objects. By default, Swift stores 3 copies of every objects, but that&#8217;s configurable.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">In that case, we need to store each partition defined above not only on 1 node, but on 2 others. So Swift adds another concept: zones. A zone is an isolated space that does not depends on other zone, so in case of an outage on a zone, the other zones are still available. Concretely, a zone is likely to be a disk, a server, or a whole cabinet, depending on the size of your cluster. It&#8217;s up to you to chose anyway.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">Consequently, each partitions has not to be mapped to 1 host only anymore, but to N hosts. Each node will therefore store this number of partitions:</p>
<p style="text-align: justify;"> </p>
<p><strong>number of partition stored on one node = number of replicas × total number of partitions ÷ number of node</strong></p>
<p>&nbsp;</p>
<p>Examples:</p>
<p>We split the ring in 2<sup>10</sup> = 1024 partitions. We have 3 nodes. We want 3 replicas of data.</p>
<p>→ Each node will store a copy of the full partition space: <code>3 × 2<sup>10</sup> ÷ 3 = 2<sup>10</sup> = 1024 partitions</code>.</p>
<p>We split the ring in 2<sup>11</sup> = 2048 partitions. We have 5 nodes. We want 3 replicas of data.</p>
<p>→ Each node will store <code>2<sup>11</sup> × 3 ÷ 5 ≈ 1129 partitions</code>.</p>
<p>We split the ring in 2<sup>11</sup> = 2048 partitions. We have 6 nodes. We want 3 replicas of data.</p>
<p>→ Each node will store <code>2<sup>11</sup> × 3 ÷ 6 = 1024 partitions</code>.</p>
<p class="Titre-Niveau-2">Three rings to rule them all</p>
<p>In Swift, there is 3 categories of thing to store: <em>account</em>, <em>container</em> and <em>objects</em>.</p>
<p>&nbsp;</p>
<p style="text-align: justify;">An <strong>account</strong> is what you&#8217;d expect it to be, a user account. An account contains <strong>containers</strong> (the equivalent of Amazon S3&#8242;s buckets). Each container can contains user-defined key and values (just like a hash table or a dictionary): values are what Swift call <strong>objects</strong>.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">Swift wants you to build 3 different and independent rings to store its 3 kind of things (<em>accounts</em>, <em>containers</em> and <em>objects</em>).</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">Internally, the two first categories are stored as <a href="http://www.sqlite.org/">SQLite</a> databases, whereas the last one is stored using regular files.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">Note that this 3 rings can be stored and managed on 3 completely different set of servers.</p>
<p><a href="http://www.enovance.com/wp-content/uploads/2012/04/openstack-swift-storage.png"><img class="size-full wp-image-4772 aligncenter" src="http://www.enovance.com/wp-content/uploads/2012/04/openstack-swift-storage.png" alt="" width="680" height="462" /></a></p>
<p class="Titre-Niveau-1" style="text-align: justify;">Data replication</p>
<p style="text-align: justify;">Now that we have our storage theory in place (accounts, containers and objects distributed into partitions, themselves stored into multiple zones), let&#8217;s go the replication practice.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">When you put something in one of the 3 rings (being an account, a container or an object) it is uploaded into all the zones responsible for the ring partition the object belongs to. This upload into the different zones is the responsibility of the <em>swift-proxy</em> daemon.</p>
<p>&nbsp;</p>
<p><a href="http://www.enovance.com/wp-content/uploads/2012/04/openstack-swift-replication.png"><img class="size-full wp-image-4773 aligncenter" src="http://www.enovance.com/wp-content/uploads/2012/04/openstack-swift-replication.png" alt="" width="670" height="238" /></a></p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">But if one of the zone is failing, you can&#8217;t upload all your copies in all zones at the upload time. So you need a mechanism to be sure the failing zone will catch up to a correct state at some point.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">That&#8217;s the role of the <em>swift-{container,account,object}-replicator</em> processes. This processes are <strong>running on each node part of a zone</strong> and replicates their contents to nodes of the other zones.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">When they run, they walk through all the contents from all the partitions on the whole file system and for each partition, issue a special <em>REPLICATE</em> HTTP request to all the other zones responsible for that same partition. The other zone responds with information about the local state of the partition. That allows the replicator process to decide if the remote zone has an up-to-date version of the partition.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">In case of account and containers, it doesn&#8217;t check at the partition level, but check each account/container contained inside each partition.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">If something is not up-to-date, it will be pushed using <em>rsync</em> by the replicator process. This is why you&#8217;ll read that the replication updates are <em>&laquo;&nbsp;push based&nbsp;&raquo;</em> in Swift documentation.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;"><strong># Pseudo code describing replication process for accounts # The principle is exactly the same for containers for account in accounts: # Determine the partition used to store this account partition = hash(account) % number_of_partitions # The number of zone is the number of replicas configured for zone in partition.get_zones_storing_this_partition(): # Send a HTTP REPLICATE command to the remote swift-account-server process version_of_account = zone.send_HTTP_REPLICATE_for(account): if version_of_account &lt; account.version() account.sync_to(zone)</strong></p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;">This replication process is O(number of account × number of replicas). The more your number of account will increase and the more you will want replicas for your data, the more the replication time for your accounts will grow. The same rule applies for containers.</p>
<p style="text-align: justify;"> </p>
<p style="text-align: justify;"><strong># Pseudo code describing replication process for objects for partition in partitions_storing_objects: # The number of zone is the number of replicas configured for zone in partition.get_zones_storing_this_partition(): # Send a HTTP REPLICATE command to the remote swift-object-server process verion_of_partition = zone.send_HTTP_REPLICATE_for(partition): if version_of_partition &lt; partition.version() # Use rsync to synchronize the whole partition # and all its objects partition.rsync_to(zone)</strong></p>
<p>&nbsp;</p>
<p style="text-align: justify;">This replication process is <em>O(number of objects partitions × number of replicas)</em>. The more your number of objects partitions will increase, and the more you will want replicas for your data, the more the replication time for your objects will grow.</p>
<p style="text-align: justify;">I think this is something important to know when deciding how to build your Swift architecture. Choose the right number the number of replicas, partitions and nodes.</p>
<p class="Titre-Niveau-1" style="text-align: justify;">Conclusion</p>
<p style="text-align: justify;">I recommend to chose correctly the different Swift parameters for your setup. The replication process optimization consists in having the minimum amount of partitions per node, which can be done by:</p>
<ul>
<li>
<p>decreasing the number of partitions</p>
</li>
<li>
<p>decreasing the number of replicas</p>
</li>
<li>
<p style="text-align: justify;">increasing the number of node</p>
</li>
</ul>
]]></content:encoded>
	
		<wfw:commentRss>http://www.enovance.com/%postname/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>

	</item>
	
	<item>
		<title>Welcome on eNovance&#8217;s blog</title>
		<link>http://www.enovance.com/fr/blog/2357/welcome-on-enovances-blog</link>
		<comments>http://www.enovance.com/%postname#comments</comments>
		<pubDate>Mon, 19 Sep 2011 18:44:36 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		<category><![CDATA[English]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[Blog]]></category>
		<guid isPermaLink="false">http://dev.enovance.com/?p=2357</guid>

		<description><![CDATA[Coming Soon!]]></description>
	
		<content:encoded><![CDATA[<p>Coming Soon!</p>
]]></content:encoded>
	
		<wfw:commentRss>http://www.enovance.com/%postname/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>

	</item>
	
</channel>
</rss>

