Storage Admin Notes: 2011

Thursday, December 22, 2011

CRC Errors, Code Violation Errors, Class 3 Discards & Loss of Sync - Why Storage isn't Always to Blame!!!

Storage is often

automatically pinpointed as the source of all problems. From System Admins, DBAs, Network guys to Application owners, all are quickly ready to point the figure at SAN Storage given the slightest hint of any performance degradation. Not really surprising though, considering it’s the common denominator amongst all silos. On the receiving end of this barrage of accusation is the SAN Storage team, who are then subjected to hours of troubleshooting only to prove that their Storage wasn’t responsible. On this circle goes until there reaches a point when the Storage team are faced with a problem that they can’t absolve themselves of blame, even though they know the Storage is working completely fine. With array-based management tools still severely lacking in their ability to pinpoint and solve storage network related problems and with server based tools doing exactly that i.e. looking at the server, there really is little if not nothing available to prove that the cause of latency is a slow draining device such as a flapping HBA, damaged cable or failing SFP. Herein lies the biggest paradox in that 99% of the time when unidentifiable SAN performance problems do occur, they are usually linked to trivial issues such as a failing SFP. In a 10,000 port environment, the million dollar question is ‘where do you begin to look for such a miniscule needle in such a gargantuan haystack?’

To solve this dilemma it’s imperative to know what to look for and have the right tools to find them, enabling your SAN storage environment to be a proactive and not a reactive fire-fighting / troubleshooting circus. So what are some of the metrics and signs that should be looked for when the Storage array, application team and servers all report everything as fine yet you still find yourself embroiled in performance problems?

Firstly to understand the context of these metrics / signs and the make up of FC transmissions, let’s use the analogy of a conversation. Firstly the Frames would be considered the words, the Sequences the sentences and an Exchange the conversation that they are all part of. With that premise it is important to first address the most basic of physical layer problems, namely Code Violation Errors. Code Violation Errors are the consequence of bit errors caused by corruption that occur in the sequence – i.e. any character corruption. A typical cause of this would be a failing HBA that would eventually start to suffer from optic degradation prior to its complete failure. I also recently experienced at one site Code Violation Errors when several SAN ports had been left enabled after their servers had been decommissioned. Some might think what’s the problem if they have nothing connected to them? In fact this scenario was creating millions of Code Violation Errors causing a CPU overhead on the SAN switch and subsequent degradation. With mission critical applications connected to the same SAN switch, performance problems became rife and without the identification of the Code Violation Errors could have led to weeks of troubleshooting with no success.

The build up of Code Violation Errors become even more troublesome as they eventually lead to what is referred to as a Loss of Sync. A Loss of Sync is usually indicative of incompatible speeds between points and again this is typical of optic degradation in the SAN infrastructure. For example if an SFP is failing, its optic signal will degrade and hence will not be at for example the 4Gbps it’s set at. Case point: a transmitting device such as a HBA is set at 4Gbps while the receiving end i.e. the SFP (unbeknownst to the end user) has degraded down to 1Gbps. Severe performance problems will occur as the two points constantly struggle with their incompatible speeds. Hence it’s an imperative to be alerted of any Loss of Sync as ultimately they are also an indication of an imminent Loss of Signal i.e. when the HBA or SFP are flapping and are about to fail. This leads to the nightmare scenario of an unplanned path failure in your SAN storage environment and worse still a possible outage if failover cannot occur.

                   One  of the biggest culprits and a sure-fire hit to resolving performance  problems is to look for what are termed CRC errors. CRC Errors usually  indicate some kind of physical problem within the FC link and are  indicative of code violation errors that have led to consequent  corruption inside the FC data frame. Usually caused by a flapping SFP or  a very old / bent / damaged cable, once CRC errors are acknowledged by  the receiver, the receiver would reject the request leaving the Frame  having to be resent. For example as an analogy imagine a newspaper  delivery boy, who while cycling to his destination loses some of the  pages of the paper prior to delivery. Upon delivery the receiver would  request for the newspaper to be redelivered with the missing pages. This  would entail the delivery boy having to cycle back to find the missing  pages and bring back the newspaper as a whole. In the context of a CRC  error a Frame that should typically take only a few milliseconds to  deliver could take up to 60 seconds in being rejected and resent. Such  response times can be catastrophic to a mission critical application and  it’s underlying business. By gaining an insight into CRC errors and  their root cause one can immediately pinpoint which bent cable or old  SFP is responsible and proactively replace them long before they start  to cause poor application response times or even worse a loss to your  business.

                The  other FC SAN gremlin is what is termed a Class 3 discard. Of the  various services of data transport defined by the Fibre Channel ANSI  Standard, the most commonly used is Class 3. Ideal for high throughput,  Class-3 is essentially a datagram service based on frame switching and  is a connectionless service. Class 3’s main advantage comes from not  giving an acknowledgement that a frame has been rejected or busied by a  destination device or Fabric. The benefits of this are that it firstly  significantly reduces the overhead on the transmitting device and  secondly allows for more bandwidth availability for transmission which  would otherwise be reduced. Furthermore the lack of acknowledgements  removes the potential delays between devices caused by round-trips of  information transfers. As for data integrity, Class 3 Flow control has  this handled by higher-level protocols such as TCP due to Fibre Channel  not checking the corrupted or missing frames. Hence any discovery of a  corrupted packet by the higher-level protocol on the receiving device  instantly initiates a retransmission of the sequence. All of this sounds  great until the non-acknowledgement of rejected frames starts to also  bring about Class 3’s disadvantage. This is that inevitably a Fabric  will become busy with traffic and will consequently discard frames,  hence the name Class 3 discards. Due to this the receiving device’s  higher-level protocol’s subsequent request for retransmission of  sequences will then degrade the device and fabric throughput.

              Another  indication of Class 3 discards are zoning conflicts where a frame has  been transmitted and cannot reach a destination, hence concluding in the  SAN initiating a Class 3 discard. This is caused by either legacy or  zoning mistakes where for example a decommissioned Storage system was  not unzoned from a server or vice versa leading to continuous frames  being discarded and degraded throughput as sequences are retransmitted.  This then results in performance problems, potential application  degradation and automatic finger pointing at the Storage System for a  problem that can’t automatically be identified. By resolving the zoning  conflict and spreading the load of the SAN throughput across the right  ports, the heavy traffic or zoning issues which cause the Class 3  discards can be quickly removed bringing immediate performance and  throughput improvements. By gaining an insight into the occurrence and  amount of Class 3 discards, huge performance problems can be quickly  remediated before they occur and thus another reason as to why the  Storage shouldn’t automatically be blamed.

           These  are just some of the metrics / signs to look for which can ultimately  save you from weeks of troubleshooting and guessing. By first  acknowledging these metrics, identifying when they occur and proactively  eliminating them, the SAN storage environment will quickly evolve and  transform into a healthy, proactive and optimized one. Furthermore by  eliminating each of these issues you also empower yourself by  eliminating their consequent problems such as application slowdown, poor  response times, unplanned outages and long drawn out troubleshooting  exercises which eventually lead to fingerpointing fights. Ideally what  will occur is a paradigm shift where instead of application owners  complaining to the Storage team, the Storage team will proactively  identify problems prior to their existence. Here lies the key to making  the ‘always blaming the Storage’ syndrome a thing of the past.   

Tuesday, December 13, 2011

Enterprise Computing: Why Thin Provisioning Is Not The Holy Grail for Utilisation

Thin Provisioning (Dynamic Provisioning, Virtual Provisioning, or whatever you prefer to call it) is being heavily touted as a method of reducing storage costs. Whilst at the outset it seems to provide some significant storage savings, it isn’t the answer for all our storage ills.

What is it?

Thin Provisioning (TP) is a way of reducing storage allocations by virtualising the storage LUN. Only the sectors of the LUN which have been written to are actually placed on physical disk. This has the benefit of reducing wastage, in instances where more storage is provisioned to a host than is actually needed. Look a the following figure. It shows five typical 10GB LUNs, allocated from an array. In a “normal” storage configuration, those LUNs would be allocated to a host and configured with a file system. Invariably, the file systems will never be run at 100% utilisation (just try it!) as this doesn’t work operationally and also because users typically order more storage than they actually require, for a many reasons. Typically, host volumes can be anywhere from 30-50% utilised and in an environment where the entire LUN is reserved out for the host, this results in a 50-70% wastage.

Now, contrast this to a Thin Provisioned model. Instead of dedicating the physical LUNs to a host, they now form a storage pool; only the data which has actually been written is stored onto disk. This has two benefits; either the storage pool can be allocated smaller than the theoretical capacity of the now virtual LUNs, or more LUNs can be created from the same size storage pool. Either way, the physical storage can be used much more efficiently and with much less waste.

There are some obvious negatives to the TP model. It is possible to over-provision LUNs and as data is written to them, exhaust the shared storage pool. This is Not A Good Thing and clearly requires additional management techniques to ensure this scenario doesn’t happen and sensible standards for layout and design to ensure a rogue host writing lots of data can’t impact other storage users.

The next problem with TP in this representation is the apparent concentration of risk and performance of many virtual LUNs to a smaller number of physical devices. In my example, the five LUNs have been stored on only three physical LUNs. This may represent a potential performance bottleneck and consequently vendors have catered for this in their implementations of TP. Rather than there being large chunks of storage provided from fixed volumes, TP is implemented using smaller blocks (or chunks) which are distributed across all disks in the pool. The third image visualises this method of allocation.

So each vendor’s implementation of TP uses a different block size. HDS use 42MB on the USP, EMC use 768KB on DMX, IBM allow a variable size from 32KB to 256KB on the SVC and 3Par use blocks of just 16KB. The reasons for this are many and varied and for legacy hardware are a reflection of the underlying hardware architecture.

Unfortunately, the file systems that are created on thin provisioned LUNs typically don’t have a matching block size structure. Windows NTFS for example, will use a maximum block size of only 4KB for large disks unless explicitly overriden by the user. The mismatch between the TP block size and the file system block size causes a major problem as data is created, amended and deleted over time on these systems. To understand why, we need to examine how file systems are created on disk.

The fourth graphic shows a snapshot from one of the logical drives in my desktop PC. This volume hasn’t been defragmented for nearly 6 months and consequently many of the files are fragmented and not stored on disk in contiguous blocks. Fragmentation is seen as a problem for physical disks as the head needs to move about frequently to retrieve fragmented files and that adds a delay to the read and write times to and from the device. In a SAN environment, fragmentation is less of an issue as the data is typically read and written through cache, negating most of the physical issues of moving disk heads. However fragmentation and thin provisioning don’t get along very well and here’s why.

The Problem of Fragmentation and TP

When files are first created on disk, they will occupy contiguous sections of space. If this data resides on TP LUNs, then a new block will be assigned to a virtual TP LUN as soon as a single filesystem block is created. For a Windows system using 4KB blocks on USP storage, this means 42MB each time. This isn’t a problem as the file continues to be expanded, however it is unlikely this file will end neatly on a 42MB boundary. As more files are created and deleted, each 42MB block will become partially populated with 4KB filesystem blocks, leaving “holes” in the filesystem which represent unused storage. Over time, a TP LUN will experience storage utilisation “creep” as new blocks are “touched” and therefore written onto physical disk. Even if data is deleted from an entire 42MB chunk, it won’t be released by the array as data is usually ”logically deleted” by the operating system. De-fragmenting a volume makes the utilisation creep issue worse; it writes to unused space in order to consolidate files. Once written, these new areas of physical disk space are never reclaimed.

So what’s the solution?

Fixing the TP Problem

Making TP useful requires a feature that is already available in the USP arrays as Zero Page Reclaim and 3Par arrays as Thin Built In. When an entire “empty” TP chunk is detected, it is automatically released by the system (in HDS’s case at the touch of a button). So, for example as fat LUNs are migrated to thin LUNs, unused space can be released.

This feature doesn’t help however with traditional file systems that don’t overwrite deleted data with binary zeros. I’d suggest two possibilities to cure this problem:

Secure Defrag. As defragmentation products re-allocate blocks, they should write binary zeros to the released space. Although this is time consuming, it would ensure deleted space could be reclaimed by the array.
Freespace Consolidation. File system free space is usually tracked by maintaining a chain of freespace blocks. Some defragmentation tools can consolidate this chain. It would be an easy fix to simply write binary zeros over each block as it is consolidated up.

One alternative solution from Symantec is to use their Volume Manager software, which is now “Thin Aware”. I’m slightly skeptical about this as a solution as it places requirements on the operating system to deploy software or patches just to make storage operate efficiently. It takes me back to Iceberg and IXFP….
Summary
So in summary, Thin Provisioning can be a Good Thing, however over time, it will lose its shine. We need fixes that allow deleted blocks of data to be consolidated and returned to the storage array for re-use. Then TP will deliver on what it promises.
Footnote
Incidentally, I’m surprised HDS haven’t made more noise about Zero Page Reclaim. It’s a TP feature that to my knowledge EMC haven’t got on DMX or V-Max.

Source:http://thestoragearchitect.com/

Thin provisioning

Thin provisioning Introduction

Thin provisioning, sometimes called "over subscription" is an important emerging storage technologies is thin provisioning. This article defines thin provisioning, describes how it works, identifies some challenges for the technology, and suggests where it will be most useful.

If applications run out of storage space, they crash. Therefore, storage administrators commonly install more storage capacity than required to avoid any potential application failures. This practice provides 'headroom' for future growth and reduces the risk of application failures. However, it requires the installation of more physical disk capacity than is actually used, creating waste.
Thin provisioning software allows higher storage utilization by eliminating the need to install physical disk capacity that goes unused. Figure 1 shows how storage administrators typically allocate more storage than is needed for applications -- planning ahead for growth and ensuring applications won't crash because they run out of disk space. In Figure 1 volume A has only 100 GB of physical data, but may has been allocated much more than that based on growth projections (500GB, in this example). The unused storage allocated to the volume cannot be used by other applications. In many cases the full 500 GB is never used and is essentially wasted. This is sometimes referred to as "stranded storage."
In most implementations, thin provisioning provides storage to applications from a common pool of storage on an as required basis. Thin provisioning works in combination with storage virtualization, which is essentially a prerequisite to effectively utilize the technology. With thin provisioning, a storage administrator allocates logical storage to an application as usual, but the system releases physical capacity only when it is required. When utilization of that storage approaches a predetermined threshold (e.g. 90%), the array automatically provides capacity from a virtual storage pool which expands a volume without involving the storage administrator. The volume can be over allocated as usual, so the application thinks it has plenty of storage, but now there is no stranded storage. Thin provisioning is on-demand storage that essentially eliminates allocated but unused capacity.
There are some challenges with thin provisioning technology, and some areas where it is not currently recommended:

The data that is deleted from a volume needs to be reclaimed, which can add to the storage controller overhead and increased cost.
File systems (e.g. Microsoft NTFS files) that use unused blocks instead of reusing released blocks cause volumes to expand to their maximum allocated size before reusing storage. This negates the benefits of thin provisioning.
Applications that spread metadata across the entire volume will obviate the advantages of thin provisioning.
Applications that expect the data to be contiguous, and/or optimize I/O performance around that assumption are not good candidates for thin provisioning.
If a host determines that there is sufficient available space, it may allocate it to an application, and the application may deploy it. This space is virtual, however, and if the array can't provision real new storage fast enough, the application will fail. High performance controllers and and good monitoring of over-provisioning of storage will be required to avoid reduced availability.

As thin provisioning technology matures, applications and file systems will be built and modified to avoid these kinds of problems. The economic justification for thin provisioning is simple: it makes storage allocation automatic, which significantly reduces the storage administrators' work, and it can reduce the amount of storage required to service applications. It also reduces the number of spinning disk drives required and therefore will result in substantial reductions in energy consumption.
Action Item: Thin provisioning can provide some major advantages in increasing overall storage utilization and should be seriously considered when virtualizing a data center. However, users should be aware of the caveats and should examine the storage requirements and management of their applications to identify any that are poor candidates for this approach.

Thursday, August 25, 2011

DATA ONTAP CONFIGURATION PARAMETERS

There are a number of variables or options in Data ONTAP that are important to understand before configuring for thin provisioning.
LUN Reservation
LUN reservation (not to be confused with SCSI2 or 3 logical unit locking reservations) determines when
space for the LUN is reserved or allocated from the volume. With reservations enabled (default) the space is subtracted from the volume total when the LUN is created. For example, if a 20GB LUN is created in a volume having 80GB of free space, the free space will go to 60GB free space at the time the LUN is created even though no writes have been performed to the LUN. If reservations are disabled, space is first taken out of the volume as writes to the LUN are performed. If the 20GB LUN was created without LUN space reservation enabled, the free space in the volume would remain at 80GB and would only go down as data was written to the LUN.

Guarantees
With flexible volumes, introduced with Data ONTAP 7.0, there is the concept of space guarantees, which allow the user to determine when space is reserved or allocated from the containing aggregate.
Volume—A guarantee of “volume” ensures that the amount of space required by the FlexVol volume is always available from its aggregate. This is the default setting for FlexVol volumes. With the space guarantee set to “volume” the space is subtracted, or reserved, from the aggregate’s available space at volume creation time. The space is reserved from the aggregate regardless of whether it is actually used for data storage or not.
The example shown here shows the creation of a 20GB volume. The df commands showing the space usage of the aggregate before and after the volume create command display how the 20GB is removed from the aggregate as soon as the volume is created, even though no data has actually been written to the volume.

netapp1> df -A -g aggr0
Aggregate total used avail capacity
aggr0 85GB 0GB 85GB 0%
aggr0/.snapshot 4GB 0GB 4GB 0%
netapp1> vol create flex0 aggr0 20g
Creation of volume 'flex0' with size 20g on hosting aggregate 'aggr0' has
completed.
netapp1> df -g /vol/flex0
Filesystem total used avail capacity Mounted on
/vol/flex0/ 16GB 0GB 16GB 0% /vol/flex0/
/vol/flex0/.snapshot 4GB 0GB 4GB 0% /vol/flex0/.snapshot
netapp1> df -A -g aggr0
Aggregate total used avail capacity
aggr0 85GB 20GB 65GB 23%
aggr0/.snapshot 4GB 0GB 4GB 0%

Since the space has already been reserved from the aggregate, write operations to the volume will not cause more space from the aggregate to be used.
None—A FlexVol volume with a guarantee of “none” reserves no space from the aggregate during volume creation. Space is first taken from the aggregate when data is actually written to the volume. The example here shows how, in contrast to the example above with the volume guarantee, the volume creation does not reduce used space in the aggregate. Even LUN creation, which by default has space reservation enabled, does not reserve space out of the aggregate. Write operations to space-reserved LUNs in a volume with guarantee=none will fail if the containing aggregate does not have enough available space. LUN reservation assure that the LUN has space in the volume but guarantee=none doesn’t assure that the volume has space in the aggregate.
netapp1> df -A -g aggr0
Aggregate total used avail capacity
aggr0 85GB 0GB 85GB 0%
aggr0/.snapshot 4GB 0GB 4GB 0%
netapp1>
netapp1> vol create noneflex -s none aggr0 20g
Creation of volume 'noneflex' with size 20g on hosting aggregate
'aggr0' has completed.
netapp1>
netapp1> df -g /vol/noneflex
Filesystem total used avail capacity Mounted on
/vol/noneflex/ 16GB 0GB 16GB 0% /vol/noneflex/
/vol/noneflex/.snapshot 4GB 0GB 4GB 0% /vol/noneflex/.snapshot
netapp1>
netapp1> df -A -g aggr0
Aggregate total used avail capacity
aggr0 85GB 0GB 85GB 0%
aggr0/.snapshot 4GB 0GB 4GB 0%
netapp1> lun create -s 10g -t windows /vol/noneflex/foo
Mon Nov 24 15:17:28 EST [array1: lun.vdisk.spaceReservationNotHonored:notice]:
Space reservations in noneflex are not being honored, either because the volume
space guarantee is set to 'none' or the guarantee is currently disabled due to
lack of space in the aggregate.
lun create: created a LUN of size: 10.0g (10742215680)
netapp1>
netapp1> df -g /vol/noneflex
Filesystem total used avail capacity Mounted on
/vol/noneflex/ 16GB 10GB 6GB 0% /vol/noneflex/
/vol/noneflex/.snapshot 4GB 0GB 4GB 0% /vol/noneflex/.snapshot
netapp1>
netapp1> df -A -g aggr0
Aggregate total used avail capacity
aggr0 85GB 0GB 85GB 0%
aggr0/.snapshot 4GB 0GB 4GB 0%

File—With guarantee=file the aggregate assures that space is always available for overwrites to space-reserved LUNs. Fractional reserve, a volume level option discussed later in this paper, is set to 100% and is not adjustable with this type of guarantee. The “file” guarantee is basically the same as the “none” guarantee with the exception that space reservations for LUNs and spacereserved files are honored. The example below looks the same as the previous example under with guarantee=none except in this example the LUN creation takes space from the aggregate because it is a space-reserved object. Since the space reservation is honored, the “lun create” command also doesn’t issue the warning shown in the previous example.
netapp1> df -A -g aggr0
Aggregate total used avail capacity
aggr0 85GB 0GB 85GB 0%
aggr0/.snapshot 4GB 0GB 4GB 0%
netapp1>
netapp1> vol create noneflex -s file aggr0 20g
Creation of volume 'noneflex' with size 20g on hosting aggregate
'aggr0' has completed.
cnrl1>
netapp1> df -g /vol/noneflex
Filesystem total used avail capacity Mounted on
/vol/noneflex/ 16GB 0GB 16GB 0 % /vol/noneflex/
/vol/noneflex/.snapshot 4GB 0GB 4GB 0%
/vol/noneflex/.snapshot
netapp1>
netapp1> df -A -g aggr0
Aggregate total used avail capacity
aggr0 85GB 0GB 85GB 0%
aggr0/.snapshot 4GB 0GB 4GB 0%
netapp1>
netapp1> lun create -s 10g -t windows /vol/noneflex/foo
lun create: created a LUN of size: 10.0g (10742215680)
netapp1>
netapp1> df -g /vol/noneflex
Filesystem total used avail capacity Mounted on
/vol/noneflex/ 16GB 10GB 6GB 0% /vol/noneflex/
/vol/noneflex/.snapshot 4GB 0GB 4GB 0% /vol/noneflex/.snapshot
netapp1>
netapp1> df -A -g aggr0
Aggregate total used avail capacity
aggr0 85GB 10GB 75GB 12%
aggr0/.snapshot 4GB 0GB 4GB 0%

Wednesday, January 19, 2011

EMC VPLEX architecture

Here you will find a good overview on the architecture of the EMC VPLEX product.

How to change a Solaris zone netmask?

It is not recommended to edit /etc/zones/zonename.xml file
zonecfg -z email-zone
zonecfg:email-zone> remove net address=174.1.130.132
zonecfg:email-zone> add net
zonecfg:email-zone:net> set address=174.1.130.232/24
zonecfg:email-zone:net> set physical=bge0
zonecfg:email-zone:net> end
zonecfg:email-zone> commit
zonecfg:email-zone> exit

Use this format for new-ip to configure a specific subnet mask: 174.1.130.132/24

Storage administration terminology

There is nothing more frustrating than trying to learn a new technology and not being able to keep up with basic instruction. This is partially representative of my own journey into storage, as historically I’ve focused on servers, virtualization, and operating systems. Storage administration is a peculiar segment of IT, and there are a number of terms and acronyms that go along with it. Here are some of the terms that I have taken in during my journey:

Thin provisioning:

Disks are expensive, and many administrators make the case to provision volumes at any level as a thin provisioned solution. This effectively doesn’t count free space as used even though it is allocated.

Hot spots:

A hot spot is an area of disk that receives a large amount of I/O from the storage consumers. This can be due to a number of logical unit numbers (LUNs) residing on a single disk or simply a very busy single LUN. The objective of a storage administrator is to mitigate the hot spots across the available storage.

Short stroking:

This is somewhat of a trick play in storage administration in that a drive can be made to perform better than a typical configuration. A short stroke is only using a very small area of a hard drive as part of a large array to make the hard drive have a short area to traverse. While this will increase the performance of the drive, there is usually a lot of wasted space.

Zero Page Reclaim or (ZPR):

For thin provisioning environments, this is where the storage controller goes and harvests zeroes and returns them to a master available storage pool. Depending on the architecture or use of storage products, this may be done exclusively by the storage processor or as an enhancement to new products. One example is VMware’s vSphere 4.1 which now offers support for vSphere APIs for Array Integration (VAAI) with selected storage products. This effectively lets the storage processor do the zero-handling or bulk zero I/O operations instead of the vSphere host as an optimization for disk use and storage network traffic.

Wide striping:

This practice involves having a higher number of drives in use for a LUN to achieve greater throughput. Basically, 12 drives in an array providing a LUN will provide better aggregate throughput than two or three of those same drives.

Cheap and deep:

This is typically used to describe storage that is very large in terms of GB or TB, yet is slow and inexpensive. In today’s data centers, that gravitates toward SATA drives that are 1 TB, 2 TB, or larger.

Rotational storage:

This term refers to traditional hard drives with moving platters and a seeking head to the regions on the disk. This is also known on a more casual level as “spinning rust”. The alternative is solid state (or enterprise flash) storage, which has no moving parts.

Storage tiering:

Whether automated or storage administrator-based, this is putting workloads on disk resources that perform to their requirements. In most environments, there are two levels of storage tiers: SAS and SATA drives with the SAS drives being higher performance (and price) drives.

Slow-Spin or No-Spin:

Denotes a slow tier of storage that is very slow, such as 5400 or 7200 RPM; or even in a powered down state. This can include a tape storage solution.

Storage administration is filled with a number of terms that denote how disk resources are managed, provisioned and consumed. Do you have some storage jargon that you use in your daily administration? Share your terms below.

Tuesday, January 4, 2011

Comparisons of SVM vs VXVM

Characteristic

Solstice Disksuite

Veritas Volume Manager

Availability

Free with a server license, pay for workstation. Sun seems to have an on again off again relationship about whether future development will continue to occur with this product, but it is currently available for Solaris8 and will continue to be supported in that capacity. The current word is that future development is on again. Salt liberally. (Just to further confuse things, SDS is free for Solaris 8 up to 8 CPUs, even for commercial).

Available from Sun or directly from Veritas (pricing may differ considerably. Also execellent educational pricing. Free with storage array or A5000 (but cannot use striping outside array device)

Installation

relatively easy. You must do special things to first achieve rootdisk, swap, and other disk mirroring in exactly the right order.

easy. Follow the onscreen prompts, let it do its reboots.

Upgrading

easy, remove patches, remove packages, add new packages

slightly more complex, but well documented. There are several ways to do it.

Replacing failed
root mirror

very easy. replace disk and resynchronize

very easy. replace disk and resynchronize

Replacing failed
primary root disk

relatively easy. boot off mirror, replace disk, resync, boot off primary.

easy to slightly complex depending on setup. Well documented. 11 steps or fewer.

Replacing failed
data disk in
redundant (mirrored
or RAID5) volume

trivial

trivial

extensibility / number
of volumes

Traditionally, relatively easy but EXTREMELY limited by usage of hard partition table on disk. Number of total volumes on a typical system is very limited because of this. If you have a lot of disks, you can still create a lot of metadevices. The default is 256 max, but this can be increased by setting nmd=XXXX in /kernel/drv/md.conf and then rebooting. Schemes for managing metadevice naming for large number of deices are available, but clunky and occassionally contrived. NOTE: SDS 4.2.1+ (avail Sol7) removes the reliance upon disk VTOC for making metadevices through 'soft partitions'.

trivial. No limitations will be encountered by most people. Number of volumes is potentially limitless.

Moving a volume

difficult unless special planning and precautions have been taken with laying out the proper partition and disk labels beforehand. Somewhat hidden by GUI.

trivial. on redundant volumes can be done on the fly.

Growing a volume

volume can be extended in two different ways. It can be concatenated with another region of space someplace else or, if there is contiguous space following ALL of the partitions of the original volume, the stripe can be extended. Using concatenation you could grow a 4 disk stripe by 2 additional disks. (e.g. 4 disk stripe concatenated with a 2 disk stripe).

volume can be extended in two different ways. The columes of the stripe can be extended for Raid0/5, simple single-disk volumes can be grown directly, and in VxVM > 3.0, a volume can be re-layed out (The number of columns in a RAID-5 stripe can be reconfigured on the fly!). Contiguous space is not required. In VXVM < 3.0 if you are increasing the size of a stripe, you must have space on disks where is the original number of disks in a stripe. You can't 'grow' (but could relayout) a 4 disk stripe by adding two more disks, but you could with 4. Extremely flexible.

Shrinking a volume
(only possible with
VxFS filesystem!)

difficult. You must adjust all disk or soft partitions manually.

trivial. vxresize can shrink filesystem and volume in one command..

Relayout volume
(change 4 disk
raid-5 volume to 5 disk volume

Requires dump/restore of data.

Available on the fly for VxVM > 3.0

Logging

in SDS a Meta-trans device may be used which provides a log based addition on top of a UFS filesystem. This transaction log, if used, should be mirrorred! (Loss of log results in a filesystem that may be corrupted even beyond fsck repair.) Use of a UFS+ logging filesystem instead of a trans device is a better alternative. UFS+ logging is availabe in Sol7 and above.

VxVM has RAID-5 logs and mirror/drl logs, Logging, if used need not be mirrored, and volume can continue operating if log fails. Having 1 is highly recommended for crash recovery. Logs are infinitessimally small, typically 1 disk cylinder or so. The SDS logs are really more equivalent to a VxFS log at the filesystem level, but it is worth mentioning the additional capabilities of VxVM in this regard. UFS+ with logging can also be used on a VxVM volume. There are many kinds of purpose-specific logs for things like fast mirror resync, volume replication, database logging, etc.

Performance

Your mileage may vary. SDS seems to excel at simple RAID-0 striping, but seems to be only marginally faster than VxVM. VxVM also seems to gain back when using large interleaves. For best results, benchmark YOUR data with YOUR app and pay very close attention to your data size and your stripe unit/interleave size. RAID5 on VxVM is almost always faster by 20-30%.
Links: archive1, archive2

Notifications (see also)

SNMP traps are used for notification. You must have something set to receive them. Notifications are limited in scope.

VxVM uses email for notifying you when a volume is being moved because of bad blocks using hot relocation or sparing. The notification is very good.

Sparing

hot spare disks may be designated for a diskset, but must be done at the slice level.

hot spare disks may be designated for a diskgroup. Or, extra space on any disk can be used for dynamic hot relocation without the need for reserving a spare.

Terminology

SDS diskset = VxVM diskgroup, SDS metadevice = VxVM volume, SDS Trans device ~ VxVM log, VxVM has subdisks which are units of data (e.g. a column of a stripe) that have no SDS equivalent. VxVM plexes are groupings of subdisks (e.g. into a stripe) that have no real SDS equivalent. VxVM Volumes are groupings of plexes. (e.g. the data plex and a log plex, or 2 plexes for a 0+1 volume)

GUI

Most people prefer the VxVM GUI, though there are a few that prefer the (now 4 years old) SDS gui. SDS has been renamed SVM in Solaris9 and the GUI is supposedly much improved. VxVM has gone through 3-4 GUI incarnations. Disclaimer: I *never* use the GUI

command line usage

metareplace, metaclear to delete, metainit for volumes, metadb for state databases, etc

vxassist is used for creation of all volume types, vxsd, vxvol, vxplex operate on appropriate VxVM objects (see terminology above). Generally, there are many more vx specific commands, but normal usage rarely requires 20% of these except for advanced configurations (special initializations, using alternate pathing, etc)

device database configuration copies

Kept in special, replicated, partitions you must setup on disk and configure via metdb. /etc/opt/SUNWmd, and /etc/system contain the boot/metadb information and description of the volumes. Lose these and you have big problems. NOTE: in Solaris9 SVM, configuration copies are now kept on the metadisks themselves with the data, like VxVM

Kept in the private region on each disk. Disks can move about and the machine can be reinstalled without having to worry about losing data in volumes.

Typical usage

Simple mirroring of root disk, simple striping of disks where situation is relatively stagnant (e.g. just a bunch of disks with RAID0 and no immediate scaling or mobility concerns). Scales well in size of small number of volumes, but poorly in large number of smaller volumes.

enterprise ready. Data mobility, scalability, and configuration are all extensively addressed. Replacing failed encapsulated rootdisk is more complicated than it needs to be. See Sun best practices paper for a better way. Other alternatives exist.

Best features

Simple/simplistic - root/swap mirroring and simple striping is no brainer, free or nearly so. Easier to fix by hand (without immediate support) when something goes fubar (vxvm is much more complex to understand under the hood).

extensible, error notifications are good, extremeley configurable, relayout on the fly with VxVM > 3.0, nice integration with VxFS, best scalability. Excellent edu pricing.

Worst features

configuration syntax (meta*), configuration information stored on host system (< Sol9). Metadb/slices -- a remnant from SunOS4 days! -- needs to be completely redone; naming is inflexible and limited. Number of hard metadevices has kernel hack workarounds, but is still very limiting. Required mirroring of trans logs is inconvenient, but mitigated by using native UFS+ w/logging in Solaris7 and above. Lack of drive level hotsparing (see sparing) is extremely inconvenient.

expensive for enterprises and big servers, root mirroring and primary rootdisk failure for encapsulated rootdisk is too complex (but well documented) (should be fixed in VxVM 4.0), somewhat steep learning curve for advanced usage. Recovery from administrative SNAFUs (involving restore and single user mode) on a mirrored rootdisk can be troublesome.

Tips

keep backups of your of configuration in case of corruption. Regular usage of metastat, metastat -p, and prtvtoc can help.

In VxVM regular usage of vxprint -ht is useful for disaster recovery. There are also several different disaster recovery scripts here

Using VxVM for
data and SDS
for root mirroring

Many people do this. There are tradeoffs. One the one hand you have added simplicity in the management of your rootdisks by not having to deal with VxVM encapsulation, which can ease recovery and upgrades. On the other hand, you now have the added complexity of having to maintain a separate rootdg volume someplace else, or use a simple slice (which, by the way, neither Sun nor Veritas will support if there are problems). You also have the added complexity of managing too completely separate storage/volume management products and their associated nuances and patches. In the end it boils down to preference. There is no right or wrong answer here, though some will say otherwise. ;) Veritas VxVM 4.0 removes the requirement for rootdg.

Red Hat Linux Tasks

Setup telnet server

# rpm -ivh xinetd.????                  insert lastest revision
# rpm -ivh telnet-server.????           insert lastest revision

edit the telnet config file
# cd /etc/inetd.d
# vi telnet                             change disable option to "no"

restart the xinetd service
# service xinetd restart

Setup Virtual IP Address

# cd /etc/sysconfig/network-scripts
# cp ifcfg-eth0 ifcfg-eth0:1
Edit and up following options, the other options should be ok
# vi ifcfg-eth0:1
DEVICE=eth0:1
IPADDR=192.168.0.6
Restart the network services, remember this will restart all network interfaces
# service network restart

Allow root access

# cd /etc
# vi securetty
Add the following lines at the bottom of the file, this allows 5 sessions but you can add more.
pts/1
pts/2
pts/3
pts/4
pts/5

Setup NTP

# cd /etc/ntp

edit the ntpservers file and add your ntpservers
clock.redhat.com
clock2.redhat.com
Start the ntp service and check ntp
# service ntpd start
# ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*mail01.tjgroup. 192.43.244.18 2 u 32 64 37 37.598 -465.66 2.783
ns1.pulsation.f 194.2.0.58 3 u 23 64 37 28.774 -478.17 0.862
+enigma.wiredgoa 204.123.2.5 2 u 30 64 37 161.413 -475.88 1.307
LOCAL(0) LOCAL(0) 10 l 29 64 37 0.000 0.000 0.004
Start the ntp service on reboot
# chkconfig ntpd --list
ntpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off

# chkconfig --levels 2345 ntpd on
# chkconfig ntpd --list
ntpd 0:off 1:off 2:on 3:on 4:on 5:on 6:off

Solaris Zones Basics-2

Zone States

Configured	Configuration has been completed and storage has been committed. Additional configuration is still required.
Incomplete	Zone is in this state when it is being installed or uninstalled.
Installed	The zone has a confirmed configuration, zoneadm is used to verify the configuration, Solaris packages have been installed, even through it has been installed, it still has no virtual platform associated with it.
Ready (active)	Zone's virtual platform is established. The kernel creates the zschedprocess, the network interfaces are plumbed and filesystems mounted. The system also assigns a zone ID at this state, but no processes are associated with this zone.
Running (active)	A zone enters this state when the first user process is created. This is the normal state for an operational zone.
Shutting down + Down (active)	Normal state when a zone is being shutdown.

Files and Directories

zone config file

/etc/zones

zone index

/etc/zones/index

Note: used by /lib/svc/method/svc-zones to start and stop zones

Cheat sheet

Creating a zone	zonecfg -z <zone> see creating a zone for a more details
deleting a zone from the global ssytem	## halt the zone first, then uninstall it zoneadm -z <zone> halt zoneadm -z <zone> uninstall ## now you can delete it zonecfg -z <zone> delete -F
Display zones current configuration	zonecfg -z <zone> info
Display zone name	zonename
Create a zone creation file	zonecfg -z <zone> export

Verify a zone	zoneadm -z <zone> verify
Installing a zone	zoneadm -z <zone> install
Ready a zone	zoneadm -z <zone> ready
boot a zone	zoneadm -z <zone> boot
reboot a zone	zoneadm -z <zone> reboot
halt a zone	zoneadm -z <zone> halt
uninstalling a zone	zoneadm -z <zone> uninstall -F
Veiwing zones	zoneadm list -cv

login into a zone	zlogin <zone>
login to a zones console	zlogin -C <zone> (use ~. to exit)
login into a zone in safe mode (recovery)	zlogin -S <zone>

add/remove a package (global zone)	# pkgadd -G -d . <package> If the -G option is missing the package will be added to all zones
add/remove a package (non-global zone)	# pkgadd -Z -d . <package> If the -Z option is missing the package will be added to all zones
Query packages in all non-global zones	# pkginfo -Z
query packages in a specified zone	# pkginfo -z <zone>

lists processes in a zone	# ps -z <zone>
list the ipcs in a zone	# ipcs -z <zone>
process grep in a zone	# pgrep -z <zone>
list the ptree in a zone	# ptree -z <zone>
Display all filesystems	# df -Zk
display the zones process informtion	# prstat -Z # prstat -z <zone> Note: -Z reports information about processes and zones -z reports information about a particular zone

Characteristic	Solstice Disksuite	Veritas Volume Manager
Availability	Free with a server license, pay for workstation. Sun seems to have an on again off again relationship about whether future development will continue to occur with this product, but it is currently available for Solaris8 and will continue to be supported in that capacity. The current word is that future development is on again. Salt liberally. (Just to further confuse things, SDS is free for Solaris 8 up to 8 CPUs, even for commercial).	Available from Sun or directly from Veritas (pricing may differ considerably. Also execellent educational pricing. Free with storage array or A5000 (but cannot use striping outside array device)
Installation	relatively easy. You must do special things to first achieve rootdisk, swap, and other disk mirroring in exactly the right order.	easy. Follow the onscreen prompts, let it do its reboots.
Upgrading	easy, remove patches, remove packages, add new packages	slightly more complex, but well documented. There are several ways to do it.
Replacing failed root mirror	very easy. replace disk and resynchronize	very easy. replace disk and resynchronize
Replacing failed primary root disk	relatively easy. boot off mirror, replace disk, resync, boot off primary.	easy to slightly complex depending on setup. Well documented. 11 steps or fewer.
Replacing failed data disk in redundant (mirrored or RAID5) volume	trivial	trivial
extensibility / number of volumes	Traditionally, relatively easy but EXTREMELY limited by usage of hard partition table on disk. Number of total volumes on a typical system is very limited because of this. If you have a lot of disks, you can still create a lot of metadevices. The default is 256 max, but this can be increased by setting nmd=XXXX in /kernel/drv/md.conf and then rebooting. Schemes for managing metadevice naming for large number of deices are available, but clunky and occassionally contrived. NOTE: SDS 4.2.1+ (avail Sol7) removes the reliance upon disk VTOC for making metadevices through 'soft partitions'.	trivial. No limitations will be encountered by most people. Number of volumes is potentially limitless.
Moving a volume	difficult unless special planning and precautions have been taken with laying out the proper partition and disk labels beforehand. Somewhat hidden by GUI.	trivial. on redundant volumes can be done on the fly.
Growing a volume	volume can be extended in two different ways. It can be concatenated with another region of space someplace else or, if there is contiguous space following ALL of the partitions of the original volume, the stripe can be extended. Using concatenation you could grow a 4 disk stripe by 2 additional disks. (e.g. 4 disk stripe concatenated with a 2 disk stripe).	volume can be extended in two different ways. The columes of the stripe can be extended for Raid0/5, simple single-disk volumes can be grown directly, and in VxVM > 3.0, a volume can be re-layed out (The number of columns in a RAID-5 stripe can be reconfigured on the fly!). Contiguous space is not required. In VXVM < 3.0 if you are increasing the size of a stripe, you must have space on disks where is the original number of disks in a stripe. You can't 'grow' (but could relayout) a 4 disk stripe by adding two more disks, but you could with 4. Extremely flexible.
Shrinking a volume (only possible with VxFS filesystem!)	difficult. You must adjust all disk or soft partitions manually.	trivial. vxresize can shrink filesystem and volume in one command..
Relayout volume (change 4 disk raid-5 volume to 5 disk volume	Requires dump/restore of data.	Available on the fly for VxVM > 3.0
Logging	in SDS a Meta-trans device may be used which provides a log based addition on top of a UFS filesystem. This transaction log, if used, should be mirrorred! (Loss of log results in a filesystem that may be corrupted even beyond fsck repair.) Use of a UFS+ logging filesystem instead of a trans device is a better alternative. UFS+ logging is availabe in Sol7 and above.	VxVM has RAID-5 logs and mirror/drl logs, Logging, if used need not be mirrored, and volume can continue operating if log fails. Having 1 is highly recommended for crash recovery. Logs are infinitessimally small, typically 1 disk cylinder or so. The SDS logs are really more equivalent to a VxFS log at the filesystem level, but it is worth mentioning the additional capabilities of VxVM in this regard. UFS+ with logging can also be used on a VxVM volume. There are many kinds of purpose-specific logs for things like fast mirror resync, volume replication, database logging, etc.
Performance	Your mileage may vary. SDS seems to excel at simple RAID-0 striping, but seems to be only marginally faster than VxVM. VxVM also seems to gain back when using large interleaves. For best results, benchmark YOUR data with YOUR app and pay very close attention to your data size and your stripe unit/interleave size. RAID5 on VxVM is almost always faster by 20-30%. Links: archive1, archive2
Notifications (see also)	SNMP traps are used for notification. You must have something set to receive them. Notifications are limited in scope.	VxVM uses email for notifying you when a volume is being moved because of bad blocks using hot relocation or sparing. The notification is very good.
Sparing	hot spare disks may be designated for a diskset, but must be done at the slice level.	hot spare disks may be designated for a diskgroup. Or, extra space on any disk can be used for dynamic hot relocation without the need for reserving a spare.
Terminology	SDS diskset = VxVM diskgroup, SDS metadevice = VxVM volume, SDS Trans device ~ VxVM log, VxVM has subdisks which are units of data (e.g. a column of a stripe) that have no SDS equivalent. VxVM plexes are groupings of subdisks (e.g. into a stripe) that have no real SDS equivalent. VxVM Volumes are groupings of plexes. (e.g. the data plex and a log plex, or 2 plexes for a 0+1 volume)
GUI	Most people prefer the VxVM GUI, though there are a few that prefer the (now 4 years old) SDS gui. SDS has been renamed SVM in Solaris9 and the GUI is supposedly much improved. VxVM has gone through 3-4 GUI incarnations. *Disclaimer: I never* use the GUI**
command line usage	metareplace, metaclear to delete, metainit for volumes, metadb for state databases, etc	vxassist is used for creation of all volume types, vxsd, vxvol, vxplex operate on appropriate VxVM objects (see terminology above). Generally, there are many more vx specific commands, but normal usage rarely requires 20% of these except for advanced configurations (special initializations, using alternate pathing, etc)
device database configuration copies	Kept in special, replicated, partitions you must setup on disk and configure via metdb. /etc/opt/SUNWmd, and /etc/system contain the boot/metadb information and description of the volumes. Lose these and you have big problems. NOTE: in Solaris9 SVM, configuration copies are now kept on the metadisks themselves with the data, like VxVM	Kept in the private region on each disk. Disks can move about and the machine can be reinstalled without having to worry about losing data in volumes.
Typical usage	Simple mirroring of root disk, simple striping of disks where situation is relatively stagnant (e.g. just a bunch of disks with RAID0 and no immediate scaling or mobility concerns). Scales well in size of small number of volumes, but poorly in large number of smaller volumes.	enterprise ready. Data mobility, scalability, and configuration are all extensively addressed. Replacing failed encapsulated rootdisk is more complicated than it needs to be. See Sun best practices paper for a better way. Other alternatives exist.
Best features	Simple/simplistic - root/swap mirroring and simple striping is no brainer, free or nearly so. Easier to fix by hand (without immediate support) when something goes fubar (vxvm is much more complex to understand under the hood).	extensible, error notifications are good, extremeley configurable, relayout on the fly with VxVM > 3.0, nice integration with VxFS, best scalability. Excellent edu pricing.
Worst features	configuration syntax (meta*), configuration information stored on host system (< Sol9). Metadb/slices -- a remnant from SunOS4 days! -- needs to be completely redone; naming is inflexible and limited. Number of hard metadevices has kernel hack workarounds, but is still very limiting. Required mirroring of trans logs is inconvenient, but mitigated by using native UFS+ w/logging in Solaris7 and above. Lack of drive level hotsparing (see sparing) is extremely inconvenient.	expensive for enterprises and big servers, root mirroring and primary rootdisk failure for encapsulated rootdisk is too complex (but well documented) (should be fixed in VxVM 4.0), somewhat steep learning curve for advanced usage. Recovery from administrative SNAFUs (involving restore and single user mode) on a mirrored rootdisk can be troublesome.
Tips	keep backups of your of configuration in case of corruption. Regular usage of metastat, metastat -p, and prtvtoc can help.	In VxVM regular usage of vxprint -ht is useful for disaster recovery. There are also several different disaster recovery scripts here
Using VxVM for data and SDS for root mirroring	Many people do this. There are tradeoffs. One the one hand you have added simplicity in the management of your rootdisks by not having to deal with VxVM encapsulation, which can ease recovery and upgrades. On the other hand, you now have the added complexity of having to maintain a separate rootdg volume someplace else, or use a simple slice (which, by the way, neither Sun nor Veritas will support if there are problems). You also have the added complexity of managing too completely separate storage/volume management products and their associated nuances and patches. In the end it boils down to preference. There is no right or wrong answer here, though some will say otherwise. ;) Veritas VxVM 4.0 removes the requirement for rootdg.

zone config file	/etc/zones
zone index	/etc/zones/index Note: used by /lib/svc/method/svc-zones to start and stop zones