Planet Linux Australia

Subscribe to Planet Linux Australia feed
Planet Linux Australia - http://planet.linux.org.au
Updated: 27 min 21 sec ago

Jeremy Visser: Spot the ADSL2+ upgrade

Mon 06th Feb 2012 23:02

Oh yeah, it should also be great for gaming:

…not.

I don’t have an ADSL2+ modem (this SpeedStream 4200 isn’t syncing above the theoretical ADSL1 maximums) so I need to go buy a proper one. Most likely a NetComm NB6, as I’m fairly familiar with that model and I only need it to function in bridge mode.

Categories: thinktime

Jeremy Visser: Trying out Cherokee

Mon 06th Feb 2012 08:02

Just a small heads–up: if you’re accessing any sites hosted on this server over SSL (i.e. this blog) or IPv6, then you’ll be served via Cherokee.

Non–SSL IPv4 requests are still being served via Apache (for now) because that is the lowest common denominator that I would like to break the least.

A couple of interesting things have happened since I have been trying this out.

One was a problem that cropped up on my Apache mod_gnutls setup, where the Apache workers were using 100% CPU. Upon stumbling on this thread I realise that the BerkeleyDB–based GnuTLS cache had become corrupted. Had I not found the fix, I would have used that as an excuse to completely move over to Cherokee.

The other was that I have been playing with FastCGI for the first time. Cherokee makes it a breeze to set up FastCGI, so I thought — why not? In particular, WordPress now takes 0.3 seconds to render a page, rather than 0.9 seconds with Apache + mod_suphp + PHP CGI.

Unfortunately there does not appear to be an equivalent of suPHP for Cherokee (or FastCGI in general). suPHP itself won’t work as it behaves as an Apache module (mod_suphp) that in turn executes the PHP CGI. That’s a bit of a fizzer for me because I need some of the sites I host to be able to write to their own files — but at the same time, not affect anybody else’s (thus chown’ing as www-data would be rather suboptimal).

The current version of Cherokee I am running (1.2.2) has a bug where it does not handle Chrome’s SSL false start technique. Thanks to ‘nosey’ on #cherokee for finding that one.

Overall I have found Cherokee to be less stable than Apache. This is to be expected, it being a young whippersnapper. In particular, if it encounters a config it doesn’t like (for example, I get some regex syntax wrong), it likes to shut down entirely, rather than gracefully continuing on with the old config.

Also I have not found Cherokee to be perceptually any faster than Apache thus far. My sites being hosted on a Linode that is 200 msec away obviously doesn’t help in this area. Perhaps I would see more of a difference on a server that is both closer (latency–wise) and busier (hits-per-second–wise).

More than anything this is a bit of fun. Let’s see how we go.

Categories: thinktime

Colin Charles: Managing MySQL with Percona Toolkit by Frédéric Descamps

Mon 06th Feb 2012 03:02

Frédéric Descamps of Percona.

Percona Toolkit is Maatkit & Aspersa combined. Opensource and the tools are very useful for a DBA.

You need Perl, DBI, DBD::mysql, Term::ReadKey. Most tools are written in Perl, and whatever is in Bash is being re-written in Perl. There is also a tarball or RPM or DEB packages.

Know your environment. The hardware & OS are crucial for you to know. How much memory/CPU do you use? Do you use swap? Is this a physical/virtual machine? Do you have free space? What kind of RAID controller? Volumes? Disk? What about the network interfaces? What IO schedulers are used? Which filesystem is the data stored on? To answer all that, just use pt-summary.

Know your MySQL environment. Version? Build? How many databases? Where is the data directory? What about replication? What are key InnoDB settings? Storage engine in use? Index type? Foreign keys? Full text indexes? To answer all this and more use pt-mysql-summary.

pt-slave-find shows you the topology and replication hierarchy of your MySQL replication instances. An inventory of replicas!

Where is my disk I/O going? Use pt-diskstats which is an improved iostat. There is pt-ioprofile but it can be dangerous in production.

Now its time to get more intimate with your database. Let’s try to find the answer to these questions: how are the indexes used? Are there duplicate keys? Which queries are eating most of the resources? You can use pt-duplicate-key-checker to check for duplicate/redundant indexes or foreign keys. pt-index-usage can tell you which indexes are unused. If you think you have bad SQL, check out pt-query-advisor.

You can use pt-query-digest to analyze the slow query log and show a profile of the workload. You mostly use this with slow query logs & tcpdump’s. Be careful when you have dropped packets — results may tend to be fake then!

After all this, its time to maintain your environment.

pt-deadlock-logger checks InnoDB status to log MySQL deadlock information. It needs to run continually to capture things.

pt-fk-error-logger extracts and logs MySQL foreign key errors.

pt-online-schema-change to alter tables. It makes a “shadow copy” and swaps them. Extremely useful for large, long-running ALTER. Facebook uses the same technique.

Validate your upgrades as upgrades are the leading cause of downtime. Are queries using different indexes? Is query execution plan different? New errors? See pt-upgrade for this. Best to run this on a third machine (i.e. the old machine and a new machine to see how it goes).

Verify replication integrity – pt-table-checksum. Perform an online replication consistency check or checksum MySQL tables efficiently on one or many servers. Use it routinely (mandatory for 95% of MySQL users). Put it in a weekly crontab. Repair differences with pt-table-sync.

Repair out-of-sync replicas – pt-table-sync

Measure delay acfurately – pt-heartbeat

Deliberately delay replication – pt-slave-delay

Watch & restart MySQL replication after errors – pt-slave-restart

When there are problems, get the symptoms when it hurts. Look at pt-stalk (wait for a condition to occur them begin collecting data – eg. everytime the threads go over 2,000 you have a problem, so it collects stuff – it calls pt-collect), pt-collect (collect information from a server for some period of time), and pt-sift.

pt-mext looks at many samples of MySQL SHOW GLOBAL STATUS side-by-side. Default STATUS shows counter since the MySQL instances started. It is very helpful to see a delta of recent activity.

The future: pt-query-digest will do query reviews; pt-stalk will do “magical fault detection algorithm”. Its all opensource and its all on Launchpad at lp:percona-toolkit.

Related posts:

  1. Practical MySQL Indexing guidelines by Stéphane Combaudon
  2. Replication features of 2011 by Sergey Petrunia
  3. MySQL synchronous replication in practice with Galera by Oli Sennhauser
Categories: thinktime

Russell Coker: Reliability of RAID

Mon 06th Feb 2012 02:02

ZDNet has an insightful article by Robin Harris predicting the demise of RAID-6 due to the probability of read errors [1]. Basically as drives get larger the probability of hitting a read error during reconstruction increases and therefore you need to have more redundancy to deal with this. He suggests that as of 2009 drives were too big for a reasonable person to rely on correct reads from all remaining drives after one drive failed (in the case of RAID-5) and that in 2019 there will be a similar issue with RAID-6.

Of course most systems in the field aren’t using even RAID-6. All the most economical hosting options involve just RAID-1 and RAID-5 is still fairly popular with small servers. With RAID-1 and RAID-5 you have a serious problem when (not if) a disk returns random or outdated data and says that it is correct, you have no way of knowing which of the disks in the set has good data and which has bad data. For RAID-5 it will be theoretically possible to reconstruct the data in some situations by determining which disk should have it’s data discarded to give a result that passes higher level checks (EG fsck or application data consistency), but this is probably only viable in extreme cases (EG one disk returns only corrupt data for all reads).

For the common case of a RAID-1 array if one disk returns a few bad sectors then probably most people will just hope that it doesn’t hit something important. The case of Linux software RAID-1 is of interest to me because that is used by many of my servers.

Robin has also written about some NetApp research into the incidence of read errors which indicates that 8.5% of “consumer” disks had such errors during the 32 month study period [2]. This is a concern as I run enough RAID-1 systems with “consumer” disks that it is very improbable that I’m not getting such errors. So the question is, how can I discover such errors and fix them?

In Debian the mdadm package does a monthly scan of all software RAID devices to try and find such inconsistencies, but it doesn’t send an email to alert the sysadmin! I have filed Debian bug #658701 with a patch to make mdadm send email about this. But this really isn’t going to help a lot as the email will be sent AFTER the kernel has synchronised the data with a 50% chance of overwriting the last copy of good data with the bad data! Also the kernel code doesn’t seem to tell userspace which disk had the wrong data in a 3-disk mirror (and presumably a RAID-6 works in the same way) so even if the data can be corrected I won’t know which disk is failing.

Another problem with RAID checking is the fact that it will inherently take a long time and in practice can take a lot longer than necessary. For example I run some systems with LVM on RAID-1 on which only a fraction of the VG capacity is used, in one case the kernel will check 2.7TB of RAID even when there’s only 470G in use!

The BTRFS Filesystem

The btrfs Wiki is currently at btrfs.ipv5.de as the kernel.org wikis are apparently still read-only since the compromise [3]. BTRFS is noteworthy for doing checksums on data and metadata and for having internal support for RAID. So if two disks in a BTRFS RAID-1 disagree then the one with valid checksums will be taken as correct!

I’ve just done a quick test of this. I created a filesystem with the command “mkfs.btrfs -m raid1 -d raid1 /dev/vg0/raid?” and copied /dev/urandom to it until it was full. I then used dd to copy /dev/urandom to some parts of /dev/vg0/raidb while reading files from the mounted filesystem – that worked correctly although I was disappointed that it didn’t report any errors, I had hoped that it would read half the data from each device and fix some errors on the fly. Then I ran the command “btrfs scrub start .” and it gave lots of verbose errors in the kernel message log telling me which device had errors and where the errors are. I was a little disappointed that the command “btrfs scrub status .” just gave me a count of the corrected errors and didn’t mention which device had the errors.

It seems to me that BTRFS is going to be a much better option than Linux software RAID once it is stable enough to use in production. I am considering upgrading one of my less important servers to Debian/Unstable to test out BTRFS in this configuration.

BTRFS is rumored to have performance problems, I will test this but don’t have time to do so right now. Anyway I’m not always particularly concerned about performance, I have some systems where reliability is important enough to justify a performance loss.

BTRFS and Xen

The system with the 2.7TB RAID-1 is a Xen server and LVM volumes on that RAID are used for the block devices of the Xen DomUs. It seems obvious that I could create a single BTRFS filesystem for such a machine that uses both disks in a RAID-1 configuration and then use files on the BTRFS filesystem for Xen block devices. But that would give a lot of overhead of having a filesystem within a filesystem. So I am considering using two LVM volume groups, one for each disk. Then for each DomU which does anything disk intensive I can export two LVs, one from each physical disk and then run BTRFS inside the DomU. The down-side of this is that each DomU will need to scrub the devices and monitor the kernel log for checksum errors. Among other things I will have to back-port the BTRFS tools to CentOS 4.

This will be more difficult to manage than just having an LVM VG running on a RAID-1 array and giving each DomU a couple of LVs for storage.

BTRFS and DRBD

The combination of BTRFS RAID-1 and DRBD is going to be a difficult one. The obvious way of doing it would be to run DRBD over loopback devices that use large files on a BTRFS filesystem. That gives the overhead of a filesystem in a filesystem as well as the DRBD overhead.

It would be nice if BTRFS supported more than two copies of mirrored data. Then instead of DRBD over RAID-1 I could have two servers that each have two devices exported via NBD and BTRFS could store the data on all four devices. With that configuration I could lose an entire server and get a read error without losing any data!

Comparing Risks

I don’t want to use BTRFS in production now because of the risk of bugs. While it’s unlikely to have really serious bugs it’s theoretically possible that as bug could deny access to data until kernel code is fixed and it’s also possible (although less likely) that a bug could result in data being overwritten such that it can never be recovered. But for the current configuration (Ext4 on Linux software RAID-1) it’s almost certain that I will lose small amounts of data and it’s most probable that I have silently lost data on many occasions without realising.

Related posts:

  1. Some RAID Issues I just read an interesting paper titled An Analysis of...
  2. ECC RAM is more useful than RAID A common myth in the computer industry seems to be...
  3. Software vs Hardware RAID Should you use software or hardware RAID? Many people claim...
Categories: thinktime

Colin Charles: Replication features of 2011 by Sergey Petrunia

Mon 06th Feb 2012 02:02

Sergey Petrunia of the MariaDB project & Monty Program.

MySQL 5.5 GA at the end of 2010. MariaDB 5.3 RC towards the end of 2011 (beta in June 2011).

MySQL 5.5 is merged to Percona Server 5.5 which included semi-sync replication, slave fsync options, atuomatic relay log recovery, RBR slave type conversions (question if this is useful or not), individual log flushing (very useful, but not many using), replication heartbeat, SHOW RELAYLOG EVENTS. About 2/3rds of the audience use MySQL 5.5 in production, with only 2 people using semi-sync replication.

MariaDB 5.3 brings replication features brings group commit in the binary log, which is merged into Percona Server 5.5. Checksums for binlog events which is merged from MySQL 5.6. Sergey goes in-depth about the group commit for the binary log. To find out a little more about MariaDB replication changes, see Replication in the Knowledgebase.

There are several implementations of group commit. Facebook started it, followed by MariaDB & Oracle. Percona 5.5 is GA so the feature is there, its not in MySQL 5.6 (yet?), and MariaDB 5.3 is where its at. Seems like the MariaDB implementation is the best so far – refer to the Facebook benchmark performed by Mark Callaghan.

Annotated RBR poses a compatibility problem. MariaDB 5.3 has annotate_rows, while MySQL 5.6 has rows_query event. They are different events. So you cannot have a MariaDB 5.3 master and a MySQL 5.6 slave at this moment. So MySQL 5.6 will have a flag to mark “ignorable” binlog events which will be merged into MariaDB and this will make binary logs compatible again.

There is now also optimized RBR for tables with no primary key.

MySQL 5.6 also has crash-safe slave (replication information stored in tables). Crash-safe master (binary log recovery if the server starts & sees the binary log is corrupted). Parallel event execution is something that is new in MySQL 5.6 which is the most important feature for Sergey.

Pre-heating: There is mk-slave-prefetch (famous quote: “Please don’t use mk-slave-prefetch on #MySQL unless you are Facebook.”). There is replication booster by Yoshinori Matsunobu. There is a Python version of mk-slave-prefetch that Facebook uses.

Related posts:

  1. MariaDB 5.3 query optimizer by Sergey Petrunia
  2. Where is MariaDB today?
  3. Building simple & complex replication clusters with Tungsten Replicator by Giuseppe Maxia
Categories: thinktime

Colin Charles: MySQL Creatively in a Sandbox by Giuseppe Maxia

Mon 06th Feb 2012 01:02

Giuseppe Maxia of Continuent and long time creator of MySQL Sandbox.

Only works on Unix-like servers. Works with MySQL, Percona & MariaDB servers. MySQL server has the data directory, the port and the socket – you can’t share these.

To use it: make_sandbox foo.tar.gz. Then just do ./use.

$SANDBOX_HOME is ~/sandboxes. You can also create ~/opt/mysql/ and if you have MySQL 5.0.91 binary in that directory, you can just do “sb 5.1.91″.

Sandbox has features to start replication systems as well. You can have varying master/slave setups with varying versions as well (good idea to test from MySQL -> MariaDB master->slave for migration).

You can now also play with tungsten-sandbox, which is a great way to start playing with Tungsten Replicator (see documentation and tungsten-toolbox). There is apparently also a MySQL Cluster sandbox tool that someone is working on.

 

Related posts:

  1. Building simple & complex replication clusters with Tungsten Replicator by Giuseppe Maxia
  2. MySQL HA reloaded by Ivan Zoratti
  3. MySQL synchronous replication in practice with Galera by Oli Sennhauser
Categories: thinktime

Colin Charles: Optimizing your InnoDB buffer pool usage by Steve Hardy

Mon 06th Feb 2012 01:02

Steve Hardy of Zarafa.

Work that has been done to make Zarafa better. Why do you optimise your buffer pool? To decrease your I/O load. How can you do it? Buy more RAM, page compression, less (smaller) data, rearrange data.

MariaDB or Percona Server allows you to inspect your buffer pool (unsure if this is now available in MySQL 5.6). Giuseppe in the audience says this is available in MySQL 5.6, but Steve used this on MariaDB 5.2.

Strategies to fix it: Make records smaller. Remove indexes if you can use others almost as efficiently. Make records that are accessed around the same time have a higher chance of being on the same page. Use page compression. Buy more RAM. Try Batched Key Access (BKA) in MariaDB 5.3+.

Best to view the presentation since there are specific examples that speak about how Zarafa solves their problems like a user trying to sort their email, etc.

Related posts:

  1. Batched Key Access Join Previews available
  2. MariaDB 5.3 query optimizer by Sergey Petrunia
  3. MariaDB/MySQL users in Paris & Brussels
Categories: thinktime

Jeremy Malcolm: This is a test

Mon 06th Feb 2012 00:02
This is a test, please ignore it
Categories: thinktime

Jeremy Malcolm: This is a test

Mon 06th Feb 2012 00:02
This is a test, please ignore it
Categories: thinktime

Jeremy Malcolm: This is a test

Mon 06th Feb 2012 00:02
This is a test, please ignore it
Categories: thinktime

Jeremy Malcolm: This is a test

Mon 06th Feb 2012 00:02
This is a test, please ignore it
Categories: thinktime

Jeremy Malcolm: This is a test

Mon 06th Feb 2012 00:02
This is a test, please ignore it
Categories: thinktime

Jeremy Malcolm: This is a test

Mon 06th Feb 2012 00:02
This is a test, please ignore it
Categories: thinktime

Jeremy Malcolm: This is a test

Mon 06th Feb 2012 00:02
This is a test, please ignore it
Categories: thinktime

Jeremy Malcolm: This is a test

Mon 06th Feb 2012 00:02
This is a test, please ignore it
Categories: thinktime

Jeremy Malcolm: This is a test

Mon 06th Feb 2012 00:02
This is a test, please ignore it
Categories: thinktime

Jeremy Malcolm: This is a test

Mon 06th Feb 2012 00:02
This is a test, please ignore it
Categories: thinktime

Colin Charles: Practical MySQL Indexing guidelines by Stéphane Combaudon

Mon 06th Feb 2012 00:02

Stéphane Combaudon of Dailymotion.

Index: separate data structure to speed up SELECTs. Think of index in a book. In MySQL, key=index. Consider that indexes are trees.

InnoDB’s clustered index – data is stored with the Primary Key (PK) so PK lookups are fast. Secondary keys hold the PK values. Designing InnoDB PK’s with care is critical for performance.

An index can filter and/or sort values. An index can contain all the fields needed for the query you don’t need to go to the table (a covering index).

MySQL only uses 1 index per table per query (not 100% true – OR clauses), so think of a composite index when you can. Can’t index TEXT fields (use a prefix). Same for BLOBs and long VARCHARs.

Indexes: speed up queries, increases the size of your dataset, slows down writes. How big is the write slowdown? Simple test by Stephane, for in-memory workloads he says adding 2 keys makes performance 2x worse; for on-disk workloads he says its 40x worse. Never neglect the slowdown of your writes when you have an index. There is a graph in the slidedeck.

What is a bad index? Unused indexes. Redundant indexes. Duplicate indexes.

Indexing is not an exact science, but guessing is probably not the best way to design indexes. Always check your assumptions – EXPLAIN does not tell you everything, time your queries with different index combinations, SHOW PROFILES is often valuable. Slow query log is a good place to start.

Many slides with examples, so I hope Stephane posts the deck soon. If possible, try to sort & filter (an index is not always the best for sorting).

InnoDB’s clustered index is always covering. SELECT by PK is the fastest access with InnoDB.

An index can give you 3 benefits: filtering, sorting, covering.

See Userstats v2 - you need Percona Server or MariaDB 5.2+. See also pt-duplicate-key-checker to find redundant indexes easily. See also pt-index-usage to help answer questions not covered by userstats.

Related posts:

  1. Sphinx user stories by Stéphane Varoqui
  2. MySQL Full Text Search by Alex Rubin
  3. Morning sessions at MySQL MiniConf
Categories: thinktime

Colin Charles: MySQL synchronous replication in practice with Galera by Oli Sennhauser

Mon 06th Feb 2012 00:02

Oli Sennhauser of FromDual.

Synchronous multi-master replication with the Galera plugin. Your application connects to the load balancer and it redirects read/write traffic to the various MySQL Galera nodes. Tested a setup with 17 SQL nodes and you can have even more. Scaling reads and also a little bit for scaling writes is what Galera is good for.

If one node fails, the other two nodes still communicates with each other and the load balancer is aware of the failed node.

Why Galera? There is master-slave replication but its not multi-master, and its asynchronous and you can get inconsistencies. There is master-master replication but its asynchronous and can have inconsistencies and conflicts if you write on both nodes. MHA/MMM/Tungsten are not providing new technology but are based on the MySQL replication technology. MySQL Cluster is another solution but its not InnoDB storage & your need new know-how for Cluster. Also Cluster has problems with fast JOINs. Active/Passive failover clustering, but too often you have resources idling. Schooner being closed & expensive is hard to know much about what they’re doing.

Galera is synchronous & based on InnoDB (others should in theory be possible). Active-active real multi-master topology. True parallel replication on row level. Cluster speaks with each other. There is no slave lag. Won’t lose transactions. Read/write scalability, write throughput can be improved but can’t scale in the way like MySQL Cluster.

Disadvantages? Its not native MySQL binaries/sources but a patch. Codership provides binaries. Higher probability of deadlocks. When you do a full sync (like when a node comes back after downtime), one node is blocked. This is why the minimum you need a 3-node cluster. Also if you do a full sync with a database larger than 50GB, the recommended method is to use mysqldump (which can be very slow). You can use rsync. Percona is working on xtrabackup to do a full sync between nodes.

Setup: 3 nodes are recommended. Or just 2 nodes and one for garbd (Galera Arbitrator Daemon). 2 nodes works but pay attention to a split brain scenario. Go to the Codership website, download their binaries and wsrep (the Galera plug-in). Create your own user on all nodes (don’t use the default root user). You then need to configure my.cnf (there have been discussions for a galera.conf, but Oli just uses my.cnf). Galera works only with InnoDB, so in my.cnf make the default storage engine InnoDB (don’t for example, by accident have MyISAM tables).

The demo has a strange Galera start script, but its not been easy to work. Just start MySQL usually like you would do.

SST is Snapshot State Transfer (SST). Its the initial full sync between the 1st and the other node. SST blocks the donor node (hence why you need 3 nodes). With Galera v2.0, there is also incremental state transfer. It should be GA in February 2012. You can get deltas as opposed to the full sync. You can configure which will be the donor node.

Currently there are 27 variables about Galera in v1.1. You can do it just by doing SHOW GLOBAL VARIABLES LIKE ‘wsrep%’;. The plugin itself, wsrep_provider_options has plenty of options & plenty of room for tuning. SHOW GLOBAL STATUS LIKE ‘wsrep%’; currently has 38 status information fields in Galera v1.1.

For load balancing, you can do it in your application (on your own). You can also use Connector/J which provides load balancing. There is also a PHP MySQLnd that works.

Related posts:

  1. MySQL HA reloaded by Ivan Zoratti
  2. Ticketmaster thrives on MySQL Replication
  3. New MySQL 5.6 Features by Oli Sennhauser
Categories: thinktime

Colin Charles: Optimising SQL applications by using client side tools by Mark Riddoch

Sun 05th Feb 2012 23:02

Mark Riddoch of SkySQL.

This was a talk about the future in general. What people would like. Etc. Not about something that exists yet, hence the sparse notes.

Trace statements are good for the “why”. You move on to the debugger, but what is a useful SQL debugger? Profilers addresses the “when”.

SQL developer tools: manually run queries (traditional route for development, effective way to test SQL statements, some indication of performance), server logs (alerts developers of serious issues like the slow query log – identify poorly written queries, requires server access, not effective in a cloud environment (?)), external monitoring (network sniffing of connection packets, Ethereal dissectors – allow individual connections to be traced, no server access, privileged network access, complex to interpret, possibility of packet loss), intrusive tools (insert “proxy” between client & server to intercept all traffic, imposes delay & requires setup modification).

Client-side tools – client modification or hook via connectors (query logging, profiling). The Java connector has profiling. Should there be a connector slow query log? So there’s no requirement for server access and its per client rather than per server.

Plans: roll out connector query logs for Java, C, and scripting languages built on Connector/C. SQL Parser for report generation/fuzzy matching. Create post-processing tools. Query log comparisons – detect fluctuation in execution time, plan flip alerting, replay log.

Related posts:

  1. Open source tools to run a small-medium sized business
  2. Leaving Summer of Code applications to the last minute?
  3. Google Native Client & Chrome about:flags
Categories: thinktime

Pages