Copyable available on GitHub

People actually download and use Copyable, and they tend to use it in scenarios I haven’t used it in. This results in bug reports and patch submissions. So far, these have been given to me by e-mail or by blog comment, neither of which is a particularly great way of receiving them. So after receiving another one today, I finally got around to putting Copyable on GitHub.

The version I put up includes several enhancements from the latest release:

  • It uses FormatterServices.GetUninitializedObject and hence does not depend on a parameterless constructor or custom instance provider (but you can of course still create an instance provider if you want to control object initialization)
  • The bug with copy semantics for already visited objects submitted by Walter Oesch has been fixed
  • The bug with inherited fields found by Alex, and the patch submitted for it, has been incorporated

Bleeding edge Copyable can be found at http://github.com/havard/copyable. The clone URL is git://github.com/havard/copyable.git. Now go fix your own bugs! Or even better, enhance the framework.

Minimalistic MapReduce in .NET 4.0 with the new Task Parallel Library (TPL)

Among the news in .NET 4.0 are several additions by the Parallel Computing Platform Team. As I wandered through the documentation of the Task library with cloud computing and parallelism buzz in the back of my head, I got the idea of using tasks to create a minimalistic MapReduce. Here’s the result, a rather crude and simple, but efficient MapReduce for you to play with and utilize!

What is MapReduce?

For those of you who don’t know what MapReduce is: MapReduce is a simplified interface for parallel data processing. MapReduce was initially described by the Google engineers Jeffrey Dean and Sanjay Ghemawat in the 2004 paper titled MapReduce: Simplified data processing on large clusters.

MapReduce processes data by splitting the processing in to a set of transformations (in functional programming, this is called the “map” function (it maps or transforms an input to an output)). The results of the transformations are then combined into a single result (in functional programming, this is called the “reduce” function (it reduces a set of values to a single value)). On a sidenote, Linq has equivalent functions, but the names are different, presumably to make them more familiar to people with SQL knowledge. In Linq, map is called Select, and reduce is called Aggregate.

Shortly put, to process a huge set of data, you split the data into chunks and process each chunk in parallel. This eventually creates a new set of intermediary results, which is reduced to a single result.

Implementing a minimalistic MapReduce in .NET 4.0

The signature of my MapReduce function is


static Task<TResult> Start<TInput, TPartial, TResult>(
  Func<TInput, TPartial> map, 
  Func<TPartial[], TResult> reduce, 
  params TInput[] inputs);</pre>

In other words, to start a MapReduce run, you supply a map function, a reduce function, and a set of inputs. Each input will be turned into an intermediate result (of type TPartial). Inputs are transformed concurrently. When all inputs are transformed, the reduce function is called to transform the partial results into a final result (of type TResult). Cool!

The map part is implemented by starting a task for each supplied input using Task.Factory.StartNew().


Task.Factory.StartNew(() => map(input));

The reduce part is implemented as a continuation of all the map tasks, meaning that the reduce task waits for all the map tasks to complete, and then executes. This is achieved using Task.Factory.ContinueWhenAll.


Task.Factory.ContinueWhenAll(
  mapTasks, 
  tasks => PerformReduce(reduce, tasks));

As you can see, the implementation is minimalistic and simple, and usage is likewise.

Here’s a simple example using MapReduce to calculate the root mean square (MSE) of a set of values:


var task = MapReduce.Start<int, int, double>(
  i => i * i,
  s => Math.Sqrt(s.Aggregate((a, b) => a + b) / 5),
  1, 2, 3, 4, 5);
// Wait for result
task.Wait();
// Prints 3.3166...
Console.WriteLine(task.Result);

Actual applications of MapReduce are of course far more interesting than this simple example.

Applications of MapReduce

MapReduce can essentially be applied to any problem where you need a number of things to be done in parallel. It can even be applied in cases where you don’t need a final result. Just return an arbitrary value as the result (or even better, implement a variant of my MapReduce which uses Action<T>).

A few obvious use cases:

  • Distributed search
  • Distributed sort
  • Tokenization
  • Indexing
  • Log processing
  • Machine learning
  • General artificial intelligence
  • General data mining
  • Large scale image processing

The list goes on and on, these are just a few things off the top of my head.

You can grab the source code for MapReduce here. Since this is done in .NET 4.0, it requires Visual Studio 2010 Beta 2 or later.

As usual, play around with it, have fun, and let me know if you find it useful!

Extension methods for copying or cloning objects

C# 3.0 includes a new feature known as extension methods, and fiddling with it triggered the idea of creating a mechanism for copying or cloning (virtually) any .NET object or graph of objects. The manifestation of that idea has become a rather decent little framework for copying objects. It performs a deep copy as automatically as it possibly can, and provides mechanisms to easily solve many of the cases which cannot be covered automatically. It is great for copying your custom object hierarchies, and saves you the pain of a solution like implementing ICloneable for an entire hierarchy of objects. Click here to grab it now, and read on for a presentation.

Let’s start off with a few words on extension methods. They are best explained through an example. Let’s say we want to be able to calculate area given size. Wouldn’t it be nice to be able to add GetArea to the already existing Size class? Well, let’s do so!

public static class ExtensionMethods
{
  public static int GetArea(this Size size)
  {
    return size.Width * size.Height;
  }
}
As you can see, the new syntax simply allows you to tell the compiler that the this of this method is a Size. This means that the method is an extension of the Size class.

As mentioned, I had the idea of extending the very base of the C# class hierarchy (System.Object) with a method for copying or cloning “any” object. Obviously, the method cannot automatically copy any object, since it cannot possibly know how to construct an object from an arbitrary class. Hence, a small framework needed to be created. The goals were to:

  • Enable copying of many objects automatically.
  • Enable copying of virtually any object with very little effort.
  • Automate and hide away as much as possible (The KISS Principle).

The result is Copyable (pun intended).

The Copyable framework

Copyable is a small framework for copying (or cloning, if you will) objects. The straightforward way of using it is to just reference the assembly it’s in from your project, and start copying!

SomeType instance = new SomeType();
// ...do lots of stuff to the object...
SomeType copy = instance.Copy(); // Create a deep copy

The instance copy is now a deep copy of instance, no matter how complex the object graph for instance is. The relations in the copy graph is the same as in instance, but all objects in the copy object graph are copies of those in instance.

For the automated copy to work, though, one of the following statements must hold for instance:

  • Its type must have a parameterless constructor, or
  • It must be a Copyable, or
  • It must have an IInstanceProvider registered for its type.

Besides the Copy method, The Copyable class and IInstanceProvider interface are the two major building blocks of the Copyable framework. Each of these blocks enable copying of objects that cannot automatically be copied.

The Copyable base class

Copyable is an abstract base class for objects that can be copied. To create a copyable class, you simply subclass Copyable and call its constructor with the arguments of your constructor.

class MyClass : Copyable
{
  public MyClass(int a, double b, string c)
    : base(a, b, c)
  {
  }
}

This code above makes MyClass a copyable class. Note that if MyClass had had a parameterless constructor, subclassing Copyable would not be necessary.

MyClass can now be copied just like the previous example.

MyClass a = new MyClass(1, 2.0, "3");
MyClass b = a.Copy();

The introduction of the Copyable base class solves many problems, but not all. Let’s say you wanted to copy a System.Drawing.SolidBrush. This class does not have a parameterless constructor, which means it cannot be copied “automatically” by the framework. Also, you cannot alter it so that it subclasses Copyable. So, what do you do? You create an instance provider.

The IInstanceProvider interface

An instance provider is defined by the interface IInstanceProvider. As the name clearly states, the implementation is a provider of instances. One instance provider can provide instances of one given type. The Copyable framework automatically detects IInstanceProvider implementations in all assembies in its application domain, so all you need to do to create a working instance provider is to define it. No registration or other additional operations are required. To simplify the implementation of instance providers and the IInstanceProvider interface, an abstract class InstanceProvider is included in the framework.

public class SolidBrushProvider</dt>
<dd>InstanceProvider<SolidBrush>
{
public override SolidBrush CreateTypedCopy(SolidBrush s)
{
return new SolidBrush(s.Color);
}
}

This implementation will be used automatically by the Copyable framework. NOTE: To be usable, the instance provider MUST have a parameterless constructor.

The instance provider pattern does not solve the case where you want different initial states for your SolidBrush instances depending on which context you use them for copying. For those cases, an overload of Copy() exists which takes an already created instance as an argument. This argument will become the copy.

SolidBrush instance = new SolidBrush(Color.Red);
instance.Color = Color.Black;
SolidBrush copy = new SolidBrush(Color.Red);
instance.Copy(copy); // Create a deep copy

In this example, copy is now of the color Color.Black.

Limitations and pitfalls

Although this solution works in most cases, it’s not a silver bullet. Be aware when you copy classes that hold unmanaged resources such as handles. If these classes are designed on the premise that their resources are exclusive to them, they will manage them as they see fit. Imagine if you copied a class which holds a handle, disposed one of the instances, and continued using the copy. The handle will (probably) be freed by the original instance, and the copy will generate an access violation by attempting reading or writing freed memory.

That’s it! The Copyable framework can be downloaded from here. For those interested in reading more on extension methods, For additional information, MSDN provides an excellent explanation in the C# Programming Guide, and Scott Guthrie has an introduction article here.

Enjoy Copyable, and please let me know if you find it useful or come across any problems with it.

UPDATE 2009-12-11: Due to popular demand, I have made the source code for Copyable available under the MIT license. The source can be downloaded here.

UPDATE 2010-01-31: The requirement of parameterless constructors has been removed in the latest version of Copyable available on GitHub. A new release will follow soon.