Action recognition in video remains a very popular research topic. However, current approaches based on supervised learning are hampered by the paucity of high-quality training data. Although a great number of videos are now available on the internet, they are not labeled with the accuracy nor at the temporal granularity required to train such methods. This has motivated the demand for semi-supervised approaches that can effectively employ a small amount of labeled data in conjunction with a much larger pool of unlabeled video. We present a complete action recognition system that learns accurate classifiers from as few as three labeled instances of each class. We represent video clips using two complementary sets of features, local appearance and motion, each aggregated using a bag-of-words. We extend the concept of co-training to the multi-class setting by co-training a pair of classifiers for each target action. These classifiers mine the unlabeled video clips to identify additional instances of each action, from which we train a final, highly-accurate classifier. Experiments show that the proposed system significantly outperforms existing semi-supervised approaches, such as the transductive SVM, on real-world YouTube videos.