Créer une présentation
Télécharger la présentation

Télécharger la présentation
## Unsupervised learning

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Unsupervised learning**David KauchakCS 451 – Fall 2013**Administrative**Final project No office hours today**Supervised learning**label label1 model/ predictor label3 label4 label5 Supervised learning: given labeled examples**Unsupervised learning**Unupervised learning: given data, i.e. examples, but no labels**Unsupervised learning**Given some example without labels, do something!**Unsupervised learning applications**learn clusters/groups without any label customer segmentation (i.e. grouping) image compression bioinformatics: learn motifs find important features …**Unsupervised learning: clustering**Raw data features f1, f2, f3, …, fn f1, f2, f3, …, fn Clusters f1, f2, f3, …, fn group into classes/clusters f1, f2, f3, …, fn extract features f1, f2, f3, …, fn No “supervision”, we’re only given data and want to find natural groupings**Unsupervised learning: modeling**Most frequently, when people think of unsupervised learning they think clustering Another category: learning probabilities/parameters for models without supervision • Learn a translation dictionary • Learn a grammar for a language • Learn the social graph**Clustering**Clustering: the process of grouping a set of objects into classes of similar objects Applications?**Gene expression data**Data from Garber et al. PNAS (98), 2001.**Clustering in search advertising**Find clusters of advertisers and keywords • Keyword suggestion • Performance estimation bids Bidded Keyword Advertiser ~10M nodes**Clustering applications**Find clusters of users • Targeted advertising • Exploratory analysis Clusters of the Web Graph • Distributed pagerank computation Who-messages-who IM/text/twitter graph ~100M nodes**Data visualization**Wise et al, “Visualizing the non-visual” PNNL ThemeScapes, Cartia • [Mountain height = cluster size]**A data set with clear cluster structure**What are some of the issues for clustering? What clustering algorithms have you seen/used?**Issues for clustering**Representation for clustering • How do we represent an example • features, etc. • Similarity/distance between examples Flat clustering or hierarchical Number of clusters • Fixed a priori • Data driven?**Clustering Algorithms**Flat algorithms • Usually start with a random (partial) partitioning • Refine it iteratively • K means clustering • Model based clustering • Spectral clustering Hierarchical algorithms • Bottom-up, agglomerative • Top-down, divisive**Hard vs. soft clustering**Hard clustering: Each example belongs to exactly one cluster Soft clustering: An example can belong to more than one cluster (probabilistic) • Makes more sense for applications like creating browsable hierarchies • You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes**K-means**Most well-known and popular clustering algorithm: Start with some initial cluster centers Iterate: • Assign/cluster each example to closest center • Recalculate centers as the mean of the points in a cluster**K-means: assign points to nearest center**No changes: Done**K-means**Iterate: • Assign/cluster each example to closest center • Recalculate centers as the mean of the points in a cluster How do we do this?**K-means**Iterate: • Assign/cluster each example to closest center • Recalculate centers as the mean of the points in a cluster iterate over each point: - get distance to each cluster center - assign to closest center (hard cluster)**K-means**Iterate: • Assign/cluster each example to closest center • Recalculate centers as the mean of the points in a cluster iterate over each point: - get distance to each cluster center - assign to closest center (hard cluster) What distance measure should we use?**Distance measures**Euclidean: good for spatial data**Clustering documents (e.g. wine data)**One feature for each word. The value is the number of times that word occurs. Documents are points or vectors in this space**When Euclidean distance doesn’t work**Which document is closest to q using Euclidian distance? Which do you think should be closer?**the Euclidean distance between qand d2 is large**but, the distribution of terms in the query qand the distribution of terms in the document d2are very similar This is not what we want! Issues with Euclidian distance**cosine similarity**correlated with theangle between two vectors**cosine distance**cosine similarity is a similarity between 0 and 1, with things that are similar 1 and not 0 We want a distance measure, cosine distance: • good for text data and many other “real world” data sets • is computationally friendly since we only need to consider features that have non-zero values both examples**K-means**Iterate: • Assign/cluster each example to closest center • Recalculate centers as the mean of the points in a cluster Where are the cluster centers?**K-means**Iterate: • Assign/cluster each example to closest center • Recalculate centers as the mean of the points in a cluster How do we calculate these?**K-means**Iterate: • Assign/cluster each example to closest center • Recalculate centers as the mean of the points in a cluster Mean of the points in the cluster: where:**K-means loss function**K-means tries to minimize what is called the “k-means” loss function: that is, the sum of the squared distances from each point to the associated cluster center**Minimizing k-means loss**Iterate: 1. Assign/cluster each example to closest center 2. Recalculate centers as the mean of the points in a cluster Does each step of k-means move towards reducing this loss function (or at least not increasing)?**Minimizing k-means loss**Iterate: 1. Assign/cluster each example to closest center 2. Recalculate centers as the mean of the points in a cluster This isn’t quite a complete proof/argument, but: Any other assignment would end up in a larger loss The mean of a set of values minimizes the squared error**Minimizing k-means loss**Iterate: 1. Assign/cluster each example to closest center 2. Recalculate centers as the mean of the points in a cluster Does this mean that k-means will always find the minimum loss/clustering?**Minimizing k-means loss**Iterate: 1. Assign/cluster each example to closest center 2. Recalculate centers as the mean of the points in a cluster NO! It will find a minimum. Unfortunately, the k-means loss function is generally not convex and for most problems has many, many minima We’re only guaranteed to find one of them**K-means variations/parameters**Start with some initial cluster centers Iterate: • Assign/cluster each example to closest center • Recalculate centers as the mean of the points in a cluster What are some other variations/parameters we haven’t specified?**K-means variations/parameters**Initial (seed) cluster centers Convergence • A fixed number of iterations • partitions unchanged • Cluster centers don’t change K!