uiyiW
|
· Xi ≥
|
|
−ξ ∀ U ∈{0,1}
|
|
n
|
|
n
|
|
i=1
452 CHAPTER 13. MINING TEXT DATA
The optimization formulation (OP2) is different from (OP1) in that it has only one slack variable ξ but 2n constraints that represent the sum of every subset of constraints in (OP1). It can be shown that a one-to-one correspondence exists between the solutions of (OP1) and (OP2).
Lemma 13.5.1 A one-to-one correspondence exists between solutions of (OP1) and (OP2),
|
|
|
|
|
n
|
ξ∗
|
|
with equal values of W = W ∗ in both models, and ξ∗ =
|
|
i=1
|
i
|
.
|
|
n
|
|
|
|
|
|
|
|
|
|
|
Proof: We will show that if the same value of W is fixed for (OP1), and (OP2), then it will lead to the same objective function value. The first step is to derive the slack variables in terms of this value of W for (OP1) and (OP2). For problem (OP1), it can be derived from the slack constraints that the optimal value of ξi is achieved for ξi = max{0, 1 − yiW · Xi} in order to minimize the slack penalty. For the problem OP2, a similar result for ξ can be obtained:
|
n
|
|
|
1
|
n
|
|
|
|
|
|
|
|
ξ = maxu1 ...un
|
i=1 ui
|
−
|
|
|
|
|
|
|
(13.24)
|
|
|
uiyiW · Xi .
|
|
n
|
n i=1
|
|
Because this function is linearly separable in ui, one can push the maximum inside the summation, and independently optimize for each ui:
|
|
|
|
|
|
|
|
n
|
|
1
|
1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ξ =
|
maxui ui
|
|
|
|
|
|
(13.25)
|
|
|
|
|
|
|
|
|
|
|
|
−
|
|
yiW · Xi .
|
|
|
n
|
n
|
|
|
|
|
|
|
|
|
|
i=1
|
|
|
|
|
|
|
|
|
|
|
|
For optimality, the value of ui
|
should be
|
picked as 1 for only the positive
|
values of
|
|
|
1
|
−
|
1
|
yi
|
|
·
|
|
and 0, otherwise. Therefore, one can show the following:
|
|
|
|
W
|
Xi
|
|
|
|
n
|
n
|
|
|
|
n
|
1
|
1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ξ =
|
|
|
|
|
|
|
|
|
|
|
|
|
|
max 0,
|
|
−
|
|
yiW · Xi
|
|
|
|
|
n
|
n
|
|
|
|
|
i=1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1
|
|
n
|
|
|
|
|
|
|
|
|
|
|
|
n
|
ξi
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
i=1
|
|
|
|
max 0,
|
1 − yiW · Xi =
|
|
|
=
|
|
|
|
|
.
|
|
n
|
n
|
|
|
i=1
(13.26)
(13.27)
This one -to-one correspondence between optimal values of W in (OP1) and (OP2) implies that the two optimization problems are equivalent.
Thus, by determining the optimal solution to problem (OP2), it is possible to determine the optimal solution to (OP1) as well. Of course, it is not yet clear, why (OP2) is a better formulation than (OP1). After all, problem (OP2) contains an exponential number of con-straints, and it seems to be intractable to even enumerate the constraints, let alone solve them.
Even so, the optimization formulation (OP2) does have some advantages over (OP1). First, a single slack variable measures the feasibility of all the constraints. This implies that all constraints can be expressed in terms of ( W , ξ). Therefore, if one were to solve the optimization problem with only a subset of the 2n constraints and the remaining were satisfied to a precision of by (W , ξ), then it is guaranteed that (W , ξ + ) is feasible for the full set of constraints.
The key is to never use all the constraints explicitly. Rather, a small subset WS of the 2n constraints is used as the working set. We start with an empty working set WS. The corresponding optimization problem is solved, and the most violated constraint among the constraints not in WS is added to the working set. The vector U for the most violated constraint is relatively easy to find. This is done by setting ui to 1, if yiW · Xi < 1, and 0 otherwise. Therefore, the iterative steps for adding to the working set WS are as follows:
13.6. NOVELTY AND FIRST STORY DETECTION
|
453
|
Determine optimal solution (W , ξ) for objective function of (OP2) using only con-straints in the working set WS.
Determine most violated constraint among the 2n constraints of (OP2) by setting u1 to 1 if yiW · Xi < 1, and 0 otherwise.
Add the most violated constraint to WS.
The termination criterion is the case when the most violated constraint is violated by no more than . This provides an approximate solution to the problem, depending on the desired precision level .
This algorithm has several desirable properties. It can be shown that the time required to solve the problem for a constant size working set WS is O(n·s), where n is the number of training examples, and s is the number of nonzero attributes per example. This is important for the text domain, where the number of non-zero attributes is small. Furthermore, the algo-rithm usually terminates in a small constant number of iterations. Therefore, the working set WS never exceeds a constant size, and the entire algorithm terminates in O(n · s) time.
13.6 Novelty and First Story Detection
The problem of first story detection is a popular one in the context of temporal text stream mining applications. The goal is to determine novelties from the underlying text stream based on the history of previous text documents in the stream. This problem is particularly important in the context of streams of news documents, where a first story on a new topic needs to be reported as soon as possible.
A simple approach is to compute the maximum similarity of the current document with all previous documents, and report the documents with very low maximum similarity values as novelties. Alternatively, the inverse of the maximum similarity value could be continuously reported as a streaming novelty score or alarm level. The major problem with this approach is that the stream size continuously increases with time, and one has to compute similarity with all previous documents. One possibility is to use reservoir sampling to maintain a constant sample of documents. The inverse of the maximum similarity of the document to any incoming document is reported as the novelty score. The major drawback of this approach is that similarity between individual pairs of documents is often not a stable representation of the aggregate trends. Text documents are sparse, and pairwise similarity often does not capture the impact of synonymy and polysemy.
13.6.1 Micro-clustering Method
The micro-clustering method can be used to maintain online clusters of the text documents. The idea is that micro-clustering simultaneously determines the clusters and novelties from the underlying text stream. The basic micro-clustering method is described in Sect. 12.4 of Chap. 12. The approach maintains k different cluster centroids, or cluster digests. For an incoming document, its similarity to all the centroids is computed. If this similarity is larger than a user-defined threshold, then the document is added to the cluster. The frequencies of the words in the corresponding centroid are updated, by adding the frequency of the word in the document to it. For each document, only the r most frequent words in the centroid are retained. The typical value of r varies between 200 and 400. On the other hand, when the incoming document is not sufficiently similar to one of the centroids, then it is reported
454 CHAPTER 13. MINING TEXT DATA
as a novelty, or as a first story. A new cluster is created containing the singleton document. To make room for the new centroid, one of the old centroids needs to be removed. This is achieved by maintaining the last update time of each cluster. The most stale cluster is removed. This algorithm provides the online ability to report the novelties in the text stream. The bibliographic notes contain pointers to more detailed versions of this method.
13.7 Summary
The text domain is sometimes challenging for mining purposes because of its sparse and high-dimensional nature. Therefore, specialized algorithms need to be designed.
The first step is the construction of a bag-of -words representation for text data. Several preprocessing steps need to be applied, such as stop-word removal, stemming, and the removal of digits from the representation. For Web documents, preprocessing techniques are also required to remove the anchor text and to extract text from the main block of the page.
Algorithms for problems such as clustering and classification need to be modified as well. For example, density-based methods are rarely used for clustering text. The k-means meth-ods, hierarchical methods, and probabilistic methods can be suitably modified to work for text data. Two popular methods include the scatter/gather approach, and the probabilistic EM-algorithm. The co-clustering method is also commonly used for text data. Topic mod-eling can be viewed as a probabilistic modeling approach that shares characteristics of both dimensionality reduction and clustering. The problem of novelty detection is closely related to text clustering. Streaming text clustering algorithms can be used for novelty detection. Data points that do not fit in any cluster are reported as novelties.
Among the classification methods, decision trees are not particularly popular for text data. On the other hand, instance-based methods, Bayes methods, and SVM methods are used more commonly. Instance-based methods need to be modified to account for the noise effects of synonymy and polysemy. The multinomial Bayes model is particularly popular for text classification of long documents. Finally, the SVMPerf method is commonly used for efficient text classification with support vector machines.
13.8 Bibliographic Notes
An excellent book on text mining may be found in [377]. This book covers both informa-tion retrieval and mining problems. Therefore, issues such as preprocessing and similarity computation are covered well by this book. Detailed surveys on text mining may be found in [31]. Discussions of the tree matching algorithm may be found in [357, 542].
The scatter/gather approach discussed in this chapter was proposed in [168]. The impor-tance of projecting out infrequent words for efficient document clustering was discussed in [452]. The PLSA discussion is adopted from the paper by Hofmann [271]. The LDA method is a further generalization, proposed in [98]. A survey on topic modeling may be found in [ 99]. Co-clustering methods for text were discussed in [171, 172, 437]. The co-clustering problem is also studied more generally as biclustering in the context of biological data. A general survey on biclustering methods may be found in [374]. General surveys on text clustering may be found in [31, 32].
The text classification problem has been explored extensively in the literature. The LSA approach was discussed in [184]. Centroid-based text classification was discussed in [249]. A detailed description of different variations of the Bayes model in may be found in [31, 33].
Dostları ilə paylaş: |