Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	361/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 357 358 359 360 361 362 363 364 ... 423

1-Data Mining tarjima

18.6. WEB USAGE MINING

613

This factorization is already directly in the form we want. Therefore, the user and item factor matrices are defined as follows:

^Fuser = ^U	(18.20)
F_item = V.	(18.21)

The main diﬀerence from the analysis of Sect. 6.8 is in how the optimization objective function is set up for incomplete matrices. Recall that the matrices U and V are determined by optimizing the following objective function:

J =||D−U ·V^T||².

(18.22)

Here, || · || represents the Frobenius norm. In this case, because the ratings matrix D is only partially specified, the optimization is performed only over the specified entries, rather than all the entries. Therefore, the basic form of the optimization problem remains very similar, and it is easy to use any oﬀ-the-shelf optimization solver to determine U and V . The bibliographic notes contain pointers to relevant stochastic gradient descent methods.

regularization term λ(||U ||2 + ||V ||2) containing the squared Frobenius norms of U and V may be added to J to reduce overfitting. The regularization term is particularly important when the number of specified entries is small. The value of the parameter λ is determined using cross-validation.

This method is more convenient than SVD for determining the factorized matrices because the optimization objective can be set up in a seamless way for an incompletely specified matrix no matter how sparse it might be. When the ratings are nonnegative, it is also possible to use nonnegative forms of matrix factorization. As discussed in Sect. 6.8, the nonnegative version of matrix factorization provides a number of interpretability advan-tages. Other forms of factorization, such as probabilistic matrix factorization and maximum margin matrix factorization, are also used. Most of these variants are diﬀerent in terms of minor variations in the objective function (e.g., Frobenius norm minimization, or maxi-mum likelihood maximization) and the constraints (e.g., nonnegativity) of the underlying optimization problem. These diﬀerences often translate to variants of the same stochastic gradient descent approach.

18.6 Web Usage Mining

The usage of the Web leads to a significant amount of log data. There are two primary types of logs that are commonly collected:

Web server logs: These correspond to the user activity on Web servers. Typically logs are stored in standardized format, known as the NCSA common log format, to facilitate ease of use and analysis by diﬀerent programs. A few variants of this format, such as the NCSA combined log format, and extended log format, store a few extra fields. Nevertheless, the number of variants of the basic format is relatively small. An example of a Web log entry is as follows:

98.206.207.157 - - [31/Jul/2013:18:09:38 -0700] "GET /productA.pdf HTTP/1.1" 200 328177 "-" "Mozilla/5.0 (Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10B329 Safari/8536.25" "retailer.net"

614 CHAPTER 18. MINING WEB DATA

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 357 358 359 360 361 362 363 364 ... 423