Subjects: Mathematics >> Mathematics (General) submitted time 2022-12-04
Abstract: In this paper, I found the two reasons of overfitting of cross entropy: boundary samples occupy a larger
and larger share as the length of normal vector becomes longer and longer, boundary samples do not
fit their probability density function well.
Peer Review Status:Awaiting Review
Subjects: Mathematics >> Mathematics (General) submitted time 2022-10-10
Abstract: In weibo APP, there are many almost same images whose only difference are watermark and resolution. In order to find out the most similar image efficiently, this paper proposes a algorithm named multi-level fingerprint, which contains 5 character strings and 3 vectors. On a dataset of 1 million images from WEIBO APP, multi-level fingerprint achieves a precision 97.69% and QPS 345.
Peer Review Status:Awaiting Review
Subjects: Mathematics >> Computational Mathematics. submitted time 2020-10-19
Abstract: Before entering into a recommender system, an entity name must be embedded into a vector. Some popular models, such as word2vec, are based on the principle “words which are in the same syntactic position should embedded into similar vectors”. However, sequence of entity names has no syntactic structure, which led to the low quality of name vectors. Based on the principle “neighbouring names should embedded into similar vectors”, this paper proposes a novel algorithm named name2vec. Name2vec has new features: vector length equals 1, relative weight which has solved the low frequency problem, optimization objective function is mean square error rather than cross entropy. The quality of embedding is measured by the similarity of entity names. On there datasets from WEIBO.COM, name2vec has a better performance than word2veec.
Peer Review Status:Awaiting Review
Subjects: Mathematics >> Computational Mathematics. submitted time 2019-11-26
Abstract: This paper proposes a novel clustring algorithm named Sliding Means, aiming to take the place of k-means algorithm which is widely used in internet applications. Sliding means has the ability to handle with very large datasets, and to automatically determine the number of clusters. With the help of shuffling samples, bad initial centroids have little chance to be selected. Sliding means is also able to drop some bad centroids on the fly. On the iris dataset and optdigits dataset, sliding means achieves better performance(Adjusted Rand Index) than k-means by 9.93% and 5.17% respectively.
Peer Review Status:Awaiting Review
Subjects: Mathematics >> Computational Mathematics. submitted time 2019-04-10
Abstract: This paper proposes a novel method named Polyhedron Regression(PR) for Click-Through-Rate prediction, aiming to take the place of Factorization Machines(FM). PR constructs a convex polyhedra with hyperplanes to separate positive samples from negative samples. PR has intuitionistic geometrical interpretations and a Lipschitz continuous surface, converges to global optimum point from arbitrary initial values. Compared with FM, PR has better classification accuracy, interpretability and surface smoothness on the three artificial datasets. With comparable parameters and computation, PR achieves better AUC than FM on Avazu and Criteo datasets.
Peer Review Status:Awaiting Review
Subjects: Mathematics >> Computational Mathematics. submitted time 2018-04-03
Abstract: In this paper, I found the two reasons of overfitting of logistic regression: boundary samples occupy a larger and larger share as the length of normal vector becomes longer and longer, boundary samples do not fit their probability density function well. With the help of insight in overfitting, I propose a acceleration method for logistic regression and got a training speedup of 38.25 on MNIST dataset, a training speedup of 5.61 on CIFAR10 dataset.
Peer Review Status:Awaiting Review
Subjects: Mathematics >> Computational Mathematics. submitted time 2018-03-22
Abstract: In this paper, I found the two reasons of overfitting of logistic regression: boundary samples occupy a larger and larger share as the length of normal vector becomes longer and longer, boundary samples do not fit their probability density function well. With the help of insight in overfitting, I propose a acceleration method for logistic regression and got a training speedup of 38.25 on MNIST dataset, a training speedup of 5.61 on CIFAR10 dataset.
Peer Review Status:Awaiting Review
Subjects: Mathematics >> Theoretical Computer Science submitted time 2017-11-17
Abstract: This paper proposes a new linear classification method named Focusing Classification, with the goal of taking the place of Logistic Regression. Focusing Classification has some advantages: length of its normal vector is limited, intuitional geometrical explanation, parameters' initial values are close to the best values. numerical experiments on the MNIST dataset demonstrate that Focusing Classification has better performance than Logistic Regression on length of its normal vector, accuracy and rate of convergence. With initial parameter values, Focusing Classification gains an accuracy of 97.31%.
Peer Review Status:Awaiting Review