/CDN-simulation

Primary LanguageJupyter Notebook

17年的AdaptSize文章在文件大小上面做了准入,大大地提高了命中率,但是他没有考虑到其他的一些文件特征,比如说文件的格式,比如说地理位置信息,文件在传输中使用的协议,还有浏览器/user agent的不同类型,可能都暗示着用户当前正在使用的服务等等。以此来做特征提取以及特征学习,可以使我们的热缓存中的内存(相比于AdaptSize的baseLine)有更高几率被访问到。 举一个例子,如果我们正在使用脸书。或者抖音看好友的动态,那么极有可能我们的访问都是图片和视频。对于短视频来说,可能比较多是一些几兆的视频。那么,假如另外一个用户,他想在Apache的官网上面下载一个几兆的压缩包,我们基本上可以假定。前者的受欢迎程度远远超过后者。对于我们的服务器而言,我们更愿意缓存前者。或者我们换一个场景,一段16年的NBA视频,偶尔会有人在19年回放,而更热门的视频往往是时效性更高的。我们希望能够识别出低热度的的文件,并且避免缓存它。 另外一个常见的问题是,内容提供商作弊,通过伪造请求来使得它的内容能够常驻在CDN缓存中。但是实际中非作弊的请求模式并不是这样,一个好的准入机制应该识别出作弊,并不看好这些内容。我们希望能有一种机制可以识别出这种作弊的模式,于是,我们提出了利用机器学习来做预测准入的机制。对比17年的AdaptSize文章有XX%性能提高,对比18年的DeepCache文章中的LSTM seq2seq 预测又有计算复杂度上的降低。

The 17-year AdaptSize article set an admission threshold on file size, greatly improving CDN cache hit rate, but he did not consider other file features, such as file type, geographic location information, the protocol used in transmission, as well as the different types of user agents. All of above may imply the service that the user is currently using, which would be of great help if we want to predict the probability that the same file is going to be requested again in the near future. This feature extraction and feature learning can make our hot cache memory (compared to AdaptSize's baseLine) have a higher Hit rate. For example, if we are using Facebook. Or Tik-Tok to see the feeds of your friends, then it is very likely that our requests are mostly pictures and videos. For short videos, they may be just a few megabytes. And while another user wants to download a few megabytes of compressed tar file on Apache's official website. We can basically assume that the former video file is far more popular than the latter tar file. For our CDN server, we prefer to cache the former. Or, change of scene, a 16-year NBA video, occasionally someone will play it back in 2019, pint is the currently more popular videos are often more time-efficient. We want to be able to identify low-popularity files and avoid caching it. Another common application scenario is that sometimes content providers cheat by forging requests to make their content resident in the CDN cache. But actually non-cheating and cheating request pattern are not alike. A good admission mechanism should identify cheating. We hope that there is a mechanism to identify this pattern of cheating, so we propose the idea of using machine learning to make predictive high-dimensional thresholds Compared to the 17-year AdaptSize article, there is an XX% performance improvement. While compared to the LSTM seq2seq prediction in the 18-year DeepCache article, our model has a computational complexity reduction.