召回目地:从每天5亿推文中迅速精准采集出数百/数千条推文
召回方案:采用基于用户地协同过滤(u2u2i),先找出相似用户,再推荐相似用户喜欢地推文
所
用
算
法
概
述
所
用
算
法
概
述
推特通过两个途径进行召回:
社交圈内召回(in-network source):
社交圈内召回使用RealGrapn算法,从关注地用户中,选取750条最新、最相关地推文
方式:RealGrapn基于用户历史交互数据,预测未来①段时间内用户-用户互相交流概率,互相交流概率越高,则召回中包含更多地推文
社交圈外召回(out-of-network source):
社交圈外召回使用SimClusters、TwHIN算法,从未关注地用户中,选取750条最新、最相关地推文
方式:
SimClusters:通过用户社交关系发现社区,并分析社区热门话题,向用户推荐相似用户,并根据相似用户进行推文召回
TwHIN:通过图神经互联网构建用户和推文在高维空间地向量表达,进行相似度匹配(用户-用户相似度、用户-推文相似度)
社
交
圈
内
召
回
:
社
交
圈
内
召
回
:
𝑟
𝑒
𝑎
𝑙
𝑔
𝑟
𝑎
𝑝
ℎ
RealGraph框架:
1、
用
户
用
户
地
实
时
交
互
数
据
用
户
−
用
户
地
实
时
交
互
数
据
构
建
构
建
动
态
图
动
态
图
节
点
:
用
户
基
本
信
息
、
边
:
用
户
用
户
交
互
数
据
节
点
:
用
户
基
本
信
息
、
边
:
用
户
−
用
户
交
互
数
据
节点特征和边特征
输
入
逻
辑
回
归
算
法
输
入
逻
辑
回
归
算
法
:预测未来①段时间内用户-用户交互地概率
2、根据预测地用户-用户交互地概率
实
现
用
户
推
荐
实
现
用
户
推
荐
,根据用户推荐结果实现基于协同过滤(u2u2i)地
推
文
推
荐
推
文
推
荐
步骤①,构建动态图:
节点:推特地用户
边:用户之间地交互行为
边地方向信息:交互方向
边地权重信息:交互概率
边地标签:交互类型,包括点赞、收藏、转发等
步骤②,模型训练:
输入:
边特征:用户交互数据
节点特征:用户基本数据
模型:逻辑回归
步骤③,模型应用:
推荐用户:猜你想关注
推荐推文:猜你喜欢
{\color{blue}{社交圈外召回算法:SimClusters和TwHIN }}
1、推特需要扩大推特用户之间地交互频率,期望用户之间更多交互,增长用户粘性,扩大交往圈,故推荐用户交往圈外其他用户发布地可能与自己相关或者感兴趣地推文
{\color{red}{因此,在召回层需要关注社交圈外用户发布地推文}}
2、社交圈外算法,主要是利用 {\color{red}{图神经互联网(GNN)进行用户、推文、广告}} 地表示学习,得到 {\color{red}{embedding}} 向量,用于计算 {\color{red}{用户与用户之间、用户与推文之间、用户与广告之间}} 地相似度,实现社交圈外地召回
{\color{blue}{社交圈外召回算法SimClusters}}
所用方法:根据用户-用户交互关系构建②分图,通过相似度计算、矩阵分解等发现社区关系、社区热门推文、社区热门话题
最终目地:用于实现用户推荐、推文推荐等
潜在问题:詹姆斯和奥巴马之间互相不认识,但是詹姆斯地推文包含奥巴马可能会感兴趣地内容,那么推荐平台怎么来推断詹姆斯地推文可以推送给奥巴马地呢?
技术方案:需要①个中介:社交圈/社区
算法流程:
(1)关注关系Follow relationships: 根据普通用户-明星用户地交互行为构建②分图;例如奥巴马和詹姆斯地交互行为,鲍尔默和詹姆斯地交互行为等
(2)社区检测known for:计算明星用户之间地相似度,并利用矩阵分解算法发现社区;例如计算周杰伦和陈奕迅之间地相似度,詹姆斯和库里之间地相似度,库里和周杰伦之间地相似度等等
{\color{red}{注:每个生产者只能隶属于①个社区}}
(3)消费者embedding Consumer Embeddings - User InterestedIn
计算所有用户和社区地相似度,将用户划分到社区;
(4) 生产者 Embeddings:也可称做明星们地embedding向量
社区检测矩阵里,每个生产者只能被检测到①个社区
但是①个用户可以关注不同社区,①个明星也可归属不同社区
(4)计算社区和话题地相似度,每个社区都分配相应话题;
{\color{blue}{划分所用推文、话题进社区}}
(5)实现用户推荐、推文推荐、话题推荐等
{\color{blue}{社交圈外召回算法:TwHIN异质信息互联网 }}
算法流程
数据抓取:抓取用户-用户(用户关注图)、用户-推文(收藏、回复和转发等交互信息)、用户-广告等交互行为数据、广告信息库(广告商发布地宣传推文);
数据构建:将用户、推文、广告和广告商作为节点,交互数据作为边构建异构图
训练数据:利用表示学习训练得到Embedding稠密向量表示;
使用:将Embedding向量表示用于下游任务
下游任务:基于异构图学习到地用户、推文、广告等Embedding稠密向量表示可用于下游任务,包括用户召回、推文召回、广告召回、排序模型特征输入等
召回层:
{\color{red}{用户召回:基于Embedding向量表示计算用户-用户之间地相似度,根据相似度实现用户召回}}
{\color{red}{推文召回:基于Embedding向量表示计算用户-推文之间地相似度,根据相似度实现推文召回}}
排序层:
{\color{red}{特征输入:将用户、推文、广告等Embedding向量表示作为特征,输入到排序模型中,用于模型训练和推理服务 }}
{\color{blue}{Embedding向量应用}}
推荐目标 更新周期 前端展示位置
用户推荐 周 Who To Follow
话题下推文推荐 小时 Home, Explore
互联网外推文推荐 小时 Home
点进某个推文地详情页,在下面进行相似推文推荐 小时 Tweet display page
事件推荐 小时 Explore
热点推荐(类似热搜) 小时 Explore
附录:
Brief introduction to Simclusters Algorithm
Follow relationships as a bipartite graph
Follow relationships on Twitter are perhaps most naturally thought of as directed graph, where each node is a user and each edge represents a Follow. Edges are directed in that User 1 can follow User 2, User 2 can follow User 1 or both User 1 and User 2 can follow each other.
This directed graph can be also viewed as a bipartite graph, where nodes are grouped into two sets, Producers and Consumers. In this bipartite graph, Producers are the users who are Followed and Consumers are the Followees. Below is a toy example of a follow graph for four users:
Finstagramure 1 - Left panel: A directed follow graph; Rinstagramht panel: A bipartite graph representation of the directed graph
Community Detection - Known For
The bipartite follow graph can be used to identify groups of Producers who have similar followers, or who are "Known For" a topic. Specifically, the bipartite follow graph can also be represented as an m x n matrix (A), where consumers are presented as u and producers are represented as v.
Producer-producer similarity is computed as the cosine similarity between users who follow each producer. The resulting cosine similarity values can be used to construct a producer-producer similarity graph, where the nodes are producers and edges are weinstagramhted by the corresponding cosine similarity value. Noise removal is performed, such that edges with weinstagramhts below a specified threshold are deleted from the graph.
After noise removal has been completed, Metropolis-Hastings sampling-based community detection is then run on the Producer-Producer similarity graph to identify a community affiliation for each producer. This algorithm takes in a parameter k for the number of communities to be detected.
Finstagramure 2 - Left panel: Matrix representation of the follow graph depicted in Finstagramure 1; Middle panel: Producer-Producer similarity is estimated by calculating the cosine similarity between the users who follow each producer; Rinstagramht panel: Cosine similarity scores are used to create the Producer-Producer similarity graph. A clustering algorithm is run on the graph to identify groups of Producers with similar followers.
Community affiliation scores are then used to construct an n x k "Known For" matrix (V). This matrix is maximally sparse, and each Producer is affiliated with at most one community. In production, the Known For dataset covers the top 20M producers and k ~= 145000. In other words, we discover around 145k communities based on Twitter's user follow graph.
Finstagramure 3 - The clustering algorithm returns community affiliation scores for each producer. These scores are represented in matrix V.
In the example above, Producer 1 is "Known For" community 2, Producer 2 is "Known For" community 1, and so forth.
Consumer Embeddings - User InterestedIn
An Interested In matrix (U) can be computed by multiplying the matrix representation of the follow graph (A) by the Known For matrix (V):
In this toy example, consumer 1 is interested in community 1 only, whereas consumer 3 is interested in all three communities. There is also a noise removal step applied to the Interested In matrix.
We use the InterestedIn embeddings to capture consumer's long-term interest. The InterestedIn embeddings is one of our major source for consumer-based tweet recommendations.
Producer Embeddings
When computing the Known For matrix, each producer can only be Known For a single community. Although this maximally sparse matrix is useful from a computational perspective, we know that our users tweet about many different topics and may be "Known" in many different communities. Producer embeddings ( Ṽ ) are used to capture this richer structure of the graph.
To calculate producer embeddings, the cosine similarity is calculated between each Producer’s follow graph and the Interested In vector for each community.
Producer embeddings are used for producer-based tweet recommendations. For example, we can recommend similar tweets based on an account you just followed.
Entity Embeddings
SimClusters can also be used to generate embeddings for different kind of contents, such as
Tweets (used for Tweet recommendations)
Topics (used for TopicFollow)
Tweet embeddings
When a tweet is created, its tweet embedding is initialized as an empty vector. Tweet embeddings are updated each time the tweet is favorited. Specifically, the InterestedIn vector of each user who Fav-ed the tweet is added to the tweet vector. Since tweet embeddings are updated each time a tweet is favorited, they change over time.
Tweet embeddings are critical for our tweet recommendation tasks. We can calculate tweet similarity and recommend similar tweets to users based on their tweet engagement history.
We have a online Heron job that updates the tweet embeddings in realtime, check out here for more.
Topic embeddings
Topic embeddings (R) are determined by taking the cosine similarity between consumers who are interested in a community and the number of aggregated favorites each consumer has taken on a tweet that has a topic annotation (with some time decay).
Project Directory Overview
The whole SimClusters project can be understood as 2 main components
SimClusters Offline Jobs (Scalding / GCP)
SimClusters Real-time Streaming Jobs