Multi-task learning: Difference between revisions

Content deleted Content added

Inline

Revision as of 10:28, 2 September 2024

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately.^[1]^[2]^[3] Inherently, Multi-task learning is a multi-objective optimization problem having trade-offs between different tasks.^[4] Early versions of MTL were called "hints".^[5]^[6]

In a widely cited 1997 paper, Rich Caruana gave the following characterization:

Multitask Learning is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better.^[3]

In the classification context, MTL aims to improve the performance of multiple classification tasks by learning them jointly. One example is a spam-filter, which can be treated as distinct but related classification tasks across different users. To make this more concrete, consider that different people have different distributions of features which distinguish spam emails from legitimate ones, for example an English speaker may find that all emails in Russian are spam, not so for Russian speakers. Yet there is a definite commonality in this classification task across users, for example one common feature might be text related to money transfer. Solving each user's spam classification problem jointly via MTL can let the solutions inform each other and improve performance.^{[citation needed]} Further examples of settings for MTL include multiclass classification and multi-label classification.^[7]

Multi-task learning works because regularization induced by requiring an algorithm to perform well on a related task can be superior to regularization that prevents overfitting by penalizing all complexity uniformly. One situation where MTL may be particularly helpful is if the tasks share significant commonalities and are generally slightly under sampled.^[8] However, as discussed below, MTL has also been shown to be beneficial for learning unrelated tasks.^[8]^[9]

Methods

The key challenge in multi-task learning, is how to combine learning signals from multiple tasks into a single model. This may strongly depend on how well different task agree with each other, or contradict each other. There are several ways to address this challenge:

Task grouping and overlap

Within the MTL paradigm, information can be shared across some or all of the tasks. Depending on the structure of task relatedness, one may want to share information selectively across the tasks. For example, tasks may be grouped or exist in a hierarchy, or be related according to some general metric. Suppose, as developed more formally below, that the parameter vector modeling each task is a linear combination of some underlying basis. Similarity in terms of this basis can indicate the relatedness of the tasks. For example, with sparsity, overlap of nonzero coefficients across tasks indicates commonality. A task grouping then corresponds to those tasks lying in a subspace generated by some subset of basis elements, where tasks in different groups may be disjoint or overlap arbitrarily in terms of their bases.^[10] Task relatedness can be imposed a priori or learned from the data.^[7]^[11] Hierarchical task relatedness can also be exploited implicitly without assuming a priori knowledge or learning relations explicitly.^[8]^[12] For example, the explicit learning of sample relevance across tasks can be done to guarantee the effectiveness of joint learning across multiple domains.^[8]

Exploiting unrelated tasks

One can attempt learning a group of principal tasks using a group of auxiliary tasks, unrelated to the principal ones. In many applications, joint learning of unrelated tasks which use the same input data can be beneficial. The reason is that prior knowledge about task relatedness can lead to sparser and more informative representations for each task grouping, essentially by screening out idiosyncrasies of the data distribution. Novel methods which builds on a prior multitask methodology by favoring a shared low-dimensional representation within each task grouping have been proposed. The programmer can impose a penalty on tasks from different groups which encourages the two representations to be orthogonal. Experiments on synthetic and real data have indicated that incorporating unrelated tasks can result in significant improvements over standard multi-task learning methods.^[9]

Transfer of knowledge

Related to multi-task learning is the concept of knowledge transfer. Whereas traditional multi-task learning implies that a shared representation is developed concurrently across tasks, transfer of knowledge implies a sequentially shared representation. Large scale machine learning projects such as the deep convolutional neural network GoogLeNet,^[13] an image-based object classifier, can develop robust representations which may be useful to further algorithms learning related tasks. For example, the pre-trained model can be used as a feature extractor to perform pre-processing for another learning algorithm. Or the pre-trained model can be used to initialize a model with similar architecture which is then fine-tuned to learn a different classification task.^[14]

Multiple non-stationary tasks

Traditionally Multi-task learning and transfer of knowledge are applied to stationary learning settings. Their extension to non-stationary environments is termed Group online adaptive learning (GOAL).^[15] Sharing information could be particularly useful if learners operate in continuously changing environments, because a learner could benefit from previous experience of another learner to quickly adapt to their new environment. Such group-adaptive learning has numerous applications, from predicting financial time-series, through content recommendation systems, to visual understanding for adaptive autonomous agents.

Multi-task optimization

Multitask optimization: In some cases, the simultaneous training of seemingly related tasks may hinder performance compared to single-task models.^[16] Commonly, MTL models employ task-specific modules on top of a joint feature representation obtained using a shared module. Since this joint representation must capture useful features across all tasks, MTL may hinder individual task performance if the different tasks seek conflicting representation, i.e., the gradients of different tasks point to opposing directions or differ significantly in magnitude. This phenomenon is commonly referred to as negative transfer. To mitigate this issue, various MTL optimization methods have been proposed. Commonly, the per-task gradients are combined into a joint update direction through various aggregation algorithms or heuristics. These methods include subtracting the projection of conflicted gradients,^[17] applying techniques from game theory,^[18] and using Bayesian modeling to get a distribution over gradients.^[19]

Mathematics

Reproducing Hilbert space of vector valued functions (RKHSvv)

The MTL problem can be cast within the context of RKHSvv (a complete inner product space of vector-valued functions equipped with a reproducing kernel). In particular, recent focus has been on cases where task structure can be identified via a separable kernel, described below. The presentation here derives from Ciliberto et al., 2015.^[7]

RKHSvv concepts

Suppose the training data set is ${\mathcal {S}}_{t}=\{(x_{i}^{t},y_{i}^{t})\}_{i=1}^{n_{t}}$ , with $x_{i}^{t}\in {\mathcal {X}}$ , $y_{i}^{t}\in {\mathcal {Y}}$ , where $t$ indexes task, and $t\in 1,...,T$ . Let $n=\sum _{t=1}^{T}n_{t}$ . In this setting there is a consistent input and output space and the same loss function ${\mathcal {L}}:\mathbb {R} \times \mathbb {R} \rightarrow \mathbb {R} _{+}$ for each task: . This results in the regularized machine learning problem:

\min _{f\in {\mathcal {H}}}\sum _{t=1}^{T}{\frac {1}{n_{t}}}\sum _{i=1}^{n_{t}}{\mathcal {L}}(y_{i}^{t},f_{t}(x_{i}^{t}))+\lambda ||f||_{\mathcal {H}}^{2}

(1)

where ${\mathcal {H}}$ is a vector valued reproducing kernel Hilbert space with functions $f:{\mathcal {X}}\rightarrow {\mathcal {Y}}^{T}$ having components $f_{t}:{\mathcal {X}}\rightarrow {\mathcal {Y}}$ .

The reproducing kernel for the space ${\mathcal {H}}$ of functions $f:{\mathcal {X}}\rightarrow \mathbb {R} ^{T}$ is a symmetric matrix-valued function $\Gamma :{\mathcal {X}}\times {\mathcal {X}}\rightarrow \mathbb {R} ^{T\times T}$ , such that $\Gamma (\cdot ,x)c\in {\mathcal {H}}$ and the following reproducing property holds:

\langle f(x),c\rangle _{\mathbb {R} ^{T}}=\langle f,\Gamma (x,\cdot )c\rangle _{\mathcal {H}}

(2)

The reproducing kernel gives rise to a representer theorem showing that any solution to equation 1 has the form:

f(x)=\sum _{t=1}^{T}\sum _{i=1}^{n_{t}}\Gamma (x,x_{i}^{t})c_{i}^{t}

(3)

Separable kernels

The form of the kernel $Γ$ induces both the representation of the feature space and structures the output across tasks. A natural simplification is to choose a separable kernel, which factors into separate kernels on the input space X and on the tasks $\{1,...,T\}$ . In this case the kernel relating scalar components $f_{t}$ and $f_{s}$ is given by ${\textstyle \gamma ((x_{i},t),(x_{j},s))=k(x_{i},x_{j})k_{T}(s,t)=k(x_{i},x_{j})A_{s,t}}$ . For vector valued functions $f\in {\mathcal {H}}$ we can write $\Gamma (x_{i},x_{j})=k(x_{i},x_{j})A$ , where $k$ is a scalar reproducing kernel, and $A$ is a symmetric positive semi-definite $T\times T$ matrix. Henceforth denote $S_{+}^{T}=\{{\text{PSD matrices}}\}\subset \mathbb {R} ^{T\times T}$ .

This factorization property, separability, implies the input feature space representation does not vary by task. That is, there is no interaction between the input kernel and the task kernel. The structure on tasks is represented solely by $A$ . Methods for non-separable kernels $Γ$ is a current field of research.

For the separable case, the representation theorem is reduced to ${\textstyle f(x)=\sum _{i=1}^{N}k(x,x_{i})Ac_{i}}$ . The model output on the training data is then $KCA$ , where $K$ is the $n\times n$ empirical kernel matrix with entries ${\textstyle K_{i,j}=k(x_{i},x_{j})}$ , and $C$ is the $n\times T$ matrix of rows $c_{i}$ .

With the separable kernel, equation 1 can be rewritten as

\min _{C\in \mathbb {R} ^{n\times T}}V(Y,KCA)+\lambda tr(KCAC^{\top })

(P)

where $V$ is a (weighted) average of L applied entry-wise to $Y$ and $KCA$ . (The weight is zero if $Y_{i}^{t}$ is a missing observation).

Note the second term in P can be derived as follows:

{\begin{aligned}\|f\|_{\mathcal {H}}^{2}&=\left\langle \sum _{i=1}^{n}k(\cdot ,x_{i})Ac_{i},\sum _{j=1}^{n}k(\cdot ,x_{j})Ac_{j}\right\rangle _{\mathcal {H}}\\&=\sum _{i,j=1}^{n}\langle k(\cdot ,x_{i})Ac_{i},k(\cdot ,x_{j})Ac_{j}\rangle _{\mathcal {H}}&{\text{(bilinearity)}}\\&=\sum _{i,j=1}^{n}\langle k(x_{i},x_{j})Ac_{i},c_{j}\rangle _{\mathbb {R} ^{T}}&{\text{(reproducing property)}}\\&=\sum _{i,j=1}^{n}k(x_{i},x_{j})c_{i}^{\top }Ac_{j}=tr(KCAC^{\top })\end{aligned}}

Known task structure

Task structure representations

There are three largely equivalent ways to represent task structure: through a regularizer; through an output metric, and through an output mapping.

Regularizer — With the separable kernel, it can be shown (below) that ${\textstyle ||f||_{\mathcal {H}}^{2}=\sum _{s,t=1}^{T}A_{t,s}^{\dagger }\langle f_{s},f_{t}\rangle _{{\mathcal {H}}_{k}}}$ , where $A_{t,s}^{\dagger }$ is the $t,s$ element of the pseudoinverse of $A$ , and ${\mathcal {H}}_{k}$ is the RKHS based on the scalar kernel $k$ , and ${\textstyle f_{t}(x)=\sum _{i=1}^{n}k(x,x_{i})A_{t}^{\top }c_{i}}$ . This formulation shows that $A_{t,s}^{\dagger }$ controls the weight of the penalty associated with ${\textstyle \langle f_{s},f_{t}\rangle _{{\mathcal {H}}_{k}}}$ . (Note that ${\textstyle \langle f_{s},f_{t}\rangle _{{\mathcal {H}}_{k}}}$ arises from ${\textstyle ||f_{t}||_{{\mathcal {H}}_{k}}=\langle f_{t},f_{t}\rangle _{{\mathcal {H}}_{k}}}$ .)

Proof

${\begin{aligned}\|f\|_{\mathcal {H}}^{2}&=\left\langle \sum _{i=1}^{n}\gamma ((x_{i},t_{i}),\cdot )c_{i}^{t_{i}},\sum _{j=1}^{n}\gamma ((x_{j},t_{j}),\cdot )c_{j}^{t_{j}}\right\rangle _{\mathcal {H}}\\&=\sum _{i,j=1}^{n}c_{i}^{t_{i}}c_{j}^{t_{j}}\gamma ((x_{i},t_{i}),(x_{j},t_{j}))\\&=\sum _{i,j=1}^{n}\sum _{s,t=1}^{T}c_{i}^{t}c_{j}^{s}k(x_{i},x_{j})A_{s,t}\\&=\sum _{i,j=1}^{n}k(x_{i},x_{j})\langle c_{i},Ac_{j}\rangle _{\mathbb {R} ^{T}}\\&=\sum _{i,j=1}^{n}k(x_{i},x_{j})\langle c_{i},AA^{\dagger }Ac_{j}\rangle _{\mathbb {R} ^{T}}\\&=\sum _{i,j=1}^{n}k(x_{i},x_{j})\langle Ac_{i},A^{\dagger }Ac_{j}\rangle _{\mathbb {R} ^{T}}\\&=\sum _{i,j=1}^{n}\sum _{s,t=1}^{T}(Ac_{i})^{t}(Ac_{j})^{s}k(x_{i},x_{j})A_{s,t}^{\dagger }\\&=\sum _{s,t=1}^{T}A_{s,t}^{\dagger }\langle \sum _{i=1}^{n}k(x_{i},\cdot )(Ac_{i})^{t},\sum _{j=1}^{n}k(x_{j},\cdot )(Ac_{j})^{s}\rangle _{{\mathcal {H}}_{k}}\\&=\sum _{s,t=1}^{T}A_{s,t}^{\dagger }\langle f_{t},f_{s}\rangle _{{\mathcal {H}}_{k}}\end{aligned}}$

Output metric — an alternative output metric on ${\mathcal {Y}}^{T}$ can be induced by the inner product $\langle y_{1},y_{2}\rangle _{\Theta }=\langle y_{1},\Theta y_{2}\rangle _{\mathbb {R} ^{T}}$ . With the squared loss there is an equivalence between the separable kernels $k(\cdot ,\cdot )I_{T}$ under the alternative metric, and $k(\cdot ,\cdot )\Theta$ , under the canonical metric.

Output mapping — Outputs can be mapped as $L:{\mathcal {Y}}^{T}\rightarrow {\mathcal {\tilde {Y}}}$ to a higher dimensional space to encode complex structures such as trees, graphs and strings. For linear maps $L$ , with appropriate choice of separable kernel, it can be shown that $A=L^{\top }L$ .

Task structure examples

Via the regularizer formulation, one can represent a variety of task structures easily.

Letting ${\textstyle A^{\dagger }=\gamma I_{T}+(\gamma -\lambda ){\frac {1}{T}}\mathbf {1} \mathbf {1} ^{\top }}$ (where $I_{T}$ is the TxT identity matrix, and ${\textstyle \mathbf {1} \mathbf {1} ^{\top }}$ is the TxT matrix of ones) is equivalent to letting $Γ$ control the variance ${\textstyle \sum _{t}||f_{t}-{\bar {f}}||_{{\mathcal {H}}_{k}}}$ of tasks from their mean ${\textstyle {\frac {1}{T}}\sum _{t}f_{t}}$ . For example, blood levels of some biomarker may be taken on $T$ patients at $n_{t}$ time points during the course of a day and interest may lie in regularizing the variance of the predictions across patients.
Letting $A^{\dagger }=\alpha I_{T}+(\alpha -\lambda )M$ , where $M_{t,s}={\frac {1}{|G_{r}|}}\mathbb {I} (t,s\in G_{r})$ is equivalent to letting $\alpha$ control the variance measured with respect to a group mean: $\sum _{r}\sum _{t\in G_{r}}||f_{t}-{\frac {1}{|G_{r}|}}\sum _{s\in G_{r})}f_{s}||$ . (Here $|G_{r}|$ the cardinality of group r, and $\mathbb {I}$ is the indicator function). For example, people in different political parties (groups) might be regularized together with respect to predicting the favorability rating of a politician. Note that this penalty reduces to the first when all tasks are in the same group.
Letting $A^{\dagger }=\delta I_{T}+(\delta -\lambda )L$ , where $L=D-M$ is the Laplacian for the graph with adjacency matrix M giving pairwise similarities of tasks. This is equivalent to giving a larger penalty to the distance separating tasks t and s when they are more similar (according to the weight $M_{t,s}$ ,) i.e. $\delta$ regularizes $\sum _{t,s}||f_{t}-f_{s}||_{{\mathcal {H}}_{k}}^{2}M_{t,s}$ .
All of the above choices of A also induce the additional regularization term ${\textstyle \lambda \sum _{t}||f||_{{\mathcal {H}}_{k}}^{2}}$ which penalizes complexity in f more broadly.

Learning tasks together with their structure

Learning problem P can be generalized to admit learning task matrix A as follows:

\min _{C\in \mathbb {R} ^{n\times T},A\in S_{+}^{T}}V(Y,KCA)+\lambda tr(KCAC^{\top })+F(A)

(Q)

Choice of $F:S_{+}^{T}\rightarrow \mathbb {R} _{+}$ must be designed to learn matrices A of a given type. See "Special cases" below.

Optimization of Q

Restricting to the case of convex losses and coercive penalties Ciliberto et al. have shown that although Q is not convex jointly in C and A, a related problem is jointly convex.

Specifically on the convex set ${\mathcal {C}}=\{(C,A)\in \mathbb {R} ^{n\times T}\times S_{+}^{T}|Range(C^{\top }KC)\subseteq Range(A)\}$ , the equivalent problem

\min _{C,A\in {\mathcal {C}}}V(Y,KC)+\lambda tr(A^{\dagger }C^{\top }KC)+F(A)

(R)

is convex with the same minimum value. And if $(C_{R},A_{R})$ is a minimizer for R then $(C_{R}A_{R}^{\dagger },A_{R})$ is a minimizer for Q.

R may be solved by a barrier method on a closed set by introducing the following perturbation:

\min _{C\in \mathbb {R} ^{n\times T},A\in S_{+}^{T}}V(Y,KC)+\lambda tr(A^{\dagger }(C^{\top }KC+\delta ^{2}I_{T}))+F(A)

(S)

The perturbation via the barrier $\delta ^{2}tr(A^{\dagger })$ forces the objective functions to be equal to $+\infty$ on the boundary of $R^{n\times T}\times S_{+}^{T}$ .

S can be solved with a block coordinate descent method, alternating in C and A. This results in a sequence of minimizers $(C_{m},A_{m})$ in S that converges to the solution in R as $\delta _{m}\rightarrow 0$ , and hence gives the solution to Q.

Special cases

Spectral penalties - Dinnuzo et al^[20] suggested setting F as the Frobenius norm ${\sqrt {tr(A^{\top }A)}}$ . They optimized Q directly using block coordinate descent, not accounting for difficulties at the boundary of $\mathbb {R} ^{n\times T}\times S_{+}^{T}$ .

Clustered tasks learning - Jacob et al^[21] suggested to learn A in the setting where T tasks are organized in R disjoint clusters. In this case let $E\in \{0,1\}^{T\times R}$ be the matrix with $E_{t,r}=\mathbb {I} ({\text{task }}t\in {\text{group }}r)$ . Setting $M=I-E^{\dagger }E^{T}$ , and $U={\frac {1}{T}}\mathbf {11} ^{\top }$ , the task matrix $A^{\dagger }$ can be parameterized as a function of $M$ : $A^{\dagger }(M)=\epsilon _{M}U+\epsilon _{B}(M-U)+\epsilon (I-M)$ , with terms that penalize the average, between clusters variance and within clusters variance respectively of the task predictions. M is not convex, but there is a convex relaxation ${\mathcal {S}}_{c}=\{M\in S_{+}^{T}:I-M\in S_{+}^{T}\land tr(M)=r\}$ . In this formulation, $F(A)=\mathbb {I} (A(M)\in \{A:M\in {\mathcal {S}}_{C}\})$ .

Generalizations

Non-convex penalties - Penalties can be constructed such that A is constrained to be a graph Laplacian, or that A has low rank factorization. However these penalties are not convex, and the analysis of the barrier method proposed by Ciliberto et al. does not go through in these cases.

Non-separable kernels - Separable kernels are limited, in particular they do not account for structures in the interaction space between the input and output domains jointly. Future work is needed to develop models for these kernels.

Software package

A Matlab package called Multi-Task Learning via StructurAl Regularization (MALSAR) ^[22] implements the following multi-task learning algorithms: Mean-Regularized Multi-Task Learning,^[23]^[24] Multi-Task Learning with Joint Feature Selection,^[25] Robust Multi-Task Feature Learning,^[26] Trace-Norm Regularized Multi-Task Learning,^[27] Alternating Structural Optimization,^[28]^[29] Incoherent Low-Rank and Sparse Learning,^[30] Robust Low-Rank Multi-Task Learning, Clustered Multi-Task Learning,^[31]^[32] Multi-Task Learning with Graph Structures.

References

^ Baxter, J. (2000). A model of inductive bias learning" Journal of Artificial Intelligence Research 12:149--198, On-line paper
^ Thrun, S. (1996). Is learning the n-th thing any easier than learning the first?. In Advances in Neural Information Processing Systems 8, pp. 640--646. MIT Press. Paper at Citeseer
^ ^a ^b Caruana, R. (1997). "Multi-task learning" (PDF). Machine Learning. 28: 41–75. doi:10.1023/A:1007379606734.
^ Multi-Task Learning as Multi-Objective Optimization Part of Advances in Neural Information Processing Systems 31 (NeurIPS 2018), https://proceedings.neurips.cc/paper/2018/hash/432aca3a1e345e339f35a30c8f65edce-Abstract.html
^ Suddarth, S., Kergosien, Y. (1990). Rule-injection hints as a means of improving network performance and learning time. EURASIP Workshop. Neural Networks pp. 120-129. Lecture Notes in Computer Science. Springer.
^ Abu-Mostafa, Y. S. (1990). "Learning from hints in neural networks". Journal of Complexity. 6 (2): 192–198. doi:10.1016/0885-064x(90)90006-y.
^ ^a ^b ^c Ciliberto, C. (2015). "Convex Learning of Multiple Tasks and their Structure". arXiv:1504.03101 [cs.LG].
^ ^a ^b ^c ^d Hajiramezanali, E. & Dadaneh, S. Z. & Karbalayghareh, A. & Zhou, Z. & Qian, X. Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data. 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. arXiv:1810.09433
^ ^a ^b Romera-Paredes, B., Argyriou, A., Bianchi-Berthouze, N., & Pontil, M., (2012) Exploiting Unrelated Tasks in Multi-Task Learning. http://jmlr.csail.mit.edu/proceedings/papers/v22/romera12/romera12.pdf
^ Kumar, A., & Daume III, H., (2012) Learning Task Grouping and Overlap in Multi-Task Learning. http://icml.cc/2012/papers/690.pdf
^ Jawanpuria, P., & Saketha Nath, J., (2012) A Convex Feature Learning Formulation for Latent Task Structure Discovery. http://icml.cc/2012/papers/90.pdf
^ Zweig, A. & Weinshall, D. Hierarchical Regularization Cascade for Joint Learning. Proceedings: of 30th International Conference on Machine Learning (ICML), Atlanta GA, June 2013. http://www.cs.huji.ac.il/~daphna/papers/Zweig_ICML2013.pdf
^ Szegedy, Christian; Wei Liu, Youssef; Yangqing Jia, Tomaso; Sermanet, Pierre; Reed, Scott; Anguelov, Dragomir; Erhan, Dumitru; Vanhoucke, Vincent; Rabinovich, Andrew (2015). "Going deeper with convolutions". 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1–9. arXiv:1409.4842. doi:10.1109/CVPR.2015.7298594. ISBN 978-1-4673-6964-0. S2CID 206592484.
^ Roig, Gemma. "Deep Learning Overview" (PDF). Archived from the original (PDF) on 2016-03-06. Retrieved 2019-08-26.
^ Zweig, A. & Chechik, G. Group online adaptive learning. Machine Learning, DOI 10.1007/s10994-017- 5661-5, August 2017. http://rdcu.be/uFSv
^ Standley, Trevor; Zamir, Amir R.; Chen, Dawn; Guibas, Leonidas; Malik, Jitendra; Savarese, Silvio (2020-07-13). "Learning the Pareto Front with Hypernetworks". International Conference on Machine Learning (ICML): 9120–9132. arXiv:1905.07553.
^ Yu, Tianhe; Kumar, Saurabh; Gupta, Abhishek; Levine, Sergey; Hausman, Karol; Finn, Chelsea (2020). "Gradient Surgery for Multi-Task Learning" (PDF). Advances in Neural Information Processing Systems. arXiv:2001.06782.
^ Navon, Aviv; Shamsian, Aviv; Achituve, Idan; Maron, Haggai; Kawaguchi, Kenji; Chechik, Gal; Fetaya, Ethan (2022). "Multi-Task Learning as a Bargaining Game". International Conference on Machine Learning: 16428–16446. arXiv:2202.01017.
^ Achituve, Idan; Diamant, Idit; Netzer, Arnon; Chechik, Gal; Fetaya, Ethan (2024). "Bayesian Uncertainty for Gradient Aggregation in Multi-Task Learning". arXiv:2402.04005 [cs.LG].
^ Dinuzzo, Francesco (2011). "Learning output kernels with block coordinate descent" (PDF). Proceedings of the 28th International Conference on Machine Learning (ICML-11). Archived from the original (PDF) on 2017-08-08.
^ Jacob, Laurent (2009). "Clustered multi-task learning: A convex formulation". Advances in Neural Information Processing Systems. arXiv:0809.2085. Bibcode:2008arXiv0809.2085J.
^ Zhou, J., Chen, J. and Ye, J. MALSAR: Multi-tAsk Learning via StructurAl Regularization. Arizona State University, 2012. http://www.public.asu.edu/~jye02/Software/MALSAR. On-line manual
^ Evgeniou, T., & Pontil, M. (2004). Regularized multi–task learning. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 109–117).
^ Evgeniou, T.; Micchelli, C.; Pontil, M. (2005). "Learning multiple tasks with kernel methods" (PDF). Journal of Machine Learning Research. 6: 615.
^ Argyriou, A.; Evgeniou, T.; Pontil, M. (2008a). "Convex multi-task feature learning". Machine Learning. 73 (3): 243–272. doi:10.1007/s10994-007-5040-8.
^ Chen, J., Zhou, J., & Ye, J. (2011). Integrating low-rank and group-sparse structures for robust multi-task learning^{[dead link]}. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining.
^ Ji, S., & Ye, J. (2009). An accelerated gradient method for trace norm minimization. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 457–464).
^ Ando, R.; Zhang, T. (2005). "A framework for learning predictive structures from multiple tasks and unlabeled data" (PDF). The Journal of Machine Learning Research. 6: 1817–1853.
^ Chen, J., Tang, L., Liu, J., & Ye, J. (2009). A convex formulation for learning shared structures from multiple tasks. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 137–144).
^ Chen, J., Liu, J., & Ye, J. (2010). Learning incoherent sparse and low-rank patterns from multiple tasks. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1179–1188).
^ Jacob, L., Bach, F., & Vert, J. (2008). Clustered multi-task learning: A convex formulation. Advances in Neural Information Processing Systems， 2008
^ Zhou, J., Chen, J., & Ye, J. (2011). Clustered multi-task learning via alternating structure optimization. Advances in Neural Information Processing Systems.

External links

Software

The Multi-Task Learning via Structural Regularization Package
Online Multi-Task Learning Toolkit (OMT) A general-purpose online multi-task learning toolkit based on conditional random field models and stochastic gradient descent training (C#, .NET)

[1] Baxter, J. (2000). A model of inductive bias learning" Journal of Artificial Intelligence Research 12:149--198, On-line paper

[2] Thrun, S. (1996). Is learning the n-th thing any easier than learning the first?. In Advances in Neural Information Processing Systems 8, pp. 640--646. MIT Press. Paper at Citeseer

[:2-3] Caruana, R. (1997). "Multi-task learning" (PDF). Machine Learning. 28: 41–75. doi:10.1023/A:1007379606734.

[4] Multi-Task Learning as Multi-Objective Optimization Part of Advances in Neural Information Processing Systems 31 (NeurIPS 2018), https://proceedings.neurips.cc/paper/2018/hash/432aca3a1e345e339f35a30c8f65edce-Abstract.html

[5] Suddarth, S., Kergosien, Y. (1990). Rule-injection hints as a means of improving network performance and learning time. EURASIP Workshop. Neural Networks pp. 120-129. Lecture Notes in Computer Science. Springer.

[6] Abu-Mostafa, Y. S. (1990). "Learning from hints in neural networks". Journal of Complexity. 6 (2): 192–198. doi:10.1016/0885-064x(90)90006-y.

[:1-7] Ciliberto, C. (2015). "Convex Learning of Multiple Tasks and their Structure". arXiv:1504.03101 [cs.LG].

[:bmdl-8] Hajiramezanali, E. & Dadaneh, S. Z. & Karbalayghareh, A. & Zhou, Z. & Qian, X. Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data. 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. arXiv:1810.09433

[:3-9] Romera-Paredes, B., Argyriou, A., Bianchi-Berthouze, N., & Pontil, M., (2012) Exploiting Unrelated Tasks in Multi-Task Learning. http://jmlr.csail.mit.edu/proceedings/papers/v22/romera12/romera12.pdf

[10] Kumar, A., & Daume III, H., (2012) Learning Task Grouping and Overlap in Multi-Task Learning. http://icml.cc/2012/papers/690.pdf

[11] Jawanpuria, P., & Saketha Nath, J., (2012) A Convex Feature Learning Formulation for Latent Task Structure Discovery. http://icml.cc/2012/papers/90.pdf

[12] Zweig, A. & Weinshall, D. Hierarchical Regularization Cascade for Joint Learning. Proceedings: of 30th International Conference on Machine Learning (ICML), Atlanta GA, June 2013. http://www.cs.huji.ac.il/~daphna/papers/Zweig_ICML2013.pdf

[13] Szegedy, Christian; Wei Liu, Youssef; Yangqing Jia, Tomaso; Sermanet, Pierre; Reed, Scott; Anguelov, Dragomir; Erhan, Dumitru; Vanhoucke, Vincent; Rabinovich, Andrew (2015). "Going deeper with convolutions". 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1–9. arXiv:1409.4842. doi:10.1109/CVPR.2015.7298594. ISBN 978-1-4673-6964-0. S2CID 206592484.

[14] Roig, Gemma. "Deep Learning Overview" (PDF). Archived from the original (PDF) on 2016-03-06. Retrieved 2019-08-26.

[15] Zweig, A. & Chechik, G. Group online adaptive learning. Machine Learning, DOI 10.1007/s10994-017- 5661-5, August 2017. http://rdcu.be/uFSv

[16] Standley, Trevor; Zamir, Amir R.; Chen, Dawn; Guibas, Leonidas; Malik, Jitendra; Savarese, Silvio (2020-07-13). "Learning the Pareto Front with Hypernetworks". International Conference on Machine Learning (ICML): 9120–9132. arXiv:1905.07553.

[17] Yu, Tianhe; Kumar, Saurabh; Gupta, Abhishek; Levine, Sergey; Hausman, Karol; Finn, Chelsea (2020). "Gradient Surgery for Multi-Task Learning" (PDF). Advances in Neural Information Processing Systems. arXiv:2001.06782.

[18] Navon, Aviv; Shamsian, Aviv; Achituve, Idan; Maron, Haggai; Kawaguchi, Kenji; Chechik, Gal; Fetaya, Ethan (2022). "Multi-Task Learning as a Bargaining Game". International Conference on Machine Learning: 16428–16446. arXiv:2202.01017.

[19] Achituve, Idan; Diamant, Idit; Netzer, Arnon; Chechik, Gal; Fetaya, Ethan (2024). "Bayesian Uncertainty for Gradient Aggregation in Multi-Task Learning". arXiv:2402.04005 [cs.LG].

[20] Dinuzzo, Francesco (2011). "Learning output kernels with block coordinate descent" (PDF). Proceedings of the 28th International Conference on Machine Learning (ICML-11). Archived from the original (PDF) on 2017-08-08.

[21] Jacob, Laurent (2009). "Clustered multi-task learning: A convex formulation". Advances in Neural Information Processing Systems. arXiv:0809.2085. Bibcode:2008arXiv0809.2085J.

[22] Zhou, J., Chen, J. and Ye, J. MALSAR: Multi-tAsk Learning via StructurAl Regularization. Arizona State University, 2012. http://www.public.asu.edu/~jye02/Software/MALSAR. On-line manual

[23] Evgeniou, T., & Pontil, M. (2004). Regularized multi–task learning. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 109–117).

[24] Evgeniou, T.; Micchelli, C.; Pontil, M. (2005). "Learning multiple tasks with kernel methods" (PDF). Journal of Machine Learning Research. 6: 615.

[25] Argyriou, A.; Evgeniou, T.; Pontil, M. (2008a). "Convex multi-task feature learning". Machine Learning. 73 (3): 243–272. doi:10.1007/s10994-007-5040-8.

[26] Chen, J., Zhou, J., & Ye, J. (2011). Integrating low-rank and group-sparse structures for robust multi-task learning^{[dead link]}. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining.

[27] Ji, S., & Ye, J. (2009). An accelerated gradient method for trace norm minimization. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 457–464).

[28] Ando, R.; Zhang, T. (2005). "A framework for learning predictive structures from multiple tasks and unlabeled data" (PDF). The Journal of Machine Learning Research. 6: 1817–1853.

[29] Chen, J., Tang, L., Liu, J., & Ye, J. (2009). A convex formulation for learning shared structures from multiple tasks. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 137–144).

[30] Chen, J., Liu, J., & Ye, J. (2010). Learning incoherent sparse and low-rank patterns from multiple tasks. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1179–1188).

[31] Jacob, L., Bach, F., & Vert, J. (2008). Clustered multi-task learning: A convex formulation. Advances in Neural Information Processing Systems， 2008

[32] Zhou, J., Chen, J., & Ye, J. (2011). Clustered multi-task learning via alternating structure optimization. Advances in Neural Information Processing Systems.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

@@ Line 1: / Line 1: @@
+{{Merge|Multitask optimization|date=August 2024}}
-'''Multi-task learning''' (MTL) is a subfield of [[machine learning]] in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately.<ref>Baxter, J. (2000). A model of inductive bias learning" ''Journal of Artificial Intelligence Research'' 12:149--198, [http://www-2.cs.cmu.edu/afs/cs/project/jair/pub/volume12/baxter00a.pdf On-line paper]</ref><ref>[[Sebastian Thrun|Thrun, S.]] (1996). Is learning the n-th thing any easier than learning the first?. In Advances in Neural Information Processing Systems 8, pp. 640--646. MIT Press. [http://citeseer.ist.psu.edu/thrun96is.html Paper at Citeseer]</ref><ref name=":2">{{Cite journal|url = http://www.cs.cornell.edu/~caruana/mlj97.pdf|title = Multi-task learning|last = Caruana|first = R.|date = 1997|journal = Machine Learning|doi = 10.1023/A:1007379606734|pmid = |access-date =|volume=28|pages=41–75}}</ref> Early versions of MTL were called "hints"<ref>Suddarth, S., Kergosien, Y. (1990). Rule-injection hints as a means of improving network performance and learning time. EURASIP Workshop. Neural Networks pp. 120-129. Lecture Notes in Computer Science. Springer.</ref><ref>{{cite journal | last1 = Abu-Mostafa | first1 = Y. S. | year = 1990 | title = Learning from hints in neural networks | url = | journal = Journal of Complexity | volume = 6 | issue = | pages = 192–198 | doi=10.1016/0885-064x(90)90006-y}}</ref>
+{{short description|Solving multiple machine learning tasks at the same time}}
+'''Multi-task learning''' (MTL) is a subfield of [[machine learning]] in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately.<ref>Baxter, J. (2000). A model of inductive bias learning" ''Journal of Artificial Intelligence Research'' 12:149--198, [http://www-2.cs.cmu.edu/afs/cs/project/jair/pub/volume12/baxter00a.pdf On-line paper]</ref><ref>[[Sebastian Thrun|Thrun, S.]] (1996). Is learning the n-th thing any easier than learning the first?. In Advances in Neural Information Processing Systems 8, pp. 640--646. MIT Press. [http://citeseer.ist.psu.edu/thrun96is.html Paper at Citeseer]</ref><ref name=":2">{{Cite journal|url = http://www.cs.cornell.edu/~caruana/mlj97.pdf|title = Multi-task learning|last = Caruana|first = R.|date = 1997|journal = Machine Learning|doi = 10.1023/A:1007379606734|volume=28|pages=41–75|doi-access = free}}</ref>
+Inherently, Multi-task learning is a [[multi-objective optimization]] problem having [[Trade-off|trade-offs]] between different tasks.<ref>Multi-Task Learning as Multi-Objective Optimization
+Part of Advances in Neural Information Processing Systems 31 (NeurIPS 2018), https://proceedings.neurips.cc/paper/2018/hash/432aca3a1e345e339f35a30c8f65edce-Abstract.html</ref>
+Early versions of MTL were called "hints".<ref>Suddarth, S., Kergosien, Y. (1990). Rule-injection hints as a means of improving network performance and learning time. EURASIP Workshop. Neural Networks pp. 120-129. Lecture Notes in Computer Science. Springer.</ref><ref>{{cite journal | last1 = Abu-Mostafa | first1 = Y. S. | year = 1990 | title = Learning from hints in neural networks | journal = Journal of Complexity | volume = 6 | issue = 2| pages = 192–198 | doi=10.1016/0885-064x(90)90006-y| doi-access = free }}</ref>
-In a widely cited 1997 paper, Rich Caruana gave the following characterization:<blockquote>Multitask Learning is an approach to [[inductive transfer]] that improves [[Generalization error|generalization]] by using the domain information contained in the training signals of related tasks as an [[inductive bias]]. It does this by learning tasks in parallel while using a shared [[Representation learning|representation]]; what is learned for each task can help other tasks be learned better.<ref name=":2">{{Cite journal|url = http://www.cs.cornell.edu/~caruana/mlj97.pdf|title = Multi-task learning|last = Caruana|first = R.|date = 1997|journal = Machine Learning|doi = 10.1023/A:1007379606734|pmid = |access-date =|volume=28|pages=41–75}}</ref></blockquote>
+In a widely cited 1997 paper, Rich Caruana gave the following characterization:<blockquote>Multitask Learning is an approach to [[inductive transfer]] that improves [[Generalization error|generalization]] by using the domain information contained in the training signals of related tasks as an [[inductive bias]]. It does this by learning tasks in parallel while using a shared [[Representation learning|representation]]; what is learned for each task can help other tasks be learned better.<ref name=":2"/></blockquote>
-In the classification context, MTL aims to improve the performance of multiple classification tasks by learning them jointly. One example is a spam-filter, which can be treated as distinct but related classification tasks across different users. To make this more concrete, consider that different people have different distributions of features which distinguish spam emails from legitimate ones, for example an English speaker may find that all emails in Russian are spam, not so for Russian speakers. Yet there is a definite commonality in this classification task across users, for example one common feature might be text related to money transfer. Solving each user's spam classification problem jointly via MTL can let the solutions inform each other and improve performance.<ref name=":0">{{Cite web|url = http://www.cs.cornell.edu/~kilian/research/multitasklearning/multitasklearning.html|title = Multi-task Learning|date = |accessdate = |website = |publisher = |last = Weinberger|first = Kilian}}</ref> Further examples of settings for MTL include [[multiclass classification]] and [[multi-label classification]].<ref name=":1">{{Cite arxiv|arxiv = 1504.03101|title = Convex Learning of Multiple Tasks and their Structure|last = Ciliberto|first = C.|date = 2015 }}</ref>
+In the classification context, MTL aims to improve the performance of multiple classification tasks by learning them jointly. One example is a spam-filter, which can be treated as distinct but related classification tasks across different users. To make this more concrete, consider that different people have different distributions of features which distinguish spam emails from legitimate ones, for example an English speaker may find that all emails in Russian are spam, not so for Russian speakers. Yet there is a definite commonality in this classification task across users, for example one common feature might be text related to money transfer. Solving each user's spam classification problem jointly via MTL can let the solutions inform each other and improve performance.{{Citation needed|date=October 2023}} Further examples of settings for MTL include [[multiclass classification]] and [[multi-label classification]].<ref name=":1">{{Cite arXiv|eprint = 1504.03101|title = Convex Learning of Multiple Tasks and their Structure|last = Ciliberto|first = C.|date = 2015 |class = cs.LG}}</ref>
-Multi-task learning works because [[Regularization (mathematics)|regularization]] induced by requiring an algorithm to perform well on a related task can be superior to regularization that prevents [[overfitting]] by penalizing all complexity uniformly. One situation where MTL may be particularly helpful is if the tasks share significant commonalities and are generally slightly under sampled.<ref name=":0" /> However, as discussed below, MTL has also been shown to be beneficial for learning unrelated tasks.<ref name=":3">Romera-Paredes, B., Argyriou, A., Bianchi-Berthouze, N., & Pontil, M., (2012) Exploiting Unrelated Tasks in Multi-Task Learning. http://jmlr.csail.mit.edu/proceedings/papers/v22/romera12/romera12.pdf</ref>
+Multi-task learning works because [[Regularization (mathematics)|regularization]] induced by requiring an algorithm to perform well on a related task can be superior to regularization that prevents [[overfitting]] by penalizing all complexity uniformly. One situation where MTL may be particularly helpful is if the tasks share significant commonalities and are generally slightly under sampled.<ref name=":bmdl"/> However, as discussed below, MTL has also been shown to be beneficial for learning unrelated tasks.<ref name=":bmdl"/><ref name=":3">Romera-Paredes, B., Argyriou, A., Bianchi-Berthouze, N., & Pontil, M., (2012) Exploiting Unrelated Tasks in Multi-Task Learning. http://jmlr.csail.mit.edu/proceedings/papers/v22/romera12/romera12.pdf</ref>
 ==Methods==
+The key challenge in multi-task learning, is how to combine learning signals from multiple tasks into a single model. This may strongly depend on how well different task agree with each other, or contradict each other. There are several ways to address this challenge:
 ===Task grouping and overlap===
-Within the MTL paradigm, information can be shared across some or all of the tasks. Depending on the structure of task relatedness, one may want to share information selectively across the tasks. For example, tasks may be grouped or exist in a hierarchy, or be related according to some general metric. Suppose, as developed more formally below, that the parameter vector modeling each task is a linear combination of some underlying basis. Similarity in terms of this basis can indicate the relatedness of the tasks. For example, with [[Sparse array|sparsity]], overlap of nonzero coefficients across tasks indicates commonality. A task grouping then corresponds to those tasks lying in a subspace generated by some subset of basis elements, where tasks in different groups may be disjoint or overlap arbitrarily in terms of their bases.<ref>Kumar, A., & Daume III, H., (2012) Learning Task Grouping and Overlap in Multi-Task Learning. http://icml.cc/2012/papers/690.pdf</ref> Task relatedness can be imposed a priori or learned from the data.<ref name=":1"/><ref>Jawanpuria, P., & Saketha Nath, J., (2012) A Convex Feature Learning Formulation for Latent Task Structure Discovery. http://icml.cc/2012/papers/90.pdf</ref> Hierarchical task relatedness can also be exploited implicitly without assuming a priori knowledge or learning relations explicitly.<ref>Zweig, A. & Weinshall, D. Hierarchical Regularization Cascade for Joint Learning. Proceedings: of 30th International Conference on Machine Learning (ICML), Atlanta GA, June 2013. http://www.cs.huji.ac.il/~daphna/papers/Zweig_ICML2013.pdf</ref>
+Within the MTL paradigm, information can be shared across some or all of the tasks. Depending on the structure of task relatedness, one may want to share information selectively across the tasks. For example, tasks may be grouped or exist in a hierarchy, or be related according to some general metric. Suppose, as developed more formally below, that the parameter vector modeling each task is a [[linear combination]] of some underlying basis. Similarity in terms of this basis can indicate the relatedness of the tasks. For example, with [[Sparse array|sparsity]], overlap of nonzero coefficients across tasks indicates commonality. A task grouping then corresponds to those tasks lying in a subspace generated by some subset of basis elements, where tasks in different groups may be disjoint or overlap arbitrarily in terms of their bases.<ref>Kumar, A., & Daume III, H., (2012) Learning Task Grouping and Overlap in Multi-Task Learning. http://icml.cc/2012/papers/690.pdf</ref> Task relatedness can be imposed a priori or learned from the data.<ref name=":1"/><ref>Jawanpuria, P., & Saketha Nath, J., (2012) A Convex Feature Learning Formulation for Latent Task Structure Discovery. http://icml.cc/2012/papers/90.pdf</ref> Hierarchical task relatedness can also be exploited implicitly without assuming a priori knowledge or learning relations explicitly.<ref name=":bmdl">Hajiramezanali, E. & Dadaneh, S. Z. & Karbalayghareh, A. & Zhou, Z. & Qian, X. Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data. 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. {{ArXiv|1810.09433}}</ref><ref>Zweig, A. & Weinshall, D. Hierarchical Regularization Cascade for Joint Learning. Proceedings: of 30th International Conference on Machine Learning (ICML), Atlanta GA, June 2013. http://www.cs.huji.ac.il/~daphna/papers/Zweig_ICML2013.pdf</ref> For example, the explicit learning of sample relevance across tasks can be done to guarantee the effectiveness of joint learning across multiple domains.<ref name=":bmdl"/>
 ===Exploiting unrelated tasks===
-One can attempt learning a group of principal tasks using a group of auxiliary tasks, unrelated to the principal ones. In many applications, joint learning of unrelated tasks which use the same input data can be beneficial. The reason is that prior knowledge about task relatedness can lead to sparser and more informative representations for each task grouping, essentially by screening out idiosyncrasies of the data distribution. Novel methods which builds on a prior multitask methodology by favoring a shared low-dimensional representation within each task grouping have been proposed. The programmer can impose a penalty on tasks from different groups which encourages the two representations to be [[orthogonal]]. Experiments on synthetic and real data have indicated that incorporating unrelated tasks can result in significant improvements over standard multi-task learning methods.<ref name=":3">Romera-Paredes, B., Argyriou, A., Bianchi-Berthouze, N., & Pontil, M., (2012) Exploiting Unrelated Tasks in Multi-Task Learning. http://jmlr.csail.mit.edu/proceedings/papers/v22/romera12/romera12.pdf</ref>
+One can attempt learning a group of principal tasks using a group of auxiliary tasks, unrelated to the principal ones. In many applications, joint learning of unrelated tasks which use the same input data can be beneficial. The reason is that prior knowledge about task relatedness can lead to sparser and more informative representations for each task grouping, essentially by screening out idiosyncrasies of the data distribution. Novel methods which builds on a prior multitask methodology by favoring a shared low-dimensional representation within each task grouping have been proposed. The programmer can impose a penalty on tasks from different groups which encourages the two representations to be [[orthogonal]]. Experiments on synthetic and real data have indicated that incorporating unrelated tasks can result in significant improvements over standard multi-task learning methods.<ref name=":3"/>
 === Transfer of knowledge ===
-Related to multi-task learning is the concept of knowledge transfer. Whereas traditional multi-task learning implies that a shared representation is developed concurrently across tasks, transfer of knowledge implies a sequentially shared representation. Large scale machine learning projects such as the deep [[convolutional neural network]] GoogLeNet,<ref>{{Cite journal|arxiv = 1409.4842|title = Going Deeper with Convolutions|last = Szegedy|first = C.|date = 2014|journal = Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on|doi = 10.1109/CVPR.2015.7298594|pmid = }}</ref> an image-based object classifier, can develop robust representations which may be useful to further algorithms learning related tasks. For example, the pre-trained model can be used as a feature extractor to perform pre-processing for another learning algorithm. Or the pre-trained model can be used to initialize a model with similar architecture which is then fine-tuned to learn a different classification task.<ref>{{Cite web|url = http://www.mit.edu/~9.520/fall15/slides/class24/deep_learning_overview.pdf|title = Deep Learning Overview|date = |accessdate = |website = |publisher = |last = Roig|first = Gemma}}</ref>
+Related to multi-task learning is the concept of knowledge transfer. Whereas traditional multi-task learning implies that a shared representation is developed concurrently across tasks, transfer of knowledge implies a sequentially shared representation. Large scale machine learning projects such as the deep [[convolutional neural network]] [[GoogLeNet]],<ref>{{Cite book|arxiv = 1409.4842 |doi = 10.1109/CVPR.2015.7298594 |isbn = 978-1-4673-6964-0|chapter = Going deeper with convolutions |title = 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) |pages = 1–9 |year = 2015 |last1 = Szegedy |first1 = Christian |last2 = Wei Liu |first2 = Youssef |last3 = Yangqing Jia |first3 = Tomaso |last4 = Sermanet |first4 = Pierre |last5 = Reed |first5 = Scott |last6 = Anguelov |first6 = Dragomir |last7 = Erhan |first7 = Dumitru |last8 = Vanhoucke |first8 = Vincent |last9 = Rabinovich |first9 = Andrew |s2cid = 206592484 }}</ref> an image-based object classifier, can develop robust representations which may be useful to further algorithms learning related tasks. For example, the pre-trained model can be used as a feature extractor to perform pre-processing for another learning algorithm. Or the pre-trained model can be used to initialize a model with similar architecture which is then fine-tuned to learn a different classification task.<ref>{{Cite web|url = https://www.mit.edu/~9.520/fall15/slides/class24/deep_learning_overview.pdf|title = Deep Learning Overview|last = Roig|first = Gemma|access-date = 2019-08-26|archive-date = 2016-03-06|archive-url = https://web.archive.org/web/20160306020712/http://www.mit.edu/~9.520/fall15/slides/class24/deep_learning_overview.pdf|url-status = dead}}</ref>
-=== Group online adaptive learning ===
+=== Multiple non-stationary tasks ===
-Traditionally Multi-task learning and transfer of knowledge are applied to stationary learning settings. Their extension to non-stationary environments is termed Group online adaptive learning (GOAL).<ref>Zweig, A. & Chechik, G. Group online adaptive learning. Machine Learning, DOI 10.1007/s10994-017- 5661-5, August 2017. http://rdcu.be/uFSv</ref> Sharing information could be particularly useful if learners operate in continuously changing environments, because a learner could benefit from previous experience of another learner to quickly adapt to their new environment. Such group-adaptive learning has numerous applications, from predicting financial time-series, through content recommendation systems, to visual understanding for adaptive autonomous agents.
+Traditionally Multi-task learning and transfer of knowledge are applied to stationary learning settings. Their extension to non-stationary environments is termed ''Group online adaptive learning'' (GOAL).<ref>Zweig, A. & Chechik, G. Group online adaptive learning. Machine Learning, DOI 10.1007/s10994-017- 5661-5, August 2017. http://rdcu.be/uFSv</ref> Sharing information could be particularly useful if learners operate in continuously changing environments, because a learner could benefit from previous experience of another learner to quickly adapt to their new environment. Such group-adaptive learning has numerous applications, from predicting [[Financial modeling|financial time-series]], through content recommendation systems, to visual understanding for adaptive autonomous agents.
+=== Multi-task optimization ===
+[[Multitask optimization]]: In some cases, the simultaneous training of seemingly related tasks may hinder performance compared to single-task models.<ref>{{Cite journal |last1=Standley |first1=Trevor |last2=Zamir |first2=Amir R. |last3=Chen |first3=Dawn |last4=Guibas |first4=Leonidas |last5=Malik |first5=Jitendra |last6=Savarese |first6=Silvio |date=2020-07-13 |title=Learning the Pareto Front with Hypernetworks |url=https://proceedings.mlr.press/v119/standley20a.html |journal=International Conference on Machine Learning (ICML)|pages=9120–9132 |arxiv=1905.07553 }}</ref> Commonly, MTL models employ task-specific modules on top of a joint feature representation obtained using a shared module. Since this joint representation must capture useful features across all tasks, MTL may hinder individual task performance if the different tasks seek conflicting representation, i.e., the gradients of different tasks point to opposing directions or differ significantly in magnitude. This phenomenon is commonly referred to as negative transfer. To mitigate this issue, various MTL optimization methods have been proposed. Commonly, the per-task gradients are combined into a joint update direction through various aggregation algorithms or heuristics. These methods include subtracting the projection of conflicted gradients,<ref>{{Cite journal |last1=Yu |first1=Tianhe |last2=Kumar |first2=Saurabh |last3=Gupta |first3=Abhishek |last4=Levine |first4=Sergey |last5=Hausman |first5=Karol |last6=Finn |first6=Chelsea |date=2020 |title=Gradient Surgery for Multi-Task Learning |url=https://proceedings.neurips.cc/paper/2020/file/3fe78a8acf5fda99de95303940a2420c-Paper.pdf |journal=Advances in Neural Information Processing Systems |arxiv=2001.06782 }}</ref> applying techniques from game theory,<ref>{{Cite journal |last1=Navon |first1=Aviv |last2=Shamsian |first2=Aviv |last3=Achituve |first3=Idan |last4=Maron |first4=Haggai |last5=Kawaguchi |first5=Kenji |last6=Chechik |first6=Gal |last7=Fetaya |first7=Ethan |date=2022 |title=Multi-Task Learning as a Bargaining Game |url=https://proceedings.mlr.press/v162/navon22a.html |journal=International Conference on Machine Learning |pages=16428–16446 |arxiv=2202.01017 }}</ref> and using Bayesian modeling to get a distribution over gradients.<ref>{{Cite arXiv |last1=Achituve |first1=Idan |last2=Diamant |first2=Idit |last3=Netzer |first3=Arnon  |last4=Chechik |first4=Gal |last5=Fetaya |first5=Ethan |date=2024 |title=Bayesian Uncertainty for Gradient Aggregation in Multi-Task Learning |class=cs.LG |eprint=2402.04005 }}</ref>
 == Mathematics ==
@@ Line 27: / Line 37: @@
 ==== RKHSvv concepts ====
-Suppose the training data set is <math>\mathcal{S}_t =\{(x_i^t,y_i^t)\}_{i=1}^{n_t}</math>, with <math>x_i^t\in\mathcal{X}</math>, <math>y_i^t\in\mathcal{Y}</math>, where <math>t</math> indexes task, and <math>t \in 1,...,T</math>. Let <math>n=\sum_{t=1}^Tn_t </math>. In this setting there is a consistent input and output space and the same [[loss function]] <math> \mathcal{L}:\mathbb{R}\times\mathbb{R}\rightarrow \mathbb{R}_+ </math> for each task: . This results in the regularized machine learning problem:
+Suppose the training data set is <math>\mathcal{S}_t =\{(x_i^t,y_i^t)\}_{i=1}^{n_t}</math>, with <math>x_i^t\in\mathcal{X}</math>, <math>y_i^t\in\mathcal{Y}</math>, where {{mvar|t}} indexes task, and <math>t \in 1,...,T</math>. Let <math>n=\sum_{t=1}^Tn_t </math>. In this setting there is a consistent input and output space and the same [[loss function]] <math> \mathcal{L}:\mathbb{R}\times\mathbb{R}\rightarrow \mathbb{R}_+ </math> for each task: . This results in the regularized machine learning problem:
 {{NumBlk|:|<math display="block" id="1"> \min_{f \in \mathcal{H}}\sum _{t=1} ^T \frac{1}{n_t} \sum _{i=1} ^{n_t} \mathcal{L}(y_i^t, f_t(x_i^t))+\lambda ||f||_\mathcal{H} ^2 </math>|{{EquationRef|1}}}}
 where <math> \mathcal{H} </math> is a vector valued reproducing kernel Hilbert space with functions <math> f:\mathcal X \rightarrow \mathcal{Y}^T </math> having components <math> f_t:\mathcal{X}\rightarrow \mathcal {Y} </math>.
@@ Line 36: / Line 46: @@
 ==== Separable kernels ====
-The form of the kernel <math>\Gamma </math> induces both the representation of the [[feature space]] and structures the output across tasks. A natural simplification is to choose a ''separable kernel,'' which factors into separate kernels on the input space <math> \mathcal X </math> and on the tasks <math> \{1,...,T\} </math>. In this case the kernel relating scalar components <math> f_t </math> and <math> f_s </math> is given by <math display="inline"> \gamma((x_i,t),(x_j,s )) = k(x_i,x_j)k_T(s,t)=k(x_i,x_j)A_{s,t} </math>. For vector valued functions  <math> f\in \mathcal H </math>  we can write <math>\Gamma(x_i,x_j)=k(x_i,x_j)A</math>, where <math>k</math> is a scalar reproducing kernel, and <math>A</math> is a symmetric positive semi-definite <math>T\times T</math> matrix. Henceforth denote <math> S_+^T=\{\text{PSD matrices} \} \subset \mathbb R^{T \times T} </math> .
+The form of the kernel {{math|&Gamma;}} induces both the representation of the [[feature space]] and structures the output across tasks. A natural simplification is to choose a ''separable kernel,'' which factors into separate kernels on the input space {{mathcal|X}} and on the tasks <math> \{1,...,T\} </math>. In this case the kernel relating scalar components <math> f_t </math> and <math> f_s </math> is given by <math display="inline"> \gamma((x_i,t),(x_j,s )) = k(x_i,x_j)k_T(s,t)=k(x_i,x_j)A_{s,t} </math>. For vector valued functions  <math> f\in \mathcal H </math>  we can write <math>\Gamma(x_i,x_j)=k(x_i,x_j)A</math>, where {{mvar|k}} is a scalar reproducing kernel, and {{mvar|A}} is a symmetric positive semi-definite <math>T\times T</math> matrix. Henceforth denote <math> S_+^T=\{\text{PSD matrices} \} \subset \mathbb R^{T \times T} </math> .
-This factorization property, separability, implies the input feature space representation does not vary by task. That is, there is no interaction between the input kernel and the task kernel. The structure on tasks is represented solely by <math>A</math>. Methods for non-separable kernels <math>\Gamma </math> is an current field of research.
+This factorization property, separability, implies the input feature space representation does not vary by task. That is, there is no interaction between the input kernel and the task kernel. The structure on tasks is represented solely by {{mvar|A}}. Methods for non-separable kernels {{math|&Gamma;}} is a current field of research.
-For the separable case, the representation theorem is reduced to <math display="inline">f(x)=\sum _{i=1} ^N k(x,x_i)Ac_i</math>. The model output on the training data is then <math>KCA</math> , where <math>K</math> is the <math>n \times n</math> empirical kernel matrix with entries <math display="inline">K_{i,j}=k(x_i,x_j)</math>, and <math>C</math>  is the <math>n \times T</math> matrix of rows <math>c_i</math>.
+For the separable case, the representation theorem is reduced to <math display="inline">f(x)=\sum _{i=1} ^N k(x,x_i)Ac_i</math>. The model output on the training data is then {{mvar|KCA}} , where {{mvar|K}} is the <math>n \times n</math> empirical kernel matrix with entries <math display="inline">K_{i,j}=k(x_i,x_j)</math>, and {{mvar|C}}  is the <math>n \times T</math> matrix of rows <math>c_i</math>.
 With the separable kernel, equation  {{EquationNote|1}} can be rewritten as
@@ Line 46: / Line 56: @@
 {{NumBlk|:|<math display="block" id="1"> \min _{C\in \mathbb{R}^{n\times T}} V(Y,KCA) + \lambda tr(KCAC^{\top})</math>|{{EquationRef|P}}}}
-where <math>V</math> is a (weighted) average of <math>\mathcal{L}</math> applied entry-wise to Y and KCA. (The weight is zero if <math> Y_i^t </math> is a missing observation).
+where {{mvar|V}} is a (weighted) average of {{mathcal|L}} applied entry-wise to {{mvar|Y}} and {{mvar|KCA}}. (The weight is zero if <math> Y_i^t </math> is a missing observation).
 Note the second term in {{EquationNote|P}} can be derived as follows:
+:<math>\begin{align}
-<math>||f||^2_\mathcal{H} = \langle \sum _{i=1} ^n k(\cdot,x_i)Ac_i, \sum _{j=1} ^n k(\cdot ,x_j)Ac_j\rangle_{\mathcal H } </math>
+\|f\|^2_\mathcal{H} &= \left\langle \sum _{i=1} ^n k(\cdot,x_i)Ac_i, \sum _{j=1} ^n k(\cdot ,x_j)Ac_j \right\rangle_{\mathcal H }
+\\
-<math>= \sum _{i,j=1} ^n  \langle   k(\cdot,x_i)A c_i, k(\cdot ,x_j)Ac_j\rangle_{\mathcal H }   </math> (bilinearity)
+&= \sum _{i,j=1} ^n  \langle   k(\cdot,x_i)A c_i, k(\cdot ,x_j)Ac_j\rangle_{\mathcal H }   & \text{(bilinearity)}
+\\
-<math>= \sum _{i,j=1} ^n \langle k(x_i,x_j)A c_i, c_j\rangle_{\mathbb R^T }   </math> (reproducing property)
+&= \sum _{i,j=1} ^n \langle k(x_i,x_j)A c_i, c_j\rangle_{\mathbb R^T }   & \text{(reproducing property)}
+\\
-<math>= \sum _{i,j=1} ^n k(x_i,x_j) c_i^\top A  c_j=tr(KCAC^\top ) </math>
+&= \sum _{i,j=1} ^n k(x_i,x_j) c_i^\top A  c_j=tr(KCAC^\top )
+\end{align}</math>
 ==== Known task structure ====
@@ Line 63: / Line 75: @@
 There are three largely equivalent ways to represent task structure: through a regularizer; through an output metric, and through an output mapping.
-'''Regularizer''' - With the separable kernel, it can be shown (below) that <math display="inline">||f||^2_\mathcal{H} = \sum_{s,t=1}^T A^\dagger _{t,s} \langle f_s, f_t \rangle _{\mathcal H_k} </math>, where <math>A^\dagger _{t,s} </math> is the  <math> t,s </math> element of the pseudoinverse of <math> A </math>, and <math>\mathcal H_k </math> is the RKHS based on the scalar kernel <math> k </math>, and <math display="inline"> f_t(x)=\sum _{i=1} ^n k(x,x_i)A_t^\top c_i </math>. This formulation shows that <math>A^\dagger _{t,s} </math> controls the weight of the penalty associated with <math display="inline">\langle f_s, f_t \rangle _{\mathcal H_k} </math>. (Note that <math display="inline">\langle f_s, f_t \rangle _{\mathcal H_k} </math> arises from <math display="inline">||f_t||_{\mathcal H_k} = \langle f_t, f_t \rangle _{\mathcal H_k} </math>.)
+{{math_theorem|name=Regularizer|1=With the separable kernel, it can be shown (below) that <math display="inline">||f||^2_\mathcal{H} = \sum_{s,t=1}^T A^\dagger _{t,s} \langle f_s, f_t \rangle _{\mathcal H_k} </math>, where <math>A^\dagger _{t,s} </math> is the  <math> t,s </math> element of the pseudoinverse of <math> A </math>, and <math>\mathcal H_k </math> is the RKHS based on the scalar kernel <math> k </math>, and <math display="inline"> f_t(x)=\sum _{i=1} ^n k(x,x_i)A_t^\top c_i </math>. This formulation shows that <math>A^\dagger _{t,s} </math> controls the weight of the penalty associated with <math display="inline">\langle f_s, f_t \rangle _{\mathcal H_k} </math>. (Note that <math display="inline">\langle f_s, f_t \rangle _{\mathcal H_k} </math> arises from <math display="inline">||f_t||_{\mathcal H_k} = \langle f_t, f_t \rangle _{\mathcal H_k} </math>.)
-Proof:
+{{Proof|
+<math>\begin{align}
+\|f\|^2_\mathcal{H} &= \left\langle \sum _{i=1} ^n \gamma ((x_i,t_i),\cdot )c_i^{t_i}, \sum _{j=1} ^n \gamma ((x_j,t_j), \cdot )c_j^{t_j}\right\rangle_{\mathcal H } \\
+&=\sum _{i,j=1} ^n c_i^{t_i} c_j^{t_j}  \gamma ((x_i,t_i),(x_j,t_j)) \\
+&=\sum _{i,j=1} ^n \sum _{s,t=1} ^T c_i^{t} c_j^{s}  k(x_i,x_j)A_{s,t} \\
+&=\sum _{i,j=1} ^n    k(x_i,x_j) \langle  c_i, A c_j\rangle_{\mathbb R^T} \\
+&=\sum _{i,j=1} ^n    k(x_i,x_j) \langle  c_i, A A^\dagger A c_j\rangle_{\mathbb R^T} \\
+&=\sum _{i,j=1} ^n    k(x_i,x_j) \langle  Ac_i,  A^\dagger A c_j\rangle_{\mathbb R^T} \\
+&=\sum _{i,j=1} ^n \sum _{s,t=1} ^T  (Ac_i)^t (A c_j)^s  k(x_i,x_j) A^\dagger_{s,t}  \\
+&= \sum _{s,t=1} ^T  A^\dagger_{s,t} \langle \sum _{i=1} ^n k(x_i,\cdot )(Ac_i)^t, \sum _{j=1} ^n   k(x_j,\cdot )(A c_j)^s   \rangle  _{\mathcal H_k}   \\
+&= \sum _{s,t=1} ^T  A^\dagger_{s,t} \langle f_t, f_s  \rangle  _{\mathcal H_k}
+\end{align}</math>
+}}}}
+{{math_theorem|name=Output metric|an alternative output metric on <math>\mathcal Y^T </math> can be induced by the inner product <math>\langle y_1,y_2 \rangle _\Theta=\langle y_1,\Theta y_2 \rangle_{\mathbb R^T}  </math>. With the squared loss there is an equivalence between the separable kernels <math>k(\cdot,\cdot)I_T </math> under the alternative metric, and <math>k(\cdot,\cdot)\Theta </math>, under the canonical metric.}}
+{{math_theorem|name=Output mapping|Outputs can be mapped as  <math>L:\mathcal Y^T \rightarrow \mathcal \tilde Y </math>  to a higher dimensional space to encode complex structures such as trees, graphs and strings.  For linear maps {{mvar|L}}, with appropriate choice of separable kernel, it can be shown that  <math>A=L^\top L</math>.}}
-<math>||f||^2_\mathcal{H} = \langle \sum _{i=1} ^n \gamma ((x_i,t_i),\cdot )c_i^{t_i}, \sum _{j=1} ^n \gamma ((x_j,t_j), \cdot )c_j^{t_j}\rangle_{\mathcal H } </math>
-<math>=\sum _{i,j=1} ^n c_i^{t_i} c_j^{t_j}  \gamma ((x_i,t_i),(x_j,t_j)) </math>
-<math>=\sum _{i,j=1} ^n \sum _{s,t=1} ^T c_i^{t} c_j^{s}  k(x_i,x_j)A_{s,t} </math>
-<math>=\sum _{i,j=1} ^n    k(x_i,x_j) \langle  c_i, A c_j\rangle_{\mathbb R^T} </math>
-<math>=\sum _{i,j=1} ^n    k(x_i,x_j) \langle  c_i, A A^\dagger A c_j\rangle_{\mathbb R^T} </math>
-<math>=\sum _{i,j=1} ^n    k(x_i,x_j) \langle  Ac_i,  A^\dagger A c_j\rangle_{\mathbb R^T} </math>
-<math>=\sum _{i,j=1} ^n \sum _{s,t=1} ^T  (Ac_i)^t (A c_j)^s  k(x_i,x_j) A^\dagger_{s,t}  </math>
-<math>= \sum _{s,t=1} ^T  A^\dagger_{s,t} \langle \sum _{i=1} ^n k(x_i,\cdot )(Ac_i)^t, \sum _{j=1} ^n   k(x_j,\cdot )(A c_j)^s   \rangle  _{\mathcal H_k}   </math>
-<math>= \sum _{s,t=1} ^T  A^\dagger_{s,t} \langle f_t, f_s  \rangle  _{\mathcal H_k}   </math>
-'''Output metric''' - an alternative output metric on <math>\mathcal Y^T </math> can be induced by the inner product <math>\langle y_1,y_2 \rangle _\Theta=\langle y_1,\Theta y_2 \rangle_{\mathbb R^T}  </math>. With the squared loss there is an equivalence between the separable kernels <math>k(\cdot,\cdot)I_T </math> under the alternative metric, and <math>k(\cdot,\cdot)\Theta </math>, under the canonical metric.
-'''Output mapping''' - Outputs can be mapped as  <math>L:\mathcal Y^T \rightarrow \mathcal \tilde Y </math>  to a higher dimensional space to encode complex structures such as trees, graphs and strings.  For linear maps <math>L</math>, with appropriate choice of separable kernel, it can be shown that  <math>A=L^\top L</math>.
 ===== Task structure examples =====
 Via the regularizer formulation, one can represent a variety of task structures easily.
-* Letting <math display="inline">A^\dagger = \gamma I_T + ( \gamma - \lambda)\frac {1} T \bold{1}\bold{1}^\top </math> (where <math>I_T </math> is the ''T''x''T'' identity matrix, and <math display="inline">\bold{1}\bold{1}^\top </math> is the ''T''x''T'' matrix of ones) is equivalent to letting <math>\gamma </math> control the variance <math display="inline">\sum_t || f_t - \bar f|| _{\mathcal H_k} </math>  of tasks from their mean <math display="inline">\frac 1 T \sum_t f_t  </math>. For example, blood levels of some biomarker may be taken on <math>T</math> patients at <math>n_t</math> time points during the course of a day and interest may lie in regularizing the variance of the predictions across patients.
+* Letting <math display="inline">A^\dagger = \gamma I_T + ( \gamma - \lambda)\frac {1} T \mathbf{1}\mathbf{1}^\top </math> (where <math>I_T </math> is the ''T''x''T'' identity matrix, and <math display="inline">\mathbf{1}\mathbf{1}^\top </math> is the ''T''x''T'' matrix of ones) is equivalent to letting {{math|&Gamma;}} control the variance <math display="inline">\sum_t || f_t - \bar f|| _{\mathcal H_k} </math>  of tasks from their mean <math display="inline">\frac 1 T \sum_t f_t  </math>. For example, blood levels of some biomarker may be taken on {{mvar|T}} patients at <math>n_t</math> time points during the course of a day and interest may lie in regularizing the variance of the predictions across patients.
 * Letting <math> A^\dagger = \alpha I_T +(\alpha - \lambda )M </math> , where <math> M_{t,s} = \frac 1 {|G_r|} \mathbb I(t,s\in G_r) </math> is equivalent to letting <math> \alpha </math> control the variance measured with respect to a group mean: <math> \sum _{r} \sum _{t \in G_r } ||f_t - \frac 1 {|G_r|} \sum _{s\in G_r)} f_s||  </math>. (Here <math> |G_r| </math> the cardinality of group r, and <math>  \mathbb I </math> is the indicator function). For example, people in different political parties (groups) might be regularized together with respect to predicting the favorability rating of a politician. Note that this penalty reduces to the first when all tasks are in the same group.
-* Letting <math> A^\dagger = \delta I_T + (\delta -\lambda)L  </math>, where <math> L=D-M</math> is the L[[Laplacian matrix|aplacian]] for the graph with adjacency matrix ''M'' giving pairwise similarities of tasks. This is equivalent to giving a larger penalty to the distance separating tasks ''t'' and ''s'' when they are more similar (according to the weight <math> M_{t,s} </math>,) i.e. <math>\delta </math> regularizes <math> \sum _{t,s}||f_t - f_s ||_{\mathcal H _k }^2 M_{t,s} </math>.
+* Letting <math> A^\dagger = \delta I_T + (\delta -\lambda)L  </math>, where <math> L=D-M</math> is the [[Laplacian matrix|Laplacian]] for the graph with [[adjacency matrix]] ''M'' giving pairwise similarities of tasks. This is equivalent to giving a larger penalty to the distance separating tasks ''t'' and ''s'' when they are more similar (according to the weight <math> M_{t,s} </math>,) i.e. <math>\delta </math> regularizes <math> \sum _{t,s}||f_t - f_s ||_{\mathcal H _k }^2 M_{t,s} </math>.
 * All of the above choices of A also induce the additional regularization term  <math display="inline">\lambda \sum_t ||f|| _{\mathcal H_k} ^2 </math> which penalizes complexity in f more broadly.
@@ Line 120: / Line 125: @@
 ===== Special cases =====
-'''[[Regularization by spectral filtering|Spectral penalties]]''' - Dinnuzo ''et al''<ref>{{Cite journal|url = http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Dinuzzo_54.pdf |title = Learning output kernels with block coordinate descent.|last = Dinuzzo|first = Francesco|date = 2011|journal = Proceedings of the 28th International Conference on Machine Learning (ICML-11)|doi = |pmid = |access-date = }}</ref> suggested setting ''F'' as the Frobenius norm <math> \sqrt{tr(A^\top A)}</math>. They optimized {{EquationNote|Q}} directly using block coordinate descent, not accounting for difficulties at the boundary of <math>\mathbb R^{n\times T} \times S_+^T</math>.
+'''[[Regularization by spectral filtering|Spectral penalties]]''' - Dinnuzo ''et al''<ref>{{Cite journal|url = http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Dinuzzo_54.pdf|title = Learning output kernels with block coordinate descent.|last = Dinuzzo|first = Francesco|date = 2011|journal = Proceedings of the 28th International Conference on Machine Learning (ICML-11)|archive-url = https://web.archive.org/web/20170808223410/http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Dinuzzo_54.pdf|archive-date = 2017-08-08|url-status = dead}}</ref> suggested setting ''F'' as the Frobenius norm <math> \sqrt{tr(A^\top A)}</math>. They optimized {{EquationNote|Q}} directly using block coordinate descent, not accounting for difficulties at the boundary of <math>\mathbb R^{n\times T} \times S_+^T</math>.
-'''Clustered tasks learning''' - Jacob ''et al''<ref>{{Cite journal|url = |title = Clustered multi-task learning: A convex formulation|last = Jacob|first = Laurent|date = 2009|journal = Advances in neural information processing systems|doi = |pmid = |access-date = }}</ref> suggested to learn ''A'' in the setting where ''T''  tasks are organized in ''R'' disjoint clusters. In this case let <math> E\in \{0,1\}^{T\times R}</math> be the matrix with <math> E_{t,r}=\mathbb I (\text{task }t\in \text{group }r)</math>. Setting <math> M = I - E^\dagger E^T</math>, and  <math> U = \frac 1 T \bold{11}^\top </math>, the task matrix <math> A^\dagger  </math>  can be parameterized as a function of <math> M  </math>: <math> A^\dagger(M) = \epsilon _M U+\epsilon_B (M-U)+\epsilon (I-M)  </math> , with terms that penalize the average, between clusters variance and within clusters variance respectively of the task predictions. M is not convex, but there is a convex relaxation <math> \mathcal S_c = \{M\in S_+^T:I-M\in S_+^T \and tr(M) = r \} </math>. In this formulation,  <math> F(A)=\mathbb I(A(M)\in \{A:M\in \mathcal S_C\})  </math>.
+'''Clustered tasks learning''' - Jacob ''et al''<ref>{{Cite journal|title = Clustered multi-task learning: A convex formulation|last = Jacob|first = Laurent|date = 2009|journal = Advances in Neural Information Processing Systems|bibcode = 2008arXiv0809.2085J|arxiv = 0809.2085}}</ref> suggested to learn ''A'' in the setting where ''T''  tasks are organized in ''R'' disjoint clusters. In this case let <math> E\in \{0,1\}^{T\times R}</math> be the matrix with <math> E_{t,r}=\mathbb I (\text{task }t\in \text{group }r)</math>. Setting <math> M = I - E^\dagger E^T</math>, and  <math> U = \frac 1 T \mathbf{11}^\top </math>, the task matrix <math> A^\dagger  </math>  can be parameterized as a function of <math> M  </math>: <math> A^\dagger(M) = \epsilon _M U+\epsilon_B (M-U)+\epsilon (I-M)  </math> , with terms that penalize the average, between clusters variance and within clusters variance respectively of the task predictions. M is not convex, but there is a convex relaxation <math> \mathcal S_c = \{M\in S_+^T:I-M\in S_+^T \land tr(M) = r \} </math>. In this formulation,  <math> F(A)=\mathbb I(A(M)\in \{A:M\in \mathcal S_C\})  </math>.
 ===== Generalizations =====
@@ Line 128: / Line 133: @@
 '''Non-separable kernels''' - Separable kernels are limited, in particular they do not account for structures in the interaction space between the input and output domains jointly. Future work is needed to develop models for these kernels.
-==Applications==
-===Spam filtering===
-Using the principles of MTL, techniques for collaborative [[spam filtering]] that facilitates personalization have been proposed. In large scale open membership email systems, most users do not label enough messages for an individual local [[classifier (mathematics)|classifier]] to be effective, while the data is too noisy to be used for a global filter across all users. A hybrid global/individual classifier can be effective at absorbing the influence of users who label emails very diligently from the general public. This can be accomplished while still providing sufficient quality to users with few labeled instances.<ref>Attenberg, J., Weinberger, K., & Dasgupta, A. Collaborative Email-Spam Filtering with the Hashing-Trick. http://www.cse.wustl.edu/~kilian/papers/ceas2009-paper-11.pdf</ref>
-===Web search===
-Using boosted [[decision trees]], one can enable implicit data sharing and regularization. This learning method can be used on web-search ranking data sets. One example is to use ranking data sets from several countries. Here, multitask learning is particularly helpful as data sets from different countries vary largely in size because of the cost of editorial judgments. It has been demonstrated that learning various tasks jointly can lead to significant improvements in performance with surprising reliability.<ref>Chappelle, O., Shivaswamy, P., & Vadrevu, S. Multi-Task Learning for Boosting
-with Application to Web Search Ranking. http://www.cse.wustl.edu/~kilian/papers/multiboost2010.pdf</ref>
-=== RoboEarth ===
-In order to facilitate transfer of knowledge, IT infrastructure is being developed. One such project, RoboEarth, aims to set up an open source internet database that can be accessed and continually updated from around the world. The goal is to facilitate a cloud-based interactive knowledge base, accessible to technology companies and academic institutions, which can enhance the sensing, acting and learning capabilities of robots and other artificial intelligence agents.<ref name="RoboEarth">[http://www.roboearth.org/motivation Description of RoboEarth Project]</ref>
 ==Software package==
+A Matlab package called Multi-Task Learning via StructurAl Regularization (MALSAR) <ref>Zhou, J., Chen, J. and Ye, J. MALSAR: Multi-tAsk Learning via StructurAl Regularization. Arizona State University, 2012. http://www.public.asu.edu/~jye02/Software/MALSAR. [http://www.public.asu.edu/~jye02/Software/MALSAR/Manual.pdf On-line manual]</ref>  implements the following multi-task learning algorithms: Mean-Regularized Multi-Task Learning,<ref>Evgeniou, T., & Pontil, M. (2004). [https://web.archive.org/web/20171212193041/https://pdfs.semanticscholar.org/1ea1/91c70559d21be93a4d128f95943e80e1b4ff.pdf Regularized multi–task learning]. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 109–117).</ref><ref>{{cite journal | last1 = Evgeniou | first1 = T. | last2 = Micchelli | first2 = C. | last3 = Pontil | first3 = M. | year = 2005 | title = Learning multiple tasks with kernel methods | url = http://jmlr.org/papers/volume6/evgeniou05a/evgeniou05a.pdf | journal = Journal of Machine Learning Research | volume = 6 | page = 615 }}</ref> Multi-Task Learning with Joint Feature Selection,<ref>{{cite journal | last1 = Argyriou | first1 = A. | last2 = Evgeniou | first2 = T. | last3 = Pontil | first3 = M. | year = 2008a | title = Convex multi-task feature learning | journal = Machine Learning | volume = 73 | issue = 3| pages = 243–272 | doi=10.1007/s10994-007-5040-8| doi-access = free }}</ref> Robust Multi-Task Feature Learning,<ref>Chen, J., Zhou, J., & Ye, J. (2011). [https://www.academia.edu/download/44101186/Integrating_low-rank_and_group-sparse_st20160325-15067-1mftmbg.pdf Integrating low-rank and group-sparse structures for robust multi-task learning]{{dead link|date=July 2022|bot=medic}}{{cbignore|bot=medic}}. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining.</ref> Trace-Norm Regularized Multi-Task Learning,<ref>Ji, S., & Ye, J. (2009). [http://www.machinelearning.org/archive/icml2009/papers/151.pdf An accelerated gradient method for trace norm minimization]. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 457–464).</ref> Alternating Structural Optimization,<ref>{{cite journal | last1 = Ando | first1 = R. | last2 = Zhang | first2 = T. | year = 2005 | title = A framework for learning predictive structures from multiple tasks and unlabeled data | url = http://www.jmlr.org/papers/volume6/ando05a/ando05a.pdf | journal = The Journal of Machine Learning Research | volume = 6 | pages = 1817–1853 }}</ref><ref>Chen, J., Tang, L., Liu, J., & Ye, J. (2009). [http://leitang.net/papers/ICML09_CASO.pdf A convex formulation for learning shared structures from multiple tasks]. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 137–144).</ref> Incoherent Low-Rank and Sparse Learning,<ref>Chen, J., Liu, J., & Ye, J. (2010). [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3783291/ Learning incoherent sparse and low-rank patterns from multiple tasks]. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1179–1188).</ref> Robust Low-Rank Multi-Task Learning, Clustered Multi-Task Learning,<ref>Jacob, L., Bach, F., & Vert, J. (2008). [https://hal-ensmp.archives-ouvertes.fr/docs/00/32/05/73/PDF/cmultitask.pdf Clustered multi-task learning: A convex formulation]. Advances in Neural Information Processing Systems， 2008</ref><ref>Zhou, J., Chen, J., & Ye, J. (2011). [http://papers.nips.cc/paper/4292-clustered-multi-task-learning-via-alternating-structure-optimization.pdf Clustered multi-task learning via alternating structure optimization]. Advances in Neural Information Processing Systems.</ref> Multi-Task Learning with Graph Structures.
-The Multi-Task Learning via StructurAl Regularization (MALSAR) Matlab package<ref>Zhou, J., Chen, J. and Ye, J. MALSAR: Multi-tAsk Learning via StructurAl Regularization. Arizona State University, 2012. http://www.public.asu.edu/~jye02/Software/MALSAR. [http://www.public.asu.edu/~jye02/Software/MALSAR/Manual.pdf On-line manual]</ref>  implements the following multi-task learning algorithms:
-* Mean-Regularized Multi-Task Learning<ref>Evgeniou, T., & Pontil, M. (2004). [https://pdfs.semanticscholar.org/1ea1/91c70559d21be93a4d128f95943e80e1b4ff.pdf Regularized multi–task learning]. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 109–117).</ref><ref>{{cite journal | last1 = Evgeniou | first1 = T. | last2 = Micchelli | first2 = C. | last3 = Pontil | first3 = M. | year = 2005 | title = Learning multiple tasks with kernel methods | url = http://jmlr.org/papers/volume6/evgeniou05a/evgeniou05a.pdf | format = PDF | journal = Journal of Machine Learning Research | volume = 6 | issue = | page = 615 }}</ref>
-* Multi-Task Learning with Joint Feature Selection<ref>{{cite journal | last1 = Argyriou | first1 = A. | last2 = Evgeniou | first2 = T. | last3 = Pontil | first3 = M. | year = 2008a | title = Convex multi-task feature learning | url = | journal = Machine Learning | volume = 73 | issue = | pages = 243–272 | doi=10.1007/s10994-007-5040-8}}</ref>
-* Robust Multi-Task Feature Learning<ref>Chen, J., Zhou, J., & Ye, J. (2011). [http://www.academia.edu/download/44101186/Integrating_low-rank_and_group-sparse_st20160325-15067-1mftmbg.pdf Integrating low-rank and group-sparse structures for robust multi-task learning]. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining.</ref>
-* Trace-Norm Regularized Multi-Task Learning<ref>Ji, S., & Ye, J. (2009). [http://www.machinelearning.org/archive/icml2009/papers/151.pdf An accelerated gradient method for trace norm minimization]. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 457–464).</ref>
-* Alternating Structural Optimization<ref>{{cite journal | last1 = Ando | first1 = R. | last2 = Zhang | first2 = T. | year = 2005 | title = A framework for learning predictive structures from multiple tasks and unlabeled data | url = http://www.jmlr.org/papers/volume6/ando05a/ando05a.pdf | journal = The Journal of Machine Learning Research | volume = 6 | issue = | pages = 1817–1853 }}</ref><ref>Chen, J., Tang, L., Liu, J., & Ye, J. (2009). [http://leitang.net/papers/ICML09_CASO.pdf A convex formulation for learning shared structures from multiple tasks]. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 137–144).</ref>
-* Incoherent Low-Rank and Sparse Learning<ref>Chen, J., Liu, J., & Ye, J. (2010). [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3783291/ Learning incoherent sparse and low-rank patterns from multiple tasks]. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1179–1188).</ref>
-* Robust Low-Rank Multi-Task Learning
-* Clustered Multi-Task Learning<ref>Jacob, L., Bach, F., & Vert, J. (2008). [https://hal-ensmp.archives-ouvertes.fr/docs/00/32/05/73/PDF/cmultitask.pdf Clustered multi-task learning: A convex formulation]. Advances in Neural Information Processing Systems， 2008</ref><ref>Zhou, J., Chen, J., & Ye, J. (2011). [http://papers.nips.cc/paper/4292-clustered-multi-task-learning-via-alternating-structure-optimization.pdf Clustered multi-task learning via alternating structure optimization]. Advances in Neural Information Processing Systems.</ref>
-* Multi-Task Learning with Graph Structures
 ==See also==
+{{div col}}
-* [[Artificial Intelligence]]
+* [[Artificial intelligence]]
 * [[Artificial neural network]]
+* [[Automated machine learning]] (AutoML)
 * [[Evolutionary computation]]
+* [[General game playing]]
 * [[Human-based genetic algorithm]]
 * [[Kernel methods for vector output]]
-* [[Machine Learning]]
+* [[Multitask optimization]]
-* [[Robotics]]
+* [[Robot learning]]
+* [[Transfer learning]]
+{{div col end}}
 ==References==
@@ Line 167: / Line 156: @@
 ==External links==
-* [http://big.cs.uiuc.edu/webpage/cumulativeLearning/cumulativeLearning.html The Biosignals Intelligence Group at UIUC]
+* [https://web.archive.org/web/20041118134329/http://big.cs.uiuc.edu/webpage/cumulativeLearning/cumulativeLearning.html The Biosignals Intelligence Group at UIUC]
-* [http://www.cse.wustl.edu/~kilian/research/multitasklearning/multitasklearning.html Washington University at St. Louis Depart. of Computer Science]
+* [http://www.cse.wustl.edu/~kilian/research/multitasklearning/multitasklearning.html Washington University in St. Louis Department of Computer Science]
 ===Software===
 * [http://www.public.asu.edu/~jye02/Software/MALSAR/index.html The Multi-Task Learning via Structural Regularization Package]
-* [http://klcl.pku.edu.cn/member/sunxu/code.htm Online Multi-Task Learning Toolkit (OMT)] A general-purpose online multi-task learning toolkit based on [[conditional random field]] models and [[stochastic gradient descent]] training ([[C Sharp (programming language)|C#]], [[.NET Framework|.NET]])
+* [https://web.archive.org/web/20131224113826/http://klcl.pku.edu.cn/member/sunxu/code.htm Online Multi-Task Learning Toolkit (OMT)] A general-purpose online multi-task learning toolkit based on [[conditional random field]] models and [[stochastic gradient descent]] training ([[C Sharp (programming language)|C#]], [[.NET Framework|.NET]])
 [[Category:Machine learning]]