(VeRA: Vector-based Random Matrix Adaptation)
I am reading the above paper for study purposes.
If I were to fine-tune GPT-3 using the methodology presented in the paper, I would only have 2.8M trainable parameters (r=16).
I think that a small number of parameters in a network(dW for this case) means that the learning capacity of that network is also small, in which case there would be no point in preparing large amount data for training.
However, the paper doesn't seem to claim that the authors' contribution is to reduce the amount of data needed for training.
Is it because it's a self-evident fact that reducing the number of trainable parameters reduces the amount of data required?
Or is my understanding (about the number of trainable parameters / network learning capacity / ..) wrong (I'm almost suspecting it's this)?
I look forward to your advice.
Say one wants to train a model based on Mistral or Llama, and with ~10k sft data, should I use base model or chat model?
Also when considering continue pre-train, which one it better? Thx
I am reading the section 5.11.2 from the where they provide an explanation how Deep Learning can deal with high dimensionality data in contrast to classical machine learning algorithms. However, I can't follow the bold part of the excerpt.
Can someone elaborate what the authors mean?
If the function additionally behaves differently
in different regions, it can become extremely complicated to describe with a set of
training examples. If the function is complicated (we want to distinguish a huge
number of regions compared to the number of examples), is there any hope to
generalize well?
The answer to both of these questions—whether it is possible to represent
a complicated function efficiently, and whether it is possible for the estimated
function to generalize well to new inputs—is yes. The key insight is that a very
large number of regions, e.g., O(2k) can be defined with O(k) examples, so long
as we introduce some dependencies between the regions via additional assumptions
about the underlying data generating distribution.
Hey, I'm working on my college thesis in deep learning and decided to build a computer for it. But I'm a bit unsure about which hardware to choose, especially which GPU would suit my work best to get decent performance with YOLO since I'm a student on a budget. Any tips?
Hi everyone,
I’m working on my master’s dissertation, generating synthetic images of colon polyps using diffusion models. I’ve been getting some okay results with OpenAI’s guided diffusion model, but I’m curious if there are other models I should test. I’m doing initial training on hyperkvasir and subsequently finetuning on a custom dataset. I use a 512GB A100 for training on the hyperkvasir. Due to data restrictions I am restricted to a 12Gb GTX2080Ti for fine tuning. I exclusively use PyTorch.
While I have some experience with deep learning, I’m keen to hear the recommendations of more experienced deep learning practitioners. Are there any other diffusion models or alternative approaches that you recommend testing? I currently use FID as my metric.
Any insights or recommendations would be greatly appreciated. Thanks in advance for your help!
Best, Erik
Hi guys, I was just starting my first lab for Full Stack Deep Learning.
However, I was getting connection errors on the first cell in the colab notebook as below.
Anyone knows how to fix that? Thanks : )
i am working on image classification(10 classes) using MOE:
steps
i)- i train 5 experts each on 2 classes(eg- exper1 on class1,2, expert2 on class3, 4 and so on)
ii) then freeze the expert params
iii) then train gating network
this is the architecture i am using.
can you all suggest some better method or improvements.
Morning everyone,
About 2 weeks ago, I found out about LISP; I am a complete beginner to programming as a whole (I only found out about it because of Patrick Collison). I started questioning why I was learning LISP of all languages, and came to the conclusion that the language was primitive.
I stumbled across DL when searching for real-world applications of AI; My overarching goal is to understand Deep Learning by Ian Goodfellow. Why? I have a genuine interest in learning so I can build practical solutions for the world.
I keep thinking about how AI is going to be smarter than all humans in a couple of years, so I wish to join the fight to preserve the light of consciousness.
I'm hearing many courses from Stanford's CS230, Ng's Deep Learning, to textbooks like Deep Learning, Neural Networks and Machine Learning.
However, I've got an understanding of only HS level Calculus; I don't know anything about Pytorch, Tensorflow, Python, C++; I have no understanding of what these mean; I only know of these words because of stumbling into Lex Fridman's podcast with Karpathy, Andrew Ng, Hinton and others.
My question today is: What formula do I begin with that will best set me up for Real-world applications of Deep Learning?
I understand this much: Python, Differential Calculus, Linear Algebra, Basic Statistics are key concepts, but I figured Reddit could point me to the right sources.
PS--I prefer textbooks that I can buy from Amazon.
Thank you!
hi everyone, I am working on the Unet project incorporating Pso in rainfall forecasting from radar imagery. I have the code but I don't understand it yet and don't know if it's really true. Hope everyone watches and helps. this is my code. Thanks everyone
Hi everyone! I have AMD and Nvidia GPUs (4080/7900 xtx). I'd like to split one quantized model (70b parameters) between two video adapters to improve text generation speed.
The only solution I found was to use Vulcan and LLama.cpp, but it is still quite slow (probably due to the features of Vulcan). Tell me, are there alternative ways to share one LLM between GPUs from different vendors?
I tried to work with the compiled version of llama.cpp (with the -DLLAMA_VULKAN=1 flag), running on the Ubuntu 22.0.4 operating system and Vulkan SDK installed. The output of LLAMA 3 70b LLM (q3, q4) on the two specified GPUs was significantly (about 4 times) slower than running models that typically only run on CUDA (for example, cuda-based text-generation-webui with llama.cpp) . Even taking into account the fact that the model with Q3 quantization is located entirely in the video memory of two adapters.
I would like to know if there is another way to run text generation on two GPUs from different manufacturers (without using Vulkan SDK) that is faster? Or amI using Vulkan wrong?
Hi everyone !
I'm making a Tetris bot with reinforcement learning and I'm not sure which approach I should take:
I don't want my NN to output the keys corresponding to the moves ; What I want is for my neural network to be able to score a grid
Basically I can get some key values from a grid in a single vector (like heights of each columns, nb of filled rows ...), I'm calculating multiple grids corresponding to the outcome of "slaming" the tetromino down at mutiple x coordinates and then I want to move to the position of the associated grid that has the best score out of all
But is this a regression problem ?
As my model just has to learn to output a single number corresponding to the score of a single grid, I get the score for every grid, then I get the grid of the best score
If it is, can I properly fine tune the loss as the reward comes only from the final move that I will make so a lot of the predictions are not properly corrected ?
Or a ranking problem ?
As my model should learn to give the best out of all grids "feeded" as input
I've tried to look if "ranking" can be done in PyTorch but I can't seem to find a way, I lack knowledge on how to search for a proper framework to do it
Thanks for your time !
Hey folks, I'm diving into using deep learning for real-time chart predictions, but I'm wondering about its real-time capabilities. As a newbie, can someone explain if deep learning can handle real-time tasks effectively?
The Vision Language Group at IIT Roorkee has written comprehensive summaries of deep learning papers from various prestigious conferences like NeurIPS, CVPR, ICCV, ICML 2016-24. A few notable examples include:
-
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, CVPR'23
-
Segment Anything, ICCV'23
-
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion, ICVR'23
-
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, NIPS'22
-
An Image is Worth 16X16 Words: Transformers for Image Recognition at Scale, ICLR'21
-
Big Bird: Transformers for Longer Sequences, NIPS'20
If you found the summaries useful you can contribute summaries of your own. The will be constantly updated with summaries of more papers from leading conferences.
Hi all :)
Do you have any experience using GNNs (especially spatio temporal GNNs) for the following graph structure.
A graph G consisting of dynamic nodes and edges G_d = (V_d(t), E_d(t)) and static nodes and edges G_s = (V_s, E_s). G is then the union of these plus some additional dynamic edges between V(t) and V. So I have two types of nodes here. Some of the are always constant over time in terms of their location and some are dynamic (we can ignore the features of the nodes and edges for now). Is there any paper or architecture out there that uses this property efficiently?
I want to make sure that the static nodes capture the full temporal information and use the dynamic nodes just as additional spatial information. The goal is to build an RL agent that selects the best location for a data centers (one of the static nodes) to minimize the latency of the system (given by client locations).
Hi all, I am trying to get a grasp on quantization. I believe I get the basics, but many points still escape my understanding. Most papers are not "comprehensive" and most articles lack rigor. I have too many tabs open now and starting to get a bit lost in the rabbithole. So, I'd appreciate if someone could throw some light on these points.
-
Is this flow accurate for the inference - 1) you have the quantized weights 2) quantize the input embeddings 3) do the calculations 4) dequantize the (low-precision) output to get 32-bit embeddings.
-
In Quantization aware training:
-
The weights are represented (and updated) in 32-bit. The training process also keeps a copy of quantized weights. The weights and embeddings are both fake-quantized right before/during the forward pass. So there are two sets of outputs - one based on 32-bit weights and another based on quantized weights. Right?
-
Are the loss and gradient calculated based on the unquantized output or the quantized output? How exactly does the training "account for" the quantization - is the quantized error simply added to the 32-bit error?
-
Is there any "dequantizing" happening during the training? Where/why?
-
-
In the case of a model quantized to 1-bit:
-
The multiplication operations become addition/subtraction. So, the embeddings do not need to be converted into 1-bit and can be used in higher-precision. Is this understanding correct?
-
Do the weights and activations have to be quantized to the same precision? In the BitNet paper, I read that the activations are quantized to 8-bit while the weights are 1-bit. Is this a special effect of 1-bit weights reducing multiplication to addition?
-
Thanks for taking the time! I'll probably have a couple of follow up questions too.
Do I need to be good in math in order to understand Andrej Karpathy's "Neural Networks: Zero to Hero" course? Or maybe all necessary math is explained in his course? I just know basic Algebra and was interesting if it is enough to start his course.
Hi everyone,
I'm working on a project where I need to embed a nXm data into a latent space for clustering purposes. The goal is to identify similar embeddings and label them (unsupervised learning). I'm considering using either a fully connected autoencoder or a variational autoencoder (VAE) for this task.
From what I understand:
-
Fully Connected Autoencoder:
-
Disadvantages: No probabilistic interpretation of the latent space, potentially less robust embeddings.
-
-
Variational Autoencoder (VAE):
-
Advantages: Provides a probabilistic interpretation of the latent space, includes a regularization term (KL divergence) to ensure a desirable latent space structure, can generate new data samples.
-
Given these pros and cons, which approach would you recommend for my use case of clustering similar embeddings? Are there specific considerations or alternative methods I should be aware of for efficiently embedding and clustering this type of tabular data?
Thanks in advance for your insights!
I've a dataset for multi choice question answering that has 8k rows and 5 themes most of the on research publication research overview standards specifications... what type of format should I use like put pairs of question and one answer and label of true or false? Or concatenate them? What type of models should u use like ml or fine-tune a Bert variant?? Any advance tips?????