[go: nahoru, domu]

Skip to main content

Reddit and its partners use cookies and similar technologies to provide you with a better experience.

By accepting all cookies, you agree to our use of cookies to deliver and maintain our services and site, improve the quality of Reddit, personalize Reddit content and advertising, and measure the effectiveness of advertising.

By rejecting non-essential cookies, Reddit may still use certain cookies to ensure the proper functionality of our platform.

For more information, please see our Cookie Notice and our Privacy Policy.

Get the Reddit app

Scan this QR code to download the app now
Or check it out in the app stores

r/deeplearning

members
online


Can I fine-tune LLM with a small amount of data? Can I fine-tune LLM with a small amount of data?

https://arxiv.org/abs/2310.11454 (VeRA: Vector-based Random Matrix Adaptation)
I am reading the above paper for study purposes.

If I were to fine-tune GPT-3 using the methodology presented in the paper, I would only have 2.8M trainable parameters (r=16).

I think that a small number of parameters in a network(dW for this case) means that the learning capacity of that network is also small, in which case there would be no point in preparing large amount data for training.

However, the paper doesn't seem to claim that the authors' contribution is to reduce the amount of data needed for training.

Is it because it's a self-evident fact that reducing the number of trainable parameters reduces the amount of data required?

Or is my understanding (about the number of trainable parameters / network learning capacity / ..) wrong (I'm almost suspecting it's this)?

I look forward to your advice.



Understanding why Deep Learning works from Goodfellow's book Understanding why Deep Learning works from Goodfellow's book

I am reading the section 5.11.2 from the Deep Learning book where they provide an explanation how Deep Learning can deal with high dimensionality data in contrast to classical machine learning algorithms. However, I can't follow the bold part of the excerpt.

Can someone elaborate what the authors mean?

If the function additionally behaves differently

in different regions, it can become extremely complicated to describe with a set of

training examples. If the function is complicated (we want to distinguish a huge

number of regions compared to the number of examples), is there any hope to

generalize well?

The answer to both of these questions—whether it is possible to represent

a complicated function efficiently, and whether it is possible for the estimated

function to generalize well to new inputs—is yes. The key insight is that a very

large number of regions, e.g., O(2k) can be defined with O(k) examples, so long

as we introduce some dependencies between the regions via additional assumptions

about the underlying data generating distribution.





Recommendations for Diffusion Models for Colon Polyp Generation Recommendations for Diffusion Models for Colon Polyp Generation

Hi everyone,

I’m working on my master’s dissertation, generating synthetic images of colon polyps using diffusion models. I’ve been getting some okay results with OpenAI’s guided diffusion model, but I’m curious if there are other models I should test. I’m doing initial training on hyperkvasir and subsequently finetuning on a custom dataset. I use a 512GB A100 for training on the hyperkvasir. Due to data restrictions I am restricted to a 12Gb GTX2080Ti for fine tuning. I exclusively use PyTorch.

While I have some experience with deep learning, I’m keen to hear the recommendations of more experienced deep learning practitioners. Are there any other diffusion models or alternative approaches that you recommend testing? I currently use FID as my metric.

Any insights or recommendations would be greatly appreciated. Thanks in advance for your help!

Best, Erik







Complete Beginner Formula Complete Beginner Formula

Morning everyone,

About 2 weeks ago, I found out about LISP; I am a complete beginner to programming as a whole (I only found out about it because of Patrick Collison). I started questioning why I was learning LISP of all languages, and came to the conclusion that the language was primitive.

I stumbled across DL when searching for real-world applications of AI; My overarching goal is to understand Deep Learning by Ian Goodfellow. Why? I have a genuine interest in learning so I can build practical solutions for the world.

I keep thinking about how AI is going to be smarter than all humans in a couple of years, so I wish to join the fight to preserve the light of consciousness.

I'm hearing many courses from Stanford's CS230, Ng's Deep Learning, to textbooks like Deep Learning, Neural Networks and Machine Learning.

However, I've got an understanding of only HS level Calculus; I don't know anything about Pytorch, Tensorflow, Python, C++; I have no understanding of what these mean; I only know of these words because of stumbling into Lex Fridman's podcast with Karpathy, Andrew Ng, Hinton and others.

My question today is: What formula do I begin with that will best set me up for Real-world applications of Deep Learning?
I understand this much: Python, Differential Calculus, Linear Algebra, Basic Statistics are key concepts, but I figured Reddit could point me to the right sources.

PS--I prefer textbooks that I can buy from Amazon.

Thank you!



Split LLM between two gpus of different vendors Split LLM between two gpus of different vendors

Hi everyone! I have AMD and Nvidia GPUs (4080/7900 xtx). I'd like to split one quantized model (70b parameters) between two video adapters to improve text generation speed.

The only solution I found was to use Vulcan and LLama.cpp, but it is still quite slow (probably due to the features of Vulcan). Tell me, are there alternative ways to share one LLM between GPUs from different vendors?

I tried to work with the compiled version of llama.cpp (with the -DLLAMA_VULKAN=1 flag), running on the Ubuntu 22.0.4 operating system and Vulkan SDK installed. The output of LLAMA 3 70b LLM (q3, q4) on the two specified GPUs was significantly (about 4 times) slower than running models that typically only run on CUDA (for example, cuda-based text-generation-webui with llama.cpp) . Even taking into account the fact that the model with Q3 quantization is located entirely in the video memory of two adapters.

I would like to know if there is another way to run text generation on two GPUs from different manufacturers (without using Vulkan SDK) that is faster? Or amI using Vulkan wrong?


Is it a regression or ranking problem ? Is it a regression or ranking problem ?

Hi everyone !

I'm making a Tetris bot with reinforcement learning and I'm not sure which approach I should take:

I don't want my NN to output the keys corresponding to the moves ; What I want is for my neural network to be able to score a grid

Basically I can get some key values from a grid in a single vector (like heights of each columns, nb of filled rows ...), I'm calculating multiple grids corresponding to the outcome of "slaming" the tetromino down at mutiple x coordinates and then I want to move to the position of the associated grid that has the best score out of all

But is this a regression problem ?
As my model just has to learn to output a single number corresponding to the score of a single grid, I get the score for every grid, then I get the grid of the best score
If it is, can I properly fine tune the loss as the reward comes only from the final move that I will make so a lot of the predictions are not properly corrected ?

Or a ranking problem ?
As my model should learn to give the best out of all grids "feeded" as input
I've tried to look if "ranking" can be done in PyTorch but I can't seem to find a way, I lack knowledge on how to search for a proper framework to do it

Thanks for your time !



Deep Learning Paper Summaries Deep Learning Paper Summaries

The Vision Language Group at IIT Roorkee has written comprehensive summaries of deep learning papers from various prestigious conferences like NeurIPS, CVPR, ICCV, ICML 2016-24. A few notable examples include:

If you found the summaries useful you can contribute summaries of your own. The repo will be constantly updated with summaries of more papers from leading conferences.


Partly dynamic graphs for GNNs Partly dynamic graphs for GNNs

Hi all :)

Do you have any experience using GNNs (especially spatio temporal GNNs) for the following graph structure.

A graph G consisting of dynamic nodes and edges G_d = (V_d(t), E_d(t)) and static nodes and edges G_s = (V_s, E_s). G is then the union of these plus some additional dynamic edges between V(t) and V. So I have two types of nodes here. Some of the are always constant over time in terms of their location and some are dynamic (we can ignore the features of the nodes and edges for now). Is there any paper or architecture out there that uses this property efficiently?

I want to make sure that the static nodes capture the full temporal information and use the dynamic nodes just as additional spatial information. The goal is to build an RL agent that selects the best location for a data centers (one of the static nodes) to minimize the latency of the system (given by client locations).


Help understanding some finer points of quantization Help understanding some finer points of quantization

Hi all, I am trying to get a grasp on quantization. I believe I get the basics, but many points still escape my understanding. Most papers are not "comprehensive" and most articles lack rigor. I have too many tabs open now and starting to get a bit lost in the rabbithole. So, I'd appreciate if someone could throw some light on these points.

  1. Is this flow accurate for the inference - 1) you have the quantized weights 2) quantize the input embeddings 3) do the calculations 4) dequantize the (low-precision) output to get 32-bit embeddings.

  2. In Quantization aware training:

    1. The weights are represented (and updated) in 32-bit. The training process also keeps a copy of quantized weights. The weights and embeddings are both fake-quantized right before/during the forward pass. So there are two sets of outputs - one based on 32-bit weights and another based on quantized weights. Right?

    2. Are the loss and gradient calculated based on the unquantized output or the quantized output? How exactly does the training "account for" the quantization - is the quantized error simply added to the 32-bit error?

    3. Is there any "dequantizing" happening during the training? Where/why?

  3. In the case of a model quantized to 1-bit:

    1. The multiplication operations become addition/subtraction. So, the embeddings do not need to be converted into 1-bit and can be used in higher-precision. Is this understanding correct?

    2. Do the weights and activations have to be quantized to the same precision? In the BitNet paper, I read that the activations are quantized to 8-bit while the weights are 1-bit. Is this a special effect of 1-bit weights reducing multiplication to addition?

Thanks for taking the time! I'll probably have a couple of follow up questions too.


Does Andrej Karpathy's "Neural Networks: Zero to Hero" course have math requirements or he explains necessary math in his videos? Does Andrej Karpathy's "Neural Networks: Zero to Hero" course have math requirements or he explains necessary math in his videos?

Do I need to be good in math in order to understand Andrej Karpathy's "Neural Networks: Zero to Hero" course? Or maybe all necessary math is explained in his course? I just know basic Algebra and was interesting if it is enough to start his course.


Autoencoder for Embedding Tabular Data for Clustering? Autoencoder for Embedding Tabular Data for Clustering?

Hi everyone,

I'm working on a project where I need to embed a nXm data into a latent space for clustering purposes. The goal is to identify similar embeddings and label them (unsupervised learning). I'm considering using either a fully connected autoencoder or a variational autoencoder (VAE) for this task.

From what I understand:

  • Fully Connected Autoencoder:

    • Disadvantages: No probabilistic interpretation of the latent space, potentially less robust embeddings.

  • Variational Autoencoder (VAE):

    • Advantages: Provides a probabilistic interpretation of the latent space, includes a regularization term (KL divergence) to ensure a desirable latent space structure, can generate new data samples.

Given these pros and cons, which approach would you recommend for my use case of clustering similar embeddings? Are there specific considerations or alternative methods I should be aware of for efficiently embedding and clustering this type of tabular data?

Thanks in advance for your insights!