Semantic Memorization

This repository is for EleutherAI's project Semantic Memorization which defines a unique taxonomy for memorized sequences based on factors that influence memorization. For detailed information on how likelihood of a sequence being memorized is dependant on taxonomy, please see our paper Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

Motivation

Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.

Taxonomy

Our taxonomy, illustrated above, defines three types of LM memorization based on colloquial de- scriptions of human memorization. Humans recite direct quotes that they commit to memory through repeated exposure, so LMs recite highly duplicated sequences. Humans reconstruct a passage by re- membering a general pattern and filling in the gaps, so LMs reconstruct inherently predictable boiler- plate templates. Humans sporadically recollect an episodic memory or fragment after a single expo- sure, so LMs recollect other sequences seen rarely during training.

Reproducing Results

Filters

Code vs Natural Language

To train a natural language vs code classifier, we used huggingface's training pipeline on randomly sampled, equal weight subsets of bookcorpus and github-code. following hparams were used while training

learning_rate: 1e-07
train_batch_size: 256
eval_batch_size: 1024
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
training_steps: 1000

Following this, we used this script to find probabilities of a sequence being memorized.

Highly Duplicated Filter

To replicate results of duplication, run the following scripts in a sequence

script saving sequence hashes, to save hash of every 32-gram sequence of Pile
script saving zero offset hashes Script saving hashes of only required offset (32 in our case)
script saving approximate duplicates, based on hashes. We now have a single numpy file that stores hashes and sequence ids of all sequences whose hashes are the same as atleast one of zero offset sequence's hashes
script calculating exact duplicates This script compares each sequence with all sequences with same hash to get exact count of duplicates.
Following this, you get a list of true counts, you can combine them use this script
You can find already processed list of sequence ids with their count of duplicates in standard and deduped datasets.

Semantic and Textual Matches Filter

To replicate semantics and textual matches filter, run the following scripts in a sequence:

Create sentence embeddings for various datasets, with this script
Compute semantic filter counts with this script
Compute textual matche counts with this script. for texual macthes, we also need to create only query sentences for each partition as we compare levestein distance between queries for this filter. This can be acheived by this script.

Token frequencies

To replicate results of token frequences, run this script. Full list of token frequencies can be found on huggingface for standard and deduped datasets.

Combining Filters

To combine all the existing filters, run combine metrics script. You will need to setup an appropriate JDK and install all requirements to run the script. Filter results can be found on this huggingface dataset

Note: Filters for templating (incrementing and repeating) as well has huffman coding length are calculated while the filters are combined.

Training Taxonomic Model

To train taxonomic model and launch greedy taxonomic search, launch this script

Plots

To replicate results on taxonomic model performance, and plots on model weights refer to this notebook.
For results on correlation coefficients, refer to this notebook
For plot on optimal thresholds for code-classifier, refer to this notebook

Name		Name	Last commit message	Last commit date
Latest commit History 236 Commits
.github/workflows		.github/workflows
datasets		datasets
filters		filters
plotting		plotting
readme-images		readme-images
spark		spark
working_dirs		working_dirs
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
__init__.py		__init__.py
calculate_metrics.py		calculate_metrics.py
inference.py		inference.py
metrics-lines.png		metrics-lines.png
model_parameters.py		model_parameters.py
model_training.py		model_training.py
model_utils.py		model_utils.py
notepad		notepad
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Memorization

Contents

Motivation

Taxonomy

Reproducing Results

Filters

Code vs Natural Language

Highly Duplicated Filter

Semantic and Textual Matches Filter

Token frequencies

Combining Filters

Training Taxonomic Model

Plots

Citation Details

About

Releases

Packages

Contributors 7

Languages

EleutherAI/semantic-memorization

Folders and files

Latest commit

History

Repository files navigation

Semantic Memorization

Contents

Motivation

Taxonomy

Reproducing Results

Filters

Code vs Natural Language

Highly Duplicated Filter

Semantic and Textual Matches Filter

Token frequencies

Combining Filters

Training Taxonomic Model

Plots

Citation Details

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages