Python150kExtractor

Python150k dataset

Steps to reproduce

Download parsed python dataset from here, unarchive and place under PYTHON150K_DIR:

# Replace with desired path.
>>> PYTHON150K_DIR=/path/to/data/dir
>>> mkdir -p $PYTHON150K_DIR
>>> cd $PYTHON150K_DIR
>>> wget http://files.srl.inf.ethz.ch/data/py150.tar.gz
...
>>> tar -xzvf py150.tar.gz
...

Extract samples to DATA_DIR:

# Replace with desired path.
>>> DATA_DIR=$(pwd)/data/default
>>> SEED=239
>>> python extract.py \
    --data_dir=$PYTHON150K_DIR \
    --output_dir=$DATA_DIR \
    --seed=$SEED
...

Preprocess for training:

>>> ./preprocess.sh $DATA_DIR
...

Train:

>>> cd ..
>>> DESC=default
>>> CUDA=0
>>> ./train_python150k.sh $DATA_DIR $DESC $CUDA $SEED
...

Test results (seed=239)

Best scores

setup#2: batch_size=64
setup#3: embedding_size=256,use_momentum=False
setup#4: batch_size=32,embedding_size=256,embeddings_dropout_keep_prob=0.5,use_momentum=False

params	Precision	Recall	F1	ROUGE-2	ROUGE-L
default	0.37	0.27	0.31	0.06	0.38
setup#2	0.40	0.31	0.34	0.08	0.41
setup#3	0.36	0.31	0.33	0.09	0.38
setup#4	0.33	0.25	0.28	0.05	0.34

Ablation studies

params	Precision	Recall	F1	ROUGE-2	ROUGE-L
default	0.37	0.27	0.31	0.06	0.38
no ast nodes (5th epoch)	0.27	0.16	0.20	0.02	0.28
no token split (4th epoch)	0.60	0.09	0.15	0.00	0.60

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
extract.py		extract.py
preprocess.sh		preprocess.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python150kExtractor

Python150kExtractor

README.md

Python150k dataset

Steps to reproduce

Test results (seed=239)

Best scores

Ablation studies

Files

Python150kExtractor

Directory actions

More options

Directory actions

More options

Latest commit

History

Python150kExtractor

Folders and files

parent directory

README.md

Python150k dataset

Steps to reproduce

Test results (seed=239)

Best scores

Ablation studies