· By Bijan Bowen
Making Runescape GPT Using NanoGPT - Training A Model From Scratch
I have been wanting to get some more experience with the 'Core' concepts of language models, but I wasn't entirely sure where to begin. Thanks to a Reddit mention of the nanogpt repository on GitHub, I was able to gain fantastic hands-on experience training a language model from scratch on my own dataset and my own machine. Since many of the higher-powered models we use are built by others and mostly hosted online, being able to construct and play with a miniature version of my own was an extremely refreshing experience in a world that can often seem overwhelming in terms of all the technical developments that are rapidly occurring.
The nanogpt repo has a simple example of training a tiny GPT based on a dataset of Shakespeare stories. While I wanted to adhere to the repo's original instructions, I must have some lingering high school English-related PTSD as I did not want to use any Shakespeare-related text. I figured it would be a good experience to incorporate my own dataset into this experiment, so after a bit of deliberation, I decided to create a little 'Runescape (2007) GPT', trained to regurgitate terminology pertaining to the wonderful game of Old School Runescape.
In order to produce a semi-coherent model, I needed a robust dataset that would likely be far too time-consuming to gather by hand. Thanks to some awesome Python libraries, I was able to write a small script to automate the cumbersome process of data gathering and preparation. To keep things simple, I made the script comb through 10 URLs from the OSRS wiki, gather sentences from each page, separate them into lines, remove any duplicates, and finally, output the results to a txt file. Fortunately, the script worked well on the first try, and after about 10 seconds, I was given a list of 866 unique Runescape-related sentences.
Now that I had my dataset, the rest of the experiment was really as simple as following along with the nanogpt repository. I ran the prepare script that used the GPT-2 tokenizer on the input text, ran the training script with the .bin outputs of the prepare script, and finally, ran the sample script with the output of the training script, to see my newly trained model in action. I changed a few parameters in the sample script, mainly just to limit the max tokens of each generation to 150, and to cap the number of samples at 2, in order to not have it just spit out walls of OSRS-related text at me. To briefly touch upon training, I was using only one of my Nvidia 3090ti GPUs, and given how small the dataset and model being trained were, it took minutes at most to train the model, especially since the loss values quickly began to drop in ways that may have indicated overfitting.
Overall, the model did work! The sample script demonstrated that the model had indeed learned Runescape-related terminology and was able to generate sentences relating to OSRS fairly lucidly given the small training data size, which would make sense to someone familiar with the topic. While the model likely overfit, that is, memorized the specific training data very well due to how limited in size the dataset was, it was still a successful experiment that gave a wonderful hands-on insight into building and training a small language model of my own.
You can view the video for this article on my YouTube channel.