# Namesformer

Before we get into the lecture you can play with the trained model here: [Namesformer Streamlit app](https://namesformer.streamlit.app/).

Inspired by Andrej Karpathy lecture [makemore](https://www.youtube.com/watch?v=PaCmpygFfXo&t=131s) that contains english name generation. 

The code was fully writen using ChatGPT with minimal corrections. My first query was:

```
I am preparing a lecture for my students on AI basics. They already know how to use attention in PyTorch to create self-attention layers. What I want to explain them is how to make a simplest possible transformer architecture (with minimal amount of code).
 As a dataset I will use a csv with names:
    john
    peter
    mike
    ...
And the goal will be to generate more names that sound name-like.
Give me an implementation with PyTorch trying to keep it as minimal as possible.
```

After that I had to ask for couple corrections, like avoiding using Transformer layer, adding comments, fixing a bug in token indexing. All were relatively easy to spot and in less than an hour this notebook was generating plausibly sounding names.

I decided to replace original dataset since I found a list of Lithuanian names that are easy to extract from [vardai.vlkk.lt](vardai.vlkk.lt) using the following code snippet:

```python
import requests
from bs4 import BeautifulSoup

names = []
for key in ['a', 'b', 'c', 'c-2', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
            'm', 'n', 'o', 'p', 'r', 's', 's-2', 't', 'u', 'v', 'z', 'z-2']:
    url = f'https://vardai.vlkk.lt/sarasas/{key}/'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    links = soup.find_all('a', class_='names_list__links names_list__links--man')
    names += [name.text for name in links]
```

If you want to play with english names download them from [here](https://github.com/karpathy/makemore/blob/master/names.txt) and use *names.txt* instead of *vardai.txt*.

In [13]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence

In [14]:
names = pd.read_csv('vardai.txt')['name'].values
names

array(['Ãbas', 'Ãbdijus', 'Abdònas', ..., 'Žilvynas', 'Žimantas',
       'Žydrunas'], dtype=object)

In [15]:
len(names)

3495

Let's add a space at the end to mark the end of the name

In [16]:
names += ' '

In [17]:
names[0]

'Ãbas '

Note that this dataset is not simple since it uses accentuation symbols and capital letters. Let's intentionally keep it like this and see if the model can figure it out.

Our transformer will be based on the self-attention.

In [26]:
dataset.vocab_size

83

In [27]:
len(dataset.int_to_char)

83

In [18]:
# Adjusted NameDataset
class NameDataset(Dataset):
    def __init__(self, csv_file):
        self.names = pd.read_csv('vardai.txt')['name'].values
        self.chars = sorted(list(set(''.join(self.names) + ' ')))  # Including a padding character
        self.char_to_int = {c: i for i, c in enumerate(self.chars)}
        self.int_to_char = {i: c for c, i in self.char_to_int.items()}
        self.vocab_size = len(self.chars)

    def __len__(self):
        return len(self.names)

    def __getitem__(self, idx):
        name = self.names[idx] + ' '  # Adding padding character at the end
        encoded_name = [self.char_to_int[char] for char in name]
        return torch.tensor(encoded_name)

# Custom collate function for padding
def pad_collate(batch):
    padded_seqs = pad_sequence(batch, batch_first=True, padding_value=0)
    input_seq = padded_seqs[:, :-1]
    target_seq = padded_seqs[:, 1:]
    return input_seq, target_seq

# Minimal Transformer Model
class MinimalTransformer(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, forward_expansion):
        super(MinimalTransformer, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.positional_encoding = nn.Parameter(torch.randn(1, 100, embed_size))
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=embed_size, nhead=num_heads)
        self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=1)
        self.output_layer = nn.Linear(embed_size, vocab_size)

    def forward(self, x):
        positions = torch.arange(0, x.size(1)).unsqueeze(0)
        x = self.embed(x) + self.positional_encoding[:, :x.size(1), :]
        x = self.transformer_encoder(x)
        x = self.output_layer(x)
        return x

# Training Loop
def train_model(model, dataloader, epochs=10):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters())

    for epoch in range(epochs):
        model.train()  # Ensure the model is in training mode
        total_loss = 0.0
        batch_count = 0

        for batch_idx, (input_seq, target_seq) in enumerate(dataloader):
            optimizer.zero_grad()
            output = model(input_seq)
            loss = criterion(output.transpose(1, 2), target_seq)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            batch_count += 1

        average_loss = total_loss / batch_count
        print(f'Epoch {epoch+1}, Average Loss: {average_loss}')


csv_file = 'vardai.txt'
dataset = NameDataset(csv_file)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=pad_collate)
model = MinimalTransformer(vocab_size=dataset.vocab_size, embed_size=128, num_heads=8, forward_expansion=4)
train_model(model, dataloader)



Epoch 1, Average Loss: 1.5661300399086693
Epoch 2, Average Loss: 1.3030138059095904
Epoch 3, Average Loss: 1.274652126160535
Epoch 4, Average Loss: 1.2546222600069914
Epoch 5, Average Loss: 1.2314461583440954
Epoch 6, Average Loss: 1.22686504017223
Epoch 7, Average Loss: 1.2192441631447186
Epoch 8, Average Loss: 1.2121069810607217
Epoch 9, Average Loss: 1.2129392515529285
Epoch 10, Average Loss: 1.2083363023671236


And generate a name by predicing the next letter.

In [19]:
def sample(model, dataset, start_str='a', max_length=20):
    model.eval()  # Switch to evaluation mode
    with torch.no_grad():
        # Convert start string to tensor
        chars = [dataset.char_to_int[c] for c in start_str]
        input_seq = torch.tensor(chars).unsqueeze(0)  # Add batch dimension
        
        output_name = start_str
        for _ in range(max_length - len(start_str)):
            output = model(input_seq)
            
            # Get the last character from the output
            probabilities = torch.softmax(output[0, -1], dim=0)
            # Sample a character from the probability distribution
            next_char_idx = torch.multinomial(probabilities, 1).item()
            next_char = dataset.int_to_char[next_char_idx]
            
            if next_char == ' ':  # Assume ' ' is your end-of-sequence character
                break
            
            output_name += next_char
            # Update the input sequence for the next iteration
            input_seq = torch.cat([input_seq, torch.tensor([[next_char_idx]])], dim=1)
        
        return output_name

# After training your model, generate a name starting with a specific letter
for _ in range(10):
    generated_name = sample(model, dataset, start_str='R')
    print(generated_name)

Regontas
Rongonis
Rẽšvijus
Ralijijus
Rugaustinas
Rámeñtas
Raùcius
Rigū̃sintas
Rùntis
Rorū́lis


Not bad! Note that this name is not in our names list.

In [20]:
generated_name

'Rorū́lis'

In [21]:
generated_name + ' ' in names

False

Let's train for longer.

In [22]:
train_model(model, dataloader, epochs=200)

Epoch 1, Average Loss: 1.2126153149388053
Epoch 2, Average Loss: 1.2008646195585078
Epoch 3, Average Loss: 1.2044431323354894
Epoch 4, Average Loss: 1.2084777745333586
Epoch 5, Average Loss: 1.2039412173357877
Epoch 6, Average Loss: 1.2010385670445183
Epoch 7, Average Loss: 1.2084186375141144
Epoch 8, Average Loss: 1.2009209258989855
Epoch 9, Average Loss: 1.1919255847280676
Epoch 10, Average Loss: 1.196088556809859
Epoch 11, Average Loss: 1.1934009887955406
Epoch 12, Average Loss: 1.1910438266667454
Epoch 13, Average Loss: 1.1883929929950021
Epoch 14, Average Loss: 1.18849547884681
Epoch 15, Average Loss: 1.1841346204280854
Epoch 16, Average Loss: 1.1886381100524555
Epoch 17, Average Loss: 1.1862260487946596
Epoch 18, Average Loss: 1.1879008596593683
Epoch 19, Average Loss: 1.1960965817624873
Epoch 20, Average Loss: 1.182948770306327
Epoch 21, Average Loss: 1.1842969905246388
Epoch 22, Average Loss: 1.184790024432269
Epoch 23, Average Loss: 1.181914219531146
Epoch 24, Average Loss: 1.

In [23]:
for _ in range(10):
    generated_name = sample(model, dataset, start_str='R')
    print(generated_name)

Ritãperdas
Rãlantas
Rãtardas
Ripõnas
Rìlmiktẽras
Rárvydas
Rìnis
Rarntìndas
Rusktaugas
Rĩšvas


If we want the model to be more creative we can add temperature/creativity control.

In [24]:
def sample(model, dataset, start_str='a', max_length=20, temperature=1.0):
    assert temperature > 0, "Temperature must be greater than 0"
    model.eval()  # Switch model to evaluation mode
    with torch.no_grad():
        # Convert start string to tensor
        chars = [dataset.char_to_int[c] for c in start_str]
        input_seq = torch.tensor(chars).unsqueeze(0)  # Add batch dimension
        
        output_name = start_str
        for _ in range(max_length - len(start_str)):
            output = model(input_seq)
            
            # Apply temperature scaling
            logits = output[0, -1] / temperature
            probabilities = torch.softmax(logits, dim=0)
            
            # Sample a character from the probability distribution
            next_char_idx = torch.multinomial(probabilities, 1).item()
            next_char = dataset.int_to_char[next_char_idx]
            
            if next_char == ' ':  # Assume ' ' is your end-of-sequence character
                break
            
            output_name += next_char
            # Update the input sequence for the next iteration
            input_seq = torch.cat([input_seq, torch.tensor([[next_char_idx]])], dim=1)
        
        return output_name

# Example usage with different temperatures
print('More confident:')
for _ in range(10):
    print(' ', sample(model, dataset, start_str='R', temperature=0.5))  # More confident

print('\nMore diverse/creative:')
for _ in range(10):
    print(' ', sample(model, dataset, start_str='R', temperature=1.5))  # More diverse

More confident:
  Rìnartas
  Rèntas
  Rĩgis
  Rùrìldas
  Ròrijus
  Rìnis
  Rarìtas
  Rėnas
  Relinijus
  Rìnijus

More diverse/creative:
  Rūdrìmildas
  Ratėnas
  Revýdridas
  Rõr̃mijus
  Rýgmas
  Ròtotrdas
  Rãan
  Riguòmas
  Ridmistis
  Rìrhez


Here we go, we have a Lithuanian name generator!

In [25]:
import json

torch.save(model, '../namesformer_app/namesformer_model.pt')
    
with open('../namesformer_app/int_to_char.json', 'w') as f:
    json.dump(dataset.int_to_char, f)

with open('../namesformer_app/char_to_int.json', 'w') as f:
    json.dump(dataset.char_to_int, f)