Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TD learing agent #26

Merged
merged 1 commit into from
Feb 19, 2024
Merged

Add TD learing agent #26

merged 1 commit into from
Feb 19, 2024

Conversation

ndsl7109256
Copy link
Contributor

TD learning is a classic Reinforcement Learning algorithm. We can pretrain the model and store it. When playing against the AI, the AI agent can simply look up the pretrained model to determine the best moves.

Ref:
https://en.wikipedia.org/wiki/Temporal_difference_learning
https://github.com/agrawal-rohit/tic-tac-toe-ai-bots

Copy link
Owner

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am enthusiastic about incorporating the reinforcement learning (RL) algorithm into this project. Yet, before proceeding with the integration, there are specific adjustments required.

  1. Maximize the reuse of existing code. This may necessitate separating the board display from the game logic to ensure a modular approach, facilitating the integration of the proposed RL/TD algorithm without replicating code.
  2. Opt for a 4x4 board size, or possibly larger, rather than the traditional 3x3 format. My aim is to slightly elevate the complexity.

@visitorckw
Copy link
Collaborator

I am concerned that the hash function may pose a risk of overflow in a larger board size. Perhaps integrating it into the Zobrist hash would be a good choice?

@ndsl7109256
Copy link
Contributor Author

  1. OK. I would try refine the code.
  2. It indeed opts for 4x4 board now, but not for 5x5 and more and it's due to the inefficient state hash function now.
    The hash function now is to maintain state of board in ternary number.
    ex :
o |  |
--------
  |  |
--------
  |  | x

would be hashed as $100000002_{3}$.
for 3x3 baord, It requires $3^{9}$ states and many of them is unreachable.(actually reachable state is 5478).
for 5x5 baord, it requres $3^{25} = 847288609443$ states and might cause overflow.
Maybe using Zobrist Hashing would be a better way but I am not sure how large board it would support.

ref : https://informatika.stei.itb.ac.id/~rinaldi.munir/Matdis/2021-2022/Makalah2021/Makalah-Matdis-2021%20(148).pdf

Copy link
Contributor

@paul90317 paul90317 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you use a random player as the AI's opponent when training, It does not make sense. you can let the AI player play against itself or the original AI instead.

train.c Outdated
}
int cur_state_hash = hash(table);
if (turn == playerO) {
int move = get_action_O(table);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you use a random player as the AI's opponent when training, It does not make sense. you can let the AI player play against itself or the original AI instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I use another AI player which is also trained with TD learning. (get_action_X and get_action_O are quite the same.) Is there any bug here?

However, I am curious about why using a random player as an opponent does not make sense. I consider the opponent's moves as 'environment state transitions,' and I believe that state transitions in the RL environment can exhibit randomness. Could you explain more about your idea?

Copy link
Contributor

@paul90317 paul90317 Feb 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mistook, you used epsilon greedy. But is epsilon greedy needed in minmax game?

@visitorckw
Copy link
Collaborator

  1. OK. I would try refine the code.
  2. It indeed opts for 4x4 board now, but not for 5x5 and more and it's due to the inefficient state hash function now.
    The hash function now is to maintain state of board in ternary number.
    ex :
o |  |
--------
  |  |
--------
  |  | x

would be hashed as 1000000023. for 3x3 baord, It requires 39 states and many of them is unreachable.(actually reachable state is 5478). for 5x5 baord, it requres 325=847288609443 states and might cause overflow. Maybe using Zobrist Hashing would be a better way but I am not sure how large board it would support.

ref : https://informatika.stei.itb.ac.id/~rinaldi.munir/Matdis/2021-2022/Makalah2021/Makalah-Matdis-2021%20(148).pdf

AFAIK, most Go AI software also utilizes Zobrist hash (Go's board size is 19x19), and its calculations are quite fast (requiring only one XOR operation after each move). However, Zobrist hash, unlike your hash function, cannot guarantee the absence of any collisions (although the probability is low enough to be generally negligible in practice). Therefore, if used for efficiency improvements in algorithms like minimax, it may result in a shift from originally ensuring the discovery of the optimal solution to no longer providing such a guarantee.

@visitorckw
Copy link
Collaborator

Additionally, I think we can split it into two files, one focusing on game mechanics and the other on AI decision-making.

README.md Outdated Show resolved Hide resolved
td_agent.c Outdated Show resolved Hide resolved
util.h Outdated Show resolved Hide resolved
Copy link
Owner

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide the implementation details during modularization.

jserv

This comment was marked as resolved.

Copy link
Owner

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use git rebase -i to rework git commits, so that we can review in a straightforward way.

agents/negamax.c Outdated Show resolved Hide resolved
agents/td_agent.c Outdated Show resolved Hide resolved
game.h Outdated Show resolved Hide resolved
main.c Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
train.c Outdated Show resolved Hide resolved
train.c Outdated Show resolved Hide resolved
@paul90317
Copy link
Contributor

paul90317 commented Feb 9, 2024

The function available_moves and get_available_moves is almost the same. And I'm wondering why you use the get_score as a reward instead of the result of the game.

@ndsl7109256
Copy link
Contributor Author

The function available_moves and get_available_moves is almost the same. And I'm wondering why you use the get_score as a reward instead of the result of the game.

  1. I have noticed a similarity between available_moves and get_available_moves. I thought that available_moves used in negamax.c could be replaced with get_available_moves in a later PR.
  2. I think using the board score from get_score as a reward, or using the game's result, are both reasonable designs for rewards. I used the board score as the reward because it offers more information than just the game's result. (Indeed, I tried using 1 for a win and -1 for a loss as the reward, but it did not perform well.) Moreover, since Negamax also uses the board score, perhaps we could compare these two algorithms in the future by aligning them with the same scoring method.

@jserv jserv requested a review from visitorckw February 9, 2024 14:23
train.c Outdated
#include <time.h>
#include "agents/temporal_difference.h"
#include "game.h"
#include "util.h"
Copy link
Collaborator

@visitorckw visitorckw Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When executing make train, the following compilation error occurred:

train.c:12:10: fatal error: util.h: No such file or directory
12 | #include "util.h"
| ^~~~~~~~
compilation terminated.

train.c Outdated
void store_state_value()
{
FILE *fptr = NULL;
if ((fptr = fopen("state_value.bin", "wb")) == NULL)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a macro to specify the file name instead of hard coding the file name directly in the code.

(nxt_reward + GAMMA * (*agent).state_value[nxt_state_hash]);
}

// TODO: find a more efficient hash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the previous discussion regarding the hash function, overflow issues will be encountered with a 5x5 board size and larger. Therefore, I would like to mention this in the comments as well.

agents/temporal_difference.c Outdated Show resolved Hide resolved
if (!((*agent).state_value))
exit(1);
for (unsigned int i = 0; i < state_num; i++)
(*agent).state_value[i] = get_score(hash_to_table(i), player);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

int move_cnt = 0;
int *available_moves = get_available_moves(table, &move_cnt);
int max_act = -1;
float max_q = -DBL_MAX;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since single-precision floating-point number is used here, we should use FLT_MAX instead of DBL_MAX.

char OX = (*agent).player;
printf("[ ");
for (int i = 0; i < move_cnt; i++) {
char *try_table = malloc(sizeof(char) * N_GRIDS);
Copy link
Collaborator

@visitorckw visitorckw Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check the return value of malloc.

agents/temporal_difference.c Show resolved Hide resolved
train.c Outdated
x = 1; \
for (int i = 0; i < N_GRIDS; i++) \
x *= 3; \
}
Copy link
Collaborator

@visitorckw visitorckw Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This macro is defined once in temporal_difference.h and once in train.c. Is it possible to consolidate them so that it is only defined once?

@visitorckw
Copy link
Collaborator

Please correct the typo 'learing' to 'learning' in the commit message.

exit(1);
}
long offset = (agent->player == 'O') ? 0 : state_num * sizeof(float);
fseek(fptr, offset, SEEK_SET);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since fseek can also fail, we should check the return value of fseek.

game.h Outdated
@@ -1,6 +1,6 @@
#pragma once

#define BOARD_SIZE 4
#define BOARD_SIZE 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you changed the default setting again. I think we would like to keep the default setting to a 4x4 board size.

main.c Outdated
unsigned int state_num = 1;
CALC_STATE_NUM(state_num);
init_td_agent(&agent, state_num, 'O');
load_model(&agent, state_num, "state_value.bin");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we introduced the MODEL_NAME macro to allow for name changes, we shouldn't hard code the file names here. Otherwise, any name changes would lead to errors.

Copy link
Owner

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase the latest main branch.

@jserv jserv requested a review from paul90317 February 15, 2024 17:47
Makefile Outdated Show resolved Hide resolved
agents/temporal_difference.c Outdated Show resolved Hide resolved
Makefile Outdated
@@ -17,10 +21,17 @@ OBJS := \
zobrist.o \
agents/negamax.o \
main.o

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimize the necessary changes. Remove this blank line here.

README.md Outdated
$./train
```

Build the game:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should explain the functionality of TD program.

{
FILE *fptr = fopen(model_path, "rb");
if (!fptr) {
printf("could not open state value table, train first\n");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error messages should start with "Failed to." That is, "Failed to open state value table file."

exit(1);
}
if (fread(agent->state_value, state_num * sizeof(float), 1, fptr) != 1) {
printf("could not load\n");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. The message should be "Failed to load the model of ..."

int move_cnt = 0;
int *available_moves = malloc(N_GRIDS * sizeof(int));
if (!available_moves)
exit(1);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of invoking exit without any prompt, you should return something to indicate the caller.

float *state_value;
} td_agent_t;

int hash(char *table);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are going to expose the symbol hash, you should refine its naming scheme to avoid unintended name conflicts.

} td_agent_t;

int hash(char *table);
char *hash_to_table(int hash);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function hash_to_table seems to be used internally, right? If so, don't expose it.

#include <time.h>
#include "agents/temporal_difference.h"
#include "game.h"

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove a blank line.

@jserv jserv requested a review from visitorckw February 15, 2024 22:43
return ret;
}

char *hash_to_table(int hash)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are not going to expose the function, we should add the 'static' modifier to protect the scope that can access the function.


// TODO: Find a more efficient hash, we could not store 5x5 or larger board,
// Since we could have 3^25 states and it might overflow.
int table_to_hash(char *table)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

perror("Failed to allocate memory");
exit(1);
}
for (int i = 0; i < move_cnt; i++) {
Copy link
Contributor

@paul90317 paul90317 Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declare a macro like list_for_each in sysprog21/lab0-c/list.h to replace get_available_moves so that get_action_exploit can be in-place.

int get_action_epsilon_greedy(char *table, td_agent_t *agent)
{
int move_cnt = 0;
int *available_moves = get_available_moves(table, &move_cnt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, make get_action_epsilon_greedy in-place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get how to replace get_available_moves with list_for_each style macro here, is there another way to get a random move by iterating the board straightly?

Copy link
Contributor

@paul90317 paul90317 Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry that I didn't think clearly. We can't obtain the random space straightly. We need to complete it in three steps:

  1. Get the number of spaces n by for_each_empty_grid.
  2. get a random count i between [0, n).
  3. get the board index of the ith space by for_each_empty_grid.

This method has the same time complexity as the original one but is not very elegant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've move get_available_moves into train.c and rewrite it with for_each_empty_grid. Thanks for your advice!.

train.c Outdated
return available_moves[rand() % move_cnt];
}
int act = get_action_exploit(table, agent);
epsilon *= GAMMA;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you are decaying epsilon by GAMMA, however, GAMMA is the parameter of the "Markov Decision Model". Does it have other meanings, or you should use another parameter like decay_rate to replace GAMMA here.

train.c Outdated
}
}

void store_state_value()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark internal functions like store_state_value as static.

int max_act = -1;
float max_q = -FLT_MAX;
float *state_value = agent->state_value;
char OX = agent->player;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a better name for the variable OX.

@visitorckw
Copy link
Collaborator

When executing make clean, it should also delete files related to dependencies such as train.d and td.d. Additionally, I'm unsure whether state_value.bin should be deleted when running make clean.

TD learning is a classic Reinforcement Learning algorithm.
We can pretrain the model and store it. When playing against
the AI, the AI agent can simply look up the pretrained model
to determine the best moves.

Ref:
https://en.wikipedia.org/wiki/Temporal_difference_learning
https://github.com/agrawal-rohit/tic-tac-toe-ai-bots
@visitorckw
Copy link
Collaborator

The changes look good to me now.
Thanks for all the great work.
Let's see if anyone else has any further comments.

@ndsl7109256
Copy link
Contributor Author

@visitorckw Thanks for your detailed review as well. It helps a lot!

@ndsl7109256 ndsl7109256 requested a review from jserv February 19, 2024 03:06
Copy link
Contributor

@paul90317 paul90317 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, could you also replace the usage of the available_moves function in negamax.c , Thank you : )

x *= 3; \
}

#define for_each_empty_grid(i, table) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename it to for_each_empty_cell and move it to game.h

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the purpose of renaming for_each_empty_cell, I mainly use the term grid for aligning N_GRIDS. Could you explain your idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tic-tac-toe
Tic-tac-toe is played on a three-by-three grid by two players, who alternately place the marks X and O in one of the nine spaces in the grid
wiki

grid
a pattern or structure made from horizontal and vertical lines crossing each other to form squares
cambridge dictionary

In my opinion, we can call a game board a grid, but I'm not sure whether or not a cell on the board can be called a grid.
And I also think N_GRIDS is not a good name, it should be N_CELLS.

static int get_action_epsilon_greedy(char *table, td_agent_t *agent)
{
int move_cnt = 0;
int *available_moves = get_available_moves(table, &move_cnt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry that I didn't think clearly. We can't obtain the random space straightly. We need to complete it in three steps:

  1. Get the number of spaces n by for_each_empty_grid.
  2. Get a random count i between [0, n).
  3. Get the board index of the ith space by for_each_empty_grid.

This method has the same time complexity as the original one but is not very elegant.

However, I still prefer my proposed method to get a random empty cell because it's in-place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your thought process. However, your method requires a two-pass loop, and I'm not sure which way is better here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someday we wanna present the board as a bit array instead of a char array to save memory usage, the get_available_moves may be the bottleneck because it always outputs an int array bigger than the board.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it. It indeed makes sense. I thought you could submit these changes with another PR since these changes provide a different aspect to improve this project, and it is worth addressing your proposal in detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thank you for bringing this to my attention. It's not appropriate to request these changes in this code review. I'll submit another PR to apply them. Thank you for communicating this.

Copy link
Contributor

@paul90317 paul90317 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is fine for me, I will apply other changes in another PR.

@jserv jserv merged commit 77299c0 into jserv:main Feb 19, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants