-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TD learing agent #26
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am enthusiastic about incorporating the reinforcement learning (RL) algorithm into this project. Yet, before proceeding with the integration, there are specific adjustments required.
- Maximize the reuse of existing code. This may necessitate separating the board display from the game logic to ensure a modular approach, facilitating the integration of the proposed RL/TD algorithm without replicating code.
- Opt for a 4x4 board size, or possibly larger, rather than the traditional 3x3 format. My aim is to slightly elevate the complexity.
I am concerned that the hash function may pose a risk of overflow in a larger board size. Perhaps integrating it into the Zobrist hash would be a good choice? |
would be hashed as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that you use a random player as the AI's opponent when training, It does not make sense. you can let the AI player play against itself or the original AI instead.
train.c
Outdated
} | ||
int cur_state_hash = hash(table); | ||
if (turn == playerO) { | ||
int move = get_action_O(table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that you use a random player as the AI's opponent when training, It does not make sense. you can let the AI player play against itself or the original AI instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I use another AI player which is also trained with TD learning. (get_action_X
and get_action_O
are quite the same.) Is there any bug here?
However, I am curious about why using a random player as an opponent does not make sense. I consider the opponent's moves as 'environment state transitions,' and I believe that state transitions in the RL environment can exhibit randomness. Could you explain more about your idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mistook, you used epsilon greedy. But is epsilon greedy needed in minmax game?
AFAIK, most Go AI software also utilizes Zobrist hash (Go's board size is 19x19), and its calculations are quite fast (requiring only one XOR operation after each move). However, Zobrist hash, unlike your hash function, cannot guarantee the absence of any collisions (although the probability is low enough to be generally negligible in practice). Therefore, if used for efficiency improvements in algorithms like minimax, it may result in a shift from originally ensuring the discovery of the optimal solution to no longer providing such a guarantee. |
Additionally, I think we can split it into two files, one focusing on game mechanics and the other on AI decision-making. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hide the implementation details during modularization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use git rebase -i
to rework git commits, so that we can review in a straightforward way.
The function |
|
train.c
Outdated
#include <time.h> | ||
#include "agents/temporal_difference.h" | ||
#include "game.h" | ||
#include "util.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When executing make train
, the following compilation error occurred:
train.c:12:10: fatal error: util.h: No such file or directory
12 | #include "util.h"
| ^~~~~~~~
compilation terminated.
train.c
Outdated
void store_state_value() | ||
{ | ||
FILE *fptr = NULL; | ||
if ((fptr = fopen("state_value.bin", "wb")) == NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a macro to specify the file name instead of hard coding the file name directly in the code.
agents/temporal_difference.c
Outdated
(nxt_reward + GAMMA * (*agent).state_value[nxt_state_hash]); | ||
} | ||
|
||
// TODO: find a more efficient hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the previous discussion regarding the hash function, overflow issues will be encountered with a 5x5 board size and larger. Therefore, I would like to mention this in the comments as well.
agents/temporal_difference.c
Outdated
if (!((*agent).state_value)) | ||
exit(1); | ||
for (unsigned int i = 0; i < state_num; i++) | ||
(*agent).state_value[i] = get_score(hash_to_table(i), player); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
agents/temporal_difference.c
Outdated
int move_cnt = 0; | ||
int *available_moves = get_available_moves(table, &move_cnt); | ||
int max_act = -1; | ||
float max_q = -DBL_MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since single-precision floating-point number is used here, we should use FLT_MAX
instead of DBL_MAX
.
agents/temporal_difference.c
Outdated
char OX = (*agent).player; | ||
printf("[ "); | ||
for (int i = 0; i < move_cnt; i++) { | ||
char *try_table = malloc(sizeof(char) * N_GRIDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check the return value of malloc
.
train.c
Outdated
x = 1; \ | ||
for (int i = 0; i < N_GRIDS; i++) \ | ||
x *= 3; \ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This macro is defined once in temporal_difference.h
and once in train.c
. Is it possible to consolidate them so that it is only defined once?
Please correct the typo 'learing' to 'learning' in the commit message. |
agents/temporal_difference.c
Outdated
exit(1); | ||
} | ||
long offset = (agent->player == 'O') ? 0 : state_num * sizeof(float); | ||
fseek(fptr, offset, SEEK_SET); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since fseek can also fail, we should check the return value of fseek.
game.h
Outdated
@@ -1,6 +1,6 @@ | |||
#pragma once | |||
|
|||
#define BOARD_SIZE 4 | |||
#define BOARD_SIZE 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like you changed the default setting again. I think we would like to keep the default setting to a 4x4 board size.
main.c
Outdated
unsigned int state_num = 1; | ||
CALC_STATE_NUM(state_num); | ||
init_td_agent(&agent, state_num, 'O'); | ||
load_model(&agent, state_num, "state_value.bin"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we introduced the MODEL_NAME macro to allow for name changes, we shouldn't hard code the file names here. Otherwise, any name changes would lead to errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebase the latest main
branch.
Makefile
Outdated
@@ -17,10 +21,17 @@ OBJS := \ | |||
zobrist.o \ | |||
agents/negamax.o \ | |||
main.o | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minimize the necessary changes. Remove this blank line here.
README.md
Outdated
$./train | ||
``` | ||
|
||
Build the game: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should explain the functionality of TD program.
agents/temporal_difference.c
Outdated
{ | ||
FILE *fptr = fopen(model_path, "rb"); | ||
if (!fptr) { | ||
printf("could not open state value table, train first\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error messages should start with "Failed to." That is, "Failed to open state value table file."
agents/temporal_difference.c
Outdated
exit(1); | ||
} | ||
if (fread(agent->state_value, state_num * sizeof(float), 1, fptr) != 1) { | ||
printf("could not load\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. The message should be "Failed to load the model of ..."
agents/temporal_difference.c
Outdated
int move_cnt = 0; | ||
int *available_moves = malloc(N_GRIDS * sizeof(int)); | ||
if (!available_moves) | ||
exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of invoking exit
without any prompt, you should return something to indicate the caller.
agents/temporal_difference.h
Outdated
float *state_value; | ||
} td_agent_t; | ||
|
||
int hash(char *table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are going to expose the symbol hash
, you should refine its naming scheme to avoid unintended name conflicts.
agents/temporal_difference.h
Outdated
} td_agent_t; | ||
|
||
int hash(char *table); | ||
char *hash_to_table(int hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function hash_to_table
seems to be used internally, right? If so, don't expose it.
#include <time.h> | ||
#include "agents/temporal_difference.h" | ||
#include "game.h" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove a blank line.
agents/temporal_difference.c
Outdated
return ret; | ||
} | ||
|
||
char *hash_to_table(int hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are not going to expose the function, we should add the 'static' modifier to protect the scope that can access the function.
agents/temporal_difference.c
Outdated
|
||
// TODO: Find a more efficient hash, we could not store 5x5 or larger board, | ||
// Since we could have 3^25 states and it might overflow. | ||
int table_to_hash(char *table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
agents/temporal_difference.c
Outdated
perror("Failed to allocate memory"); | ||
exit(1); | ||
} | ||
for (int i = 0; i < move_cnt; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Declare a macro like list_for_each
in sysprog21/lab0-c/list.h to replace get_available_moves
so that get_action_exploit
can be in-place.
int get_action_epsilon_greedy(char *table, td_agent_t *agent) | ||
{ | ||
int move_cnt = 0; | ||
int *available_moves = get_available_moves(table, &move_cnt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, make get_action_epsilon_greedy
in-place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get how to replace get_available_moves
with list_for_each
style macro here, is there another way to get a random move by iterating the board straightly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry that I didn't think clearly. We can't obtain the random space straightly. We need to complete it in three steps:
- Get the number of spaces n by
for_each_empty_grid
. - get a random count i between [0, n).
- get the board index of the ith space by
for_each_empty_grid
.
This method has the same time complexity as the original one but is not very elegant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've move get_available_moves
into train.c and rewrite it with for_each_empty_grid
. Thanks for your advice!.
train.c
Outdated
return available_moves[rand() % move_cnt]; | ||
} | ||
int act = get_action_exploit(table, agent); | ||
epsilon *= GAMMA; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that you are decaying epsilon
by GAMMA
, however, GAMMA
is the parameter of the "Markov Decision Model". Does it have other meanings, or you should use another parameter like decay_rate
to replace GAMMA
here.
train.c
Outdated
} | ||
} | ||
|
||
void store_state_value() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark internal functions like store_state_value
as static.
agents/temporal_difference.c
Outdated
int max_act = -1; | ||
float max_q = -FLT_MAX; | ||
float *state_value = agent->state_value; | ||
char OX = agent->player; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a better name for the variable OX
.
When executing |
TD learning is a classic Reinforcement Learning algorithm. We can pretrain the model and store it. When playing against the AI, the AI agent can simply look up the pretrained model to determine the best moves. Ref: https://en.wikipedia.org/wiki/Temporal_difference_learning https://github.com/agrawal-rohit/tic-tac-toe-ai-bots
The changes look good to me now. |
@visitorckw Thanks for your detailed review as well. It helps a lot! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, could you also replace the usage of the available_moves
function in negamax.c
, Thank you : )
x *= 3; \ | ||
} | ||
|
||
#define for_each_empty_grid(i, table) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename it to for_each_empty_cell
and move it to game.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the purpose of renaming for_each_empty_cell
, I mainly use the term grid
for aligning N_GRIDS
. Could you explain your idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tic-tac-toe
Tic-tac-toe is played on a three-by-three grid by two players, who alternately place the marks X and O in one of the nine spaces in the grid
wiki
grid
a pattern or structure made from horizontal and vertical lines crossing each other to form squares
cambridge dictionary
In my opinion, we can call a game board a grid, but I'm not sure whether or not a cell on the board can be called a grid.
And I also think N_GRIDS
is not a good name, it should be N_CELLS
.
static int get_action_epsilon_greedy(char *table, td_agent_t *agent) | ||
{ | ||
int move_cnt = 0; | ||
int *available_moves = get_available_moves(table, &move_cnt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry that I didn't think clearly. We can't obtain the random space straightly. We need to complete it in three steps:
- Get the number of spaces n by
for_each_empty_grid
.- Get a random count i between [0, n).
- Get the board index of the ith space by
for_each_empty_grid
.This method has the same time complexity as the original one but is not very elegant.
However, I still prefer my proposed method to get a random empty cell because it's in-place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your thought process. However, your method requires a two-pass loop, and I'm not sure which way is better here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someday we wanna present the board as a bit array instead of a char array to save memory usage, the get_available_moves
may be the bottleneck because it always outputs an int array bigger than the board.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it. It indeed makes sense. I thought you could submit these changes with another PR since these changes provide a different aspect to improve this project, and it is worth addressing your proposal in detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thank you for bringing this to my attention. It's not appropriate to request these changes in this code review. I'll submit another PR to apply them. Thank you for communicating this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is fine for me, I will apply other changes in another PR.
TD learning is a classic Reinforcement Learning algorithm. We can pretrain the model and store it. When playing against the AI, the AI agent can simply look up the pretrained model to determine the best moves.
Ref:
https://en.wikipedia.org/wiki/Temporal_difference_learning
https://github.com/agrawal-rohit/tic-tac-toe-ai-bots