Add TD learing agent #26

ndsl7109256 · 2024-02-02T01:52:27Z

TD learning is a classic Reinforcement Learning algorithm. We can pretrain the model and store it. When playing against the AI, the AI agent can simply look up the pretrained model to determine the best moves.

Ref:
https://en.wikipedia.org/wiki/Temporal_difference_learning
https://github.com/agrawal-rohit/tic-tac-toe-ai-bots

jserv

I am enthusiastic about incorporating the reinforcement learning (RL) algorithm into this project. Yet, before proceeding with the integration, there are specific adjustments required.

Maximize the reuse of existing code. This may necessitate separating the board display from the game logic to ensure a modular approach, facilitating the integration of the proposed RL/TD algorithm without replicating code.
Opt for a 4x4 board size, or possibly larger, rather than the traditional 3x3 format. My aim is to slightly elevate the complexity.

visitorckw · 2024-02-02T03:36:55Z

I am concerned that the hash function may pose a risk of overflow in a larger board size. Perhaps integrating it into the Zobrist hash would be a good choice?

ndsl7109256 · 2024-02-02T03:52:59Z

OK. I would try refine the code.
It indeed opts for 4x4 board now, but not for 5x5 and more and it's due to the inefficient state hash function now.
The hash function now is to maintain state of board in ternary number.
ex :

o |  |
--------
  |  |
--------
  |  | x

would be hashed as $100000002_{3}$.
for 3x3 baord, It requires $3^{9}$ states and many of them is unreachable.(actually reachable state is 5478).
for 5x5 baord, it requres $3^{25} = 847288609443$ states and might cause overflow.
Maybe using Zobrist Hashing would be a better way but I am not sure how large board it would support.

ref : https://informatika.stei.itb.ac.id/~rinaldi.munir/Matdis/2021-2022/Makalah2021/Makalah-Matdis-2021%20(148).pdf

paul90317

It seems that you use a random player as the AI's opponent when training, It does not make sense. you can let the AI player play against itself or the original AI instead.

paul90317 · 2024-02-03T02:25:05Z

train.c

+        }
+        int cur_state_hash = hash(table);
+        if (turn == playerO) {
+            int move = get_action_O(table);


It seems that you use a random player as the AI's opponent when training, It does not make sense. you can let the AI player play against itself or the original AI instead.

I thought I use another AI player which is also trained with TD learning. (get_action_X and get_action_O are quite the same.) Is there any bug here?

However, I am curious about why using a random player as an opponent does not make sense. I consider the opponent's moves as 'environment state transitions,' and I believe that state transitions in the RL environment can exhibit randomness. Could you explain more about your idea?

I mistook, you used epsilon greedy. But is epsilon greedy needed in minmax game?

visitorckw · 2024-02-03T14:15:08Z

OK. I would try refine the code.

It indeed opts for 4x4 board now, but not for 5x5 and more and it's due to the inefficient state hash function now.
The hash function now is to maintain state of board in ternary number.
ex :
o |  |
--------
  |  |
--------
  |  | x
would be hashed as 1000000023. for 3x3 baord, It requires 39 states and many of them is unreachable.(actually reachable state is 5478). for 5x5 baord, it requres 325=847288609443 states and might cause overflow. Maybe using Zobrist Hashing would be a better way but I am not sure how large board it would support.

ref : https://informatika.stei.itb.ac.id/~rinaldi.munir/Matdis/2021-2022/Makalah2021/Makalah-Matdis-2021%20(148).pdf

AFAIK, most Go AI software also utilizes Zobrist hash (Go's board size is 19x19), and its calculations are quite fast (requiring only one XOR operation after each move). However, Zobrist hash, unlike your hash function, cannot guarantee the absence of any collisions (although the probability is low enough to be generally negligible in practice). Therefore, if used for efficiency improvements in algorithms like minimax, it may result in a shift from originally ensuring the discovery of the optimal solution to no longer providing such a guarantee.

visitorckw · 2024-02-03T14:16:32Z

Additionally, I think we can split it into two files, one focusing on game mechanics and the other on AI decision-making.

README.md

td_agent.c

util.h

jserv

Hide the implementation details during modularization.

jserv

Use git rebase -i to rework git commits, so that we can review in a straightforward way.

agents/negamax.c

agents/td_agent.c

game.h

main.c

Makefile

train.c

paul90317 · 2024-02-09T04:30:12Z

The function available_moves and get_available_moves is almost the same. And I'm wondering why you use the get_score as a reward instead of the result of the game.

ndsl7109256 · 2024-02-09T09:07:09Z

The function available_moves and get_available_moves is almost the same. And I'm wondering why you use the get_score as a reward instead of the result of the game.

I have noticed a similarity between available_moves and get_available_moves. I thought that available_moves used in negamax.c could be replaced with get_available_moves in a later PR.
I think using the board score from get_score as a reward, or using the game's result, are both reasonable designs for rewards. I used the board score as the reward because it offers more information than just the game's result. (Indeed, I tried using 1 for a win and -1 for a loss as the reward, but it did not perform well.) Moreover, since Negamax also uses the board score, perhaps we could compare these two algorithms in the future by aligning them with the same scoring method.

visitorckw · 2024-02-09T18:01:56Z

train.c

+#include <time.h>
+#include "agents/temporal_difference.h"
+#include "game.h"
+#include "util.h"


When executing make train, the following compilation error occurred:

train.c:12:10: fatal error: util.h: No such file or directory 12 | #include "util.h" | ^~~~~~~~ compilation terminated.

visitorckw · 2024-02-09T18:10:06Z

train.c

+void store_state_value()
+{
+    FILE *fptr = NULL;
+    if ((fptr = fopen("state_value.bin", "wb")) == NULL)


Consider using a macro to specify the file name instead of hard coding the file name directly in the code.

agents/util.h

visitorckw · 2024-02-09T18:26:14Z

agents/temporal_difference.c

+            (nxt_reward + GAMMA * (*agent).state_value[nxt_state_hash]);
+}
+
+// TODO: find a more efficient hash


Based on the previous discussion regarding the hash function, overflow issues will be encountered with a 5x5 board size and larger. Therefore, I would like to mention this in the comments as well.

agents/temporal_difference.c

visitorckw · 2024-02-09T18:33:04Z

agents/temporal_difference.c

+    if (!((*agent).state_value))
+        exit(1);
+    for (unsigned int i = 0; i < state_num; i++)
+        (*agent).state_value[i] = get_score(hash_to_table(i), player);


visitorckw · 2024-02-09T18:38:53Z

agents/temporal_difference.c

+    int move_cnt = 0;
+    int *available_moves = get_available_moves(table, &move_cnt);
+    int max_act = -1;
+    float max_q = -DBL_MAX;


Since single-precision floating-point number is used here, we should use FLT_MAX instead of DBL_MAX.

visitorckw · 2024-02-09T18:43:12Z

agents/temporal_difference.c

+    char OX = (*agent).player;
+    printf("[ ");
+    for (int i = 0; i < move_cnt; i++) {
+        char *try_table = malloc(sizeof(char) * N_GRIDS);


We should check the return value of malloc.

agents/temporal_difference.c

visitorckw · 2024-02-09T18:55:15Z

train.c

+        x = 1;                            \
+        for (int i = 0; i < N_GRIDS; i++) \
+            x *= 3;                       \
+    }


This macro is defined once in temporal_difference.h and once in train.c. Is it possible to consolidate them so that it is only defined once?

visitorckw · 2024-02-14T22:37:36Z

Please correct the typo 'learing' to 'learning' in the commit message.

visitorckw · 2024-02-15T15:22:11Z

agents/temporal_difference.c

+        exit(1);
+    }
+    long offset = (agent->player == 'O') ? 0 : state_num * sizeof(float);
+    fseek(fptr, offset, SEEK_SET);


Since fseek can also fail, we should check the return value of fseek.

visitorckw · 2024-02-15T15:47:20Z

game.h

@@ -1,6 +1,6 @@
 #pragma once

-#define BOARD_SIZE 4
+#define BOARD_SIZE 3


It seems like you changed the default setting again. I think we would like to keep the default setting to a 4x4 board size.

visitorckw · 2024-02-15T15:59:02Z

main.c

+    unsigned int state_num = 1;
+    CALC_STATE_NUM(state_num);
+    init_td_agent(&agent, state_num, 'O');
+    load_model(&agent, state_num, "state_value.bin");


Since we introduced the MODEL_NAME macro to allow for name changes, we shouldn't hard code the file names here. Otherwise, any name changes would lead to errors.

jserv

Rebase the latest main branch.

Makefile

agents/temporal_difference.c

jserv · 2024-02-15T22:30:58Z

Makefile

@@ -17,10 +21,17 @@ OBJS := \
 	zobrist.o \
 	agents/negamax.o \
 	main.o
+


Minimize the necessary changes. Remove this blank line here.

jserv · 2024-02-15T22:31:50Z

README.md

+$./train
+```
+
+Build the game:


You should explain the functionality of TD program.

jserv · 2024-02-15T22:34:22Z

agents/temporal_difference.c

+{
+    FILE *fptr = fopen(model_path, "rb");
+    if (!fptr) {
+        printf("could not open state value table, train first\n");


The error messages should start with "Failed to." That is, "Failed to open state value table file."

jserv · 2024-02-15T22:35:04Z

agents/temporal_difference.c

+        exit(1);
+    }
+    if (fread(agent->state_value, state_num * sizeof(float), 1, fptr) != 1) {
+        printf("could not load\n");


Ditto. The message should be "Failed to load the model of ..."

jserv · 2024-02-15T22:36:59Z

agents/temporal_difference.c

+    int move_cnt = 0;
+    int *available_moves = malloc(N_GRIDS * sizeof(int));
+    if (!available_moves)
+        exit(1);


Instead of invoking exit without any prompt, you should return something to indicate the caller.

jserv · 2024-02-15T22:39:50Z

agents/temporal_difference.h

+    float *state_value;
+} td_agent_t;
+
+int hash(char *table);


If you are going to expose the symbol hash, you should refine its naming scheme to avoid unintended name conflicts.

jserv · 2024-02-15T22:41:40Z

agents/temporal_difference.h

+} td_agent_t;
+
+int hash(char *table);
+char *hash_to_table(int hash);


The function hash_to_table seems to be used internally, right? If so, don't expose it.

jserv · 2024-02-15T22:42:31Z

train.c

+#include <time.h>
+#include "agents/temporal_difference.h"
+#include "game.h"
+


Remove a blank line.

visitorckw · 2024-02-16T06:36:45Z

agents/temporal_difference.c

+    return ret;
+}
+
+char *hash_to_table(int hash)


Since we are not going to expose the function, we should add the 'static' modifier to protect the scope that can access the function.

visitorckw · 2024-02-16T06:39:11Z

agents/temporal_difference.c

+
+// TODO: Find a more efficient hash, we could not store 5x5 or larger board,
+// Since we could have 3^25 states and it might overflow.
+int table_to_hash(char *table)


paul90317 · 2024-02-18T03:17:41Z

agents/temporal_difference.c

+        perror("Failed to allocate memory");
+        exit(1);
+    }
+    for (int i = 0; i < move_cnt; i++) {


Declare a macro like list_for_each in sysprog21/lab0-c/list.h to replace get_available_moves so that get_action_exploit can be in-place.

paul90317 · 2024-02-18T03:23:03Z

train.c

+int get_action_epsilon_greedy(char *table, td_agent_t *agent)
+{
+    int move_cnt = 0;
+    int *available_moves = get_available_moves(table, &move_cnt);


Ditto, make get_action_epsilon_greedy in-place.

I didn't get how to replace get_available_moves with list_for_each style macro here, is there another way to get a random move by iterating the board straightly?

I'm sorry that I didn't think clearly. We can't obtain the random space straightly. We need to complete it in three steps:

Get the number of spaces n by for_each_empty_grid.

get a random count i between [0, n).

get the board index of the ith space by for_each_empty_grid.

This method has the same time complexity as the original one but is not very elegant.

I've move get_available_moves into train.c and rewrite it with for_each_empty_grid. Thanks for your advice!.

paul90317 · 2024-02-18T04:13:47Z

train.c

+        return available_moves[rand() % move_cnt];
+    }
+    int act = get_action_exploit(table, agent);
+    epsilon *= GAMMA;


It seems that you are decaying epsilon by GAMMA, however, GAMMA is the parameter of the "Markov Decision Model". Does it have other meanings, or you should use another parameter like decay_rate to replace GAMMA here.

paul90317 · 2024-02-18T04:19:06Z

train.c

+    }
+}
+
+void store_state_value()


Mark internal functions like store_state_value as static.

paul90317 · 2024-02-18T04:24:12Z

agents/temporal_difference.c

+    int max_act = -1;
+    float max_q = -FLT_MAX;
+    float *state_value = agent->state_value;
+    char OX = agent->player;


There should be a better name for the variable OX.

visitorckw · 2024-02-18T16:45:52Z

When executing make clean, it should also delete files related to dependencies such as train.d and td.d. Additionally, I'm unsure whether state_value.bin should be deleted when running make clean.

TD learning is a classic Reinforcement Learning algorithm. We can pretrain the model and store it. When playing against the AI, the AI agent can simply look up the pretrained model to determine the best moves. Ref: https://en.wikipedia.org/wiki/Temporal_difference_learning https://github.com/agrawal-rohit/tic-tac-toe-ai-bots

visitorckw · 2024-02-19T02:49:52Z

The changes look good to me now.
Thanks for all the great work.
Let's see if anyone else has any further comments.

ndsl7109256 · 2024-02-19T03:05:39Z

@visitorckw Thanks for your detailed review as well. It helps a lot!

paul90317

Additionally, could you also replace the usage of the available_moves function in negamax.c , Thank you : )

paul90317 · 2024-02-19T03:34:30Z

agents/temporal_difference.h

+            x *= 3;                       \
+    }
+
+#define for_each_empty_grid(i, table) \


Rename it to for_each_empty_cell and move it to game.h

I'm not sure the purpose of renaming for_each_empty_cell, I mainly use the term grid for aligning N_GRIDS. Could you explain your idea?

Tic-tac-toe
Tic-tac-toe is played on a three-by-three grid by two players, who alternately place the marks X and O in one of the nine spaces in the grid
wiki

grid
a pattern or structure made from horizontal and vertical lines crossing each other to form squares
cambridge dictionary

In my opinion, we can call a game board a grid, but I'm not sure whether or not a cell on the board can be called a grid.
And I also think N_GRIDS is not a good name, it should be N_CELLS.

paul90317 · 2024-02-19T03:38:14Z

train.c

+static int get_action_epsilon_greedy(char *table, td_agent_t *agent)
+{
+    int move_cnt = 0;
+    int *available_moves = get_available_moves(table, &move_cnt);


I'm sorry that I didn't think clearly. We can't obtain the random space straightly. We need to complete it in three steps:

Get the number of spaces n by for_each_empty_grid.

Get a random count i between [0, n).

Get the board index of the ith space by for_each_empty_grid.

This method has the same time complexity as the original one but is not very elegant.

However, I still prefer my proposed method to get a random empty cell because it's in-place.

I understand your thought process. However, your method requires a two-pass loop, and I'm not sure which way is better here.

If someday we wanna present the board as a bit array instead of a char array to save memory usage, the get_available_moves may be the bottleneck because it always outputs an int array bigger than the board.

I got it. It indeed makes sense. I thought you could submit these changes with another PR since these changes provide a different aspect to improve this project, and it is worth addressing your proposal in detail.

Okay, thank you for bringing this to my attention. It's not appropriate to request these changes in this code review. I'll submit another PR to apply them. Thank you for communicating this.

paul90317

This PR is fine for me, I will apply other changes in another PR.

jserv requested changes Feb 2, 2024

View reviewed changes

paul90317 reviewed Feb 3, 2024

View reviewed changes

jserv reviewed Feb 5, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

jserv reviewed Feb 5, 2024

View reviewed changes

td_agent.c Outdated Show resolved Hide resolved

jserv reviewed Feb 5, 2024

View reviewed changes

util.h Outdated Show resolved Hide resolved

jserv requested changes Feb 5, 2024

View reviewed changes

This comment was marked as resolved.

Sign in to view

jserv requested changes Feb 8, 2024

View reviewed changes

jserv reviewed Feb 8, 2024

View reviewed changes

agents/negamax.c Outdated Show resolved Hide resolved

jserv reviewed Feb 8, 2024

View reviewed changes

agents/td_agent.c Outdated Show resolved Hide resolved

jserv reviewed Feb 8, 2024

View reviewed changes

game.h Outdated Show resolved Hide resolved

jserv reviewed Feb 8, 2024

View reviewed changes

main.c Outdated Show resolved Hide resolved

jserv reviewed Feb 9, 2024

View reviewed changes

Makefile Outdated Show resolved Hide resolved

jserv reviewed Feb 9, 2024

View reviewed changes

train.c Outdated Show resolved Hide resolved

jserv reviewed Feb 9, 2024

View reviewed changes

train.c Outdated Show resolved Hide resolved

jserv requested a review from visitorckw February 9, 2024 14:23

visitorckw reviewed Feb 9, 2024

View reviewed changes

agents/util.h Show resolved Hide resolved

visitorckw reviewed Feb 9, 2024

View reviewed changes

visitorckw reviewed Feb 15, 2024

View reviewed changes

jserv requested changes Feb 15, 2024

View reviewed changes

jserv requested a review from paul90317 February 15, 2024 17:47

jserv reviewed Feb 15, 2024

View reviewed changes

Makefile Outdated Show resolved Hide resolved

jserv reviewed Feb 15, 2024

View reviewed changes

agents/temporal_difference.c Outdated Show resolved Hide resolved

jserv reviewed Feb 15, 2024

View reviewed changes

README.md Outdated

$./train

```

Build the game:

Copy link

Owner

jserv Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should explain the functionality of TD program.

jserv reviewed Feb 15, 2024

View reviewed changes

train.c

#include <time.h>

#include "agents/temporal_difference.h"

#include "game.h"

Copy link

Owner

jserv Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove a blank line.

jserv requested a review from visitorckw February 15, 2024 22:43

visitorckw reviewed Feb 16, 2024

View reviewed changes

paul90317 reviewed Feb 18, 2024

View reviewed changes

ndsl7109256 requested a review from visitorckw February 18, 2024 15:58

visitorckw approved these changes Feb 19, 2024

View reviewed changes

ndsl7109256 requested a review from jserv February 19, 2024 03:06

paul90317 suggested changes Feb 19, 2024

View reviewed changes

paul90317 approved these changes Feb 19, 2024

View reviewed changes

jserv merged commit 77299c0 into jserv:main Feb 19, 2024
2 checks passed

Add TD learing agent #26

Add TD learing agent #26

Conversation

ndsl7109256 commented Feb 2, 2024

jserv left a comment

Choose a reason for hiding this comment

visitorckw commented Feb 2, 2024

ndsl7109256 commented Feb 2, 2024

paul90317 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul90317 Feb 3, 2024 • edited Loading

Choose a reason for hiding this comment

visitorckw commented Feb 3, 2024

visitorckw commented Feb 3, 2024

jserv left a comment

Choose a reason for hiding this comment

This comment was marked as resolved.

jserv left a comment

Choose a reason for hiding this comment

paul90317 commented Feb 9, 2024 • edited Loading

ndsl7109256 commented Feb 9, 2024

visitorckw Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

visitorckw Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

visitorckw Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

visitorckw commented Feb 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jserv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul90317 Feb 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul90317 Feb 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

visitorckw commented Feb 18, 2024

visitorckw commented Feb 19, 2024

ndsl7109256 commented Feb 19, 2024

paul90317 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul90317 left a comment

Choose a reason for hiding this comment

paul90317 Feb 3, 2024 •

edited

Loading

paul90317 commented Feb 9, 2024 •

edited

Loading

visitorckw Feb 9, 2024 •

edited

Loading

visitorckw Feb 9, 2024 •

edited

Loading

visitorckw Feb 9, 2024 •

edited

Loading

paul90317 Feb 18, 2024 •

edited

Loading

paul90317 Feb 18, 2024 •

edited

Loading

paul90317 left a comment •

edited

Loading