The SANA project's goal is to create a Islamic-specific database for research purposes. My contribution to this goal is to create a model that would predict category based on Abstract and Title.
In order to accomplish this, I have currently divided the work with taking removig no punctuation and removing punctuation to see the overall noise difference it creates.
Next Steps:
- Create a dictionary or import a list of arabic names for grouping. Ex: Mohammed and Mohamad --> Mohammad
- Include removing some punctuation vs others
- Write machine learning classifiers as a pipeline