INF502: Knowledge Representation and Reasoning
(Assignment 2)
Faculty of Engineering & IT, The British University in Dubai
Topic: knowledge representation in AI and Natural Language processing
Assignment marked out of: 100%
report softcopy via Blackboard in a form suitable for processing by www.turnitin.com•
CD including software and how to install it, report, test dataset and/or training dataset, presentation•
and important papers cited
hardcopy to Module Coordinator•
1. Overview
In this project assignment, you will demonstrate your understanding of knowledge representation in a Webbased
practical AI programming project. Choose only one of the following assignments. It might be a
group assignment with maximum of two students; a division of work must be clarified in the report.
You might use a Java servlet (preferably using NetBeans 7 with Java EE and TomCat “can be downloaded
for free from http://netbeans.org/downloads/”) for making your assignment we-based.
Assignment 1: Statistical parsing of Arabic for web user interface. [1]
Assignment 2: Preprocessing of Arabic text: tokenization & POS tagging. [2]
Assignment 3: Developing an Arabic Named Entity Recognition System. [3, 10, 11]
Assignment 4: Processing Arabic Questions using open Source tools [5][6]
Assignment 5: Implementing Recommender systems using data mining and knowledge discovery tools. [7]
Assignment 6: Automatic document summarization. [8]
Proposed Assignment: You could also propose your own project that is related to the Knowledge
Representation and Reasoning module, but it needs approval from me. Your proposal should follow that
same format used in this assignment brief.
You will present your project and demonstrate your work in a lab session. You will also hand in a report,
which includes the following:
1. An introduction with a general description of the problem domain, and the aspects you focus on.
2. A description of your solution, including a description of the algorithm you defined, any clever ideas
you came up with or borrowed, and so on.
3. A discussion of the performance of the system, the problems encountered, error analysis, etc.
4. Conclusion, including suggestions for future enhancements.
The report should be in PDF format and between 12 and 22 pages in size, excluding references. You are
required to use ACL (2012) style (available for LaTeX and Word) in producing the PDF document. These
templates are available at: http://acl2012.org/call/sub01.asp
You might use https://www.sharelatex.com/ for Latex documents
2. Requirements
Students are expected to implement a java web application (preferably using NetBeans 7 with Java EE and
TomCat “can be downloaded for free from http://netbeans.org/downloads/”), if possible, responsible for
complex knowledge representation in AI application.
Pls. see the Marking Scheme Section which will give you an idea about the criteria for marking and their
weights.
Note that the code will need to be extended and revised by other developers, so make sure to include full
and clear comments and documentation.
Assignment Milestones
Milestone 1: Preparation
1. You have acquired all the required training and testing data.
2. You have installed the necessary software (Netbeans, Weka, Bikel Parser, among others)
3. You have run the application on a small sample of data, or created a small “Hello world” application.
Milestone 2: Development
1. You have developed the application with all functions and features.
2. All various components, functions, features and classes are integrated together in one single
application.
3. The program accepts all instances of the training data as input and gives the expected output.
Milestone 3: Testing and Evaluation Deadline:
1. Gold standard is created or acquired.
2. Continuous cycle of testing-development-testing until satisfactory results are gained. Error analysis
of results achieved will guide you to the points of improvements. You can refer to your trial and the
mythology you followed.
3. Testing results in terms of standard evaluation metrics are reported with error analysis. Try to
compare with state of the art research.
2.1 Assignment 1: Statistical parsing of Arabic for web user interface
This task consists of training a statistical parser for Arabic and porting it on a web interface allowing it to
accept user input and provide parse results.
1. Training a statistical parser for Arabic: Use the Bikel parser which is already tuned for Arabic.
The parser can be downloaded from http://www.cis.upenn.edu/~dbikel/software.html and the (Arabic
Treebank) training data from Software and Resources folder. You might use/compare with another
parser (e.g. Stanford Arabic Parser, http://nlp.stanford.edu/software/lex-parser.shtml)
2. Port the parser to the web using a java servlet: From a web server (using NetBeans and TomCat)
you should be able to send input to the parser and get output from it to be displayed back in the
server.
3. Take user input and give parse output: user input is a free Arabic script text not tokenized,
transliterated or formatted in any way. See how you can format the raw text to get a successful parse
from the parser. Provide means for presenting the output sentence graphically.
4. Test and Evaluate
2.2 Assignment 2: Preprocessing of Arabic text: tokenization & POS tagging
This task consists of using SVM machine learning in order to pre-process raw Arabic text and produced
tokenized and part-of-speech (POS) tagged analysis. You are recommended to use WEKA or RapidMiner
for this task. Refer to the paper Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks
By Mona Diab, Kadri Hacioglu and Daniel Jurafsky, Published in HLT-NAACL 04 or recent work by the
first author. You can download the tool described in this paper from Software and Tools folder:
1. Design classifiers for tokenization of Arabic text: Arabic words consists of clitics that need to be
separated in the tokenization task.
2. Design classifiers for POS-tagging of Arabic text: Each word should be assigned the right POS
category
3. Train on the Arabic Treebank: The model will be trained on the Arabic Treebank from Software
and Resources folder
4. Test and evaluate.
A demonstration of a similar system can be seen here: http://nlp.ldeo.columbia.edu/amira/
Tokenization Sample Input sentence
ولم يحتسب الحكم المجري ساندور بول ركلة جزاء صحيحة اثر عرقلة ھيسكى داخل المنطقة من قبل اليساندرو نستا
wlm yHtsb AlHkm Almjry sAndwr bwl rklp jzA’ SHyHp Avr Erqlp hyskY dAxl AlmnTqp mn qbl
AlysAndrw nstA.
Tokenization Sample Output sentence
w lm yHtsb Al Hkm Al mjry sAndwr bwl rklp jzA’ SHyHp Avr Erqlp hyskY dAxl Al mnTqp mn qbl
AlysAndrw nstA .
POS-Tagging Sample Input sentence
w lm yHtsb Al Hkm Al mjry sAndwr bwl rklp jzA’ SHyHp Avr Erqlp hyskY dAxl Al mnTqp mn qbl
AlysAndrw nstA .
POS-Tagging Sample Output sentence
w/CC lm/RP yHtsb/VBP AlHkm/NN Almjry/JJ sAndwr/NO_FUNC bwl/NNP rklp/NN jzA’/NN SHyHp/JJ
Avr/IN Erqlp/NN hyskY/NO_FUNC dAxl/IN AlmnTqp/NN mn/IN qbl/NN AlysAndrw/NNP nstA/NN
./PUNC
2.3 Assignment 3: Developing an Arabic Named Entity Recognition System.
In this assignment, you have to build a rule-based Named Entity Recognition system (RBNER) for Arabic,
which is capable of identifying one or more of ENAMEX categories (i.e. Person, Location and Organization
NEs), using GATE tool [9]. A RBNER system consists basically of a set of linguistic rules (i.e. grammars)
and a set of gazetteers (i.e. dictionaries/keyword lists). A linguistic rule may utilize NE Gazetteer(s) in its
structure to support and implement the rule efficiently. Then, you will need to evaluate the performance of
the rule-based NER system when applied on a standard dataset/corpus (i.e. ANERcorp1 dataset). It is
recommended that you have a look at the following papers: [3, 10, 11].
The system environment: GATE platform [9], which allows you to implement linguistic rules,
create/add gazetteers and evaluate the produced system.
The NE gazetteers: You need to consider NE gazetteers in the structure of the new linguistic rules.
An example of gazetteers to be considered is ANERGazet2
The linguistic rules: The rules need to be implemented in JAPE language. Read the GATE user
manual to learn about JAPE. Also, reading [3, 10, 11] might help too.
System evaluation: The performance of the system, when applied on ANERcorp dataset, can be
evaluated using GATE built-in evaluation tool, so-called AnnotationDiff. The results should be in
terms of precision, recall and f-measure.
2.4 Assignment 4: Processing Arabic Questions using open Source tools
In this assignment you will use any open source tool such as QANUS (can be downloaded from
http://www.qanus.com/download/) or OpenEphyra (can be downloaded from
1, 2 Available to download on http://www1.ccls.columbia.edu/~ybenajiba/downloads.html
http://sourceforge.net/projects/openephyra/) or any other tool to develop Question Answering System. Refer
to the papers in reference [5] and [6] to know more about Question Answering Systems and related tasks and
tools. Then, you can use some standard set of questions (both English and Arabic) from
TREC(http://www.emi.ac.ma/bouzoubaa/download/) or CLEF
(http://www1.ccls.columbia.edu/~ybenajiba/downloads.html).
1. Processing of questions: Process the English questions using the open source tool and predict the
classes of the question. Processing of question involves word segmentation and POS tagging.
2. Modification of the source code: Modify the source code of the tool to process Arabic questions
3. Test and evaluate: Compare the performance of the tool for both English and Arabic question
sets.
2.5 Assignment 5: Implementing Recommender systems using data mining and knowledge discovery
tools
Recommender Systems are software tools and techniques providing suggestions for items to such as what
items to buy, what music to listen to, or what online news to read. A recommender system normally focuses
on to generate the recommendations of a specific type of item based on some recommendation technique.
You can find more about recommender system in [7]. In this assignment, you are required to perform the
following task for recommender system:
1. Recommender Algorithms: Compile and compare the at least four recommender algorithms.
2. Data set and Tools: Identify the data set for recommender system. You can use your Facebook/
LinkenIn/ Instagram friend list or list of books on Amazon, YouTube video lectures, online music
store or any other data of your choice. Select any data mining tool useful for recommender system
such as RecommendeLab, RapidMiner, KNIME, Weka .
3. Implementation of Recommender Algorithm: Implement the best algorithm described in Task 1
using tools of your choice from step 1. A sample implementation can be found in
[https://bib.irb.hr/datoteka/596976.rcomm2012_recommenders.pdf].
4. Results: Present the results and interesting patterns.
2.6 Assignment 6: Automatic document summarization
Document summarization is the technique of identifying and extracting important information from text
documents. The output of the document summarization is usually significantly smaller than original
document and is not longer than half of the original document under any circumstances. In this assignment
you are required to do the following task:
1. Summarization Algorithms: Discuss at least three document summarization technique.
2. Implementation of summarization algorithm: Implement one of the document summarization
techniques using Perl, Java or Python. Optionally you can use automatic summarization tool such as
Mead [http://www.summarization.com/mead/]
3. Results: Rate the summarization of text produced by program/tool. Present the summarization
results.
3. Guidelines for Report
Below are guidelines on how to write-up your report for the final project. Of course, for a short class project,
all of the sections may not be relevant. However, you may use it as a general guide in structuring your final
report.
A “standard” experimental AI paper consists of the following sections:
1. Introduction
Motivate and abstractly describe the problem you are addressing and how you are addressing it. What is the
problem? Why is it important? What is your basic approach? A short discussion of how it fits into related
work in the area is also desirable. Summarize the basic results and conclusions that you will present.
2. Problem Definition and Algorithm
2.1 Task Definition
Precisely define the problem you are addressing (i.e. formally specify the inputs and outputs). Elaborate on
why this is an interesting and important problem. Include a simple specific example, providing the I/O
showing how the output is related to the input specifying the desired/achieved properties of the output
illustrating the basic terms used.
2.2 Algorithm Definition
Describe in reasonable detail the algorithm (rules) you are using to address this problem. A pseudo-code
description of the algorithm you are using is frequently useful. Trace through a concrete example, showing
how your algorithm processes this example. The example should be complex enough to illustrate all of the
important aspects of the problem but simple enough to be easily understood. If possible, an intuitively
meaningful example is better than one with meaningless symbols.
3. Experimental Evaluation
3.1 Methodology
What are criteria you are using to evaluate your method? What specific hypotheses does your experiment
test? Describe the experimental methodology that you used. What are the dependent and independent
variables? What is the training/test data that was used, and why is it realistic or interesting? Exactly what
performance data did you collect and how are you presenting and analyzing it? Comparisons to competing
methods that address the same problem are particularly useful.
3.2 Results
Present the quantitative results of your experiments. Graphical data presentation such as graphs and
histograms are frequently better than tables. What are the basic differences revealed in the data? Are they
statistically significant?
3.3 Discussion
Is your hypothesis supported? What conclusions do the results support about the strengths and weaknesses
of your method compared to other methods? How can the results, be explained in terms of the underlying
properties of the algorithm and/or the data.
4. Related Work
Answer the following questions for each piece of related work that addresses the same or a similar problem.
What is their problem and method? How is your problem and method different? Why is your problem and
method better?
5. Future Work
What are the major shortcomings of your current method? For each shortcoming, propose additions or
enhancements that would help overcome it.
6. Conclusion
Briefly summarize the important results and conclusions presented in the paper. What are the most
important points illustrated by your work? How will your results improve future research and applications
in the area?
Bibliography & Citations
Be sure to include a standard, well-formatted, comprehensive bibliography with citations from the text
referring to previously published papers in the scientific literature that you utilized or are related to your
work. Always use a consistent citation style for your references. The standard style used around the
university is the Harvard Style. However, I will accept any other standard style (e.g. APA style) as long as it
is used consistently.
Try to make your report EASY to read.
Be sure to include an overview in the beginning, which outlines what the report will be describing, in a•
section-by-section fashion.
Include simple examples (or better, a single simple example throughout), to help illustrate the ideas.•
A picture is worth (at least) a thousand words. Use figures, flow-charts, graphs, whenever appropriate.•
The material should be structured, and flow. It should NOT be a core-dump of everything you happened•
to read when you were looking at things related to X. Readers (read “the people who will assign your
grade!”) get annoyed by having to wade through irrelevant material.
If you are giving a high-level description of an algorithm, be sure to explicitly state its input and output.•
Many algorithms have a flow of information, from one subroutine to another. Provide one or more•
figures, to make the ideas clear.
Also, proof-read your report. As a grader, I find it very irritating to read a report that has pages of easyto-•
fix typos, illegible figures, missing citations, etc. And you really don’t want to irritate the person who
is assigning your grade…
If you are describing a precise algorithm, you should give the actual formulas, using terms that are welldefined,•
in the report.
Your report should be self-contained. You are allowed to copy figures from other sources (if they are•
properly credited). But if you do, be sure to define the terms that appear in that figure!
Save trees – hand in a 2-sided version. And use section numbers, and page numbers!•
The submission must accompany a CD containing your code; also include a tutorial, and a user manual
which will help the user to run the agent based system. An agreed dataset should be provided. Use your
creativity to make the submission better.
4. Demonstration
The demonstration times for individual teams will be posted later in the semester. It is planned that the
demonstrations will take place around the submission deadline. Pls. make your appointment.
5. Academic Integrity
Copying or paraphrasing someone’s work (code included), or permitting your own work to be copied or
paraphrased, even if only in part, is not allowed, and would result in a disciplinary action according to the
university policy. Any resources or ideas borrowed from other sources should be explicitly referenced in text
and bibliographies.
6. Marking Scheme:
The grading will be broken down based on the following criteria:
Deliverable Criterion Max Actual
Software Based on system usability Demonstration of system,
Supporting data are provided
3%
Coding Style, Readability,
Comments etc.
2%
Total for Software 5%
Report based on
quality of
report
Introduction
and Literature
Review
Articulation of research•
issue/problem
Coherence of the research•
aim(s) and objectives
Relevance and importance•
of the research issue
Criteria for the proposed•
solution
10%
Explanation of constraints•
Organisation and logical•
sequence of the contents of
the dissertation
Comprehensive and correct•
citation of references
and/or bibliography
Appropriate written style•
and use of language
General quality of•
presentation
Supporting documents are•
provided
5%
Comprehensive, rigorous•
and critical review of the
literature
Appropriateness of•
theoretical concepts
employed
10%
Subtotal 25%
Research
Methodology,
and data
Analysis,
interpretation
and discussion
Relevant and effective•
research methodology
Rigour of application of•
the methods of
investigation
Identification of a solution•
and exploration of
alternatives
Development of an•
application
20%
Quality and depth of•
research
Appropriateness of•
methods of analysis
Rigour of application of•
the methods of analysis
Reproducibility of results•
20%
Subtotal 40%
Creativity, and
Conclusion(s)
Producing results close to•
or exceeding those in
published research.
If there is no relevant•
published research, then
this score will be used for
accuracy and coverage
sufficient to support
interpretation and
conclusion.
Testing in multiple•
conditions
Design and•
implementation that
demonstrates software
engineering skills and
completeness
20%
Consistency of•
conclusion(s) with research
objectives
Consistency of•
10%
conclusion(s) with findings
and discussion
Comprehensive of the•
implications of the
conclusion(s)
Appropriateness of•
recommendations on the
basis of the conclusion(s)
Value of the research and•
makes a contribution to
knowledge and /or practice
Subtotal 30%
Total for the report 95%
Total 100%
References
[1] Daniel M. Bikel. 2004. A Distributional Analysis of a Lexicalized Statistical Parsing Model. In the
proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing
(EMNLP 2004)
[2] Mona Diab, Kadri Hacioglu and Daniel Jurafsky. 2004. Automatic Tagging of
Arabic Text: From Raw Text to Base Phrase Chunks. In HLT-NAACL 04.
[3] Abdallah, S., Shaalan, K., Shoaib, M., Integrating Rule-based System with Classification for Arabic
Named Entity Recognition, Lecture Notes in Computer Science, Computational Linguistics and
Intelligent Text Processing, 7181: 311-322, 2012.
[4] Mohammed Attia, Antonio Toral, Lamia Tounsi, Monica Monachini and Josef van Genabith. 2010.
‘An automatically built Named Entity lexicon for Arabic’. LREC 2010. Valletta, Malta.
[5] J-P Ng and M-Y Kan, “QANUS: AN Open Source Question-Answering Platform”, 2010,
http://wing.comp.nus.edu.sg/~junping/docs/qanus.pdf
[6] Nico Schlaefer, “A Semantic Approach to Question Answering”.
VDM Verlag Dr. Mueller, ISBN 3836450739, 2007.
[7] Recommender Systems, http://www.cc.uah.es/drg/courses/datamining/IntroRecSys.pdf
[8] Dipanjan Das and André F.T. Martins, “A Survey on Automatic Text Summarization”,
Literature Survey for the Language and Statistics II course at Carnegie Mellon University, 2007
[9] Cunningham H, Maynard D, Bontcheva K, Tablan V, Aswani N, Roberts I et al. Text Processing
with GATE (Version 6). University of Sheffield Department of Computer Science, 2011.
[10] Oudah, M. and Shaalan, K. (2013). Person Name Recognition Using the Hybrid Approach. Lecture
Notes in Computer Science, Natural Language Processing and Information Systems, Springer Berlin
Heidelberg, vol. 7934, pages 237–248.
[11] Shaalan, K., Oudah, M., A Hybrid Approach to Arabic Named Entity Recognition, Journal of
Information Science (JIS), 40(1): 67-87, SAGE Publications Ltd, UK.