Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated to UD release 2.7 and adopting conllup format #10

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,12 +110,12 @@ Please email your questions or comments to [Yunyao Li](http://researcher.watson.
* [Marina Danilevsky](http://researcher.watson.ibm.com/researcher/view.php?person=us-mdanile), IBM Research - Almaden
* [Yunyao Li](http://researcher.watson.ibm.com/researcher/view.php?person=us-yunyaoli), IBM Research - Almaden
* [Huaiyu Zhu](http://researcher.watson.ibm.com/researcher/view.php?person=us-huaiyu), IBM Research - Almaden
* [Alexandre Rademaker](http://researcher.ibm.com/researcher/view.php?person=br-alexrad), IBM Research - Brazil

### Contributors

* Xinyu Guan, Yale University
* Tomer Mahlin, IBM Systems Division, Israel
* [Alexandre Rademaker](http://researcher.ibm.com/researcher/view.php?person=br-alexrad), IBM Research - Brazil
* Vishwajeet Kumar, IIT Bombay
* [Fei Xia](http://faculty.washington.edu/fxia), University of Washington
* Chenguang (Ray) Wang, Amazon
Expand Down
298 changes: 153 additions & 145 deletions UP_English-EWT/README.org

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion UP_English-EWT/conll-to-conllu.awk
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,5 @@ new_sent == 1 && NF > 0 {
new_sent == 0 && NF > 1 {
for(i=j=9; i < NF; i+=1) {$j = $j"/"$(i+1)}
# ID form lemma upos xpos feats head deprel deps misc
print $3, $4, $4, $5, "_", "_", "_", "_", $6, "Framefile=" $7 "|" "Roleset=" $8 "|" "Args=" $9
print $3+1, $4, $4, $5, "_", "_", 1, "_", $6, "Framefile=" $7 "|" "Roleset=" $8 "|" "Args=" $9
}
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,23 @@ function join2(array, sep1, sep2) {
result = ""
for(i in array)
if(result == "")
result = i sep2 array[i]
if(array[i] != "")
result = i sep2 array[i]
else
result = i sep2
else
result = result sep1 i sep2 array[i]
if(array[i] != "")
result = result sep1 i sep2 array[i]
else
result = result sep1 i
return result
}


BEGIN {OFS = "\t";}
BEGIN {
OFS = "\t";
print "# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC PB:PRED PB:ARGS"
}

$0 ~ /^#/ {print; next}

Expand All @@ -39,20 +48,26 @@ $0 ~ /^[0-9]/ && NF == 10 {
delete b;

str2map(pmisc,"|","=",a)
frame = a["Framefile"]
delete a["Framefile"]
if ( frame == "-") frame = "_"

role = a["Roleset"]
delete a["Roleset"]
if ( role == "-" ) role = "_"
if(role=="-" || role == "") role = "_"

args=a["Args"]
if(args=="-" || args == "") args = "_"

delete a["Roleset"]
delete a["Framefile"]
delete a["Args"]
split(args,b,/\//)
margs = join1(b,"\t")
delete a["_"]
delete a["-"]

misc = join2(a,"|","=")
if (misc == "") misc = "_"
split(args,b,/\//)
margs = join1(b,"|")

if (length(a)>0)
misc = join2(a,"|","=")
else
misc = "_"

print $1,$2,$3,$4,$5,$6,$7,$8,$9,misc,role,margs
next
Expand Down
47,753 changes: 24,416 additions & 23,337 deletions UP_English-EWT/en_ewt-up-dev.conllu → UP_English-EWT/en_ewt-up-dev.conllup

Large diffs are not rendered by default.

48,203 changes: 24,695 additions & 23,508 deletions UP_English-EWT/en_ewt-up-test.conllu → UP_English-EWT/en_ewt-up-test.conllup

Large diffs are not rendered by default.

395,278 changes: 200,095 additions & 195,183 deletions UP_English-EWT/en_ewt-up-train.conllu → UP_English-EWT/en_ewt-up-train.conllup

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion UP_English-EWT/make.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
sbcl --load merge.lisp --eval "(in-package :merge-pb)" --eval "(main)" --eval "(sb-ext:quit)"

for f in en_ewt-up-{dev,test,train}.conllu.new; do
awk -f conllu-to-conll.awk $f > $(basename $f .conllu.new).conllu;
awk -f conllu-to-conllup.awk $f > $(basename $f .conllu.new).conllup;
done
rm en_ewt-up-{dev,test,train}.conllu.new

Expand Down
36 changes: 31 additions & 5 deletions UP_English-EWT/merge.lisp
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@

(ql:quickload '(:str :cl-conllu :cl-ppcre))

(declaim (optimize (debug 2)))

(defpackage :merge-pb
(:use :cl :cl-conllu :cl-ppcre))

Expand All @@ -36,6 +38,9 @@
(progn (setf (cdr cell) value) alist)
(acons key value alist))))

(Defun alist-remove (alist key)
(remove key alist :key #'car :test #'equal))

(defun update-token-misc (tk alist)
(setf (token-misc tk)
(format nil "~{~a~^|~}"
Expand Down Expand Up @@ -205,16 +210,37 @@
(car a)))))
(parse-args (loop for r from 0 below rt collect (aref args 0 r c)))))


(loop for tk in (sentence-tokens s)
do (let ((al (token-misc-alist tk))
(vs (loop for c from 0 below ct collect (or (aref args 1 (1- (token-id tk)) c) "_"))))
(update-token-misc tk
(alist-update al "Args" (format nil "~{~a~^/~}" vs)))))))))
do (let* ((al (token-misc-alist tk))
(a1 (alist-remove al "Framefile"))
(a2 (alist-remove a1 "Args"))
(a3 (remove-if (lambda (p) (and (equal "Roleset" (car p)) (equal "-" (cdr p)))) a2)))
(update-token-misc tk a3)))

(loop for p in preds
for i from 0 below ct
do (let* ((al (token-misc-alist (car p)))
(args (loop for r from 0 below rt
when (and (aref args 1 r i)
(not (equal "V" (aref args 1 r i))))
collect (format nil "~a:~a" (aref args 1 r i) (1+ r))))
(a1 (if (> (length args) 0)
(alist-update al "Args" (format nil "~{~a~^/~}" args))
(alist-update al "Args" "_"))))
(update-token-misc (car p) a1)))

;; (loop for tk in (sentence-tokens s)
;; do (let ((al (token-misc-alist tk))
;; (vs (loop for c from 0 below ct collect (or (aref args 1 (1- (token-id tk)) c) "_"))))
;; (update-token-misc tk
;; (alist-update al "Args" (format nil "~{~a~^/~}" vs)))))
))))


(defun main ()
(let* ((sets (make-hash-table :test #'equal))
(up (cl-conllu:read-conllu "propbank-all.conllu"))
(up (cl-conllu:read-conllu #P"propbank-all.conllu"))
(ud (reduce (lambda (r fn)
(let ((sents (cl-conllu:read-conllu fn)))
(setf (gethash fn sets)
Expand Down