-
Notifications
You must be signed in to change notification settings - Fork 1
/
readme.txt
68 lines (51 loc) · 2.04 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
How to run :
create a folder paste wiki_indexer.py and wiki_search.py
FOR INDEXING
RUN : python3 wiki_indexer.py [path location of folder where xml files present]
ex: python3 wiki_indexer.py wiki/dump/
FOR Searching:
RUN : python3 wiki_search.py [path location of queries.txt]
format of queries.txt
3, t:World Cup i:2019 c:Cricket
value of k-> 3(top 3 documents)
query -> 't:World Cup i:2019 c:Cricket'
OUTPUT:
queries_op.txt having result in format
Indeing:
parsing each xml file:
creating index files of 1 lakh words each
after parsing , we get multiple file of 1 lakh words each, these files are sorted but they are not
sorted with others
Applied merge sort for every 2 file
After merging we get one huge file,huge file is difficult to open and close again,
so we split big file in chunk of 1000 each
and for each staring word of file,
we make entry in secondary file (intutuion : to make search efficient)
splitted files will be present in './index/'
along with that I have created one pickle file for title which holds [pageNo -> title]
Searching:
use tfidf for ranking mechanism
take query
check whether query is fieldquery or not:
if not fieldquery:
tokenize(data) [lowercase->stemming->stopwords]
for each word:
get the posting list for each word
get the documents where word is present
get the frequency of that word in that document
calculate tfidf for each docid
get the topk docid acc to their tfidf score
get the title for corresponding docid
else
for each filed perform same operation :(change get the docid where word is present in that field)
get the list for each field
perform intersection of list and produce result
if result less than topk documents
perform union
postingList Format:
coalici:d3666b3 d3973b1 d4955b1
word 'coalici' occur in 3 documents with docID[3666,3973,4955]
occurence of 'coalici' in docID '3666' is 3
Further improvement :
mutti-threading for each xml file
k-way parallel merge sort