1
00:00:00,000 --> 00:00:03,370
It's 15:00, let's get started.

2
00:00:03,860 --> 00:00:07,000
Thanks for coming, my name is Michael Stapelberg.

3
00:00:07,000 --> 00:00:09,420
This talk is about Debian Code Search.

4
00:00:09,420 --> 00:00:12,360
Before we start: who uses Debian?

5
00:00:12,360 --> 00:00:20,150
Most of you. Maybe I should have chosen a different title; the topic is actually not Debian-specific.

6
00:00:23,310 --> 00:00:26,470
Here we have the contents of this talk.

7
00:00:27,370 --> 00:00:30,640
It looks like a lot, but we can definitely cover it.

8
00:00:30,640 --> 00:00:35,270
First of all, I want to explain the motivation, i.e. why you would need a code search engine.

9
00:00:35,270 --> 00:00:42,450
Afterwards, I will briefly explain how search engines in general work.

10
00:00:42,450 --> 00:00:49,760
Then I will cover the basis of my work, the codesearch tools from Russ Cox.

11
00:00:49,760 --> 00:00:54,880
I will cover why it makes sense to do this project in the context of Debian,

12
00:00:54,880 --> 00:00:58,070
then the general architecture,

13
00:00:58,070 --> 00:01:00,840
what an inverted index means,

14
00:01:00,840 --> 00:01:02,840
how we implemented ranking,

15
00:01:02,840 --> 00:01:04,840
what kinds of optimizations I did

16
00:01:04,840 --> 00:01:06,840
and finally we have some time for questions.

17
00:01:06,840 --> 00:01:12,910
I should mention that Debian Code Search is also the subject of my bachelor thesis,

18
00:01:12,910 --> 00:01:21,710
which was mostly a lucky coincidence. I would have done it anyway, but the time was right and I found a professor supporting my project.

19
00:01:23,200 --> 00:01:27,910
Let's talk about motivation: why would you search for source code?

20
00:01:27,910 --> 00:01:35,660
Frequently, with Open Source projects, I experience the problem that documentation is really bad.

21
00:01:35,660 --> 00:01:42,550
Maybe there is no documentation at all, or out-of-date docs, or docs which are just not very friendly to newcomers.

22
00:01:42,550 --> 00:01:46,460
Two examples are OpenSSL and XCB.

23
00:01:46,460 --> 00:01:51,790
With OpenSSL - who of you has worked with OpenSSL?

24
00:01:51,790 --> 00:01:54,980
Who of you found it easy to get into OpenSSL?

25
00:01:54,980 --> 00:01:56,980
Exactly! Nobody.

26
00:01:56,980 --> 00:02:01,000
OpenSSL is incredibly complex. It does a lot.

27
00:02:01,000 --> 00:02:07,110
All of it is documented, but to understand the simple things, you need to read all the docs.

28
00:02:07,110 --> 00:02:14,600
So it is much much simpler to read a tutorial or look at what other programs are doing, i.e. working with code examples,

29
00:02:14,600 --> 00:02:19,390
instead of reading through all the documentation, which is not a feasible thing in this case.

30
00:02:19,390 --> 00:02:23,880
Another, different, example is XCB.

31
00:02:23,880 --> 00:02:27,320
With XCB, nothing is documented. Why is that?

32
00:02:27,320 --> 00:02:35,380
XCB, first of all, stands for X C Bindings. It is a library that allows you to talk to the X server from a C program.

33
00:02:35,380 --> 00:02:41,180
XCB is the intended successor of Xlib, which partly succeeded, but that is not very interesting right now.

34
00:02:41,180 --> 00:02:48,500
Previously, with Xlib, humans wrote and maintained all the code.

35
00:02:48,500 --> 00:02:53,590
With XCB, they thought: that is not very clever, it led to subtle errors,

36
00:02:53,590 --> 00:02:58,020
so instead, they created a protocol description of the X11 protocol.

37
00:02:58,020 --> 00:03:02,370
This is an XML file from which source code is automatically generated,

38
00:03:02,370 --> 00:03:07,060
for C, for Python, for Lua, whatever. That is a nice advantage,

39
00:03:07,060 --> 00:03:16,750
the other advantage is that by generating code, you cannot make the same mistakes as have happened in the past.

40
00:03:16,750 --> 00:03:21,290
However, in that protocol description, there is no documentation.

41
00:03:21,290 --> 00:03:28,860
So, if you as a programmer look at this library, you will see lots and lots of functions.

42
00:03:28,860 --> 00:03:32,220
They are all generated and come with zero documentation.

43
00:03:32,220 --> 00:03:38,420
That means you have to know the X11 protocol already, or maybe have worked with Xlib in the past, before you can even use this project.

44
00:03:38,420 --> 00:03:44,150
When working with XCB, it helped me a lot to look at the source code of other programs.

45
00:03:44,150 --> 00:03:56,040
So this is the first point. I want to find real-life code examples so that I can avoid dealing with bad documentation.

46
00:03:56,040 --> 00:04:01,560
The other big point is tracing program flow across project boundaries.

47
00:04:01,560 --> 00:04:10,470
Assume I have some application, say evince, on my computer and find a bug. What do I do?

48
00:04:10,470 --> 00:04:19,330
Well, I download the source code, locate the bug in the source and usually realize:

49
00:04:19,330 --> 00:04:22,270
ah, evince does a library call into another project.

50
00:04:22,270 --> 00:04:30,850
So I start from scratch: download that library, locate the line, realize it calls another library.

51
00:04:30,850 --> 00:04:38,520
And it goes on like that. It is annoying to download the whole source code, when all you want to look at is just a tiny fraction of the source.

52
00:04:38,520 --> 00:04:46,280
Also, jumping around between different projects is much easier with a search engine, where you just enter the name of a function and get to the function.

53
00:04:46,280 --> 00:04:57,990
So this is the second reason. There are more, but I think it is enough to explain those two. People who want to read more can read my bachelor thesis.

54
00:04:59,970 --> 00:05:08,510
The next question is of course: why a new search engine? You would think somebody already built such a thing.

55
00:05:09,830 --> 00:05:14,700
Of course there are some code search engines, but they are language-specific.

56
00:05:14,700 --> 00:05:17,440
Two examples are Nullege and Sourcerer.

57
00:05:17,440 --> 00:05:28,150
Nullege is a Python search engine, and it really understands Python. So you can navigate through hierarchies of classes and stuff, which is cool.

58
00:05:28,150 --> 00:05:33,300
But it also implies that Nullege cannot handle non-python source code.

59
00:05:33,300 --> 00:05:37,800
And I don't want to be restricted to one language, I want it to work for all of them.

60
00:05:37,800 --> 00:05:42,010
With sourcerer, it's the same issue, but sourcerer is Java-specific.

61
00:05:42,010 --> 00:05:49,680
The other class of search engines is the one which is not up-to-date or contains very little software in general.

62
00:05:49,680 --> 00:05:53,750
An example for that is Koders and ohloh code, which by now belong together.

63
00:05:53,750 --> 00:06:01,590
They are changing their system since quite some time, and don't update their index in the meantime.

64
00:06:01,590 --> 00:06:06,460
It is super frustrating when you search for something and find old code.

65
00:06:06,460 --> 00:06:13,560
Another example is Krugle, which, like a few others, has a company behind it.

66
00:06:13,560 --> 00:06:23,950
What this company wants to do is sell a code search engine to other companies. To demonstrate their product, they run a demo version with open source software in it.

67
00:06:23,950 --> 00:06:30,600
The problem is that Krugle, in particular, contains 450 OSS projects.

68
00:06:30,600 --> 00:06:35,100
This is a ridiculously small amount. If you search for anything, you won't find it.

69
00:06:35,100 --> 00:06:40,990
The other problem is: there have been search engines which were pretty good, e.g. Google Code Search, which I used.

70
00:06:40,990 --> 00:06:46,900
but it is not available anymore because Google wants to focus on different projects and just closed it.

71
00:06:46,900 --> 00:06:52,310
Furthermore, all of the previously mentioned search engines are not open source themselves.

72
00:06:52,310 --> 00:07:00,980
It is always good to have an OSS project, but especially when doing something for an OSS project like Debian, it's even more important the tools are OSS.

73
00:07:00,980 --> 00:07:06,600
Aside from all that, it's very interesting to build your own search engine. I learnt a lot.

74
00:07:08,410 --> 00:07:12,140
So, even if there already are search engines, it's worthwhile to build another one.

75
00:07:12,140 --> 00:07:17,640
So how does a search engine work in general? Who of you implemented something in that area?

76
00:07:17,640 --> 00:07:23,440
2 people, alright. There are a few technical terms you should know, so I'll briefly explain them.

77
00:07:23,440 --> 00:07:28,120
The first one is the corpus, which describes the set of data which is being searched.

78
00:07:30,110 --> 00:07:33,590
Usually, when thinking of a search engine, you think of a web search engine.

79
00:07:33,590 --> 00:07:38,000
For a web search engine, the corpus is just the set of all web sites.

80
00:07:38,000 --> 00:07:44,800
However, web sites change permanently, so a web search engine needs a crawler.

81
00:07:44,800 --> 00:07:50,170
The crawler permanently downloads web sites, constantly expanding the corpus.

82
00:07:50,170 --> 00:07:53,000
But this is not an inherent property of search engines.

83
00:07:53,000 --> 00:07:53,020
A corpus could also, in your local library, be the set of all books.
But this is not an inherent property of search engines.

84
00:07:53,020 --> 00:07:57,970
A corpus could also, in your local library, be the set of all books.

85
00:07:57,970 --> 00:08:03,840
And when new books arrive, the employee at the library enters them into the system. Then you don't need a crawler.

86
00:08:03,840 --> 00:08:09,530
So there are different ways to construct your corpus, but now you know what the term itself means.

87
00:08:09,530 --> 00:08:16,830
The second thing you think of is the query, meaning a query to your search engine.

88
00:08:16,830 --> 00:08:23,440
This is very simple and everyone of you has dealt with it: this is what you enter for example into the Google search box.

89
00:08:23,440 --> 00:08:30,860
However, not only the query itself is sent to search engine, but it usually gets expanded.

90
00:08:30,860 --> 00:08:35,840
As an example, you could imagine that, when you do a web search in Germany,

91
00:08:35,840 --> 00:08:48,770
your search engine could say: okay, the query was "kebab house", I will add Germany, or even Karlsruhe, to the query, so it knows which documents to return first.

92
00:08:48,770 --> 00:08:56,310
So, query means query for the internal system and can be different from what you entered.

93
00:08:56,310 --> 00:09:01,540
You also need an inverted index,

94
00:09:01,540 --> 00:09:09,120
which is just a data structure leading from a query to the location in your corpus. I will cover how that works in a second.

95
00:09:09,120 --> 00:09:19,360
And finally, the ranking is very important because nobody would be happy with a results page in no particular order.

96
00:09:19,360 --> 00:09:24,150
Obviously, you would want the thing you are most likely looking for to appear at the very top.

97
00:09:24,150 --> 00:09:32,210
This is a pretty hard problem. I will cover later how Debian Code Search does it.

98
00:09:32,210 --> 00:09:37,350
By the way, if there are any questions, please just raise your hand.

99
00:09:39,320 --> 00:09:50,280
The basis of my work are the code search tools by Russ Cox. He is the author of Google Code Search and wrote it as his internship project in 2006.

100
00:09:50,280 --> 00:10:03,990
After Google Code Search was taken offline in 2012, he published an article in February 2012, in which he explains how it worked.

101
00:10:03,990 --> 00:10:11,530
The article not only explains how it works, but also comes with source code.

102
00:10:11,530 --> 00:10:24,630
He re-implemented his ideas, and his code does a few simplifications, so it is not production-ready.

103
00:10:24,630 --> 00:10:30,290
But it provides a good basis, and you understand how it was meant to work.

104
00:10:30,290 --> 00:10:37,360
I want to show a quick demo. Let's first have a quick look at the article itself.

105
00:10:37,360 --> 00:10:43,180
It is called "Regular Expression Matching with a Trigram Index, or, How Google Code Search worked".

106
00:10:43,180 --> 00:10:46,820
Here he described how it all worked.

107
00:10:46,820 --> 00:10:56,890
Now, let's have a look at how those tools work. Assume I have a directory containing source code.

108
00:10:58,340 --> 00:11:02,630
Yes, this is open source under the BSD license, very friendly.

109
00:11:02,630 --> 00:11:14,530
Now let's demonstrate the tools. I have a directory called "i3" and can run the tool "cindex",

110
00:11:14,530 --> 00:11:24,770
which creates an index. I pass "src" as command line argument and now the tool has created an index.

111
00:11:24,860 --> 00:11:32,910
What I can do now is run "csearch" with a search term like "i3Font",

112
00:11:33,210 --> 00:11:41,860
and I will get the results. So it searched the directory I previously indexed and displays the results.

113
00:11:41,860 --> 00:11:50,090
So this is the basis with which I started. In the end, Debian Code Search is much more, though.

114
00:11:53,040 --> 00:12:11,550
Most people know Debian. It is a free, non-commercial Linux distribution. So it is a good idea to host such a project there, where you don't want it to get taken offline after a few years.

115
00:12:14,370 --> 00:12:20,160
In case I cannot maintain the project at some point in time, somebody else can step up.

116
00:12:20,160 --> 00:12:24,400
Personally, I am a Debian Developer since March 2012.

117
00:12:24,400 --> 00:12:38,360
And one of the reasons that led me to doing this with Debian is that we have a lot of software in Debian: about 17000 source packages, resulting in even more binary packages.

118
00:12:38,360 --> 00:12:42,300
All in all, they contain 129 GiB of source code.

119
00:12:42,300 --> 00:12:48,810
The advantage is that in Debian we only have free software. That means, we don't need a crawler.

120
00:12:48,810 --> 00:12:54,810
So we don't need to download software and figure out its license, or find out whether we can re-distribute it at all.

121
00:12:57,080 --> 00:13:01,960
So a lot of what you normally need for a web search engine is just not an issue here.

122
00:13:02,950 --> 00:13:08,170
Furthermore, in Debian, we have plenty of machine-readable metadata.

123
00:13:08,170 --> 00:13:11,770
Quite recent are machine-readable copyright files.

124
00:13:11,770 --> 00:13:16,910
Obviously, we also have the package names and their description.

125
00:13:16,910 --> 00:13:27,660
But what's even better is that we have usage numbers from the popularity contest. We can see how many popcon participants have installed a specific package.

126
00:13:27,660 --> 00:13:35,540
That's how we can figure out which packages are more important than others. More on that later.

127
00:13:42,690 --> 00:13:51,050
Now let's look at the architecture on a high level. The best thing is to give a quick demo.

128
00:13:56,660 --> 00:14:06,770
What we see here is the index page of Debian Code Search, which is quite simple and just contains a search box.

129
00:14:06,770 --> 00:14:20,770
There are a few links at the top: about, FAQ, etc. Those are static and the search engine only handles input to the search box. Now, question: what should I search for?

130
00:14:28,570 --> 00:14:41,290
Here we have the results page. It took a bit of time, but not incredibly long. A grep on the entire Debian archive would be much slower, so that's a win.

131
00:14:41,380 --> 00:14:53,110
What we see here are the results. The function we searched for is defined in the libc, which is a package that a lot of people have installed, so it shows up at the very top.

132
00:14:53,110 --> 00:15:08,830
The results page is structured like you would expect it. There are more results, e.g. in the python package, and then you can navigate to the next page.

133
00:15:15,090 --> 00:15:24,810
As a short aside, what you wanted to search for is fine, but have a look at what other people enter into search engines after you launch them.

134
00:15:24,810 --> 00:15:31,060
I was surprised at first, especially considering the amount of hits.

135
00:15:31,060 --> 00:15:43,270
The first hit, by the way, was caused by somebody writing a blog post about the issues with Douglas Crockford's license. In that post, he linked to this query as an example.

136
00:15:43,270 --> 00:15:51,620
For the other queries, I just think that this is what people enter once you present them an input box.

137
00:15:51,620 --> 00:16:07,810
This table was created to evaluate the cache ratio. The most important improvement after launch was to enable aggressive caching, otherwise the search engine would have been overloaded.

138
00:16:07,810 --> 00:16:14,580
Now, to the architecture. What you just saw works like this:

139
00:16:14,580 --> 00:16:21,140
To the top left, you see the HTTP frontend, which is just an nginx server in our case.

140
00:16:21,140 --> 00:16:31,930
What nginx does for us is delivering static pages like the about and FAQ page. Furthermore, it can load-balance the requests going into the system.

141
00:16:31,930 --> 00:16:50,050
That is, if I have multiple of these backend processes, I could split the load 50/50. That allows me to disable one of the stacks, update it, then re-enable it.

142
00:16:50,050 --> 00:17:08,680
nginx then sends the query to dcs-web, which is a process that queries the different backends and then renders the results page that we just looked at.

143
00:17:08,680 --> 00:17:31,300
We have the index backend, which is sharded for different reasons. This is the inverted index, which is kept in memory, and allows us to go from a search query to the corresponding position in the Debian archive.

144
00:17:31,300 --> 00:17:44,920
After dcs-web queried the index backend, it only knows which files should be searched. This means, it only reduces the amount of files to search.

145
00:17:44,920 --> 00:17:54,600
Afterwards, dcs-web gets information from a PostgreSQL database, which it uses to define the order in which the files are searched.

146
00:17:54,600 --> 00:18:07,790
And finally, the query is passed to the source backend, which has access to the entire Debian archive. It searches the files and returns the actual matches, resulting in the page you just saw.

147
00:18:09,760 --> 00:18:23,470
Now to the inverted index. Assume we have two documents, d1.txt and d2.txt. d1 contains "Das Wetter ist schoen", d2 contains "Die Wetter-Vorhersage luegt".

148
00:18:23,470 --> 00:18:30,530
In an inverted index, you want to go from a word to the posting list.

149
00:18:30,530 --> 00:18:47,070
So you go through the first document, word by word. Then you store in which document each word appears. So the posting list for "Das" contains just d1.txt.

150
00:18:47,070 --> 00:19:00,950
"Wetter" however appears in both documents. This is a very simple index with a very tiny amount of documents, but that is how it works.

151
00:19:00,950 --> 00:19:11,400
Note that we only use whole words, no punctuation. So if you have asked yourself why web search engines don't care about punctuation: this is the reason.

152
00:19:16,500 --> 00:19:25,190
Right, two words are missing on this slide because I was careless.

153
00:19:25,190 --> 00:19:36,460
This is a good remark. Many search engines completely ignore words that are too frequent, because you would have incredibly long posting lists that don't really help you in finding what you are looking for.

154
00:19:36,460 --> 00:19:44,550
The thing is that we cannot just use a classical inverted index for a regular expression search engine.

155
00:19:44,550 --> 00:19:50,890
I should note that both, Google Code Search and Debian Code Search support regular expressions.

156
00:19:50,890 --> 00:19:58,900
This is a mighty tool for developers searching for code, but it also makes it harder to build a search engine in the first place.

157
00:19:58,900 --> 00:20:19,200
The problem is, when you consider ".*bar", we could search in our inverted index for "bar", but that doesn't match "foobar". So you would have to apply the regular expression on all keys in our index before you get to the posting list, and that is not feasible.

158
00:20:34,410 --> 00:20:41,820
So this example explained why you cannot just simply search for regular expressions.

159
00:20:41,820 --> 00:21:01,440
Another example is searching for "..._free". What should we do? Search for all combinations? Especially after considering Unicode, this results in combinatoric explosion, and everybody agrees that one cannot do this.

160
00:21:01,440 --> 00:21:21,570
Instead, the approach is to reduce the amount of files to search in. So we somehow want to go from a regular expression to a query that excludes a lot of files, and then we have a closer look at the remaining files.

161
00:21:21,570 --> 00:21:30,950
How does that work? You take the source code files and split them in trigrams.

162
00:21:30,950 --> 00:21:53,480
It looks like this: taking the identifier "XCreateWindow", we take "XCr", "Cre", and so on. So we split the entire source code in trigrams, including punctuation.

163
00:21:53,480 --> 00:22:02,650
"Cre" appears in two documents, so the posting list contains two entries, just like you saw in the example earlier.

164
00:22:02,650 --> 00:22:13,010
So we now transform the regular expression query we get into a trigram query.

165
00:22:13,010 --> 00:22:26,930
In this case, we search for Deb AND ebi AND bia and so on. The trigram query engine only supports AND, OR and subqueries.

166
00:22:26,930 --> 00:22:36,230
Those interested in how all the special cases are dealt with are welcome to read Russ Cox's original article.

167
00:22:36,230 --> 00:22:58,590
Some people might ask: why trigrams and not 2-grams or 4-grams? With 2-grams, we have too many common 2-grams, so the index doesn't tell you anything. With 4-grams, we have too many distinct entries, so the index is too big.

168
00:22:58,590 --> 00:23:02,870
So trigrams are the sweet spot.

169
00:23:02,870 --> 00:23:14,990
Now to another short demo. What I want to show is how to get from a trigram to a posting list.

170
00:23:14,990 --> 00:23:31,310
So we want to consider "i3F" as a trigram. This perl oneliner just prints the corresponding ASCII values for each character.

171
00:23:31,310 --> 00:23:52,470
So 69, 33, 46. These can be passed to a debugging tool which searches for the trigram in the first index shard. Within 20 us I get a list of files containing "i3F".

172
00:23:53,410 --> 00:24:16,200
When doing this with all 6 different index shards, I get a long list of files. But when ANDing "fon", and so on, the list gets shorter and shorter. And afterwards, I just essentially do a grep on the remaining files.

173
00:24:24,120 --> 00:24:31,580
The question was: are the numbers 16 bit or 32 bit? The answer is it's 32 bit to support unicode.

174
00:24:35,090 --> 00:24:38,040
Yes, you can search for unicode.

175
00:24:38,040 --> 00:24:44,680
Now to the ranking.

176
00:24:47,850 --> 00:24:52,270
This depends on the encoding of the source code.

177
00:24:54,240 --> 00:24:59,380
Yeah, latin is a subset of UTF-8 [not correct].

178
00:25:04,220 --> 00:25:24,130
No, it does not convert encodings. It assumes everything is UTF-8. So everything in latin which is in the subset of UTF-8, I hope that was clearer, just works. But having latin-encouded source is very rare, why would you do such a thing?

179
00:25:27,600 --> 00:25:44,180
Usually, comments are English, ASCII and therefore Unicode-clean, but if you have localized comments, please save them as UTF-8. Any more questions?

180
00:25:52,020 --> 00:25:59,000
The question was: are the sources indexed, too? Only the sources are indexed. I don't get the question yet.

181
00:26:04,870 --> 00:26:13,430
Okay, got it. The question is: are only program source code files indexed or everything which is inside a source package?

182
00:26:14,760 --> 00:26:36,390
Yeah, changelogs and stuff. I wrote the indexer myself and just used a blacklist of things I don't want to index. What the original code search tools already provide is recognizing binary files and not indexing them.

183
00:26:36,390 --> 00:26:48,340
So, files such as images, are not in the index. A few other files, like CHANGELOG and NEWS are ignored on purpose.

184
00:26:48,340 --> 00:26:53,060
Does that answer your question? Any more questions?

185
00:26:55,030 --> 00:27:03,340
Now we saw how we go from the search query to the list of potential results.

186
00:27:03,340 --> 00:27:20,470
Now comes the ranking. We have to decide which files to search first. It's not enough to search all files and sort afterwards, because, depending on the search query, searching all files is not feasible.

187
00:27:20,470 --> 00:27:32,330
So which results to show on top is a thing we decide later on, but we have to consider the ranking now already.

188
00:27:32,330 --> 00:27:43,860
Evaluating the ranking was done by picking search queries and looking for expected results manually.

189
00:27:43,860 --> 00:27:53,580
That is, as an example, we have "XCreateWindow" as a search query, and what I want to have in my results is libx11 in this file in this line.

190
00:27:53,580 --> 00:28:02,620
Then I tested each ranking idea I had and looked at how they performed.

191
00:28:02,620 --> 00:28:14,180
The result is a weighted sum, consisting to a large part of the popularity contest,

192
00:28:14,180 --> 00:28:33,110
as I explained, popcon is an opt-in mechanism which transmits the amount of package installations. I can see that e.g. libc is installed on so many machines whereas python is only installed on so many machines.

193
00:28:33,110 --> 00:28:50,720
This is the most important part. Then we have the reverse dependencies. When having a library such as libx11, and I have another library or program depending on libx11, that is a reverse dpendency for libx11.

194
00:28:50,720 --> 00:29:08,410
That is, the longer the list of reverse dependencies, the more important is the software. This is very similar to the popularity contest, but it's not the same. I tested it and it brings a different quality to the result.

195
00:29:08,410 --> 00:29:16,290
Then we have the file path, so if the search term is already contained in the file path, that gets a boost.

196
00:29:16,290 --> 00:29:29,910
You can have a look at how many libraries are structured, e.g. libc. Each function is in a separate file with a corresponding name. This is not the only case, so it's helpful to have this in our ranking.

197
00:29:30,340 --> 00:29:38,910
The remaining 14% contain the scope, the exact word match and the package name match.

198
00:29:38,910 --> 00:29:49,870
Scope means having a look at the indentation of the file. From the indentation, we derive the scope of the result.

199
00:29:49,870 --> 00:30:03,580
That is, if you have a result in a line which is not indented at all, the result is an the top-level scope, so a class definition, or a function definition.

200
00:30:03,580 --> 00:30:15,100
If it's indented by one, it is for example a variable definition. If it's indented even further, it might only be a helper variable and not very important.

201
00:30:15,100 --> 00:30:25,720
So you can say the higher in the scope, the more important the match. This is not very scientific and exact, but it works.

202
00:30:25,720 --> 00:30:45,040
The second thing is the exact word match. As an example, when searching for "XCreateWindow", I prefer "XCreateWindow(" over "XCreateWindowEvent". So if the result is contained in another word, it's not as good as when it appears standalone.

203
00:30:45,040 --> 00:30:58,920
The package name match is essentially the same thing as the file path. So if the search term is already in the package name, the result is better. This happens rarely, but when it happens, it's a strong signal.

204
00:30:58,920 --> 00:31:14,810
Now I briefly want to cover two optimizations I performed. The first one is posting list decoding. A posting list, we recall, is the list of documents for a certain word in our inverted index.

205
00:31:14,810 --> 00:31:22,010
Posting lists are saved using variable integer delta encoding.

206
00:31:22,010 --> 00:31:37,770
This sounds more complex than it is. Think of two documents, identified by id. So you don't have d1.txt and d2.txt, but 450 and 451, which makes the index much smaller.

207
00:31:37,770 --> 00:31:51,260
Now you want to save the numbers, but you don't want to use 32 bit for each number, because the index should be small enough to keep it entirely in RAM.

208
00:31:51,260 --> 00:32:02,100
So what you do, is you split the numbers in blocks of 7 bits and use the eighth bit to indicate there is another block, a lot like UTF-8.

209
00:32:02,100 --> 00:32:09,770
So, when encoding 450, you get 0x83 and 0x42.

210
00:32:09,770 --> 00:32:21,160
And delta encoding means we interpret each number based on the previous number. The delta between 450 and 452 is 2, so we just store a 2.

211
00:32:21,160 --> 00:32:30,070
The underlying assumption is that documents which are in the same source package are close to each other in the index, too, so that saves space.

212
00:32:30,070 --> 00:32:49,600
This decoding was implemented naively. Debian Code Search is written in Go, because Russ Cox' code search tools are in Go and because I like the language.

213
00:32:49,600 --> 00:33:02,280
The Go implementation is understandable, but not highly optimized. So I decided to re-implement it in C, and that works significantly faster.

214
00:33:02,280 --> 00:33:19,280
Question: are the posting lists sorted due to the delta? Yes, I think they are precisely for that reason.

215
00:33:19,280 --> 00:33:35,690
The second optimization is a query planner. This table is sorted by the amount of documents in the posting list of each term. Let's explain what's what:

216
00:33:35,690 --> 00:33:35,770
Assuming we have the query "XCreateWindow", these are all the trigrams which this query contains.
The second optimization is a query planner. This table is sorted by the amount of documents in the posting list of each term. Let's explain what's what:

217
00:33:35,770 --> 00:33:46,350
Assuming we have the query "XCreateWindow", these are all the trigrams which this query contains.

218
00:33:46,350 --> 00:34:00,360
The second column is how many documents are in each posting list. So, "XCr" is in 763 documents, "teW" in 5732 and "ate" is very frequent, it appears in 419000 documents.

219
00:34:00,360 --> 00:34:02,360
The i column is just an index.

220
00:34:02,360 --> 00:34:26,920
Then we have the amount of documents in the intersection. In this query we have "XCr" AND "teW" and so on. So we start with 763 documents, then we go to 266, 263, and so on. You can already see that there is not much change down here.

221
00:34:26,920 --> 00:34:38,830
This fact is made very explicit in the following column. In the first row we don't have a delta yet, then 497 documents change, then 3, 2, 1 and no change after that.

222
00:34:39,040 --> 00:34:46,280
The second to last column is the delta to the final result, which is big in the beginning and goes down to zero quickly.

223
00:34:46,280 --> 00:34:57,930
The last column is the time of decoding for each trigram individually. A short posting list is decoded in 12 us, but the time is much higher for a longer posting list.

224
00:34:57,930 --> 00:35:07,010
The original source used all trigrams in the order in which they appeared in the index and then ANDed them all.

225
00:35:07,010 --> 00:35:19,690
The observation is that if you sort the documents ascending by documents in their posting list, you will realize that after a certain point, nothing changes.

226
00:35:19,690 --> 00:35:37,760
Therefore, my algorithm heuristically waits until not a lot changes in the results anymore and then aborts. It is better to search one more file than to decode and process all remaining, long posting lists.

227
00:35:37,760 --> 00:35:41,360
This works and makes the search engine faster.

228
00:35:41,360 --> 00:35:56,100
To conclude, I want to state that Debian Code Search is useful to developers in their day to day life. Many people told me they like to use it and it's helpful for them.

229
00:35:56,100 --> 00:36:12,930
I would like to see patches. There are many ideas which really simplify life further for developers and other users. If you would like to help, please talk to me.

230
00:36:12,930 --> 00:36:21,370
And just in case you have a machine with unused 8 GiB RAM and 160 GiB SSD connected to the internet, I could make use of that.

231
00:36:22,520 --> 00:36:33,700
Question: Why does the posting list decoding time vary and not get longer as we go further down in the table?

232
00:36:35,550 --> 00:36:39,920
I think that's not an error in the table.

233
00:36:39,920 --> 00:36:57,390
Possibly the posting list just has different values. The amount of documents in the posting list is not the same thing as the length of the posting list in bytes.

234
00:36:57,390 --> 00:37:00,650
That's why they vary.

235
00:37:03,430 --> 00:37:09,170
Exactly, Debian Code Search is open source since January 2013.

236
00:37:14,480 --> 00:37:28,450
The question is how often do we re-index and can you incrementally update it. Currently, we update every friday.

237
00:37:28,450 --> 00:37:40,050
The problem is that an update takes a long time. Updating incrementally doesn't work unfortunately.

238
00:37:40,050 --> 00:37:45,240
The published code search tools don't do that.

239
00:37:45,240 --> 00:37:50,080
One could think about tackling this issue, but it's a lot of work for relatively little benefit.

240
00:37:50,080 --> 00:37:58,730
The problem is that an update takes a long time. We have not a lot of CPU resources, I just host it all on my server, so just one machine.

241
00:37:58,730 --> 00:38:14,060
An update takes about 8 hours. I could decide to update the index daily and have a slow search engine for 8 of 24 hours, or not have very fresh results, but a faster search engine, which currently is more important.

242
00:38:22,970 --> 00:38:29,270
The question was whether I have automated bench marks so that I can see what effect changes have on the performance. Yes, I have that.

243
00:38:29,270 --> 00:38:44,470
The tool uses the same queries that I also used for the evaluation of my ranking.

244
00:39:03,060 --> 00:39:11,890
The note was: Complicated regular expressions could take a lot of time to process. The answer is:

245
00:39:11,890 --> 00:39:25,340
Yes and no. It works faster than you might think, but the more complex your query is, the longer the processing takes.

246
00:39:25,340 --> 00:39:33,220
You can definitely think of queries which take a long time to process.

247
00:39:55,660 --> 00:40:21,280
The question was whether we can go from the very simple result list we currently have, open a file, and click on an identifier to see all the callers for example.

248
00:40:21,280 --> 00:40:29,200
This is a bit hard because the search engine currently is language independent. To do that, you at least need to recognize identifiers.

249
00:40:29,200 --> 00:40:46,760
You could do that, but suddenly you have a lot more metadata. So, yes, of course that is technically possible, but it's a lot of work. I personally don't intend to do that, but I don't mind other people trying to implement it.

250
00:40:50,280 --> 00:41:05,570
The remark is that the original use-case of following program flow cross-project gets even easier with such a feature, which is correct.

251
00:41:18,550 --> 00:41:33,710
The remark was that there are tools like ctags which already recognize identifiers. One of the search engines I mentioned in the beginning just uses ctags to build their index.

252
00:41:48,060 --> 00:41:58,600
The question is how do queries below three characters work when using a trigram index? The answer is: they don't. I think that's the reason why many search engines have a three character minimum length.

253
00:41:58,600 --> 00:42:07,890
In that case, the site just says the query is invalid and links you to the FAQ.

254
00:42:13,760 --> 00:42:35,470
The question was: How are the Debian packages stored, are they unpacked? The answer is yes, before building the index, I need to unpack the packages, and I just leave them on disk like that. That way, I can search and display the individual files.

255
00:42:47,380 --> 00:43:04,820
The remark is that there are packages like the linux kernel which contain just another tarball, which then contains the actual source. I didn't think of that, but think only a handful of packages are affected. We would need to modify the unpacker to handle that case.

256
00:43:04,820 --> 00:43:14,450
This is a little bit risky in terms of security, but, given the package is already in Debian, it should be somewhat clean.

257
00:43:32,910 --> 00:43:40,620
The question was whether popcon still works without atime. I don't know.

258
00:43:40,620 --> 00:43:53,050
Personally, I have disabled atime and use popularity contest. I think packages which only I can have yet do show up, so it must work, but I'm not entirely sure.

259
00:43:58,790 --> 00:44:03,500
The implementation is entirely in Go except for one or two shell scripts that unpack the sources.

260
00:44:08,680 --> 00:44:13,560
The question is how high the effort of reimplementing it in C++ is.

261
00:44:18,750 --> 00:44:27,660
Or, actually, how big is Debian Code Search and how many dependencies does it have. I can show two things to answer that.

262
00:44:31,340 --> 00:44:41,960
Another remark is that I reimplemented parts of it in C. But I only reimplemented one little file in C, so that's negligible.

263
00:44:41,960 --> 00:45:02,220
So here is the source code, see github.com/Debian/dcs. It is nearly self-contained. There is a document which describes how to build a Debian package from it that you can just install on your server.

264
00:45:02,220 --> 00:45:19,780
What you can see here is how to setup your Go development environment and what dependencies you need: a postgresql driver, a Go debian control file parser and the original code search tools from Russ Cox, so 3 external dependencies.

265
00:45:19,780 --> 00:45:25,180
Apart from that, what we can do real quick:

266
00:45:31,480 --> 00:45:46,770
Oh, this doesn't work. sloccount doesn't do Go, so I have to guess. It's not easy to port Debian Code Search to C++, especially when keeping its architecture.

267
00:45:46,770 --> 00:45:59,100
Quite a bit depends on Go's way of handling concurrency, i.e. channels and so on. The architecture uses a lot of Go's features, so it's definitely a good fit.

268
00:45:59,100 --> 00:46:06,860
You certainly can do it, but what is your motivation?

269
00:46:06,860 --> 00:46:10,160
Yeah, but why?

270
00:46:15,940 --> 00:46:20,650
Okay, so it's purely theoretical, without a good reason.

271
00:46:23,010 --> 00:46:28,790
Then again: right tool for the right job. Any more questions?

272
00:46:37,050 --> 00:46:50,680
The question is does the indexer handle stable, unstable, etc.? This is a frequently asked question and we index sid, i.e. unstable.

273
00:46:50,680 --> 00:47:03,700
If you would index more, you'd have the problem that for a single query you'd get multiple results in the same package, but with different versions, which makes the results page not as useful.

274
00:47:03,700 --> 00:47:22,540
Otherwise, you'd need to select which distributions you want to search in. This implies that you need multiple indexes. So this makes everything more complex. But in case you have a really good reason for doing this, please talk to us.

275
00:47:31,920 --> 00:47:36,290
The question was: are package-specific patches in the index? Yes, they are.

276
00:47:37,320 --> 00:47:40,450
Also, the entire debian/ directory.

277
00:47:40,450 --> 00:47:48,370
So if you want to search in things that only appear in debian/rules, you can do that.

278
00:47:51,460 --> 00:48:01,090
Is anything within debian/ excluded? Good question, let's look at the source.

279
00:48:01,090 --> 00:48:16,560
There is dcsindex, which is not very complicated. It has a hard-coded white- and blacklist.

280
00:48:16,560 --> 00:48:31,930
Ah, changelog and readme, not being in /debian. So within /debian, we don't exclude anything.

281
00:48:31,930 --> 00:48:40,890
I mean, it's called Debian Code Search, so it better finds Debian specific stuff.

282
00:48:51,470 --> 00:48:58,230
Exactly. My work is on top of cindex and csearch which I showed earlier.

283
00:49:00,890 --> 00:49:11,080
The question is are there any alternatives. Yeah, most likely. But I think none of the alternatives does what Russ Cox did, i.e. regular expression search on a big index.

284
00:49:11,080 --> 00:49:30,190
Or, on a big corpus. Also, I found his approach attractive. I share his views and like Go, so without looking for alternatives, I just went for it.

285
00:49:30,190 --> 00:49:32,190
Any more questions?

286
00:49:38,500 --> 00:49:51,900
Alright. In case you think of any more questions, feel free to talk to me later on. Thanks for the many questions and your attention!