1 00:00:00,000 --> 00:00:03,370 It's 15:00, let's get started. 2 00:00:03,860 --> 00:00:07,000 Thanks for coming, my name is Michael Stapelberg. 3 00:00:07,000 --> 00:00:09,420 This talk is about Debian Code Search. 4 00:00:09,420 --> 00:00:12,360 Before we start: who uses Debian? 5 00:00:12,360 --> 00:00:20,150 Most of you. Maybe I should have chosen a different title; the topic is actually not Debian-specific. 6 00:00:23,310 --> 00:00:26,470 Here we have the contents of this talk. 7 00:00:27,370 --> 00:00:30,640 It looks like a lot, but we can definitely cover it. 8 00:00:30,640 --> 00:00:35,270 First of all, I want to explain the motivation, i.e. why you would need a code search engine. 9 00:00:35,270 --> 00:00:42,450 Afterwards, I will briefly explain how search engines in general work. 10 00:00:42,450 --> 00:00:49,760 Then I will cover the basis of my work, the codesearch tools from Russ Cox. 11 00:00:49,760 --> 00:00:54,880 I will cover why it makes sense to do this project in the context of Debian, 12 00:00:54,880 --> 00:00:58,070 then the general architecture, 13 00:00:58,070 --> 00:01:00,840 what an inverted index means, 14 00:01:00,840 --> 00:01:02,840 how we implemented ranking, 15 00:01:02,840 --> 00:01:04,840 what kinds of optimizations I did 16 00:01:04,840 --> 00:01:06,840 and finally we have some time for questions. 17 00:01:06,840 --> 00:01:12,910 I should mention that Debian Code Search is also the subject of my bachelor thesis, 18 00:01:12,910 --> 00:01:21,710 which was mostly a lucky coincidence. I would have done it anyway, but the time was right and I found a professor supporting my project. 19 00:01:23,200 --> 00:01:27,910 Let's talk about motivation: why would you search for source code? 20 00:01:27,910 --> 00:01:35,660 Frequently, with Open Source projects, I experience the problem that documentation is really bad. 21 00:01:35,660 --> 00:01:42,550 Maybe there is no documentation at all, or out-of-date docs, or docs which are just not very friendly to newcomers. 22 00:01:42,550 --> 00:01:46,460 Two examples are OpenSSL and XCB. 23 00:01:46,460 --> 00:01:51,790 With OpenSSL - who of you has worked with OpenSSL? 24 00:01:51,790 --> 00:01:54,980 Who of you found it easy to get into OpenSSL? 25 00:01:54,980 --> 00:01:56,980 Exactly! Nobody. 26 00:01:56,980 --> 00:02:01,000 OpenSSL is incredibly complex. It does a lot. 27 00:02:01,000 --> 00:02:07,110 All of it is documented, but to understand the simple things, you need to read all the docs. 28 00:02:07,110 --> 00:02:14,600 So it is much much simpler to read a tutorial or look at what other programs are doing, i.e. working with code examples, 29 00:02:14,600 --> 00:02:19,390 instead of reading through all the documentation, which is not a feasible thing in this case. 30 00:02:19,390 --> 00:02:23,880 Another, different, example is XCB. 31 00:02:23,880 --> 00:02:27,320 With XCB, nothing is documented. Why is that? 32 00:02:27,320 --> 00:02:35,380 XCB, first of all, stands for X C Bindings. It is a library that allows you to talk to the X server from a C program. 33 00:02:35,380 --> 00:02:41,180 XCB is the intended successor of Xlib, which partly succeeded, but that is not very interesting right now. 34 00:02:41,180 --> 00:02:48,500 Previously, with Xlib, humans wrote and maintained all the code. 35 00:02:48,500 --> 00:02:53,590 With XCB, they thought: that is not very clever, it led to subtle errors, 36 00:02:53,590 --> 00:02:58,020 so instead, they created a protocol description of the X11 protocol. 37 00:02:58,020 --> 00:03:02,370 This is an XML file from which source code is automatically generated, 38 00:03:02,370 --> 00:03:07,060 for C, for Python, for Lua, whatever. That is a nice advantage, 39 00:03:07,060 --> 00:03:16,750 the other advantage is that by generating code, you cannot make the same mistakes as have happened in the past. 40 00:03:16,750 --> 00:03:21,290 However, in that protocol description, there is no documentation. 41 00:03:21,290 --> 00:03:28,860 So, if you as a programmer look at this library, you will see lots and lots of functions. 42 00:03:28,860 --> 00:03:32,220 They are all generated and come with zero documentation. 43 00:03:32,220 --> 00:03:38,420 That means you have to know the X11 protocol already, or maybe have worked with Xlib in the past, before you can even use this project. 44 00:03:38,420 --> 00:03:44,150 When working with XCB, it helped me a lot to look at the source code of other programs. 45 00:03:44,150 --> 00:03:56,040 So this is the first point. I want to find real-life code examples so that I can avoid dealing with bad documentation. 46 00:03:56,040 --> 00:04:01,560 The other big point is tracing program flow across project boundaries. 47 00:04:01,560 --> 00:04:10,470 Assume I have some application, say evince, on my computer and find a bug. What do I do? 48 00:04:10,470 --> 00:04:19,330 Well, I download the source code, locate the bug in the source and usually realize: 49 00:04:19,330 --> 00:04:22,270 ah, evince does a library call into another project. 50 00:04:22,270 --> 00:04:30,850 So I start from scratch: download that library, locate the line, realize it calls another library. 51 00:04:30,850 --> 00:04:38,520 And it goes on like that. It is annoying to download the whole source code, when all you want to look at is just a tiny fraction of the source. 52 00:04:38,520 --> 00:04:46,280 Also, jumping around between different projects is much easier with a search engine, where you just enter the name of a function and get to the function. 53 00:04:46,280 --> 00:04:57,990 So this is the second reason. There are more, but I think it is enough to explain those two. People who want to read more can read my bachelor thesis. 54 00:04:59,970 --> 00:05:08,510 The next question is of course: why a new search engine? You would think somebody already built such a thing. 55 00:05:09,830 --> 00:05:14,700 Of course there are some code search engines, but they are language-specific. 56 00:05:14,700 --> 00:05:17,440 Two examples are Nullege and Sourcerer. 57 00:05:17,440 --> 00:05:28,150 Nullege is a Python search engine, and it really understands Python. So you can navigate through hierarchies of classes and stuff, which is cool. 58 00:05:28,150 --> 00:05:33,300 But it also implies that Nullege cannot handle non-python source code. 59 00:05:33,300 --> 00:05:37,800 And I don't want to be restricted to one language, I want it to work for all of them. 60 00:05:37,800 --> 00:05:42,010 With sourcerer, it's the same issue, but sourcerer is Java-specific. 61 00:05:42,010 --> 00:05:49,680 The other class of search engines is the one which is not up-to-date or contains very little software in general. 62 00:05:49,680 --> 00:05:53,750 An example for that is Koders and ohloh code, which by now belong together. 63 00:05:53,750 --> 00:06:01,590 They are changing their system since quite some time, and don't update their index in the meantime. 64 00:06:01,590 --> 00:06:06,460 It is super frustrating when you search for something and find old code. 65 00:06:06,460 --> 00:06:13,560 Another example is Krugle, which, like a few others, has a company behind it. 66 00:06:13,560 --> 00:06:23,950 What this company wants to do is sell a code search engine to other companies. To demonstrate their product, they run a demo version with open source software in it. 67 00:06:23,950 --> 00:06:30,600 The problem is that Krugle, in particular, contains 450 OSS projects. 68 00:06:30,600 --> 00:06:35,100 This is a ridiculously small amount. If you search for anything, you won't find it. 69 00:06:35,100 --> 00:06:40,990 The other problem is: there have been search engines which were pretty good, e.g. Google Code Search, which I used. 70 00:06:40,990 --> 00:06:46,900 but it is not available anymore because Google wants to focus on different projects and just closed it. 71 00:06:46,900 --> 00:06:52,310 Furthermore, all of the previously mentioned search engines are not open source themselves. 72 00:06:52,310 --> 00:07:00,980 It is always good to have an OSS project, but especially when doing something for an OSS project like Debian, it's even more important the tools are OSS. 73 00:07:00,980 --> 00:07:06,600 Aside from all that, it's very interesting to build your own search engine. I learnt a lot. 74 00:07:08,410 --> 00:07:12,140 So, even if there already are search engines, it's worthwhile to build another one. 75 00:07:12,140 --> 00:07:17,640 So how does a search engine work in general? Who of you implemented something in that area? 76 00:07:17,640 --> 00:07:23,440 2 people, alright. There are a few technical terms you should know, so I'll briefly explain them. 77 00:07:23,440 --> 00:07:28,120 The first one is the corpus, which describes the set of data which is being searched. 78 00:07:30,110 --> 00:07:33,590 Usually, when thinking of a search engine, you think of a web search engine. 79 00:07:33,590 --> 00:07:38,000 For a web search engine, the corpus is just the set of all web sites. 80 00:07:38,000 --> 00:07:44,800 However, web sites change permanently, so a web search engine needs a crawler. 81 00:07:44,800 --> 00:07:50,170 The crawler permanently downloads web sites, constantly expanding the corpus. 82 00:07:50,170 --> 00:07:53,000 But this is not an inherent property of search engines. 83 00:07:53,000 --> 00:07:53,020 A corpus could also, in your local library, be the set of all books. But this is not an inherent property of search engines. 84 00:07:53,020 --> 00:07:57,970 A corpus could also, in your local library, be the set of all books. 85 00:07:57,970 --> 00:08:03,840 And when new books arrive, the employee at the library enters them into the system. Then you don't need a crawler. 86 00:08:03,840 --> 00:08:09,530 So there are different ways to construct your corpus, but now you know what the term itself means. 87 00:08:09,530 --> 00:08:16,830 The second thing you think of is the query, meaning a query to your search engine. 88 00:08:16,830 --> 00:08:23,440 This is very simple and everyone of you has dealt with it: this is what you enter for example into the Google search box. 89 00:08:23,440 --> 00:08:30,860 However, not only the query itself is sent to search engine, but it usually gets expanded. 90 00:08:30,860 --> 00:08:35,840 As an example, you could imagine that, when you do a web search in Germany, 91 00:08:35,840 --> 00:08:48,770 your search engine could say: okay, the query was "kebab house", I will add Germany, or even Karlsruhe, to the query, so it knows which documents to return first. 92 00:08:48,770 --> 00:08:56,310 So, query means query for the internal system and can be different from what you entered. 93 00:08:56,310 --> 00:09:01,540 You also need an inverted index, 94 00:09:01,540 --> 00:09:09,120 which is just a data structure leading from a query to the location in your corpus. I will cover how that works in a second. 95 00:09:09,120 --> 00:09:19,360 And finally, the ranking is very important because nobody would be happy with a results page in no particular order. 96 00:09:19,360 --> 00:09:24,150 Obviously, you would want the thing you are most likely looking for to appear at the very top. 97 00:09:24,150 --> 00:09:32,210 This is a pretty hard problem. I will cover later how Debian Code Search does it. 98 00:09:32,210 --> 00:09:37,350 By the way, if there are any questions, please just raise your hand. 99 00:09:39,320 --> 00:09:50,280 The basis of my work are the code search tools by Russ Cox. He is the author of Google Code Search and wrote it as his internship project in 2006. 100 00:09:50,280 --> 00:10:03,990 After Google Code Search was taken offline in 2012, he published an article in February 2012, in which he explains how it worked. 101 00:10:03,990 --> 00:10:11,530 The article not only explains how it works, but also comes with source code. 102 00:10:11,530 --> 00:10:24,630 He re-implemented his ideas, and his code does a few simplifications, so it is not production-ready. 103 00:10:24,630 --> 00:10:30,290 But it provides a good basis, and you understand how it was meant to work. 104 00:10:30,290 --> 00:10:37,360 I want to show a quick demo. Let's first have a quick look at the article itself. 105 00:10:37,360 --> 00:10:43,180 It is called "Regular Expression Matching with a Trigram Index, or, How Google Code Search worked". 106 00:10:43,180 --> 00:10:46,820 Here he described how it all worked. 107 00:10:46,820 --> 00:10:56,890 Now, let's have a look at how those tools work. Assume I have a directory containing source code. 108 00:10:58,340 --> 00:11:02,630 Yes, this is open source under the BSD license, very friendly. 109 00:11:02,630 --> 00:11:14,530 Now let's demonstrate the tools. I have a directory called "i3" and can run the tool "cindex", 110 00:11:14,530 --> 00:11:24,770 which creates an index. I pass "src" as command line argument and now the tool has created an index. 111 00:11:24,860 --> 00:11:32,910 What I can do now is run "csearch" with a search term like "i3Font", 112 00:11:33,210 --> 00:11:41,860 and I will get the results. So it searched the directory I previously indexed and displays the results. 113 00:11:41,860 --> 00:11:50,090 So this is the basis with which I started. In the end, Debian Code Search is much more, though. 114 00:11:53,040 --> 00:12:11,550 Most people know Debian. It is a free, non-commercial Linux distribution. So it is a good idea to host such a project there, where you don't want it to get taken offline after a few years. 115 00:12:14,370 --> 00:12:20,160 In case I cannot maintain the project at some point in time, somebody else can step up. 116 00:12:20,160 --> 00:12:24,400 Personally, I am a Debian Developer since March 2012. 117 00:12:24,400 --> 00:12:38,360 And one of the reasons that led me to doing this with Debian is that we have a lot of software in Debian: about 17000 source packages, resulting in even more binary packages. 118 00:12:38,360 --> 00:12:42,300 All in all, they contain 129 GiB of source code. 119 00:12:42,300 --> 00:12:48,810 The advantage is that in Debian we only have free software. That means, we don't need a crawler. 120 00:12:48,810 --> 00:12:54,810 So we don't need to download software and figure out its license, or find out whether we can re-distribute it at all. 121 00:12:57,080 --> 00:13:01,960 So a lot of what you normally need for a web search engine is just not an issue here. 122 00:13:02,950 --> 00:13:08,170 Furthermore, in Debian, we have plenty of machine-readable metadata. 123 00:13:08,170 --> 00:13:11,770 Quite recent are machine-readable copyright files. 124 00:13:11,770 --> 00:13:16,910 Obviously, we also have the package names and their description. 125 00:13:16,910 --> 00:13:27,660 But what's even better is that we have usage numbers from the popularity contest. We can see how many popcon participants have installed a specific package. 126 00:13:27,660 --> 00:13:35,540 That's how we can figure out which packages are more important than others. More on that later. 127 00:13:42,690 --> 00:13:51,050 Now let's look at the architecture on a high level. The best thing is to give a quick demo. 128 00:13:56,660 --> 00:14:06,770 What we see here is the index page of Debian Code Search, which is quite simple and just contains a search box. 129 00:14:06,770 --> 00:14:20,770 There are a few links at the top: about, FAQ, etc. Those are static and the search engine only handles input to the search box. Now, question: what should I search for? 130 00:14:28,570 --> 00:14:41,290 Here we have the results page. It took a bit of time, but not incredibly long. A grep on the entire Debian archive would be much slower, so that's a win. 131 00:14:41,380 --> 00:14:53,110 What we see here are the results. The function we searched for is defined in the libc, which is a package that a lot of people have installed, so it shows up at the very top. 132 00:14:53,110 --> 00:15:08,830 The results page is structured like you would expect it. There are more results, e.g. in the python package, and then you can navigate to the next page. 133 00:15:15,090 --> 00:15:24,810 As a short aside, what you wanted to search for is fine, but have a look at what other people enter into search engines after you launch them. 134 00:15:24,810 --> 00:15:31,060 I was surprised at first, especially considering the amount of hits. 135 00:15:31,060 --> 00:15:43,270 The first hit, by the way, was caused by somebody writing a blog post about the issues with Douglas Crockford's license. In that post, he linked to this query as an example. 136 00:15:43,270 --> 00:15:51,620 For the other queries, I just think that this is what people enter once you present them an input box. 137 00:15:51,620 --> 00:16:07,810 This table was created to evaluate the cache ratio. The most important improvement after launch was to enable aggressive caching, otherwise the search engine would have been overloaded. 138 00:16:07,810 --> 00:16:14,580 Now, to the architecture. What you just saw works like this: 139 00:16:14,580 --> 00:16:21,140 To the top left, you see the HTTP frontend, which is just an nginx server in our case. 140 00:16:21,140 --> 00:16:31,930 What nginx does for us is delivering static pages like the about and FAQ page. Furthermore, it can load-balance the requests going into the system. 141 00:16:31,930 --> 00:16:50,050 That is, if I have multiple of these backend processes, I could split the load 50/50. That allows me to disable one of the stacks, update it, then re-enable it. 142 00:16:50,050 --> 00:17:08,680 nginx then sends the query to dcs-web, which is a process that queries the different backends and then renders the results page that we just looked at. 143 00:17:08,680 --> 00:17:31,300 We have the index backend, which is sharded for different reasons. This is the inverted index, which is kept in memory, and allows us to go from a search query to the corresponding position in the Debian archive. 144 00:17:31,300 --> 00:17:44,920 After dcs-web queried the index backend, it only knows which files should be searched. This means, it only reduces the amount of files to search. 145 00:17:44,920 --> 00:17:54,600 Afterwards, dcs-web gets information from a PostgreSQL database, which it uses to define the order in which the files are searched. 146 00:17:54,600 --> 00:18:07,790 And finally, the query is passed to the source backend, which has access to the entire Debian archive. It searches the files and returns the actual matches, resulting in the page you just saw. 147 00:18:09,760 --> 00:18:23,470 Now to the inverted index. Assume we have two documents, d1.txt and d2.txt. d1 contains "Das Wetter ist schoen", d2 contains "Die Wetter-Vorhersage luegt". 148 00:18:23,470 --> 00:18:30,530 In an inverted index, you want to go from a word to the posting list. 149 00:18:30,530 --> 00:18:47,070 So you go through the first document, word by word. Then you store in which document each word appears. So the posting list for "Das" contains just d1.txt. 150 00:18:47,070 --> 00:19:00,950 "Wetter" however appears in both documents. This is a very simple index with a very tiny amount of documents, but that is how it works. 151 00:19:00,950 --> 00:19:11,400 Note that we only use whole words, no punctuation. So if you have asked yourself why web search engines don't care about punctuation: this is the reason. 152 00:19:16,500 --> 00:19:25,190 Right, two words are missing on this slide because I was careless. 153 00:19:25,190 --> 00:19:36,460 This is a good remark. Many search engines completely ignore words that are too frequent, because you would have incredibly long posting lists that don't really help you in finding what you are looking for. 154 00:19:36,460 --> 00:19:44,550 The thing is that we cannot just use a classical inverted index for a regular expression search engine. 155 00:19:44,550 --> 00:19:50,890 I should note that both, Google Code Search and Debian Code Search support regular expressions. 156 00:19:50,890 --> 00:19:58,900 This is a mighty tool for developers searching for code, but it also makes it harder to build a search engine in the first place. 157 00:19:58,900 --> 00:20:19,200 The problem is, when you consider ".*bar", we could search in our inverted index for "bar", but that doesn't match "foobar". So you would have to apply the regular expression on all keys in our index before you get to the posting list, and that is not feasible. 158 00:20:34,410 --> 00:20:41,820 So this example explained why you cannot just simply search for regular expressions. 159 00:20:41,820 --> 00:21:01,440 Another example is searching for "..._free". What should we do? Search for all combinations? Especially after considering Unicode, this results in combinatoric explosion, and everybody agrees that one cannot do this. 160 00:21:01,440 --> 00:21:21,570 Instead, the approach is to reduce the amount of files to search in. So we somehow want to go from a regular expression to a query that excludes a lot of files, and then we have a closer look at the remaining files. 161 00:21:21,570 --> 00:21:30,950 How does that work? You take the source code files and split them in trigrams. 162 00:21:30,950 --> 00:21:53,480 It looks like this: taking the identifier "XCreateWindow", we take "XCr", "Cre", and so on. So we split the entire source code in trigrams, including punctuation. 163 00:21:53,480 --> 00:22:02,650 "Cre" appears in two documents, so the posting list contains two entries, just like you saw in the example earlier. 164 00:22:02,650 --> 00:22:13,010 So we now transform the regular expression query we get into a trigram query. 165 00:22:13,010 --> 00:22:26,930 In this case, we search for Deb AND ebi AND bia and so on. The trigram query engine only supports AND, OR and subqueries. 166 00:22:26,930 --> 00:22:36,230 Those interested in how all the special cases are dealt with are welcome to read Russ Cox's original article. 167 00:22:36,230 --> 00:22:58,590 Some people might ask: why trigrams and not 2-grams or 4-grams? With 2-grams, we have too many common 2-grams, so the index doesn't tell you anything. With 4-grams, we have too many distinct entries, so the index is too big. 168 00:22:58,590 --> 00:23:02,870 So trigrams are the sweet spot. 169 00:23:02,870 --> 00:23:14,990 Now to another short demo. What I want to show is how to get from a trigram to a posting list. 170 00:23:14,990 --> 00:23:31,310 So we want to consider "i3F" as a trigram. This perl oneliner just prints the corresponding ASCII values for each character. 171 00:23:31,310 --> 00:23:52,470 So 69, 33, 46. These can be passed to a debugging tool which searches for the trigram in the first index shard. Within 20 us I get a list of files containing "i3F". 172 00:23:53,410 --> 00:24:16,200 When doing this with all 6 different index shards, I get a long list of files. But when ANDing "fon", and so on, the list gets shorter and shorter. And afterwards, I just essentially do a grep on the remaining files. 173 00:24:24,120 --> 00:24:31,580 The question was: are the numbers 16 bit or 32 bit? The answer is it's 32 bit to support unicode. 174 00:24:35,090 --> 00:24:38,040 Yes, you can search for unicode. 175 00:24:38,040 --> 00:24:44,680 Now to the ranking. 176 00:24:47,850 --> 00:24:52,270 This depends on the encoding of the source code. 177 00:24:54,240 --> 00:24:59,380 Yeah, latin is a subset of UTF-8 [not correct]. 178 00:25:04,220 --> 00:25:24,130 No, it does not convert encodings. It assumes everything is UTF-8. So everything in latin which is in the subset of UTF-8, I hope that was clearer, just works. But having latin-encouded source is very rare, why would you do such a thing? 179 00:25:27,600 --> 00:25:44,180 Usually, comments are English, ASCII and therefore Unicode-clean, but if you have localized comments, please save them as UTF-8. Any more questions? 180 00:25:52,020 --> 00:25:59,000 The question was: are the sources indexed, too? Only the sources are indexed. I don't get the question yet. 181 00:26:04,870 --> 00:26:13,430 Okay, got it. The question is: are only program source code files indexed or everything which is inside a source package? 182 00:26:14,760 --> 00:26:36,390 Yeah, changelogs and stuff. I wrote the indexer myself and just used a blacklist of things I don't want to index. What the original code search tools already provide is recognizing binary files and not indexing them. 183 00:26:36,390 --> 00:26:48,340 So, files such as images, are not in the index. A few other files, like CHANGELOG and NEWS are ignored on purpose. 184 00:26:48,340 --> 00:26:53,060 Does that answer your question? Any more questions? 185 00:26:55,030 --> 00:27:03,340 Now we saw how we go from the search query to the list of potential results. 186 00:27:03,340 --> 00:27:20,470 Now comes the ranking. We have to decide which files to search first. It's not enough to search all files and sort afterwards, because, depending on the search query, searching all files is not feasible. 187 00:27:20,470 --> 00:27:32,330 So which results to show on top is a thing we decide later on, but we have to consider the ranking now already. 188 00:27:32,330 --> 00:27:43,860 Evaluating the ranking was done by picking search queries and looking for expected results manually. 189 00:27:43,860 --> 00:27:53,580 That is, as an example, we have "XCreateWindow" as a search query, and what I want to have in my results is libx11 in this file in this line. 190 00:27:53,580 --> 00:28:02,620 Then I tested each ranking idea I had and looked at how they performed. 191 00:28:02,620 --> 00:28:14,180 The result is a weighted sum, consisting to a large part of the popularity contest, 192 00:28:14,180 --> 00:28:33,110 as I explained, popcon is an opt-in mechanism which transmits the amount of package installations. I can see that e.g. libc is installed on so many machines whereas python is only installed on so many machines. 193 00:28:33,110 --> 00:28:50,720 This is the most important part. Then we have the reverse dependencies. When having a library such as libx11, and I have another library or program depending on libx11, that is a reverse dpendency for libx11. 194 00:28:50,720 --> 00:29:08,410 That is, the longer the list of reverse dependencies, the more important is the software. This is very similar to the popularity contest, but it's not the same. I tested it and it brings a different quality to the result. 195 00:29:08,410 --> 00:29:16,290 Then we have the file path, so if the search term is already contained in the file path, that gets a boost. 196 00:29:16,290 --> 00:29:29,910 You can have a look at how many libraries are structured, e.g. libc. Each function is in a separate file with a corresponding name. This is not the only case, so it's helpful to have this in our ranking. 197 00:29:30,340 --> 00:29:38,910 The remaining 14% contain the scope, the exact word match and the package name match. 198 00:29:38,910 --> 00:29:49,870 Scope means having a look at the indentation of the file. From the indentation, we derive the scope of the result. 199 00:29:49,870 --> 00:30:03,580 That is, if you have a result in a line which is not indented at all, the result is an the top-level scope, so a class definition, or a function definition. 200 00:30:03,580 --> 00:30:15,100 If it's indented by one, it is for example a variable definition. If it's indented even further, it might only be a helper variable and not very important. 201 00:30:15,100 --> 00:30:25,720 So you can say the higher in the scope, the more important the match. This is not very scientific and exact, but it works. 202 00:30:25,720 --> 00:30:45,040 The second thing is the exact word match. As an example, when searching for "XCreateWindow", I prefer "XCreateWindow(" over "XCreateWindowEvent". So if the result is contained in another word, it's not as good as when it appears standalone. 203 00:30:45,040 --> 00:30:58,920 The package name match is essentially the same thing as the file path. So if the search term is already in the package name, the result is better. This happens rarely, but when it happens, it's a strong signal. 204 00:30:58,920 --> 00:31:14,810 Now I briefly want to cover two optimizations I performed. The first one is posting list decoding. A posting list, we recall, is the list of documents for a certain word in our inverted index. 205 00:31:14,810 --> 00:31:22,010 Posting lists are saved using variable integer delta encoding. 206 00:31:22,010 --> 00:31:37,770 This sounds more complex than it is. Think of two documents, identified by id. So you don't have d1.txt and d2.txt, but 450 and 451, which makes the index much smaller. 207 00:31:37,770 --> 00:31:51,260 Now you want to save the numbers, but you don't want to use 32 bit for each number, because the index should be small enough to keep it entirely in RAM. 208 00:31:51,260 --> 00:32:02,100 So what you do, is you split the numbers in blocks of 7 bits and use the eighth bit to indicate there is another block, a lot like UTF-8. 209 00:32:02,100 --> 00:32:09,770 So, when encoding 450, you get 0x83 and 0x42. 210 00:32:09,770 --> 00:32:21,160 And delta encoding means we interpret each number based on the previous number. The delta between 450 and 452 is 2, so we just store a 2. 211 00:32:21,160 --> 00:32:30,070 The underlying assumption is that documents which are in the same source package are close to each other in the index, too, so that saves space. 212 00:32:30,070 --> 00:32:49,600 This decoding was implemented naively. Debian Code Search is written in Go, because Russ Cox' code search tools are in Go and because I like the language. 213 00:32:49,600 --> 00:33:02,280 The Go implementation is understandable, but not highly optimized. So I decided to re-implement it in C, and that works significantly faster. 214 00:33:02,280 --> 00:33:19,280 Question: are the posting lists sorted due to the delta? Yes, I think they are precisely for that reason. 215 00:33:19,280 --> 00:33:35,690 The second optimization is a query planner. This table is sorted by the amount of documents in the posting list of each term. Let's explain what's what: 216 00:33:35,690 --> 00:33:35,770 Assuming we have the query "XCreateWindow", these are all the trigrams which this query contains. The second optimization is a query planner. This table is sorted by the amount of documents in the posting list of each term. Let's explain what's what: 217 00:33:35,770 --> 00:33:46,350 Assuming we have the query "XCreateWindow", these are all the trigrams which this query contains. 218 00:33:46,350 --> 00:34:00,360 The second column is how many documents are in each posting list. So, "XCr" is in 763 documents, "teW" in 5732 and "ate" is very frequent, it appears in 419000 documents. 219 00:34:00,360 --> 00:34:02,360 The i column is just an index. 220 00:34:02,360 --> 00:34:26,920 Then we have the amount of documents in the intersection. In this query we have "XCr" AND "teW" and so on. So we start with 763 documents, then we go to 266, 263, and so on. You can already see that there is not much change down here. 221 00:34:26,920 --> 00:34:38,830 This fact is made very explicit in the following column. In the first row we don't have a delta yet, then 497 documents change, then 3, 2, 1 and no change after that. 222 00:34:39,040 --> 00:34:46,280 The second to last column is the delta to the final result, which is big in the beginning and goes down to zero quickly. 223 00:34:46,280 --> 00:34:57,930 The last column is the time of decoding for each trigram individually. A short posting list is decoded in 12 us, but the time is much higher for a longer posting list. 224 00:34:57,930 --> 00:35:07,010 The original source used all trigrams in the order in which they appeared in the index and then ANDed them all. 225 00:35:07,010 --> 00:35:19,690 The observation is that if you sort the documents ascending by documents in their posting list, you will realize that after a certain point, nothing changes. 226 00:35:19,690 --> 00:35:37,760 Therefore, my algorithm heuristically waits until not a lot changes in the results anymore and then aborts. It is better to search one more file than to decode and process all remaining, long posting lists. 227 00:35:37,760 --> 00:35:41,360 This works and makes the search engine faster. 228 00:35:41,360 --> 00:35:56,100 To conclude, I want to state that Debian Code Search is useful to developers in their day to day life. Many people told me they like to use it and it's helpful for them. 229 00:35:56,100 --> 00:36:12,930 I would like to see patches. There are many ideas which really simplify life further for developers and other users. If you would like to help, please talk to me. 230 00:36:12,930 --> 00:36:21,370 And just in case you have a machine with unused 8 GiB RAM and 160 GiB SSD connected to the internet, I could make use of that. 231 00:36:22,520 --> 00:36:33,700 Question: Why does the posting list decoding time vary and not get longer as we go further down in the table? 232 00:36:35,550 --> 00:36:39,920 I think that's not an error in the table. 233 00:36:39,920 --> 00:36:57,390 Possibly the posting list just has different values. The amount of documents in the posting list is not the same thing as the length of the posting list in bytes. 234 00:36:57,390 --> 00:37:00,650 That's why they vary. 235 00:37:03,430 --> 00:37:09,170 Exactly, Debian Code Search is open source since January 2013. 236 00:37:14,480 --> 00:37:28,450 The question is how often do we re-index and can you incrementally update it. Currently, we update every friday. 237 00:37:28,450 --> 00:37:40,050 The problem is that an update takes a long time. Updating incrementally doesn't work unfortunately. 238 00:37:40,050 --> 00:37:45,240 The published code search tools don't do that. 239 00:37:45,240 --> 00:37:50,080 One could think about tackling this issue, but it's a lot of work for relatively little benefit. 240 00:37:50,080 --> 00:37:58,730 The problem is that an update takes a long time. We have not a lot of CPU resources, I just host it all on my server, so just one machine. 241 00:37:58,730 --> 00:38:14,060 An update takes about 8 hours. I could decide to update the index daily and have a slow search engine for 8 of 24 hours, or not have very fresh results, but a faster search engine, which currently is more important. 242 00:38:22,970 --> 00:38:29,270 The question was whether I have automated bench marks so that I can see what effect changes have on the performance. Yes, I have that. 243 00:38:29,270 --> 00:38:44,470 The tool uses the same queries that I also used for the evaluation of my ranking. 244 00:39:03,060 --> 00:39:11,890 The note was: Complicated regular expressions could take a lot of time to process. The answer is: 245 00:39:11,890 --> 00:39:25,340 Yes and no. It works faster than you might think, but the more complex your query is, the longer the processing takes. 246 00:39:25,340 --> 00:39:33,220 You can definitely think of queries which take a long time to process. 247 00:39:55,660 --> 00:40:21,280 The question was whether we can go from the very simple result list we currently have, open a file, and click on an identifier to see all the callers for example. 248 00:40:21,280 --> 00:40:29,200 This is a bit hard because the search engine currently is language independent. To do that, you at least need to recognize identifiers. 249 00:40:29,200 --> 00:40:46,760 You could do that, but suddenly you have a lot more metadata. So, yes, of course that is technically possible, but it's a lot of work. I personally don't intend to do that, but I don't mind other people trying to implement it. 250 00:40:50,280 --> 00:41:05,570 The remark is that the original use-case of following program flow cross-project gets even easier with such a feature, which is correct. 251 00:41:18,550 --> 00:41:33,710 The remark was that there are tools like ctags which already recognize identifiers. One of the search engines I mentioned in the beginning just uses ctags to build their index. 252 00:41:48,060 --> 00:41:58,600 The question is how do queries below three characters work when using a trigram index? The answer is: they don't. I think that's the reason why many search engines have a three character minimum length. 253 00:41:58,600 --> 00:42:07,890 In that case, the site just says the query is invalid and links you to the FAQ. 254 00:42:13,760 --> 00:42:35,470 The question was: How are the Debian packages stored, are they unpacked? The answer is yes, before building the index, I need to unpack the packages, and I just leave them on disk like that. That way, I can search and display the individual files. 255 00:42:47,380 --> 00:43:04,820 The remark is that there are packages like the linux kernel which contain just another tarball, which then contains the actual source. I didn't think of that, but think only a handful of packages are affected. We would need to modify the unpacker to handle that case. 256 00:43:04,820 --> 00:43:14,450 This is a little bit risky in terms of security, but, given the package is already in Debian, it should be somewhat clean. 257 00:43:32,910 --> 00:43:40,620 The question was whether popcon still works without atime. I don't know. 258 00:43:40,620 --> 00:43:53,050 Personally, I have disabled atime and use popularity contest. I think packages which only I can have yet do show up, so it must work, but I'm not entirely sure. 259 00:43:58,790 --> 00:44:03,500 The implementation is entirely in Go except for one or two shell scripts that unpack the sources. 260 00:44:08,680 --> 00:44:13,560 The question is how high the effort of reimplementing it in C++ is. 261 00:44:18,750 --> 00:44:27,660 Or, actually, how big is Debian Code Search and how many dependencies does it have. I can show two things to answer that. 262 00:44:31,340 --> 00:44:41,960 Another remark is that I reimplemented parts of it in C. But I only reimplemented one little file in C, so that's negligible. 263 00:44:41,960 --> 00:45:02,220 So here is the source code, see github.com/Debian/dcs. It is nearly self-contained. There is a document which describes how to build a Debian package from it that you can just install on your server. 264 00:45:02,220 --> 00:45:19,780 What you can see here is how to setup your Go development environment and what dependencies you need: a postgresql driver, a Go debian control file parser and the original code search tools from Russ Cox, so 3 external dependencies. 265 00:45:19,780 --> 00:45:25,180 Apart from that, what we can do real quick: 266 00:45:31,480 --> 00:45:46,770 Oh, this doesn't work. sloccount doesn't do Go, so I have to guess. It's not easy to port Debian Code Search to C++, especially when keeping its architecture. 267 00:45:46,770 --> 00:45:59,100 Quite a bit depends on Go's way of handling concurrency, i.e. channels and so on. The architecture uses a lot of Go's features, so it's definitely a good fit. 268 00:45:59,100 --> 00:46:06,860 You certainly can do it, but what is your motivation? 269 00:46:06,860 --> 00:46:10,160 Yeah, but why? 270 00:46:15,940 --> 00:46:20,650 Okay, so it's purely theoretical, without a good reason. 271 00:46:23,010 --> 00:46:28,790 Then again: right tool for the right job. Any more questions? 272 00:46:37,050 --> 00:46:50,680 The question is does the indexer handle stable, unstable, etc.? This is a frequently asked question and we index sid, i.e. unstable. 273 00:46:50,680 --> 00:47:03,700 If you would index more, you'd have the problem that for a single query you'd get multiple results in the same package, but with different versions, which makes the results page not as useful. 274 00:47:03,700 --> 00:47:22,540 Otherwise, you'd need to select which distributions you want to search in. This implies that you need multiple indexes. So this makes everything more complex. But in case you have a really good reason for doing this, please talk to us. 275 00:47:31,920 --> 00:47:36,290 The question was: are package-specific patches in the index? Yes, they are. 276 00:47:37,320 --> 00:47:40,450 Also, the entire debian/ directory. 277 00:47:40,450 --> 00:47:48,370 So if you want to search in things that only appear in debian/rules, you can do that. 278 00:47:51,460 --> 00:48:01,090 Is anything within debian/ excluded? Good question, let's look at the source. 279 00:48:01,090 --> 00:48:16,560 There is dcsindex, which is not very complicated. It has a hard-coded white- and blacklist. 280 00:48:16,560 --> 00:48:31,930 Ah, changelog and readme, not being in /debian. So within /debian, we don't exclude anything. 281 00:48:31,930 --> 00:48:40,890 I mean, it's called Debian Code Search, so it better finds Debian specific stuff. 282 00:48:51,470 --> 00:48:58,230 Exactly. My work is on top of cindex and csearch which I showed earlier. 283 00:49:00,890 --> 00:49:11,080 The question is are there any alternatives. Yeah, most likely. But I think none of the alternatives does what Russ Cox did, i.e. regular expression search on a big index. 284 00:49:11,080 --> 00:49:30,190 Or, on a big corpus. Also, I found his approach attractive. I share his views and like Go, so without looking for alternatives, I just went for it. 285 00:49:30,190 --> 00:49:32,190 Any more questions? 286 00:49:38,500 --> 00:49:51,900 Alright. In case you think of any more questions, feel free to talk to me later on. Thanks for the many questions and your attention!