Erk Subasi's shared items
There are an increasing number of systems that attempt to allow the user to specify a probabilistic model in a high-level language — for example, declare a (Bayesian) generative model as a hierarchy of various distributions — then automatically run training and inference algorithms on a data set. Now, you could always learn a good math library, and implement every model from scratch, but the motivation for this approach is you’ll avoid doing lots of repetitive and error-prone programming. I’m not yet convinced that any of them completely achieve this goal, but it would be great if they succeeded and we could use high-level frameworks for everything.
Everyone seems to know about only a few of them, so here’s a meager attempt to list together a bunch that can be freely downloaded. There is one package that is far more mature and been around much longer than the rest, so let’s start with:
-
BUGS - Bayesian Inference under Gibbs Sampling. Specify a generative model, then it does inference with a Gibbs sampler, thus being able to handle a wide variety of different sorts of models. The classic version has extensive GUI diagnostics for convergence and the like. BUGS can also be used from R. (The model definition language itself is R-like but not actually R.)
BUGS has had many users from a variety of fields. There are many books and zillions of courses and other resources showing how to use it to do Bayesian statistical data analysis. BUGS is supposed to be too slow once you get to thousands of parameters. The original implementation, WinBUGS, is written in Delphi, a variant of Pascal (!); its first release was in 1996. There are also two alternative open-source implementations (OpenBUGS, JAGS).
This is clearly very mature and successful software. Any new attempts to make something new should be compared against BUGS.
Next are systems that are much newer, generally less than several years old. Their languages all fall broadly into the category of probabilistic graphical models, but there are plenty of differences and specializations and assumptions that are a project in itself to understand. In lieu of doing a real synthesis, I’ll just list them with brief explanations.
-
Factorie focuses on factor graphs and discriminative undirected models. Claims to scale to millions of parameters. Written in Scala. New as of 2009. Its introductory paper is interesting. From Andrew McCallum’s group at UMass Amhearst.
-
Infer.NET. I only just learned of it. New as of 2008. Focuses on message-passing inference. Written in C#. From MSR Cambridge. I actually can’t tell whether you get its source code in the download. All other systems here are clearly open source (except WinBUGS, but OpenBUGS is a real alternative).
-
Church. Very new (as of 2009?), without much written about it yet. Focuses on generative models. Seems small/limited compared to the first three. Written in Scheme. From MIT.
-
PMTK - Probabilistic Modeling Toolkit. I actually have no idea whether it does model specification-driven inference, but the author’s previous similar-looking toolkit (BNT) is fairly well-known, so it’s in this list. Written in Matlab. From Kevin Murphy.
-
HBC - Hierarchical Bayesian Compiler. Similar idea as BUGS, though see webpage for a clear statement of its somewhat different goals. It compiles the Gibbs sampler to C, so it’s much faster. Seems to be unmaintained. Written in Haskell. From Hal Daume.
Finally, there are a few systems that seem to be more specialized. I certainly haven’t listed all of them; see the Factorie paper for a list of a few others.
-
Alchemy - an implementation of the Markov Logic Network formalism, an undirected graphical model over log-linear-weighted first-order logic. So, unlike BUGS and the above systems, there are no customized probability distributions for anything; everything is a Boltzmann (log-linear) distribution. At least, that’s how I understood it from the original paper. The FOL is essentially a language the user uses to define log-linear features. Alchemy then runs training algorithms to fit the their weights to data.
From Pedros Domingos’ group at UWashington. Written in C++. I’ve heard people complain that Alchemy is too slow. But in fairness, all these systems are slower than customized implementations.
-
Dyna is specialized for dynamic programming. The formalism is weighted Horn clauses (weighted Prolog). Implements agenda-based training/inference algorithms that generalize Baum-Welch, PCFG chart parsers, and the like. Written in C++, compiles to C++. Seems unmaintained. From Jason Eisner’s group at John Hopkins.
Since it only does dynamic programs, Dyna usefully supports a much more limited set of models than the above systems. But I expect that means it can train and infer with models that the above would be hopeless to handle, since dynamic programming gives you big-O efficiency gains over more general algorithms. (But on the other hand, even dynamic programming can be too generic and slow compared to direct, customized implementations. That’s the danger of all these systems, of course.)
-
BLOG - first-order logic with probability, though a fairly different formalism than MLNs. Focuses on problems with unknown and unknown numbers of objects. I personally don’t understand the use case very well. Its name stands for “Bayesian logic,” which seems like an unfairly broad characterization given all the other work in this area. From Brian Milch. Seems unmaintained? Written in Java.
An interesting axis of variation of all these is whether the model specification language is Turing-complete or not, and to what extent training and inference can be combined with external code.
- Turing-complete: Factorie, Infer.NET, Church, and Dyna are all Turing complete. The modeling languages of the first three are embedded in general procedural programming languages (Scala, C#, and Scheme respectively). Dyna is Turing complete in two different ways: it has a complete Prolog-ish engine, which is technically Turing complete but is gonna be a pain to do anything normal in (I simply mean, since Prolog is technically Turing-complete but a total pain to do anything non-Prolog-y in); but also, it compiles to C++.
- Not Turing-complete: BUGS, HBC, Alchemy/MLN, and BLOG use specialized mini-languages. BUGS’ and HBC’s languages are essentially the same as standard probabilistic model notation, though BUGS is imperative. Alchemy and BLOG are logic variants.
- Compiles to Turing-complete: HBC compiles to C, and Dyna compiles to C++, which are then intended to be hacked up and/or embedded in larger programs. I imagine this is a maintainability nightmare, but could be fine for one-off projects.
Another interesting variation is to what extent the systems handle probabilistic relations. BUGS and HBC don’t really try at all beyond plates; Alchemy, BLOG, and Factorie basically specialize in this; Dyna kind of does in a way; and the rest I can’t tell.
In summary, lots of interesting variation here. Given how many of these things are new and changing, this area will probably look much different in a few years.
When we perform a skilled movement such as reaching for an object, we can make use of prior information, for example about the location of the object in space. This helps us to prepare the movement, and we gain improved accuracy and speed during movement execution. Here, we investigate how prior information affects the motor cortical representation of movements during preparation and execution. We trained two monkeys in a delayed reaching task and provided a varying degree of prior information about the final target location. We decoded movement direction from multiple single-unit activity recorded from M1 (primary motor cortex) in one monkey and from PMd (dorsal premotor cortex) in a second monkey. Our results demonstrate that motor cortical cells in both areas exhibit individual encoding characteristics that change dynamically in time and dependent on prior information. On the population level, the information about movement direction is at any point in time accurately represented in a neuronal ensemble of time-varying composition. We conclude that movement representation in the motor cortex is not a static one, but one in which neurons dynamically allocate their computational resources to meet the demands defined by the movement task and the context of the movement. Consequently, we find that the decoding accuracy decreases if the precise task time, or the previous information that was available to the monkey, were disregarded in the decoding process. An optimal strategy for the readout of movement parameters from motor cortex should therefore take into account time and contextual parameters.
Well, what he actually says is the ‘phase transition’ in computer science. Two things make that possible: 1- too much data and 2- processing speed.
One of the nicest example he gives is that a learning algorithm X is the best with 1 million examples and another algorithm Y comes at the third rank but when the same algorithms are run on a data set of 1 billion examples then Y becomes the best one.
Another good examples: Scene completion example where the algorithm did not provide meaningful results with 10.000 images and researchers kept on trying with 100.000 images, again no good results, then with 1.000.000 images, again no results but then with 10.000.000 images it worked very well! So there’s some kind of phase transition – or a quantum leap – is going on here. The situation is similar to Google Image Search where they were trying to find the canonical images, e.g. the image that best represents ‘Mona Lisa’ and not some variation of it. By taking pairs of images, doing a feature comparison, calculating a distance and arranging data as graph and running a pagerank-like algorithm on the graph they were able to find the images that represent the given set of keywords best.
It is always fun and revealing to listen to Norvig. If you are interested in cutting edge research in machine learning, pattern recognition and machine translation I recommend this video enthusiastically. Especially the parts where Norvig shows some single page Python source code for word segmentation and typo checking programs (first is about %97 correct, running on a laptop with a data set of about 1.7 billion words, the second is about %75 correct, again running on a not-very-high-end laptop). He also mentions MapReduce programming paradigm and some wrong claims about the model, showing how it helps to do parallel programming for very large amounts of data.
İnternette değer sunan servislerin tamamen ücretsiz sunulmasından hiç hoşlanmıyorum. Ama benim hoşlanmıyor olmam bir şey değiştirmiyor: artık ücretsiz servis, genel bir kural. Gidiş zaten o yöndeydi, ve Facebook kuralın adını koydu. Facebook gibi bir devin sunduğu inanılmaz servisin tamamen ücretsiz olması, ücretli üyelikten gelir elde eden bir çok networking sitesini, ve aslında daha bir çok servis sistesini tam anlamıyla bitirdi.
Bu konuda yeni bir tartışma, The Tipping Point’i başucu kitabı olan Malcolm Gladwell‘in Chris Anderson‘un yeni kitabı Free‘yle ilgili eleştirisiyle başladı. Seth Godin, “Malcolm yanılıyor” diyerek Anderson’u savundu. Bu tartışmaya başkaları da katılınca önemli bir zihin cimnastiği başlamış oldu.
Severek takip ettiğim Fred Wilson, bizim de cember.net’te kullandığımız “çoğunluğa ücretsiz, ekstra isteyenlere ücretli” olarak tanımlanacak Freemium kavramını savunuyor. “Freemium and Freeconomics” başlığı altındaki yazısında konuyla ilgili düşüncelerini yine Facebook örneğiyle açıklıyor. Wilson, Facebook’un Freeconomics adını verdiği modelin mükemmel bir örneği olduğunu, ücretsiz üyelik / hizmet etrafında reklama dayalı modelin kesinlikle başarılı olduğunu savunuyor.
Karşı tarafta yer alan David Semeria ise blogunda bu modelin hem kendisini, hem de etrafındakileri yokettiğini öne sürerek bu modele “Kamikaze Marketing” adını vermiş. Doğru yaklaşım olsa da ben kendi kendini yoketme konusuna açıklayacağım nedenle katılmıyorum.
Freeconomics’çiler, marjinal maliyetlerin sıfıra yaklaşıyor olmasından yola çıkıyorlar. Ama bu kadar büyük miktarlar söz konusu olduğunda küçük marjinal maliyetler yine büyük maliyetler olarak ortaya çıkıyor. Bu nedenle, Wilson’un freeconomics dediği modelin aslında zaten varolan bir kavramın sempatikleştirilmiş hali olduğunu düşünüyorum: Damping.
Damping, bir ürün ya da hizmetin, çok küçük marjlarla, hatta maliyetinin altında bir fiyatla, hatta tamamen ücretsiz olarak tüketicinin hizmetine sunulmasıdır. Çoğunlukla “büyükler”, “küçükler”i piyasadan silmek için kullanırlar. Herkesin nefesini tutmak zorunda olduğu bir durumda, ilk önce zayıflar boğulur. İşte damping bunu yapmaya çalışır. Eğer anti-damping kuralları işlemezse, güçlü olanlar bir süre sonra zayıfları tamamen silerler, ve sonrasında karşılarına da kolay kolay rakip çıkmaz. İnternetin özgürlükçü ortamı görüntüsü altında, başka tüm sektörlerde uygulanan anti-damping kuralları – henüz – işlemiyor, ve freeconomics dediğimiz sistem, yalnızca güçlülerin hayatta kalacağı bir medya yaratıyor.
Facebook’un 2009 reklam gelirlerinin 475 milyon dolar olacağı öngörülüyor. Facebook, çok büyük bir kitleye ücretsiz servis sunarak, başka şekilde gelir yaratabiliyor. Diğer taraftan Amerika’daki, Türkiye’deki, dünyanın geri kalanındaki daha küçük siteler, bu geliri yaratamıyorlar. Bir süre sonra nefesleri yetmiyor, ve boğuluyorlar. Damping’in tipik sonucu. Facebook yalnızca social networking alanında değil, geliştirdiği daha bir çok özelliğiyle başka siteleri bitirdi, ya da can çekişir duruma getirdi. Google, arama motorlarına, mail servislerine, bunun ötesinde özelleşmiş yazılımlara aynı şeyi yaptı. Microsoft gibi bir dev bile şimdi nefes tutma yarışında çırpınıyor.
Özetle, Carrefour’un, Real’in bakkalları, küçük yerel marketleri bitirmesi gibi, freeconomics, yerel siteleri sürekli bitiriyor olacak. Globalleşme budur zaten. Bu yazıyı freeconomics’ten şikayet etmek için değil, yalnızca olayın adını net olarak koymaya katkıda bulunabilmek amacıyla yazdım. Sektör olarak bizim yapmamız gereken, oyunun ne olduğunu bilip, ona göre “büyük” yaratmaya odaklanmak, başkaları bizi boğmayı denemeden önce bizim nefesimizi güçlendirmemizi sağlamak.
Ancak bu sayede internette yalnızca pazar değil, oyuncu da olabiliriz.
Unfortunately the forex market has not escape the impact of global deleveraging and the failure of Lehman Brothers in 2008. Central banks from around the world have released their semi-annual foreign exchange surveys and based upon all of the reports, forex trading volume decreased significantly between April 2008 and April 2009. Investors large and small have reduced risk with carry trades unwound aggressively. The lack of participation may explain why the major currency pairs have been stuck in a range since the beginning of May. In New York for example, forex spot trading volume fell to the lowest level in more than 3 years.
London remains the most active forex trading center followed by NY and Tokyo. The EUR/USD is still the most actively traded currency pair by far.
Here are some stats (all of data is in billions of U.S. dollars):
London (link to report)
- Britain is the world’s biggest FX trading hub with over a third of global turnover.
- Average daily turnover in forex products fell 20% since October 2008 to $1,356B, down 25% from April 2008
- Majority of decline was attributed to less activity in spot FX which fell 28%
- The most heavily traded currency pair was euro/dollar, which accounted for 32% of total turnover.
New York (link to report)
- Daily FX market turnover fell 26.3% to $527B, the lowest level since October 2005
- Spot transactions dropped 25.2%, Option trades fell 48.4%
Most Heavily Traded Currencies (Spot Transactions) in NY
Tokyo (link to report)
- Daily FX Market Turnover Fell 16 percent
Singapore (link to report)
- Daily Foreign Exchange turnover down 21 percent compared to October 2008
Canada (link to report)
- Daily Foreign Exchange turnover down 11.3 percent, lowest volume since April 2007
Google has released a series of YouTube interviews with their lead engineers. Embedded below is one about MapReduce. The four engineers interviewed include the inventors of MapReduce. Some quotes:
6:17 – If we haven’t had to deal with [machine] failures… we would have probably never implemented MapReduce. Because without having to support failures, the rest of the machine code is just not that complicated.
7:20 – (Interviewer) What do you feel the technology [MapReduce] isn’t applicable for?… (Sanjay Ghemawat, Google Fellow) you can always squint at [a problem] at the right way… you can usually find a way to express it as a MapReduce…, but sometimes you have to squint at things in a pretty strange way to do this… For example, suppose you want to compute the cross correlation of every single pair of web pages in terms of saying what is the similiarity… I can run a pass where I just sort of magnify the input into the cross product of the inputs and then I can apply a function on each pair in there saying how similar it is. You intermediate data will be quadratic in the size of the input, so you probably don’t want to do it that way. So you’ll have to think a bit more carefully what your intermediate data is in that case… There’s a lot of thinking at the application level if you want to use MapReduce in that scenario. [Emphasis mine]
18:14 – (Matt Austern, SW engr) One of the core implementation issues in MapReduce is how you get the intermediate data from the Mappers to the Reducers. Every Mapper writes to every Reducer, and so it ends up making very heavy use of the network… (Interviewer) If you really want to provide a lot of computing, its very easy, one would think, to just buy lots more microprocessor… but the issue is communication between them… (Jerry Zhao, SW engr) Communication is not only the limit. How to coordinate the communication channel itself is also an interesting problem.
20:17 – MapReduce was originally designed as a batch processing system for large quantity of data. But we see our users are using MapReduce for relatively small set of data but have very strict latency requirement.
This is probably besides the point, but everyone in the video except maybe Sanjay sounds really scripted and robotic…
Enough complaining about the broken bits of Business Intelligence; it's time to highlight the things that are good and right in the industry. Like most industries, the renewal and innovation occurs at the fringe, beyond the comfort zone of established vendors.
I've created five categories and a catch-all to capture the solutions and companies (not so much technologies) that are leading the next generation of Business Intelligence. The categories are:
- Analyst tools
- Dashboards
- Targeted solutions
- Open-source and free
- Advanced visualizations
- Other stuff
Naturally I've focused on areas of Juice expertise and focus -- not coincidentally, the places where we feel BI has neglected end-users. According to a study by the Business Application Research Center, BI end-user adoption sits at a lowly 8%.
I'm happy to take your suggestions (and update the post) for things I've missed in these categories or for entirely new categories.
Analyst tools
Tools that make it easy for analysts to pull data from multiple sources, analyze, visualize and share it.
Winner: Tableau, the reigning king of visual analytics tools, has added more web-based functionality to allow for online sharing and collaboration.

Runner-up: Good Data has arrived on the market with a web-first platform designed to democratize analytics. I had a chance to get a demo from the management team and was impressed with the ease of use and high-quality data presentation.

Dashboards
"A frequently updated analytical display that is clear and concise" (via a recent post)...and not likely to draw the rage of Stephen Few.
Winner: BonaVista Systems wants to make Excel a "first choice dashboard tool." From the humble position of sparkline plug-in vendor, BonaVista has taken a leadership role in encouraging more effective dashboard design.

Runner-up (tie): Two BI companies, Qlikview and Microstrategy, seem to be following BonaVista's lead. Unfortunately, they may only be dipping in a toe as I found just a couple examples that break from the traditional over-glossy, gauge-riddled dashboard interface.
Targeted solutions
Companies that serve a narrow slice of the BI world extremely well. The desire to be all things to all people has been an Achilles Heel of the BI industry. The general purpose BI platforms often prove too broad and too generic to serve the unique problems of specific industries or functional areas.
Winner: Wall Street on Demand is a brilliant, below-the-radar provider of information solutions to the financial sector. Their sparse, articulate marketing text and few screenshots hint at a company that knows exactly what they do and deliver high-quality BI solutions. I wish I knew more.

Runner-up (multiple): The following are just a few companies that have focused on an industry or functional segment to deliver targeted BI solutions:
- Quantivo for customer behavior analytics
- Visual I|O for pharmaceuticals
- LucidEra for sale pipeline reporting and analytics
Open-source and free
(I know there is a difference.)
Winner: Pentaho offers an open-source end-to-end BI suite that is a competitive alternative to the big-guys. Of course, the implementation it isn't necessarily cheap or easy.

Runner-up: If anything should scare the BI industry, it is the possibility of a Google Analytics model extended into more general data analysis and visualization tools. Google Fusion Tables may just be the tip of the iceberg.

Advanced visualizations
Bringing leading-edge visualization techniques out of academia and into the business world.
Winner: Many Eyes continues to impress with high-quality visualizations. They are easy to create and clean in design and usability. Impress your boss with a slick visualization in your next presentation.

Runner-up (tie): Openviz / Advanced Visual Systems and Panopticon appear to be the two BI vendors battling it out for leadership in advanced visualization solutions. Unlike Many Eyes, these guys lack Tufte-esque sophistication in infoviz design. That said, there is a big difference between creating a one-off New York Times-quality visualization and delivering a toolset that is re-usable in many different situations.
Other stuff to be admired
Free charts with good default design. InetSoft's Style Chart and Google Charts offer free, embeddable charts.
Jargon-free BI marketing. With few exceptions, BI web sites are densely populated with those awful stock-photography people sitting around conference tables (or worse, the ethnically-diverse V-formation marching at you) and meaningless business jargon and techno-babble. I really appreciate Blink Logic's web site with its straight talk and clean, readable design.
Beyond the desktop. RoamBI has a great-looking iPhone application that is designed to "transform your data into insightful, interactive visualizations delivered to the iPhone." It makes the Oracle and Qlikview iPhone apps look old-school.





