![]() |
|
Author - Terence Yeoh
- The Aerospace Corporation
Work - Failure Analysis
- Microfocus X-ray System
Scipy tomography - Used array facilities and existing scripts
Electron Tomography - Compute sinogram using scipy array packages
Plasma Transients - Energized states of matter with radically different chemical, electrical, and physical attributes
- have non-linear voltage spikes that appear random
- Need statistical analysis of events
- Tin whisker electrical shorts can develop into sustained plasmas
- When these develop they can kill a spacecraft
- They were tasked with evaluating space shuttle risks
- Initial assessment by NASA indicated low risk
- Testing showed risk much higher
Wavelets - Wavelet decomposition breaks the complex waveform into manageable bites
- freq/time maps
- separate random noise without throwing out noise envelopes
SciPy's GA code - Used for automated signal processing
- clustering algorithms are necessary for sorting data
Additional Modules - Delegate.py
- PyWavelets
- Psyco
Openmosix - Single system image clustering solution
- Cluster aces as one large multiprocessor system
- Useful for Monte Carlo simulations or GA code
SciPy's Genetic Algorithm code - specify population size, islanding, migrants
- determine fitness function
- iterate over generations, specify exit criteria
- Fitness function is written in python
- They have used this code for automated alignment
- remove jitter from slice to slice
Psyco - A specializing compiler for python
- Useful for accelerating scipy's fitness function
- speedups come at the cost of memory space
- compiled functions do not relase memory until after the program is complete
PyWavelets - all popular wavelets supported
- supports all useful transorms
Plasma Transients - inherently random events
- best suited for wavelet analysis
- Neural networks did not seem to work
- They want to do arbitrary waveform analysis
- take plasmas regardless of media and find the most important features
Outline - regulatory genomics intro
- a bioinformatics solution
- project maintenance
- sociological considerations
- future plans
You can build two bodies out of the same types of material - Same materials, different architecture
- regulatory regions control gene expression
Looking for segments of DNA - regulatory regions
- binding sites
- seem impossible to find
- genomes are big, regulatory regions are small
- no obvious statistical signature unlike protein coding regions
- no good way to test predictions without doing experiments
- slow, expensive, difficult
Comparative sequence analysis works well - look for regions conserved between two or more species
- can use very simple sequence matching algorithm
FamilyRelations - Use for displaying data
- correlations done by eye
Cartwheel - Allows system that lets biologists:
- establish sequence analyses with custom params
- run them on someone else's computer server
- visualize and interact with the result via a client GUI
- Aimed at bench biologists
- Intended to be extensible
- all open source
- use python, linux, Quixote (web server)
Motif searching is the cutting edge
Galaxy - http://g2.bx.psu.edu
- an open-source framework for integrating tools and databases into a workspace
- a web-based serve we provide integrating many popular tools for comparitive genomics
How Galaxy integrates existing command line tools - HTML inputs generated from abstract parameter description
- tool help generated from a simple text format
- template for generating command line from parameter values
- output datasets generated by tool
- functional tests to be run with the full stack in place
- running functional tests for a specific tool on the command line
- test results on the command line and in an HTML report
- template language for building complex operations
- conditional groups, grouping constructs can be nested
- describe interface at a very high level
- command line tool expects a configuration file
- configuration file is generated based upon user input
Job execution in Galaxy - flexible exection environment
- jobs are submitted as JobRunner
- Pluggable queuing policies
Metagenomics - mapping "reads" onto protein databases
- Reliable Virtual Resources
- Python framwork for parallel and distributed programming
- prototype parallel programming systems
- River Core
- Discovery
- Process naming and creation
- message passing
- state management
- River Extensions
- RPC/RMI, Trickle, MPI, MapReduce
- Benefits
- small, easy to use core interface
- written in python
- dynamic typing for rapid prototyping
- python goodies
- heterogeneous(python as vm)
- state capture at language VM level
- integrated check point migration
- Motivation
- parallel programming is still hard
- the future: more cores, largest clusters
- apps will have to utilize multiple processors
- apps will have to tolerate failures
- find the next set of programming models
- incrementally improve current models
River Goals - Extend python's rapid development capabilities to parallel systems
- facilitate short design/development cycle
River Concepts - Virtual machines (VMs)
- Virtual resources (VRs)
- Named with UUIDs
- code, data, thread, and message queue
- discover/allocate/deploy
- flexible code execution
- super flexible messaging
- Only need to specify a destination
- pass arbitrary attributes and values
- must be able to be pickled
- state management
- designed from the beginning
- encapsulate local state in VR
- only hooks to outside UUIDs
- per-VR queues hold in-transit messages
- Transparent migration and check pointing
- internal and external support
- Coordinated checkpointing
River Implementation - One net thread, one control VR
- control handles VR create, state
- TCP-based, connection caching (scalable)
- broadcast-based discovery
- super flexible messaging
- queue matching
- serialization (Pickle)
State Implementation - Use stackless python
- CPyhon without the execution stack
- Keep soft VR state separate from hard VR state
- two Vr classes: VR and VRI (internal)
- VR has a reference to host VRI
- Stackless: run VR as a tasklet in a VRI thread
- generate atomic system calls (VRI calls)
- state capture: unlink VRI reference from VR
Remote Invocation - remote access and invocation (RAI)
- RPC, RMI, and remote data access
- Create and access functions, objects, data on remote VRs
- built on top of the River core
Trickle - simple task framing language
- put code/data on remote VMs
- execute sequentially or in parallel
- parallel invocation
- fork/join paradigm
- dynamic scheduling
River MPI (rMPI) - partial implementation of MPI 1.2 in River
- most p-to-p and collectives
- easy to read and understand
- model C MPI interface
- experiment with different algorithms
machine learning - task
- supervised learning
- use labels to help the learning algorithm
- does not need to deal with large data sets
- does need to deal with large data structures
Topics - Density Functional Theory calculations
- Scipy nonlinear solvers
- Finite Element Method using python-petsc and libmesh
- SymPy - the symbolic manipulation package in Python
Slides on the scipy web site
Exploring Scientific Algorithms - Scientists often know how to module their problems in software
- Its exploring the algorithms that is hard
- Need to make the exploration easier
- Enthough has more ideas than solitions
- Use Traits to make algorithms interactive
- complex problems, simple algorithms
- oil drilling
- only have seismic data to work with and what geoscientists have told them
- given physics for every type of area.
- simple formulas for each area
- combine simple formulas to build up algorithms
- Have created a stochastic modeling tool
- Approach to providing a better analysis
- code blocks are sets of executable instructions
- blocks are executed in contexts
- contexts with events
- events fire when data changes
- interacting with a variable
- dependency analysis
- how does variable change effect model
- create a "shadow" context to refer back to the original context for all static values
- function context accepts only functions
- data context excepts everything else
- prevents data context from getting cluttered
Text mining - treat text as data
- text in the sense as literature
- gaining prominence
- commercial tools include tools
Research Informatics within Biogen Idec - I want to know everything about gene xyz
- name is a central identifier and search key but
- names are really complicated
- origins in other species, environments
- biologist's sense of humor
- duplicate common words
- ambiguous/nonlinear
- small differences have a large impact on meaning
- problem with simple lexical searches
Concrete Problem Statement Offer capability to automate the expansion of a scientist's literature search. Increase recall --maintain precision.
- Develop good set of synonymous protein names
- extract all citations
- extract articles and abstracts
- do fuzzy search
- count up and rate the matches
- Couldn't do this with simple regex searching
- Developed a multiple-pass approach
- index source test
- search test
- tabluate
- This is a parallel problem
|