SciPy Conference 2007 -- Day 2
Last edited August 17, 2007
More by Christopher Hanley »
"Evolving Wavelets using Scipy: Analysis of Controlled Plasma Transients"

Author 
  •   Terence Yeoh
  • The Aerospace Corporation
Work 
  •  Failure Analysis
  • Microfocus X-ray System
    • Computed tomorgaphy
Scipy tomography 
  •  Used array facilities and existing scripts
Electron Tomography 
  •  Compute sinogram using scipy array packages
Plasma Transients 
  •  Energized states of matter with radically different chemical, electrical, and physical attributes
  • have non-linear voltage spikes that appear random
  • Need statistical analysis of events
  • Tin whisker electrical shorts can develop into sustained plasmas
    • When these develop they can kill a spacecraft
    • They were tasked with evaluating space shuttle risks
      • Initial assessment by NASA indicated low risk
      • Testing showed risk much higher
Wavelets 
  •  Wavelet decomposition breaks the complex waveform into manageable bites
    • freq/time maps
    • separate random noise without throwing out noise envelopes
SciPy's GA code 
  •  Used for automated signal processing
  • clustering algorithms are necessary for sorting data
Additional Modules 
  •  Delegate.py
    • forking code
  • PyWavelets
    • wavelet transforms
  • Psyco
    • "JIT" compiled Code
Openmosix 
  •  Single system image clustering solution
  • Cluster aces as one large multiprocessor system
  • Useful for Monte Carlo simulations or GA code
SciPy's Genetic Algorithm code 
  •  specify population size, islanding, migrants
  • determine fitness function
  • iterate over generations, specify exit criteria
  • Fitness function is written in python
  • They have used this code for automated alignment
    • remove jitter from slice to slice
    • use to be done by hand
Psyco 
  •  A specializing compiler for python
  • Useful for accelerating scipy's fitness function
  • speedups come at the cost of memory space
  • compiled functions do not relase memory until after the program is complete
PyWavelets 
  •  all popular wavelets supported
  • supports all useful transorms
Plasma Transients 
  • inherently random events
  • best suited for wavelet analysis
  • Neural networks did not seem to work
  • They want to do arbitrary waveform analysis
    • take plasmas regardless of media and find the most important features
"The Cartwheel Project: Python tools for regulatory genomics"

Author 
Outline 
  • regulatory genomics intro
  • a bioinformatics solution
  • project maintenance
  • sociological considerations
  • future plans
You can build two bodies out of the same types of material 
  • Same materials, different architecture
  • regulatory regions control gene expression
Looking for segments of DNA 
  • regulatory regions
  • binding sites
    • seem impossible to find
    • genomes are big, regulatory regions are small
    • no obvious statistical signature unlike protein coding regions
    • no good way to test predictions without doing experiments
      • slow, expensive, difficult
Comparative sequence analysis works well 
  •  look for regions conserved between two or more species
  • can use very simple sequence matching algorithm
FamilyRelations 
  •  Use for displaying data
  • correlations done by eye
Cartwheel 
  •  Allows system that lets biologists:
    • establish sequence analyses with custom params
    • run them on someone else's computer server
    • visualize and interact with the result via a client GUI
  • Aimed at bench biologists
  • Intended to be extensible
  • all open source
    • use python, linux, Quixote (web server)
 Motif searching is the cutting edge
"The Galaxy platform for accessible and reproducible scientific research"

Author 
  •   James Taylor
  • NYU
Galaxy
  •  http://g2.bx.psu.edu
  • an open-source framework for integrating tools and databases into a workspace
  • a web-based serve we provide integrating many popular tools for comparitive genomics
How Galaxy integrates existing command line tools 
  •  HTML inputs generated from abstract parameter description
    • user doesn't write HTML
  • tool help generated from a simple text format
  • template for generating command line from parameter values
  • output datasets generated by tool
  • functional tests to be run with the full stack in place
    • running functional tests for a specific tool on the command line
    • test results on the command line and in an HTML report
  • template language for building complex operations
    • conditional groups, grouping constructs can be nested
    • describe interface at a very high level
  • command line tool expects a configuration file
    • configuration file is generated based upon user input
Job execution in Galaxy 
  •  flexible exection environment
  • jobs are submitted as JobRunner
  • Pluggable queuing policies
Phylogenomic tools 
Metagenomics 
  •  mapping "reads" onto protein databases
    • need to do much faster
"River: A Foundation for the Rapid Development of Reliable Parallel Programming Systems"

Authors 
  • RIVER 
  •  Reliable Virtual Resources
  • Python framwork for parallel and distributed programming
    • prototype parallel programming systems
  • River Core
    • Discovery
    • Process naming and creation
    • message passing
    • state management
  • River Extensions
    • RPC/RMI, Trickle, MPI, MapReduce
  • Benefits
    • small, easy to use core interface
    • written in python
    • dynamic typing for rapid prototyping
    • python goodies
      • heterogeneous(python as vm)
      • state capture at language VM level
        • integrated check point migration
  • Motivation
    • parallel programming is still hard
    • the future: more cores, largest clusters
    • apps will have to utilize multiple processors
    • apps will have to tolerate failures
    • find the next set of programming models
    • incrementally improve current models
River Goals 
  • Extend python's rapid development capabilities to parallel systems
  • facilitate short design/development cycle
River Concepts 
  •  Virtual machines (VMs)
    • Python + River Core
  • Virtual resources (VRs)
    • Named with UUIDs
    • code, data, thread, and message queue
  • discover/allocate/deploy
  • flexible code execution
  • super flexible messaging
    • Only need to specify a destination
    • pass arbitrary attributes and values
      • must be able to be pickled
  • state management
    • designed from the beginning
    • encapsulate local state in VR
      • only hooks to outside UUIDs
    • per-VR queues hold in-transit messages
    • Transparent migration and check pointing
    • internal and external support
  • Coordinated checkpointing
    • algorithm is extensible
River Implementation 
  •  One net thread, one control VR
    • control handles VR create, state
  • TCP-based, connection caching (scalable)
  • broadcast-based discovery
  • super flexible messaging
    • queue matching
    • serialization (Pickle)
 State Implementation
  • Use stackless python
    • CPyhon without the execution stack 
  • Keep soft VR state separate from hard VR state
    • two Vr classes: VR and VRI (internal)
    • VR has a reference to host VRI
  • Stackless: run VR as a tasklet in a VRI thread
    • generate atomic system calls (VRI calls)
    • state capture: unlink VRI reference from VR

Remote Invocation 
  •  remote access and invocation (RAI)
    • RPC, RMI, and remote data access
  • Create and access functions, objects, data on remote VRs
  • built on top of the River core
Trickle 
  •  simple task framing language
  • put code/data on remote VMs
  • execute sequentially or in parallel
  • parallel invocation
    • fork/join paradigm
    • dynamic scheduling
River MPI (rMPI) 
  •  partial implementation of MPI 1.2 in River
  • most p-to-p and collectives
  • easy to read and understand
  • model C MPI interface
  • experiment with different algorithms
"Elefant (Efficient Learning, Large-scale, Inference, and Optimization Toolkit"

Author 
machine learning 
  •  task
    • training
      • use training data sets
    • testing
      • use test data sets
  • supervised learning
    • use labels to help the learning algorithm
  • does not need to deal with large data sets
  • does need to deal with large data structures
"Using Python for electronic structure calculations, nonlinear solvers, FEM and symbolic manipulatio

Author 
  •   Ondrej Certik
 Topics
  • Density Functional Theory calculations
  • Scipy nonlinear solvers
  • Finite Element Method using python-petsc and libmesh
  • SymPy - the symbolic manipulation package in Python
Slides on the scipy web site
"The Block Canvas and Contexts: A rapid approach to developing scientific workflows"

Author 
  • Eric Jones
Exploring Scientific Algorithms 
  •  Scientists often know how to module their problems in software
  • Its exploring the algorithms that is hard
  • Need to make the exploration easier
  • Enthough has more ideas than solitions
    • Use Traits to make algorithms interactive
  • complex problems, simple algorithms
    • oil drilling
      • only have seismic data to work with and what geoscientists have told them
      • given physics for every type of area.
        • simple formulas for each area
      • combine simple formulas to build up algorithms
      • Have created a stochastic modeling tool
  • Approach to providing a better analysis
    • code blocks are sets of executable instructions
    • blocks are executed in contexts
    • contexts with events
      • events fire when data changes
    • interacting with a variable
      • dependency analysis
        • how does variable change effect model
    • create a "shadow" context to refer back to the original context for all static values
    • function context accepts only functions
    • data context excepts everything else
      • prevents data context from getting cluttered
"Building a Protein Name Thesaurus"

Author 
  •   Bill Smith
Text mining 
  •  treat text as data
  • text in the sense as literature
  • gaining prominence
  • commercial tools include tools
Research Informatics within Biogen Idec 
  •  I want to know everything about gene xyz
  • name is a central identifier and search key but
    • names are really complicated
    • origins in other species, environments
    • biologist's sense of humor
    • duplicate common words
    • ambiguous/nonlinear
      • small differences have a large impact on meaning
  • problem with simple lexical searches
Concrete Problem Statement 
 Offer capability to automate the expansion of a scientist's literature search.  Increase recall --maintain precision.
  • Develop good set of synonymous protein names
  • extract all citations
  • extract articles and abstracts
  • do fuzzy search
  • count up and rate the matches
  • Couldn't do this with simple regex searching
  • Developed a multiple-pass approach
    • index source test
    • search test
    • tabluate
  • This is a parallel problem
"Volume Rendering with Python Molecular Viewer"

Author 
MGLTools 
  •  Multiple tools available
The content on this page is provided by a Google Notebook user, and Google assumes no responsibility for this content.