Saturday, February 16, 2008

Visualisation of DNA

After some attention has been caought by the wonderfull Rainbow DNA project, I have decided to join the club! Here is a very simplistic, far more useless way of visualising DNA: turtle graphics.
I really cannot put up a website with comlete renderings of the DNA using turtle graphics, but I uploaded two sample images to my flikr account. I wanted to find out if turtle graphics could reveal diffrent sets of patterns as those perceivable with a color plot of basepairs.

Attached is source code so you can see what the program does. You can also reuse the part that proceses the contents of "gbk" files containing genome data.

The code is just a hack done at night while I was waiting in a starbucks for a friend to pick me up, so if you think this code is a mess - you earned your degree, I was just curios after I found out about the rainbow dna project.

Well the idea of the program is very simple:
1. Initialise the turtle to be in center of the screen
2. read the next basepair, for each base encountered look up the turtle rotation
3. rotate the turtle
4. draw 5 pixels
5. goto step 2 until finished with a gene

So here are the results, not surprisingly very unspectacular. If you want a good representation for the contents of the human genome, well ... look into a mirror. All other reps. just look ridiculous in comparison.

I think it's funny, that one could argue that the dna seems to be like a multi-quine: not only that it conatins the code that creates organisms to reproduce itself through regular biological reproduction, it also encodes a brain with the ability to create turtle renderings of itself...

In the following image rendering the following rules were applied. Whenever an "a" (as of a, c, t, g) is encountered turn the turle by -180 degree, on "c" -60 degrees, "t" 60 deg and "g" 180 deg.

turtle dna rendering heyll-mode

In the next rendering the following rules apply. Whenever an "a" turns the turle by 23 degree, on "c" 42 degrees, "t" 128 deg and "g" 15 deg.

turtle dna rendering boese-mode

Ok now here is the code.
You might want to get the gbk files. Have a look at my delicious account, I have stored a link to an ftp server where a gbk file for every chromosome of the human genome can be found.

For my experiments I used parts of the X chromosome.


(require (lib "turtles.ss" "graphics"))


(load "boyer-moore.scm")
(load "lazy-streams.scm")
(load "list-utils.scm")

;; read a line
(define LS #\newline)

(define (read-line port)
(let loop ((line '())
(c (read-char port)))
(if (eof-object? c)
(reverse (cons c line))
(if (eqv? c LS)
(reverse line)
(loop (cons c line) (read-char port)) ) )))

(define (lazy-line-stream port)
(define current-line (read-line port))
(define result-stream
(cons-lazy-stream current-line (lazy-line-stream port)))
(if (eof-object? (car current-line))
the-empty-lazy-stream
result-stream))

(define (filter-bases char-list)
(filter
(lambda (c) (or (eqv? c #\a) (eqv? c #\c) (eqv? c #\g) (eqv? c #\t)))
char-list))

(define (process-gbk port info-block-func base-pair-func post-draw)
(define (loop-base-pairs line-stream)
(if (empty-lazy-stream? line-stream)
(display "stream exhausted(while basepair parsing).")
(if (and (eqv? (car (lazy-head line-stream)) #\/) (eqv? (cadr (lazy-head line-stream)) #\/))
(begin
(post-draw)
(loop-header (lazy-tail line-stream) '()))
(begin
(base-pair-func (filter-bases (lazy-head line-stream)))
(loop-base-pairs (lazy-tail line-stream))))))
(define (loop-header line-stream info-block-A)
(if (empty-lazy-stream? line-stream)
(begin (newline)
(display "stream exhausted(while parsing header).")
(newline))
;; ok there's more stuff to read so. Find the ORIGIN string indicating the start of a DNA string
(if (equal? #f (>>boyer-moore (string->list "ORIGIN") (lazy-head line-stream)))
(begin
(loop-header (lazy-tail line-stream) `(,@info-block-A ,(lazy-head line-stream))))
;; ok found the ORIGIN string
(begin
(newline)
(display "found beginning of base pair sequence.")
(newline)
(info-block-func info-block-A)
(loop-base-pairs (lazy-tail line-stream))))))
;; well the file is always assumed to start with a header
(loop-header (lazy-line-stream port) '()))


;; simple function that just displays the base pairs
(define (simple-base-pair-displayer L)
(display L)
(newline))

;; now some turtle functions
;; simple turtle moving and turning

;
;(define base-table ;; pun intended
; '((#\a 23)
; (#\c 42)
; (#\t 128)
; (#\g 15)))
;(define angle-factor 1)
;(define step-len 5)

;(define base-table ;; pun intended
; '((#\a -3)
; (#\c -1)
; (#\t 1)
; (#\g 3)))
;(define angle-factor 60)
;(define step-len 5)

(define base-table ;; pun intended
'((#\a -2)
(#\c -1)
(#\t 1)
(#\g 2)))
(define angle-factor 60)
(define step-len 4)

;(define base-table ;; pun intended
; '((#\a 0)
; (#\c 1)
; (#\t 2)
; (#\g 3)))
;(define angle-factor 90)
;(define step-len 4)


(turtles #t)

;;simple turtle func that will draw a line
(define (basepair-drawer L)
(define (draw-loop L)
(if (not (equal? L '()))
(begin
(turn (* angle-factor (cadr (assoc (car L) base-table))))
(draw step-len))))
(draw-loop L))

(define (info-block-func info-block)
(display "GOT AN INFO BLOCK, STARTING NEW RENDERING")
;(display info-block)
(newline)
(clear))

(define (post-draw)
(display "FINISHED DRAWING")
(newline)
(sleep 5))


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; main prog
(define (start)
(call-with-input-file "ref_chrX.gbk"
(lambda (port)
(process-gbk port info-block-func basepair-drawer post-draw))))
(start)

Boyer Moore Search Algorithm

I needed to find a String in a text file, so I wrote(rather hacked) a scheme imlementation of the boyer moore string search algoritm.

This is just a hack. But it is commented. What do you think?
(I decided to use this blog also as my cut-paste-source from now on.)


;; searches for string using boyer moore algorithm
(define (>>boyer-moore needle haystack)
(define needle-len (string-length needle))
(define hs-len (string-length haystack))
(define r-needle-list (reverse (string->list needle)))
;; two tables are build
;; compute the bad character shift table
;; it contains the number of chars to skip, if a character is encountered that is not the last of the search string.
;; (this table is only used after the search cursor was replaced)
(define bad-char-shift-table
(let loop
((shift 0)
(nlist r-needle-list)
(table '()))
(if (eq? nlist '())
table
(if (assv (car nlist) table)
(loop (+ 1 shift) (cdr nlist) table)
(loop (+ 1 shift) (cdr nlist) `((,(car nlist) ,shift) ,@table))))))
;; the good char table contains the number of chars to skip forwar, if a substring starting from the
;; end of a needle was matched befor a mismatch occurs
;; it contains the next possible position from the current search position where a
;; string match might end...
(define good-suffix-shift-table
;; for every reverse substring define a shift value
(let char-pattern-loop
((pattern '())
(pattern-len 0)
(nlist r-needle-list)
(table '()))
(if (eq? nlist '())
table
(char-pattern-loop
`( ,@pattern ,(car nlist))
(+ 1 pattern-len)
(cdr nlist)
`(,@table
(,pattern ,(let loop
((shift 0)
(unmatched (car nlist)))
(if (equal? (ncar pattern-len (ncdr shift r-needle-list)) (ncar (- needle-len shift) pattern))
(if (eqv? shift needle-len)
shift
(if (>= (+ shift pattern-len) needle-len)
shift
(if (eqv? (car (ncdr shift nlist)) unmatched)
;; ok nicht gefunden weiter schieben/suchen
(loop (+ 1 shift) unmatched)
shift)))
(loop (+ 1 shift) unmatched)))))))))
;; searching at a position
(define (search-needle-at index)
(letrec ((sub-hs (reverse (string->list (substring haystack index (+ needle-len index))))) ;; den kaefer erstma aufn ruecken drehen...
(first-char (car sub-hs)))
(if (eqv? first-char (car r-needle-list)) ;; first time is special
;; if the fist char matches, proceed with subpattern search
(let ((common (common-sublist sub-hs r-needle-list)))
(if (= (car common) needle-len)
0 ;; found
(cadr (assoc (cdr common) good-suffix-shift-table))))
;;if the first char did not match, look up shift in bad-char-shift table
(let ((shift (assv first-char bad-char-shift-table)))
(if (eq? shift #f)
needle-len ;; return the needle length if nothing better could be found in the bad-char jump table
(cadr shift)))))) ;; ...otherwise return the value obtained from the table

;; search mainloop
(let main-loop ((current-index 0))
(if (> (+ needle-len current-index) hs-len)
#f
(let ((minimum-chars-to-skip (search-needle-at current-index)))
(if (= 0 minimum-chars-to-skip)
current-index ;; juhu found string!
(main-loop (+ current-index minimum-chars-to-skip)))))))


(>>boyer-moore "ANPANMAN" "NNNNNAXPANPANMANANMAN")


Some utility definitions are missing from the above code, these are:


;; returns the rest of the list after removing n elements
(define (ncdr n list)
(if (eqv? n 0)
list
(if (eq? list '())
list
(ncdr (- n 1) (cdr list)))))

;; returns the fist n items of the list
(define (ncar n list)
(let loop ((result '())
(rest list)
(c n))
(if (eqv? c 0)
result
(if (eq? rest '())
result
(loop `(,(car rest) ,@result) (cdr rest) (- c 1) )))))

;; return the common begining sublist of two lists
(define (common-sublist listA listB)
(let loop
((listC '())
(restA listA)
(restB listB)
(count 0))
(if (or (eq? restA '()) (eq? restB '()))
(cons count listC)
(if (eqv? (car restA) (car restB))
(loop `(,@listC ,(car restA)) (cdr restA) (cdr restB) (+ 1 count))
(cons count listC)))))



The above code might be complete bullsh*t, I dont know I just hacked it down while reading the wikipedia article of the algorithm. I didn't bother to lookup a reference implementation...
Also it was like 4:00 am when I hacked it...(apologies accepted?)