A blog about Software Development, Astronomy and Everything

JDBM 3.0 alpha 1 released

I am proud to announce first alpha of JDBM3. JDBM is embedded Java key value database with more than 10 years of history. It provides java collections (maps, sets and lists) backed up by disk storage. And it has unbeatable performance and simplicity.

Main change from JDBM2 is write performance. JDBM is now probably the fastest Java db ever. It inserts million records per second. It creates multi-terabyte store with 1e11 records overnight. Instance cache now scales up-to 64 GB RAM. And it uses mapped memory buffers to maximize disk IO speed.

JDBM3 also introduces new deadly simple API. Everything from JDBM1 is gone (package protected). JDBM3 exports only two public classes. New release brings much more features, but is also simpler to use.

List of main new features

Compact serialization

JDBM has serialization with very little overhead. Compared to java serialization it uses 100x less space. It also stores class definition outside of records on single space. Serialization in this alpha seems to be working fine, but some corner cases (Externalizable, inner classes) are not handled correctly.

Write performance improvements

I spend huge amount of time making sure JDBM is fast with disabled transactions. Most of slow hot spots were identified and optimized away. Mapped byte buffers are now used, so random inserts have minimal penalty. JDBM can now truly insert million records per second (at least on my 5GHz computer with SSD drive). But JDBM makes 200 000 records/s even on an old laptop with slow disk.

API simplification

API was greatly simplified. 'RecordManager' was renamed to 'DB'. There is new builder for configuration, no more verbose properties. BTree/HTree and direct recid access are now obsolete(and gone) by collections. JDBM was merged into single package and most of internal stuff is package protected. I may have went too far, so I am open to discussion about making some old APIs public again.

Cache improvements

There are two new cache types using hard and weak references. 'Hard' cache has very little overhead (does not have to maintain reference queue) and scales well up to 64GB Heap. JDBM now periodically checks free mem and if its less than 25%, it clears reference cache. It is necessary for hard cache and I found GC to be slow and unreliable with huge heap. MRU cache is now default safe option.

Mapped memory disk buffer

JDBM now uses mapped memory disk buffer. This is very fast and advance way to access disk. JDBM3 now has nearly zero data copying and unbeatable performance. Mapped buffer is on by default, but there is an option to fallback into RandomAccessFile.

New collections

JDBM3 adds TreeSet and HashSet which are basically maps without values. It also adds LinkedList which is completely new structure. Secondary maps, StorageSet and some other stuff from JDBM2 is gone, but I am open to discussion about implementing it again.

New storages

If you specify 'null' instead of filename, JDBM will store all data in-memory. So data access is very fast, but data will be lost after JVM restarts. Other storage options are in progress and do not work well in this alpha (encryption, zip, write overlay).

Maven2

JDBM is now fully mavenized. I will add it to public reps when it reaches beta.

Defragmentation

Defrag is now much better. It reorders records so collections are stored at the same pages. This makes trees much faster.

Large values stored outside of tree

Large values are no longer inlined in trees. If it is bigger than 32 bytes, it is stored as separate record and load lazily.

Readonly store

Now it is possible to open store in readonly mode. This uses different locking, so you can open one file using more JVM instances in readonly mode. It also means that data are read faster as JDBM does not have to create defensive copies.

Usage

JDBM is located at github repository. You can download compiled jar file here. There is no javadoc yet, just follow those two simple class or example on main page.

This release is usable, but contains bugs and many TODOs! Its main purpose is to get feedback from community.

Future and other stuff

I got married and reevaluated my opensource activities. JDBM is very compact, with good code quality, successful and with huge potential. Exactly as a personal hobby project should be. So I suspended my other projects and work now solely on JDBM.

As you may have noticed I renamed 'jdbm' package to 'net.kotek.jdbm'. I founded previous package name too generic. New package should reduce fragmentation (there already 10 JDBM forks). It should be clearly visible where 'official' page is.

I expect JDBM3 to reach beta stage in about 6 months. At this point JDBM will turn into regular project with documentation, bug tracking system, maven2 repo etc.. There will also be feature freeze.

I believe JDBM3 has potential to be used by millions of people. So final version should be much better tested than JDBM2. I have automatic test suite, which hammers JDBM with random data for several weeks (or months). But before final JDBM release I will try to get sponsors, some extra hardware and adds could really speedup release.

At current speed I expect to have final JDBM 3 in two years.

· %2012/%01/%18 %01:%Jan · Jan Kotek

Pixy2 updated

Pixy 2 System is an astronomical image examination and object identification application. It process raw images and makes basic corrections. It can also identify asteroids, comets and find variable stars. It was developed by Seiichi Yoshida, I took liberty and updated it a bit.

Original version was last updated in 2007, more than four years ago. Features of this program are impressive so I tryed to run it. But it depended on some outdated packages from ancient version of Java Runtime.

I gave it a few hours and updated Pixy2 sources so it runs on recent JVM (1.6+). I also moved project source codes to Github, so now anyone can easily contribute new features.

So what I did:

  • I removed dependency on external XML parser. It is already bundled with recent JRE, so there is no need for external library
  • There was an external dependency on image library. I refactored it to use image library bundled with JRE
  • I dropped support for FITs (image library bundled with JRE is not that powerful).
  • A bit of refactoring to remove class name and keywords conflits
  • Fixed bunch of compilation warnings

Pixy2 is now polished and shiny, but it could still use some improvements. My plan is to add support for UCAC3 catalog and some online catalogs. And in some time I would like to make Pixy2 part of planetarium.

You can find updated version in repository, or you can binary package

· %2011/%11/%01 %23:%Nov · Jan Kotek

JDBM 3 is coming

A few weeks ago I started work on JDBM 3 at github. Main goal is to improve simplicity and performance. JDBM3 is packed with new features and changes. Difference from JDBM2 is even bigger than between JDBM1 and JDBM2. But there is still policy 'no test left behind', so we should enjoy great stability similar to previous releases.

I already started work and some features bellow are actually already implemented (serialization, lazy tree values…). I expect to have first alpha version in January 2012 (all features implemented with usable stability). Jar file should remain very small, around 200KB.

Serialization

Most significant change is object serialization. JDBM2 used very primitive space efficient serialization for a few base classes (Long, Integer, ArrayList…). For rest of classes it uses java serialization. Serialized data usually contains two section: class metadata (class and fields name and types…) and serialized data. Java serialization stores class metadata with each record and this creates huge space overhead. More efficient is to store class metadata on single space and just reference those from each record. In JDBM3 I am going to reimplement Java serialization to do exactly this. As result space usage will be dramatically reduced. New serialization will be completely transparent to user and behave exactly as normal java serialization (Serializable, Externalizable etc…). This may look as huge step, but most of it is already implemented in JDBM3 github repository.

Improved defragmentation

Current defragmentation does not rearrange records, it just reclaims unused space. New defrag will reorganize records so tree nodes will be located on the same pages. This should significantly improve tree read operations.

Large value stored outside tree

Currently all values are stored inside tree nodes. Even for simple lookup this means loading all values in node. If values are big (1~kb) it slows down tree operations. In JDBM3 values larger than 32 bytes will be serialized into separate record and only reference id will be stored as part of tree. 'PrimaryStoreMap' is no longer necessary and will be removed.

RecordManager builder

JDBM2 uses properties to provide settings for RecordManager. This is very verbose and does not work with IDE hints. So JDBM3 replaces properties with new RecordManagerBuilder class. An example:

  RecordManager recman = new RecordManagerBuilder("file.db")
     .enableWeakCache().readonly().build();

New collections

JDBM currently provides HashMap and TreeMap collections. JDBM3 will introduce HashSet, TreeSet and LinkedList.

Weak cache

There are small improvements in cache. Weak reference cache is added (we already have Soft). I am thinking about adding hard reference cache, but I have no valid use case for it.

Read only store

It will be possible to open RecordManager in readonly mode. In this mode all insert/update/delete methods will throw 'OperationNotSupportedException'. Readonly store will not be locked and will be openable by multiple JVM instances.

Alternative storages

JDBM3 will introduce alternatives to traditional file storage. In-memory storage will store all data in RAM (useful for testing or bulk imports). In-jar readonly storage will read all data from compressed jar file. So user can deploy database over webstart or java-applet. It will be easy to copy database from one store to other, so you may do bulk import in memory and package it directly into zip file.

RecordManager write overlay

Want to write into read-only storage (jar file)? For this case JDBM3 introduces Write Overlay RecordManager. In this mode original read-only RecordManager is wrapped with proxy, which stores all modifications in second storage. For user it behaves exactly as single writable record manager. I expect this to be very usefull for testing and deployment on desktop.

Two file storage

There were a lot of complains about JDBM2 using 8 files for storage. I would love to put everything into single file, but it is not possible. JDBM3 will have storage in two files: physical records and logical records. Keeping logical records separated greatly improves defragmentation and performance.

Maven 2

JDBM3 will use Maven2 instead of Ant. I will also add JDBM3 into main maven repositories.

Faster transactions

JDBM1 had interesting feature when transaction were grouped and written into record file at once. This greatly improved performance with write modifications. I removed this feature in JDBM2, as it also caused frequent 'OutOfMemoryExceptions' (transactions were stored in memory). In JDBM3 this feature will be reintroduced, but with fix for memory consumption.

Backups

JDBM will be able to backup database into zip file, while running.

Space usage statistics

Currently it is hard to tell how much space each structure uses. So JDBM3 will be able to printout same basic statistics about store, those are: unused space in store; min max and avg record size in store and each tree; total space consumed by each tree; number of nodes in tree etc.. This feature will be also important for development and performance profiling.

Free record sorting

On each insert JDBM needs to find free slot for new record. Currently 'brutal force' scan across all free slots is performed. In JDBM3 I would like to keep free records sorted by size. This would improve performance on inserts and updates.

Serializers stored in tree

Currently serializers are not inside tree, but are supplied outside by user. Now I think it is mistake and makes JDBM harder to use. So in JDBM3 serializers will be stored as part of tree definition in JDBM stored

Data format strictness

One of goals is to bring JDBM closer to SQL in terms of format definition and data consistency. In JDBM2 there is no difference between creating new tree and loading existing. Now I think it is mistake, and JDBM should have more strict definition of data structures. So there will be separated method for creating new tree and loading existing trees. Secondary trees will also have more strict definitions. I will also add something similar to constraints from SQL.

Single package

JDBM is now contained in single package 'jdbm'. Subpackages (recman, helper, btree and htree) were merged into single folder. Some classes were renamed to fit better into new structure (eg btree prefix). Internal classes (pages, free disk manager) should not be visible to user and are now package protected. Also I feel that JDBM has a few classes (~50) and there is no need for subpackages.

BTree and HTree completely replaced by maps

There were two ways to manipulate trees in JDBM2. Now BTree or HTree classes can not be used directly, Map wrapper must be used instead. This makes API simpler and reduces code size. HTree now implements Map interface directly. BTree is package protected and still uses TreeMap wrapper.

Future

JDBM 3 is last major version for years to come. I have no desire nor resources to move JDBM into clouds of clustering and super concurrency. JDBM will remain simple and fast storage for desktop and Android. In future I will concentrate on tooling and small improvements. For example I would love to have Spring support or GUI application to analyze record store. Also .Net port would be great.

· %2011/%10/%23 %19:%Oct · Jan Kotek

Scala problems

I love Scala, it lifted my capabilities and made me incredible productive. But after 18 months I found a few problems.

Compilation speed

Scala compilation is just bloody slow. My 'small' project with 200+ classes takes about 2 minutes to compile on quad core system. It is about 10x slower than similar code in Java. Scala compiler does lot of additional stuff (type inference, implicit conversions), so it kind of makes sense.

Incremental compilation just does not work yet

Eclipse (or any decent IDE) compiles Java as you type and reports errors almost instantly. Scala IDEs have incremental compilation as well, but it is still far behind Java.

Incremental compiler in Idea is just slow, for example simple test case can take up to 20 seconds to recompile. It is also unreliable, often it reports missing classes and you need to fully rebuild project to fix it.

Eclipse incremental compiler is more interactive and more reliable then in Idea. But Scala Eclipse plugin is currently just better Notepad hooked to compiler, I would not call it IDE yet.

There are some workarounds, but best is change in coding style; just dont run unit tests every 5 minutes as monkey.

Huge jars to distribute

Scala does lot of tricks to be fit into Java bytecode. For example each closure generated new *.class file.

As result size of jar files grows rapidly. You also need to bundle Scala Library with program. Scala Library source code zip has 1.1 MB, compiled to Jar file it takes 8.5 MB. Size is problem when distributing desktop and mobile applications.

Workaround is easy, just use Pro-guard.

Not possible to return to Java

Scala is often presented as replacement to Java, but in real it brings programming to a new level. Scala extends mind and abilities. And after while it is just not possible to return back to Java.

To ilustrate: Scala has XML support. Not much just native XML literals support and XPath wired into language. And now after 18 months I just forgot how XML parsing is done in Java.

Binary incompability

Scala changes class file format nearly every version. 2.7, 2.8 and 2.9 are not compatible with each other. But it gets even worse: 2.8.1 compiler can not compile with 2.8.0 library because it depends on some now classes. With a lot of Scala libraries, upgrade can become very painfull.

Bugs and release cycle

Scala is just not as realiable as Java. There are bugs in compiler and libraries. I usually discover new bug every 4 months.

Look at recent version numbers: 2.8.0 then 2.8.1 then 2.9.0 and 2.9.0.1. There is not clear release cycle. 'Bug fix' releases may have new functionality and bugs. Scala 2.8.1 actually introduced worst bug I found, I spend 2 days hunting it down.

And when new major version rolls out (2.8 or 2.9), you may forget about old version updates. It sucks when compared to PHP, Groovy or Python…

No tight loops

Scala is just not very good for tight loops. Take this code:

for(i <- 0 until 100){}

It translates to:

  1. first new object is created: Range(1,100)
  2. new Integer instance is created in each iteration
  3. bunch of methods is called, creating overhead on method stack.

All this creates huge overhead, which makes loop many times slower. To get the same speed as with Java one have to use this construct:

var i = 0
while(i<100){

 i += 1
}

It is not about actors

Scala is somehow strongly associated with Actors. But it is just tiny tiny speck in Scala world. Scala let me write better Swing code, integrate tightly with JDBC, do much better XML processing, program Android…

Yet some people have fetish about Actors. BTW I actually programmed on large application which uses Actor model. It is real pain in ars.

· %2011/%07/%06 %21:%Jul · Jan Kotek

Scala Fast Compiler and Intellij Idea

Scala compilation in Idea is painfully slow on large projects. SFC should speed up, but is just not reliable. Solution? Just start Scala Fast Compiler from Ant script and use it from Idea.

If previous SFC instance crashed, it may left mess behind. So before starting SFC it is good to delete its settings. This idea comes from Krzysztof Białek. He provides shell to stat SFC, but why not to use Ant:

    <target name="sfc" description="Starts Scala Fast Compilation server" depends="">

        <!-- delete previous config dir, there may be mess in there -->
        <delete dir="${java.io.tmpdir}/scala-devel"/>

        <!-- define classpath -->
        <path id="fsc-classpath">
            <fileset dir="lib">
                <include name="*.jar"/>
            </fileset>
            <fileset dir="tools/buildlib">
                <include name="*.jar"/>
            </fileset>
        </path>

        <!--
        fsc usually quits after N minutes of inactivity,
        start it again a few times to extend time
        -->
        <java classname="scala.tools.nsc.CompileServer" maxmemory="512m" fork="true" >
            <classpath refid="fsc-classpath"/>
        </java>
        <java classname="scala.tools.nsc.CompileServer" maxmemory="512m" fork="true" >
            <classpath refid="fsc-classpath"/>
        </java>
        <java classname="scala.tools.nsc.CompileServer" maxmemory="512m" fork="true" >
            <classpath refid="fsc-classpath"/>
        </java>
        <java classname="scala.tools.nsc.CompileServer" maxmemory="512m" fork="true" >
            <classpath refid="fsc-classpath"/>
        </java>
        <java classname="scala.tools.nsc.CompileServer" maxmemory="512m" fork="true" >
            <classpath refid="fsc-classpath"/>
        </java>
        <java classname="scala.tools.nsc.CompileServer" maxmemory="512m" fork="true" >
            <classpath refid="fsc-classpath"/>
        </java>
        <java classname="scala.tools.nsc.CompileServer" maxmemory="512m" fork="true" >
            <classpath refid="fsc-classpath"/>
        </java>
        <java classname="scala.tools.nsc.CompileServer" maxmemory="512m" fork="true" >
            <classpath refid="fsc-classpath"/>
        </java>


    </target>

You will need to change classpath to match your project. SFC needs scala-compiler.jar and scala-library.jar. FSC exits after some time of inactivity, so it is restarted multiple times. After you start ant task, leave it running and use Idea.

In Idea goto File > Settings > Compiler > Scala Compiler > Use fsc (fast scalac). When this is enabled scala compilation will became much faster.

Update other source problem is wrongly configured hostname. FSC resolves your hostname into IP address, and tries to connect to this IP. It does not use '127.0.0.1'. On linux checkout content of '/etc/hostname' (in my case artemis), and try to ping to that name

 $ ping artemis

If this fails, you need to add this host manually as first line into '/ets/hosts'

 127.0.0.1     artemis
· %2011/%04/%21 %21:%Apr · Jan Kotek

Older entries >>