"DFLib" (short for "DataFrame Library") is a lightweight, pure Java implementation of DataFrame. DataFrame is a very common structure in data science and Big Data worlds. It provides operations like search, filtering, joins, aggregations, statistical functions, etc., that are reminiscent of SQL (also of Excel), except you run them in your app over dynamic in-memory data sets.

There are DataFrame implementations in Python (pandas), R, Apache Spark, etc. DFLib project’s goal is to provide the same functionality for regular Java applications. It is a simple library, that requires no special infrastructure. DFLib core is dependency-free.

The code in this documentation can be run in any Java IDE, as well as (and this is pretty cool!) in a Jupyter notebook. You will need Java 8 or newer. Running in Jupyter will require Java 9+.

Getting Started with DFLib

Include DFLib in a project. Assuming you are using Maven, start by declaring a "BOM" to have a common version for multiple DFLib modules:

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>com.nhl.dflib</groupId>
      <artifactId>dflib-bom</artifactId>
      <version>0.11</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

Next include DFLib core as a dependency:

<dependency>
    <groupId>com.nhl.dflib</groupId>
    <artifactId>dflib</artifactId>
</dependency>

Create a DataFrame, do some manipulations with it, print the result:

DataFrame df1 = DataFrame
        .newFrame("a", "b", "c")
        .foldIntStreamByRow(IntStream.range(1, 10000));

DataFrame df2 = df1.selectRows(
        df1.getColumnAsInt(0).indexInt(i -> i % 2 == 0));

System.out.println(Printers.tabular.toString(df2));

When you run this code, console output will look like this:

   a    b    c
---- ---- ----
   4    5    6
  10   11   12
  16   17   18
...
9982 9983 9984
9988 9989 9990
9994 9995 9996
1666 rows x 3 columns
We’ll omit print statements in all the following examples, and will simply display their output. Details of printing are discussed in the in "Printers" chapter.

Main Data Structures

The two essential DFLib classes are Series and DataFrame. Series is a 1-dimensional array-like object, and DataFrame is a 2-dimensional table-like object. Columns in a DataFrame are stored as Series. Additionally, we will discuss Index object that stores DataFrame column names.

Both DataFrame and Series (and Index) are fully immutable, so all operations on them return a new instance. Behind the scenes the library shares as much data as possible between instances, so copying these objects does not cause significant performance degradation, and in turn makes DFLib fully thread-safe, with multiple concurrent operations possible on the same data structure. Also immutability means that you can have full snapshots of your data at each step of a transformation, simplifying debugging and auditing of data pipelines.

Series

Series is the simplest of the two data structures. You can think of it as a wrapper around an array of values. You can use Series to model sequences of data such as timestamps in a time series. Series object is parameterized for the type of data that it holds. So there can be e.g. Series<String> or a Series<LocalDate>. There is also an important category of "primitive" Series (IntSeries, LongSeries, DoubleSeries, BooleanSeries) that are optimized for memory use and arithmetic operations. While this would be less efficient, they can also be used as Series of a corresponding wrapper object type (e.g. IntSeries is also a Series<Integer>).

Series object is important as a building block of DataFrame, but it also defines a number of useful data manipulation and transformation operations on its own. Those will be covered in the following chapters. Here we’ll discuss how to create Series.

Creating Series from Arrays

Series can be created using static forData method on the Series interface:

Series<String> s = Series.forData("a", "bcd", "ef", "g");
a
bcd
...
g
4 elements

Primitive Series classes have their own factory methods. E.g.:

IntSeries is = IntSeries.forInts(0, 1, -300, Integer.MAX_VALUE);

Creating Series Incrementally

If we don’t have a full array of data to start with, and instead have code producing a sequence of values of unpredictable length, we can use special mutable "accumulators" to store the sequence and then produce a Series out of it. E.g. the following example reads data from an InputStream line by line as Strings:

// InputStream inputStream = ...
Accumulator<String> accum = new ObjectAccumulator<>();
Scanner scanner = new Scanner(inputStream);
while (scanner.hasNext()) {
    accum.add(scanner.next());
}

Series<String> s = accum.toSeries();

Similarly there are IntAccumulator, LongAccumulator, etc.

DataFrame

DataFrame is an in-memory table with rows and columns. Each column has a name, and is represented as a Series object. DataFrame can contain columns of different kinds and is not parameterized for a single type.

There are a few ways to create a DataFrame. Here we’ll show "builder" API to convert arrays, Streams, Collections, Series and other in-memory data into DataFrames.

Alternatively a DataFrame can be loaded from (and saved to) a database or a CSV file. Those data sources are discussed in separate documentation chapters.

First example - adding data row by row:

DataFrame df = DataFrame
        .newFrame("name", "age") (1)
        .addRow("Joe", 18)   (2)
        .addRow("Andrus", 45)
        .addRow("Joan", 32)
        .create();
1 Builder is created from column names
2 Passing individual rows one by one

The resulting DataFrame looks like this:

name   age
------ ---
Joe    18
Andrus 45
Joan   32
3 rows x 2 columns

Another example - if we have a single array and need to "fold" it into a two-dimensional DataFrame row-by-row:

DataFrame df = DataFrame
        .newFrame("name", "age") (1)
        .foldByRow("Joe", 18, "Andrus", 45, "Joan", 32); (2)
1 Builder is created from column names
2 Passing a varargs array of values, that is folded to match the specified number of columns, row by row.

Same, but folding column-by-column:

DataFrame df = DataFrame
        .newFrame("name", "age")
        .foldByColumn("Joe", "Andrus", "Joan", 18, 45, 32);

You can also create DataFrames from Collections and Streams (folded either by row or by column). Here is an example of how to use Streams of primitive ints, creating a DataFrame made of memory-efficient IntSeries columns:

DataFrame df = DataFrame
        .newFrame("col1", "col2")
        .foldIntStreamByColumn(IntStream.range(0, 10000));
col1 col2
---- ----
   0 5000
   1 5001
   2 5002
...
4997 9997
4998 9998
4999 9999
5000 rows x 2 columns

Finally, a DataFrame can be created from an array of Series, each Series representing a column:

DataFrame df = DataFrame
        .newFrame("name", "age")
        .columns(
                Series.forData("Joe", "Andrus", "Joan"),
                IntSeries.forInts(18, 45, 32)
        );

This is the fastest and most efficient way to make a DataFrame, as an array of Series is how it is structured internally.

Now that we’ve discussed how to create Series and DataFrames, we’ll talk about printing their contents, and then will continue to the chapters that describe data operations.

Index

Index is somewhat similar to Series. It stores a sequence of String labels, corresponding to column labels in a DataFrame it is attached to. While you normally won’t need to create or manipulate Index outside of a DataFrame, you may use it to access DataFrame columns labels.

DataFrame df = DataFrame
        .newFrame("col1", "col2")
        .foldIntStreamByColumn(IntStream.range(0, 10000));

String[] labels = df.getColumnsIndex().getLabels(); (1)
1 Access DataFrame index and read label names.

Printers

When doing data exploration and running data pipelines, it is important to be able to visualize data at every step. More advanced forms of visualization include charts and diagrams. But the simplest thing you can do is printing data to the console. Both DataFrame and Series implement toString() method, that will print their contents as a single line, truncating large data sets in the process, which is good for debugging applications running on a server.

A more human-friendly form of output is done by using a tabular printer. Here is how to use the default tabular printer:

DataFrame df = DataFrame
        .newFrame("col1", "col2", "col3")
        .foldIntStreamByColumn(IntStream.range(0, 10000));

String table = Printers.tabular.toString(df);
System.out.println(table);

It prints the data as follows, displaying at most 6 rows and truncating the rest. Same goes for cell values. Values longer than 30 chars are also truncated (since they are pretty short in our example, value truncation is not obvious here).

col1 col2 col3
---- ---- ----
   0 3334 6668
   1 3335 6669
   2 3336 6670
...
3331 6665 9999
3332 6666    0
3333 6667    0
3334 rows x 3 columns

You can change the truncation parameters by creating your own printer:

Printer printer = new TabularPrinter(3, 3); (1)
String table = printer.toString(df);
System.out.println(table);
1 Create a printer that displays at most 3 rows and up to 3 characters in each cell.
c.. c.. c..
--- --- ---
  0 3.. 6..
  1 3.. 6..
...
3.. 6..   0
3334 rows x 3 columns
If you are using Jupyter Notebook, all the printers are already setup for you. So if the last statement in a Jupyter cell is a DataFrame or a Series, it will be printed as a table in the notebook.

Expressions

DFLib includes a built-in expression language (implemented as a Java "DSL"). It allows to manipulate columns and rows in DataFrames and data in Series. Exp is the interface representing an expression that takes a DataFrame or a Series and produces a Series of the specified type. It also provides an entry point to create expressions via its static factory methods.

To use expressions, you’d start by adding a static import of the Exp interface, so that all its factory methods are available directly in the code:

import static com.nhl.dflib.Exp.*;

Now let’s create two simple expressions that return a named and a positional column of the requested type:

StrExp lastExp = $str("last");
DecimalExp salaryExp = $decimal(2);

And now let’s evaluate the expressions:

DataFrame df = DataFrame.newFrame("first", "last", "salary").foldByRow(
        "Jerry", "Cosin", new BigDecimal("120000"),
        "Amanda", "Gabrielly", new BigDecimal("80000"),
        "Joan", "O'Hara", new BigDecimal("95000"));

Series<String> last = lastExp.eval(df);
Series<BigDecimal> salary = salaryExp.eval(df);

This doesn’t look like much (other DFLib APIs would do the same), but this basic abstraction allows to implement a wide range of operations (filtering, sorting, aggregation, etc). The following chapters will demonstrate various use cases for expressions.

DFLib expression language Expressions work on DFLib data structures instead of individual values, so they can achieve the best possible performance for any given operation. Expressions should be a preferred way to manipulate your data instead more "direct" API.

Now let’s look at various types of expressions…​

Column Expressions

$str(..) and $decimal(..) expressions in the example above are the column lookup expressions. They return a DataFrame column with a given name or at a given (zero-based) position without applying any transformations to the data.

So what is returned when you evaluate a column expression with a Series object? Series can be thought of as a single (unnamed) column. So a column expression simply returns the Series unchanged, ignoring implied column name or position.

Factory methods for column expressions are easy to spot in the Exp interface - they all start with a dollar sign. Return type of the column expression is implied from the method name ($str produces a Series<String>, $decimal - Series<BigDecimal>, $int - Series<Integer> and so on).

As of v0.11 DFLib doesn’t do any type conversion in the original column data (the exception being $bool). So you need to select the right type in your code to avoid ClassCastExceptions downstream. This may change in the future.

Complex Expressions

Expressions can be composed of other expressions by invoking static methods of Exp and type-specific instance methods of various Exp subclasses. E.g.:

// "Condition" is an Exp<Boolean> described in more detail below.
Condition c = and(
        $str("last").startsWith("A"),
        $decimal("salary").add($decimal("benefits")).gt(100000.)
);

Numeric Expressions

TODO

Date Expressions

TODO

Conditions

TODO

Sorters

Sorter is a special object, that allows to sort DFLib data structures. Internally a Sorter is using an expression to retrieve values that comprise sorting criteria and index them in the specified order. Sorter can be created from any expression by calling its asc() or desc() methods:

// sort by last name in the ascending order
Sorter s = $str("last").asc();
DFLib would allow you to create a Sorter based on any expression type. In runtime the actual type must be either a Java primitive or an instance of Comparable, or a ClassCastException will be thrown during sorting.

Selecting Rows and Columns

Before analyzing or transforming a Series or a DataFrame, very often they need to be to narrowed down to just the relevant data. API for selecting and filtering rows and columns described in this chapter helps with such task.

Selecting Columns

selectColumns(..) method allows to do a few things at once: select a subset of columns in the DataFrame, reorder columns and even duplicate columns.

  • Selecting by column name. In the example below we are getting rid of the "middle" column and reordering "first" and "last" columns:

DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
        "Jerry", "Cosin", "M",
        "Amanda", "Gabrielly", null,
        "Joan", "O'Hara", "J");

DataFrame df1 = df.selectColumns("last", "first");
last      first
--------- ------
Cosin     Jerry
Gabrielly Amanda
O'Hara    Joan
3 rows x 2 columns
  • Selecting by column position instead of the name:

DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
        "Jerry", "Cosin", "M",
        "Amanda", "Gabrielly", null,
        "Joan", "O'Hara", "J");

DataFrame df1 = df.selectColumns(1, 0);

Deleting Columns

DataFrame.dropColumns(..) is an operation complimentary to selectColumns. It allows to remove one or more columns, but does not allow to reorder the remaining columns.

DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
        "Jerry", "Cosin", "M",
        "Amanda", "Gabrielly", null,
        "Joan", "O'Hara", "J");

DataFrame df1 = df.dropColumns("first", "middle"); (1)
1 Drop one or more columns from the DataFrame.
last
---------
Cosin
Gabrielly
O'Hara
3 rows x 1 column

Selecting Rows by Position

  • Selecting a subset of rows by position. As there are no row labels in DFLib yet, selecting by position is the only "direct" selection option. Just like with column selection, it allows to duplicate rows by referencing the same position more than once, as well as reorder rows:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

DataFrame df1 = df.selectRows(2, 0, 0);
first last
----- ------
Joan  O'Hara
Jerry Cosin
Jerry Cosin
3 rows x 2 columns
  • Selecting using an index. You may precalulate an "index" (an IntSeries) of row positions and then use it for row selection. An index often comes from applying a condition to a Series that is not a part of the DataFrame. E.g.:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

IntSeries rowNumbers = new IntSequenceSeries(0, df.height());

IntSeries selectionIndex = rowNumbers.indexInt(i -> i % 3 == 0); (1)

DataFrame df1 = df.selectRows(selectionIndex);
1 Index is calculated by applying a condition to a Series, that is not a part of the DataFrame, whose rows we are selecting.
first last
----- -----
Jerry Cosin
1 row x 2 columns

Selecting Rows by Condition

Most often than not we don’t know specific positions of rows we would like to select, and instead would like to evaluate each row against some condition to decide whether it should be included in the result:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

DataFrame df1 = df.selectRows(and(
        $str("first").startsWith("J"),
        $str("last").startsWith("O")));
first last
----- ------
Joan  O'Hara
1 row x 2 columns
Unlike positional select this flavor does not allow to reorder or duplicate rows.

In the example above select condition was provided using the Condition object. You can also use a lambda function on a single column:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

DataFrame df1 = df.selectRows(
        "first",
        (String f) -> f != null && f.startsWith("J"));

Or a lambda function on the entire row:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

DataFrame df1 = df.selectRows(r ->
        r.get("first").toString().startsWith("J")
                && r.get("last").toString().startsWith("O"));

Finally, BooleanSeries can also serve as a filter condition (a "mask"). Such mask is usually a result of a prior logical operation between Series. E.g.:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

Series<String> names = Series.forData("Sandra", "Anton", "Joan");
BooleanSeries mask = names.eq(df.getColumn("first")); (1)

DataFrame df1 = df.selectRows(mask);
1 Mask is calculated by comparing a Series of names with DataFrame column of first names.
first last
----- ------
Joan  O'Hara
1 row x 2 columns

Head and Tail

When you only need the "first N" or "last M" rows of a DataFrame (or values of a Series) you can use head and tail operations. Here is how to get the top 2 rows of a DataFrame:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

DataFrame df1 = df.head(2); (1)
1 Returns a new DataFrame with two top rows of the original DataFrame.
first  last
------ ---------
Jerry  Cosin
Amanda Gabrielly
2 rows x 2 columns

tail works similarly, but returns the last N rows:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

DataFrame df1 = df.tail(1);
first last
----- ------
Joan  O'Hara
1 row x 2 columns

Series also define head(..) and tail(..) operations that do what you’d expect from them, returning first or last N values.

Unlike index operations on arrays and lists in Java, head(..) and tail(..) are safe in regards to bounds checking. They DO NOT throw exceptions when the length parameter is bigger than DataFrame height (or Series size), returning an empty DataFame (or Series) instead.

Columns Manipulation

Selecting rows and columns, described in the previous chapter, helps to produce a desired subset of data. Here we’ll demonstrate a few extra methods for DataFrame columns manipulation:

  • Adding columns with new data, that may or may not be derived from the data in other DataFrame columns.

  • Renaming columns

  • Normalizing column data and converting it to a specific Java type

Adding Columns

There are a few ways to add columns to a DataFrame. A rather common case is adding a column by calculating cell values based on data from other columns:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

DataFrame df1 = df.addColumn(
        concat($str("first"), $val(" "), $str("last")) (1)
                .as("full")  (2)
);
1 Expression producing the column by concatenating values of other columns
2 "as" sets the name of the final expression, which will be the name of the new column

The same can be done using a lambda instead of an expression (though the Expression API should be preferred) :

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

DataFrame df1 = df.addColumn(
        "full",                                   (1)
        r -> r.get("first") + " " + r.get("last") (2)
);
1 New column name
2 Operation to produce values for the new column from the existing row. r is a special RowProxy object, providing access to the row being processed.

Multiple columns can be added in a single method call:

DataFrame df1 = df.addColumns(
        $str("last").mapVal(s -> s.charAt(0)).as("last_initial"),
        $str("first").mapVal(s -> s.charAt(0)).as("first_initial")
);
first   last      last_initial first_initial
------ --------- ------------ -------------
Jerry  Cosin     C            J
Amanda Gabrielly G            A
Joan   O'Hara    O            J
3 rows x 4 columns

If you have a Series object of the size matching the DataFrame height, it can be added as a column. E.g. we can add a column that will number DataFrame rows:

IntSeries rowNumbers = new IntSequenceSeries(1, df.height() + 1); (1)
DataFrame df1 = df.addColumn(
        "number",
        rowNumbers
);
1 IntSequenceSeries is a special Series that produces a sequence of numbers in the specified range
first   last      number
------ --------- ------
Jerry  Cosin          1
Amanda Gabrielly      2
Joan   O'Hara         3
3 rows x 3 columns

For this particular task (adding row numbers) there’s a shortcut that does the same thing as the code above:

DataFrame df1 = df.addRowNumberColumn("number");

Renaming Columns

Often DataFrames come from external sources (like CSV files or database tables) with column names requiring further cleanup to match the expectations of the application. There are a number of ways to rename columns:

  • Rename individual columns:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Joan", "O'Hara");

DataFrame df1 = df
        .renameColumn("first", "first_name")
        .renameColumn("last", "last_name");
first_name last_name
---------- ---------
Jerry      Cosin
Joan       O'Hara
2 rows x 2 columns
  • Specify the new names of all columns at once. This may not be practical for a large number of columns:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Joan", "O'Hara");

DataFrame df1 = df.renameColumns("first_name", "last_name");
first_name last_name
---------- ---------
Jerry      Cosin
Joan       O'Hara
2 rows x 2 columns
  • Rename using a map of old names to new names. Names not present in the map will remain unchanged:

Map<String, String> nameMap = new HashMap<>();
nameMap.put("first", "first_name");

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Joan", "O'Hara");

DataFrame df1 = df.renameColumns(nameMap);
first_name last
---------- ------
Jerry      Cosin
Joan       O'Hara
2 rows x 2 columns
  • Rename using a function, applied to all column names in turn. This is useful e.g. to convert all names to lower case:

DataFrame df = DataFrame.newFrame("FIRST", "LAST").foldByRow(
        "Jerry", "Cosin",
        "Joan", "O'Hara");

DataFrame df1 = df.renameColumns(String::toLowerCase);
first last
----- ------
Jerry Cosin
Joan  O'Hara
2 rows x 2 columns

Changing Column Type

Data may come to us in various forms. Often it comes as Strings or binary objects, leaving it up to the application to interpret them. There are a few methods in DataFrame to convert column values from one type to another. The most generic is convertColumn, allowing to convert between any two data types:

DataFrame df = DataFrame.newFrame("year", "sales").foldByRow(
        "2017", "2005365.01",
        "2018", "4355098.75");

DataFrame df1 = df
        .convertColumn(0, (String s) -> Integer.valueOf(s))
        .convertColumn(1, (String s) -> new BigDecimal(s));

Additionally there are methods that can convert Strings and Numbers to primitive values:

DataFrame df1 = df
        .toIntColumn("year", 0) (1)
        .toDoubleColumn("sales", 0.); (1)
1 The second method argument is the primitive value to use if the value being converted is null

toIntColumn and other such methods are not just syntactic sugar. They convert DataFrame columns to primitive Series, that take significantly less memory and compute most operations faster than Object-based ones. You should always consider using them where possible.

In addition to methods for numeric primitives DataFrame has toBooleanColumn, toEnumFromNumColumn and toEnumFromStringColumn methods that serve similar purpose, only for booleans and enums.

Cleaning up Data

Data coming to DFLib may be of varying quality and hence may require cleanup before processing. We’ve already discussed column type conversion operation, that allows to normalize data among other things. Another typical case of data requiring cleanup is missing values. Such values are represented as regular Java nulls. There is a variety of reasons why data can be missing. A rather common one is outer joins and concatenation operations described further down in this document.

In all of these cases you may have to guess or extrapolate the actual values based on some internal knowledge of the dataset. There are methods in both Series and DataFrame that allow to replace nulls with either fixed values or values from adjacent cells. First, let’s see how this works for Series…​

Filling Nulls in Series

The simplest approach is to replace nulls with a constant value:

Series<String> withNulls = Series.forData("a", "b", null, null, "c");
Series<String> filled = withNulls.fillNulls("X");
a
b
X
X
c
5 elements

Another strategy is to use values adjacent to the cells with nulls:

Series<String> filled = withNulls.fillNullsForward(); (1)
1 fillNullsForward takes a non-null value preceding the null cell and uses it to replace nulls. Similarly there’s a fillNullsBackward method that uses the value following the null cell.
a
b
b
b
c
5 elements

Finally, there’s a way to fill nulls using another Series object as a kind of "template" or "mask" :

Series<String> mask = Series.forData("A", "B", "C", "D", "E", "F", "G");
Series<String> filled = withNulls.fillNullsFromSeries(mask);
a
b
C
D
c
5 elements

Notice that the "mask" Series can be longer (or shorter) than the Series where the nulls are replaced. Values between the two Series objects are aligned by position starting at zero.

Filling Nulls in DataFrame

Filling nulls in a DataFrame is done by column, so the API is very similar to Series.fillNulls* methods, only with an additional String on integer argument specifying column position.

DataFrame withNulls = DataFrame.newFrame("c1", "c2").foldByRow(
        "1", "1",
        null, null,
        "2", "2");

DataFrame filled = withNulls
        .fillNullsBackwards("c1")
        .fillNullsForward("c2");
c1 c2
-- --
1  1
2  1
2  2
3 rows x 2 columns

Sorting

You can sort values in Series, and sort rows in DataFrames. As is the case everywhere else in DFLib, sorting does not alter the original object, and instead creates a new instance of either Series or DataFrame.

First let’s look at sorting Series…​

Sorting Series

Series provides a sort method to sort its data using a sorter built from an expression:

// sort series by String length
Series<String> s = Series
        .forData("12", "1", "123")
        .sort($str("").mapVal(String::length).asc());

The result is a new Series with sorted data:

1
12
123
3 elements

Alternatively the same can be achieved using a Comparator:

Series<String> s = Series
        .forData("12", "1", "123")
        .sort(Comparator.comparingInt(String::length));

The next example shows how to sort Series in the "natural" order (alphabetically for Strings):

Series<String> s = Series
        .forData("c", "d", "a")
        .sort($str("").asc());
a
c
d
3 elements

Series of primitives have methods to sort values in the natural order without an explicit sorter or comparator:

LongSeries s = LongSeries
        .forLongs(Long.MAX_VALUE, 15L, 0L)
        .sortLong();
                  0
                 15
9223372036854775807
3 elements

Additionally, IntSeries has an optimized method to sort ints with a custom IntComparator.

Next we’ll check how to sort a DataFrame.

Sorting DataFrame

Rows in a DataFrame can be sorted with one or more sorters.

Just like in other places where a Sorter might be used, there is an assumption that the Sorter expression produces values that are either Java primitives or are compatible with java.lang.Comparable (i.e., Strings, numbers, etc).
DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
        "Jerry", "Cosin", "M",
        "Amanda", "Gabrielly", null,
        "Jerry", "Albert", null,
        "Joan", "O'Hara", "J");

DataFrame df1 = df.sort(
        $str("first").asc(),
        $str("last").asc());
first  last      middle
------ --------- ------
Amanda Gabrielly null
Jerry  Albert    null
Jerry  Cosin     M
Joan   O'Hara    J
4 rows x 3 columns

Alternatively sorting can be done by a single column name or an array of columns. The assumption is the same as above - values in columns used for sorting must be either Java primitive or implement java.lang.Comparable.

DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
        "Jerry", "Cosin", "M",
        "Amanda", "Gabrielly", null,
        "Joan", "O'Hara", "J");

DataFrame df1 = df.sort("first", true); (1)
1 The first argument is the column name (can also be column integer index), the second - a boolean indicating sort direction (true for ascending, false - for descending order).
DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
        "Jerry", "Cosin", "M",
        "Amanda", "Gabrielly", null,
        "Jerry", "Albert", null,
        "Joan", "O'Hara", "J");

DataFrame df1 = df.sort(new String[]{"last", "first"}, new boolean[]{true, false});

Concatenating

There are various ways to combine data from multiple Series or DataFrames. The simplest is concatenation, described in this chapter. Series can be concatenated together producing a longer Series. DataFrames can be concatenated either vertically (along the rows axis) or horizontally (along the columns axis).

Concatenating Series

If you have a Series object and want to concatenate it with one or more other Series, you’d use Series.concat(..) method:

Series<String> s1 = Series.forData("x", "y", "z");
Series<String> s2 = Series.forData("a");
Series<String> s3 = Series.forData("m", "n");

Series<String> sConcat = s1.concat(s2, s3);

The result is as expected contains values of all Series put together:

x
y
z
a
m
n
6 elements

If you have a collection or an array of Series and want to "glue" them together, you can use a static SeriesConcat.concat(..):

Collection<Series<String>> ss = asList(
        Series.forData("x", "y", "z"),
        Series.forData("a"),
        Series.forData("m", "n"));

Series<String> sConcat = SeriesConcat.concat(ss);

The result is the same as in the previous example. Also you can concatenate Series with itself:

Series<String> s = Series.forData("x", "y");
Series<String> sConcat = s.concat(s);
x
y
x
y
4 elements

Concatenating DataFrames

DataFrame offers two options for concatenation - vertical (stacking DataFrames on top of each other) and horizontal (putting them next to each other). Let’s see some examples..

DataFrame df1 = DataFrame.newFrame("a", "b").foldByRow(
        1, 2,
        3, 4);

DataFrame df2 = DataFrame.newFrame("a", "b").foldByRow(
        5, 6,
        7, 8);

DataFrame dfv = df1.vConcat(df2); (1)
1 Concatenate this and another DataFrame vertically. The argument is a vararg, so more than one DataFrame can be passed to the method.
a b
- -
1 2
3 4
5 6
7 8
4 rows x 2 columns
DataFrame dfh = df1.hConcat(df2); (1)
1 Concatenate this and another DataFrame horizontally. The argument is a vararg, so more than one DataFrame can be passed to the method.
a b a_ b_
- - -- --
1 2 5  6
3 4 7  8
2 rows x 4 columns
Since both df1 and df2 had the same column names, and a DataFrame must only have unique columns, _ suffix was automatically appended to the conflicting columns in the resulting DataFrame. We will see this auto-renaming behavior in other places, such as joins.

So far all our concatenation examples consisted of DataFrames that had matching dimensions (same number and names of columns for vConcat and same number of rows for hConcat). But what if concatenated DataFrames are shaped or named differently?

Concat methods may take an extra "how" parameter to define concatenation semantics. The type of the "how" parameter is JoinType, and can be one of inner (default), left, right, full.

We’ll see JoinType again soon when discussing joins. Concatenation and joins are related fairly closely.

Let’s look how this works with vConcat:

DataFrame df1 = DataFrame.newFrame("b", "a").foldByRow(
        1, 2,
        3, 4);

DataFrame df2 = DataFrame.newFrame("a", "c").foldByRow( (1)
        5, 6,
        7, 8);

DataFrame dfv = df1.vConcat(JoinType.inner, df2); (2)
1 df1 column names are "b" and "c", while df2 - "a" and "c".
2 Explicitly passing JoinType.inner to vConcat. It is done to demonstrate the point. Since inner is the default, omitting it will not change the outcome.
a
-
2
4
5
7
4 rows x 1 column

As you can see in the result, the behavior of inner join is to only keep columns that are present in both df1 and df2. Columns are joined by name, regardless of their order (though the order of columns is preserved in the result, following the order in each DataFame being joined, left to right).

Changing semantics from inner to left gives us all the columns of the leftmost DataFrame, but those columns that are missing from the concatenated DataFrame are filled with nulls in the corresponding positions:

DataFrame dfv = df1.vConcat(JoinType.left, df2);
b    a
---- -
1    2
3    4
null 5
null 7
4 rows x 2 columns

Leaving it as an exercise for the reader to try right and full joins.

hConcat works similarly, however concatenation is done by row position, as there are no row names:

DataFrame df1 = DataFrame.newFrame("a", "b").foldByRow(
        1, 2,
        3, 4,
        5, 6);

DataFrame df2 = DataFrame.newFrame("c", "d").foldByRow(
        7, 8,
        9, 10);

DataFrame dfv = df1.hConcat(JoinType.left, df2);
a b c    d
- - ---- ----
1 2 7    8
3 4 9    10
5 6 null null
3 rows x 4 columns

Joins

Grouping and Aggregating

Window functions

Working with .csv Files

CSV (comma-separated values) is a very common and rather simple format for working with raw data. CSV files can be manipulated programmatically or manually (via Excel/LibreOffice), loaded to / from databases, etc. DFLib supports reading DataFrames from CSV and storing them to CSV. You need to add an extra dependency to your project to take advantage of this functionality:

<dependency>
    <groupId>com.nhl.dflib</groupId>
    <artifactId>dflib-csv</artifactId>
</dependency>

Once you do that, Csv class is the entry point to all the .csv related operations as discussed below.

Reading DataFrame from CSV

The simplest API for reading a CSV is this:

DataFrame df = Csv.load("src/test/resources/f1.csv"); (1)
1 The argument to the "load" method can be a filename, a file, or a Reader, so it can be loaded from a variety of sources.

The result is a DataFrame that matches CSV structure:

A B
- --
1 s1
4 s2
2 rows x 2 columns

DFLib made a few assumptions about the data in f1.csv, namely that the first row represents column names, that all columns are Strings, and that all of them need to be included. These assumptions are not necessarily true with many data sets. So there is a longer form of this API that allows to configure column types, skip rows and columns, or even sample the data without loading the entire CSV in memory. Some examples:

DataFrame df = Csv.loader() (1)
        .skipRows(1) (2)
        .header("x", "y") (3)
        .intColumn("x") (4)
        .load("src/test/resources/f1.csv");
1 Instead of "load" method use "loader".
2 Skip the header row.
3 Provide our own header.
4 Convert the first column to int.
x y
- --
1 s1
4 s2
2 rows x 2 columns
In theory you don’t have to do most of these tweaks via CSV loader. You can load the raw data, and then use standard DataFrame transformations to shape the result. However doing it via the loader allows to optimize both load speed and memory use, so it is usually preferable.

Writing DataFrame to a CSV

Writing to a CSV is equally easy:

Csv.save(df, "target/df.csv"); (1)
1 The argument to the "save" method can be a filename, a file, a Writer (or generally Appendable), so it can be stored to a variety of destinations.

Just like with the loader, CSV saver provides its own set of options, if the defaults are not sufficient:

Csv.saver() (1)
        .createMissingDirs() (2)
        .format(CSVFormat.EXCEL) (3)
        .save(df, "target/csv/df.csv");
1 Instead of "save" method use "saver" to be able to customize the process.
2 If intermediate directories for the output file are missing, create them.
3 Provide an alternative format (either from a collection of known formats, or user-specific).

Working with .avro Files

Avro binary file format is common in data engineering. Its main advantage is that files are saved with an embedded schema. So when they are read back, the fields are converted to the correct original Java types. (This is different from say CSV. Reading a .csv to a DataFrame by default only produces String columns). To save or load DataFrames using the Avro format, include the following dependency:

<dependency>
    <groupId>com.nhl.dflib</groupId>
    <artifactId>dflib-avro</artifactId>
</dependency>

Once you do that, Avro class is the entry point to all the operations. With it you can save DataFrames to .avro files and load them back.

Avro Schema

In most cases you don’t need to know anything about Avro schemas. DFLib automatically generates a schema for a DataFrame before it is saved. Alternatively you can create a custom Schema object based on Avro specification and pass it to AvroSaver.

TODO…​

Working with Relational Databases

Relational databases are arguably the most important type of data stores out there. Also they happen to map really well to DataFrames. DFLib provides advanced support for loading and saving DataFrames to RDBMS. It supports transactions, auto-generation of certain SQL statements, merging of data on top of the existing data ("create-or-update"), building custom SQL for selects and updates. It knows how to treat different DB "flavors" and does a number of other cool database-related things.

To start using all this, you need to import dflib-jdbc module:

<dependency>
    <groupId>com.nhl.dflib</groupId>
    <artifactId>dflib-jdbc</artifactId>
</dependency>

<!-- Of course you also need to import your preferred JDBC driver -->

JdbcConnector

Once the imports are setup, you’d use Jdbc class to obtain an instance of JdbcConnector that can be later used for all DB operations. You may already have a preconfigured javax.sql.DataSource, so the simplest way to create a connector is to pass that DataSource to Jdbc factory method:

JdbcConnector connector = Jdbc.connector(dataSource);

Alternatively you can build everything from scratch:

JdbcConnector connector = Jdbc
        .connector("jdbc:derby:target/derby/mydb;create=true")
        // .driver("com.foo.MyDriver") (1)
        // .userName("root")
        // .password("secret") (2)
        .build();
1 (Optional) Driver name. Some drivers are not properly registered with the DriverManager, and require an explicit declaration.
2 DB account username/password (not needed for our in-memory Derby example, but will definitely be required for a real database).

TableLoader / TableSaver

Once you have the connector, you can start reading and persisting DataFrame data. Many DataFrames map directly to individual tables (or views) defined in the database. For those DFLib provides a set of rather straightforward operations that do not require the user to write SQL (all needed SQL is auto-generated by the framework):

DataFrame df = connector.tableLoader("person").load();
id name             salary
-- ---------------- --------
 1 Jerry Cosin      70000.0
 2 Amanda Gabrielly 85000.0
 3 Joan O'Hara      101000.0
3 rows x 3 columns

Table loader provides a way to customize the load operation. It allows to select specific columns, set the maximum number of rows to read, sample rows, and even specify fetch condition as another DataFrame. Some examples:

DataFrame condition = DataFrame.newFrame("salary")
        .addRow(70_000)
        .addRow(101_000)
        .create();

DataFrame df = connector.tableLoader("person")
        .includeColumns("name", "salary")
        .eq(condition)
        .load();
name        salary
----------- --------
Jerry Cosin 70000.0
Joan O'Hara 101000.0
2 rows x 2 columns

What it doesn’t require (unlike CSV loader) is explicit column types, as the proper value types are inferred from the database metadata.

Table saver allows to save DataFrame to a table. Column names in the DataFrame should match column names in the DB table:

DataFrame df = DataFrame.newFrame("id", "name", "salary")
        .addRow(1, "Jerry Cosin", 70_000)
        .addRow(2, "Amanda Gabrielly", 85_000)
        .addRow(3, "Joan O'Hara", 101_000)
        .create();

connector.tableSaver("person").save(df);

In this scenario table saver executes insert for each DataFrame row. But what if there is already an existing data in the table? There are a few options the user has to overwrite or merge the data:

  • Append the data (that’s what we effectively did above).

  • Delete all existing data before doing the insert.

  • Merge the data by comparing DataFrame and DB table data on PK column(s) or an arbitrary set of columns. Insert missing rows, update the existing rows. If the table has more columns than there are columns in the DataFrame, the data in those extra columns is preserved.

Delete before insert example:

connector.tableSaver("person")
        .deleteTableData()
        .save(df);

Merge by PK example (PK columns are detected from DB metadata) :

connector.tableSaver("person")
        .mergeByPk()
        .save(df);

SqlLoader / SqlSaver

TODO

Working in Jupyter Notebook

While DFLib runs inside regular Java applications, it also works perfectly in a Jupyter notebook, which is a very common environment among data scientists, and is usually associated with Python / pandas. Jupyter Notebook is very visual and allows you to debug each step of your a data load and transformation pipeline individually without a need to rerun the whole thing.

Combining DFLib with Jupyter provides a pandas-like notebook experience, only in pure Java. The code developed in notebook can be later copied and pasted into your application. Java developers should consider using Jupyter as a part of their data project workflow, and DFLib design (and its Jupyter integration features described here) makes it rather easy.

Installing Jupyter with Java Kernel

  • You will need at least Java 9 run Java in Jupyter, as it requires jshell. Java 11 or later is recommended.

  • Install Jupyter. Follow Jupyter installation instructions. It requires Python. Out of the variety of installation options, the authors of this document are using Mac OS, pyenv, Python 3.7, and pip.

  • Install iJava Jupyter "kernel"

Work with DFLib in Jupyter

Start Jupyter

jupyter notebook

The browser opens. Click on "New > Java". In the document cell enter something like this:

%maven com.nhl.dflib:dflib:0.11
%maven com.nhl.dflib:dflib-jupyter:0.11

import com.nhl.dflib.jupyter.*;
import com.nhl.dflib.*;

DFLibJupyter.init(getKernelInstance());

Click "Shift + Return" to execute the cell. If the are no errors, you can start using DFLib API in the following cells. E.g.:

DataFrame df = DataFrame.newFrame("a", "b", "c").foldIntByColumn(-1, 1, 2, 3, 4, 5, 6);

// the result of the last statement in a cell is printed just below it
df

Change Display Parameters

To control truncation behavior, you can use static methods on DFLibJupyter:

DFLibJupyter.setMaxDisplayRows(10);
DFLibJupyter.setMaxDisplayColumnWidth(50);

Notebooks and Version Control

Jupyter places the data generated by the notebook in the notebook itself. This creates a challenge maintaining notebooks under version control. Assuming you are using Git, you will need to add Git hooks to your project to strip off the data before commit.

TODO: a recipe for excluding the data

Unit Testing