Categories

Versions

You are viewing the RapidMiner Studio documentation for version 9.10 - Check here for latest version

What's New in RapidMiner Studio 9.7.0?

Released: June 2nd, 2020

The following describes the bug fixes in RapidMiner Studio 9.7.0:

New Features

  • Added versioned projects which are tied to RapidMiner Server. You can have as many versioned projects as you like, no limits! The versioning is backed by Git and can be accessed by any regular Git clients. This means sharing between Python/R coders and RapidMiner users has never been easier!
    • Added dialog to select which version of a file to keep in case of a conflict in the versioned projects while getting Snapshots from Server.Versioning happens on a project level. As you can now have as many projects as you like, this is the most sensible behavior because most of the time many entries are interconnected in a project. Thus the entire state is saved and can be later restored, without having to worry about dependency versions.
    • Projects support ALL files you may have on your computer! You can put your .py scripts, your .md files, your .png files, your .pdf files, etc all into a project. It will be neatly displayed in RapidMiner Studio.
    • Of course, all those files can be versioned together, so RapidMiner users and Python coders can share the same git repository. The Python coders can even use their native Git client to do so, no magic required. This will make collaboration between RapidMiner users and Python coders easier than ever before!
    • Processes in versioned projects can also be run and scheduled on RapidMiner Server as they can for an existing Server central repository
    • All the files live locally on your computer, but are also shared via Git. This gives you the performance of a local repository when working with it during prototyping, but also allows for easy collaboration with your colleagues.
  • Added new panel "Snapshot History" which allows to browse the history of your versioned projects, as well as see the changes you've made since the latest snapshot. It can also be used to restore an earlier state of the project, view past versions of individual files, and to restore those past versions.
  • ExampleSets are now written to disk in a new file format: HDF5. This is a well-established format used e.g. by the NASA to store large amounts of data. This also means that Python and RapidMiner Studio can exchange data via HDF5 files much more easily and faster than ever before.
  • Local repositories that will be created with RapidMiner Studio 9.7 or later can also take advantage of supporting all files you may have on your computer (.py, .jpeg, .pdf, etc).
  • New operator Target Encoding which can remove nominal attributes with too many values and performs a target encoding (also known as mean encoding) on the remaining attributes
  • Auto Model: some processes (e.g. SVM, FLM, or weight calculations) now use the new Target Encoding instead of one-hot encoding which reduces memory usage and run times
  • Time Series: New operator Integrate to integrate time series with different methods (cumulative sum / left and right riemann sum / trapezoidal rule)

Enhancements

  • Both local repositories and versioned projects (tied to RM Server) have been completely rebuilt to get rid of many old limitations. Benefits include:
    • Enhanced throughput and performance
    • Better meta data caching
    • Concurrent access support
    • Displaying all files (no matter what they are, e.g. Python scripts, images, ...)
    • Allowing different file types (e.g. data, processes) and folders to share the same name
    • Note: Your existing local repositories have (Legacy) after their name, indicating they still run on the old technology and still have some of the limitations! If you create a new local repository, it will have (Local) after its name and have all the capabilities listed above. You can copy your data over via Studio from the old repository to a new one to migrate.
  • It is now possible to have a folder with the same name as a data entry in the repository (might not work for some old repositories)
  • It is now possible to have a process and a data entry with the same name in the repository (might not work for some old repositories)
  • Replaced Send Mail operator with new version which supports file attachments
  • Improved memory usage for Aggregate and Pivot operators for nominal columns with potentially a lot of unused values
  • Improved dealing with whitespaces in repository entry names
  • Improved cleanup of temp files, to reduce disk space clutter when Studio runs for a long time, i.e. in a Server environment
  • Made log tables in Result View behave more like other results, adding more actions and a shortcut to the context menu
  • Process background images are now using a relative path to the image if possible, instead of an absolute path. This only applies for background images set from now on, it does not work retroactively
  • For binominal attributes the Statistics tab shows the positive and the negative value
  • Renamed RapidMiner Server to RapidMiner AI Hub
  • Opening/Moving the Process panel into the foreground when opening a process while in the Design view to make it more obvious something happened
  • Auto Model: remote executions on Server require the central repository as storage location
  • Turbo Prep: only local file based repositories can now be used as temporary repositories for the handover to Auto Model
  • Model Ops: only local repositories or central Server repositories can be used as storage locations for deployed models (also known as "deployment location")
  • Model Ops: keep unused and ID columns in the results after scoring
  • The operators Explain Predictions and Model Simulator now also support grouped models where arbitrary models have been grouped instead of only preprocessing models
  • The operator Explain Predictions now offers a parameter to limit the number of important features also for the "importances" output
  • Time Series
    • Added options to use padding for Fast Fourier Transformation and calculate the frequency of the amplitude value.
    • Added the option to specify negative lags for the Lag operator
    • Added the option to specify a default lag for a set of attributes (selected by an attribute subset selector) to the Lag operator
      • Unfortunately due to parameter key incompatibilities, old version of the Lag operator is deprecated and new version with the same name, but different operator key is added.
  • H2O
    • Updated H2O library to version 3.30.0.1.
    • Added monotonicity constraints to Gradient Boosted Trees
    • Added weights port to Deep Learning
    • Expanded whitelist of accepted expert parameters, now supports all parameters provided by H2O
    • Deep Learning and Logistic Regression now work with datasets that have nominal columns with only one value

Bugfixes

  • Fixed an issue that could cause Studio startup to never complete
  • Made Studio startup more rigid to quit process instead of silently hanging on the splash screen forever
  • Fixed issue that could cause panels to sometimes not open if they had been closed previously in this session
  • Fixed an issue that caused CTAs not working when HTML5 safe mode was enabled
  • Fixed an issue with back propagation of changes to performance vectors
  • Fixed a problem for JDBC drivers that do not implement a certain set of functionality by adding a fallback (e.g. SQLite writing)
  • Fixed potential cause for complete UI freeze when interacting with a CTA notification banner
  • Fixed an issue with process navigation and property panel if operator names contain HTML
  • Generate Multi-Label Data does now correctly work in non-regression mode
  • Fixed memory leak caused by the Visualizations
  • Fixed rare issue where data sets could not be downsampled automatically if license limit was exceeded
  • Fixed an issue in Automatic Feature Engineering if all input features have been nominal in the feature selection case
  • Fixed "Edit Access Rights" dialog for Server repositories not getting the permissions correctly when using Enterprise SSO
  • Fixed an issue that caused Studio to lag and increase memory consumption when using the right-click "Insert operator" popup menu in the Process panel.
  • Fixed broken replacing (instead it was duplicated) on move of data entries to a different repository
  • Auto Model: remote executions show new submission screens now which only allows the reset of Auto Model to load the results which avoids problems with multiple remote submissions within the same session
  • Auto Model: reordering the columns in the column selection table no longer lead to graphics problems
  • Time Series: Fixed a bug in Extract Peaks, that causes all "_position" features to have an offset of 1 to the Example number

Known issues

  • One Hot Encoding does not produce the desired results, this will be fixed with the next patch release.

Special notes

  • Columns of type "Integer" that were previously stored as integers are now stored as their double representation. This of course means more range (~53 bit precision), but also means that values are no longer capped. This might have an impact when storing data to disk and rereading it.
  • Columns of type "Date" no longer store the milliseconds due to the new file format. This might have an impact of equality tests and matching when storing data to disk and rereading it.
  • Visualizations that have been created locally for data sets stored in repositories will not be found anymore after the update, causing the result visualization to reset to its default. If you have set up complex visualizations that you absolutely want to restore, you can follow these steps:
    1. Open the data set in the Results view of RapidMiner Studio.
    2. Navigate on your disk via your filesystem explorer into the "USER_HOME/.RapidMiner/internal cache/content mapper" folder. There you can find a folder structure matching your repository names and structure.
    3. Find the exact path to the data set (e.g. "C:/Users/xyz/.RapidMiner/internal cache/content mapper/Local Repository/Charts/Demo/12. Pie")
    4. You should see a very similar path right next to it, either ending in ".ioo" or ".rmhdf5table" (e.g. "C:/Users/xyz/.RapidMiner/internal cache/content mapper/Local Repository/Charts/Demo/12. Pie.ioo")
    5. Go into the folder from step 3 (the one without the .ioo ending), and copy the "pc.json" file from it to the folder from step 4 (the one with the .ioo ending)
    6. Close the data set in the Results view
    7. Open it again. It should now have its configuration back!

Development

The introduction of versioned projects (backed by Git) have forced a major redesign of the Repository API. Up until 9.7, a RepositoryLocation was represented by a string like "//RepositoryName/folder/test" and "test" was guaranteed to be unique. It was either a folder, a process, an ioobject (data) object, or a blob. This is no longer the case!

Since collaboration with Git can introduce naming conflicts which are not actually file-level conflicts (so Git is fine with them), we had to allow these "non-conflicts" into the Repository world as well.

Now a repository location that ends with "test" as the last path element can either depict a folder (RepositoryLocationType#FOLDER), or data (RepositoryLocationType#DATA_ENTRY). Sometimes this is unknown, which is also fine: RepositoryLocationType#UNKNOWN can be used in that case. However, it does not stop there. Since for Git, "test.rmp" and "test.ioo" are also perfectly fine, we had to go one step further and also allow that. Therefore, a RepositoryLocation now also has an expected DataEntry (sub-)type which is used to determine what specific type of a DataEntry to locate (a ProcessEntry, an IOObjectEntry, a ConnectionEntry, or a BinaryEntry).

You can even end up in the undesirable situation of having a "test.ioo" and a "test.rmhdf5table" (both IOObjects) in the same location. Because we cannot determine which IOObject a process should potentially use, these situations must be rectified by the user - the Retrieve operator will throw an error in that case! Looking at the data and renaming one of the entries will work fine, though. This scenario can only happen after a Git pull with the new versioned projects.

In other words, "test" can in our example now be a folder, a process, a data ioobject, a connection entry, or a binary entry. And they can all exist at the very same time in the very same folder. So be sure to specify in the new RepositoryLocationBuilder what exactly you want from the repository, or you may end up getting the first name match it finds, which may be of an unexpected type.

  • Repositories now distinguish between data and folders, and even between different data subtypes (process, ioobject, connection, binary entry) which means you can have a folder called "A" and e.g. a process called "A" at the same time. This has implications for a large number of APIs, most notably:
    • com.rapidminer.repository.Repository interface:
      • locateFolder(String) and locateData(String, Class) have been added and can be implemented, their default implementation points to the RepositoryManager()#locateFolder(String) and locateData(String, Class<? extends DataEntry) methods
      • getIOObjectEntrySubtype(Class<? extends IOObject> ioObjectClass) has been added and can be implemented, the default implementation returns IOObjectEntry.class. This is used for the new file-based repository implementations (Local and versioned Project) that will ultimately have different file suffixes on disk for every distinct IOObject type (instead of all of them sharing the legacy .ioo suffix)
      • isTransient() has been added, defaults to false. This is used to hide temporary repositories from the repositories panel and from the Global Search if true.
      • locate(String) has been deprecated and should not be used anymore because it cannot know whether a file or a folder is requested
    • com.rapidminer.repository.RepositoryManager class:
      • locate(Repository, String, boolean) has been deprecated and replaced with
      • locateFolder(Repository, String, boolean) and locateData(Repository, String, Class<? extends DataEntry, boolean, boolean)
    • com.rapidminer.repository.Folder interface:
      • containsFolder(String) and containsData(String, Class<? extends DataEntry) have been added and must be implemented
      • containsEntry(String) has been deprecated and should not be used anymore because it cannot know whether a file or a folder is requested
      • canRefreshChildFolder(String) and canRefreshChildData(String) have been added and must be implemented
      • canRefreshChild(String) has been deprecated and should not be used anymore because it cannot know whether a file or a folder is requested
    • com.rapidminer.operator.Operator class:
      • getParameterAsRepositoryLocation(String) has been deprecated and should not be used anymore because it cannot know whether a file or a folder is requested
      • getParameterAsRepositoryLocationData(String, Class<? extends DataEntry>) has been added for looking for data
      • getParameterAsRepositoryLocationFolder(String) has been added for looking for folders
    • com.rapidminer.repository.RepositoryLocation class:
      • locateEntry() has been deprecated and replaced with locateFolder() and locateData() (same as above)
      • ALL constructors have been deprecated and replaced with a builder: com.rapidminer.repository.RepositoryLocationBuilder
      • getRepositoryLocation(String, Operator) has been deprecated and replaced with getRepositoryLocationFolder(String, Operator) and getRepositoryLocationData(String, Operator, Class<? extends DataEntry>)
      • added getLocationType() and setLocationType(RepositoryLocationType) which are used to specify whether a RepositoryLocation references a folder, a data entry, or that it is not know what it references
      • added getExpectedDataEntryType() and setExpectedDataEntryType(Class<? extends DataEntry>) which are used to specify what data entry (sub-)type is expected. Not used if RepositoryLocationType#FOLDER is expected.
      • added isFailIfDuplicateIOObjectExists() and setFailIfDuplicateIOObjectExists(boolean), which control whether a RepositoryIOObjectEntryDuplicateFoundException is thrown when locateData() is called (an IOObjectEntry is requested), but there are at least two IOObject entry subtypes with the same name (prefix). As this is an undesirable situation, operators will refuse to work with such locations when retrieving data.
      • These changes are very important to adapt to, otherwise you can end up for example getting a folder when expecting a file, or a process when expecting an IOObject!
  • ParameterTypeRepositoryLocation now has a new getter and setter for a predicate to limit the available UI choices for the user when selecting entries. It is used in the RepositoryLocationValueCellEditor if getRepositoryFilter() is not overwritten in it. Note that the operator still has to check the validity of the repository location for its use case, the filter is purely for UI purposes and does not validate the returned value.
    • setRepositoryFilter(Predicate)
    • getRepositoryFilter()
  • Added secure encryption framework to Studio, based on Google Tink. See com.rapidminer.tools.encryption.EncryptionProvider for a starting point. The old CipherTools have been deprecated and must not to be used for new encryptions anymore!
    • This means that any access methods to process XML and connections have been deprecated and replaced with a version where you can specify the desired encryption context! Failure to use these new methods may lead to decryption failure of encrypted values in connections, and ParameterTypePassword in processes!
    • See Repository#getEncryptionContext() for getting the encryption context for a repository. The default implementation uses the EncryptionProvider.DEFAULT_CONTEXT which is used by all local repositories. Implement this method if you need a custom encryption key for each of your repository instances.
    • All Process constructors have been deprecated and replaced by a version that takes an encryption context String identifier:
      • Process(String) has been deprecated and replaced by Process(String, String)
      • Process(File, ProgressListener) has been deprecated and replaced by Process(File, String, ProgressListener)
      • Process(Reader) has been deprecated and replaced by Process(Reader, String)
      • Process(InputStream) has been deprecated and replaced by Process(InputStream, String)
      • Process(URL) has been deprecated and replaced by Process(URL, String)
  • ParameterTypePasssword and ParameterTypeOAuth have been deprecated and must never be used again in operators! Use the connection framework introduced in version 9.3 instead to avoid having sensitive values in process XML.
  • com.rapidminer.repository.BlobEntry has been deprecated. All new repositories must support the new com.rapidminer.repository.BinaryEntry instead. That is a direct view onto binary content that is not interpreted in any way, shape or form. No magic bytes, nothing. Just like it would be on the filesystem.
  • Added com.rapidminer.repository.gui.BinaryEntryResultRendererRegistry for registering custom renderers to display in the Results view when the new BinaryEntry is passed as a process result.
    • Multiple renderers can be registered per suffix
    • Depends on the suffix of a file, e.g. jpeg
  • Added com.rapidminer.gui.dnd.DropBinaryEntryIntoProcessActionRegistry for registering custom hooks when a user drags the new BinaryEntry into the canvas.
    • Can create an operator or trigger any custom action
    • Depends on the suffix of a file, e.g. py
  • Added com.rapidminer.gui.dnd.DropFileIntoProcessActionRegistry for registering custom hooks when a user drags a binary file from disk into the canvas.
    • Can create an operator or trigger any custom action
    • Depends on the suffix of a file, e.g. py
  • Added com.rapidminer.repository.gui.OpenBinaryEntryActionRegistry for registering custom hooks when the user double-clicks or otherwise opens the new BinaryEntry
    • Depends on the suffix of a file, e.g. py
  • Added com.rapidminer.repository.gui.BinaryEntryIconRegistry for registering custom icons for each binary entry suffix which are shown in the repository panel
    • Depends on the suffix of a file, e.g. py
  • Refactored the com.rapidminer.operator.ports.Port and com.rapidminer.operator.ports.Ports interfaces and sub-interfaces/-classes. Port is now a self referencing generic type, allowing more convenient and type-safe methods
    • There are only input and output ports, they are always opposite, so the types reflect that and the method getOpposite() returns a value of the opposite type; getSource() and getDestination() are now present in both subclasses and might return themselves
    • Connecting and disconnecting can now be done by either side of a connection; there also is a method canConnectTo that checks if two ports can be connected
    • Ports and its implementations were updated to reflect the generic nature of Port. Custom implementations should not be affected at runtime, but might need a small adjustment in code to compile properly
  • Added com.rapidminer.gui.tools.ResourceDockKey#PROPERTY_KEY_NEXT_TO_DOCKABLE and com.rapidminer.gui.tools.ResourceDockKey#PROPERTY_KEY_DEFAULT_FALLBACK_LOCATION which can be used to define where a Dockable should be opened by default
  • Added com.rapidminer.TestUtils which contains utility methods for testing, e.g. a method to do the most barebones setup of RapidMiner required to at least create and use empty processes in unit tests
  • Removed quite a few deprecated methods and classes, which were deprecated for over 10 years at this point. This should have no impact on extensions, unless very old methods annotated with @Deprecated since RapidMiner 5 have been used.
  • Added com.rapidminer.tools.io.NotifyingOutputStreamWrapper, which is an OutputStream wrapper that can execute a runnable after (manual and automatic) close of the stream
  • Added com.rapidminer.tool.TempFileTools with methods to create temporary files which are automatically removed during RapidMiner#cleanup() and RapidMiner#shutdown(), always use these methods to ensure cleaning up of unused files.